Turning it in: You’ll turn in a zip file via Canvas that contains (a) a pdf write-up of the answers to the specific questions described below and, (b) all your code, and (c) the other items described below.

## Document similarity

Calculating document similarity is the first step in many text analytic tasks, including clustering and finding similar documents to a single one known to be of relevance. In this part of the assignment, you’ll compare different ways of computing document similarity.

1. Splitting, tokenization, normalization. In order to have a set of documents to calculate similarities between, split the classbios.txt file into one file per person. (Note that you can use regular expressions to do this.) Then, tokenize and normalize these files using any of the methods (in Lucene or CoreNLP) you used on the previous assignment. Next, remove stopwords and any words that contain non-alphabetic characters (e.g., puncutation or numbers), using any method you like. [No need to say anything about this problem in your write-up, but do include the code in your zip.]

2. Boolean similarity. Use the normalized, tokenized bio documents to create a binary term-document matrix, where each row represents a vocabulary item, each column represents a document, and each element is a 0 or 1. In this matrix, each document is represented by a binary vector. From this binary matrix, create a document similarity matrix, where cell i, j gives the similarity of documents i and j, and where similarity is given by the cosine between the document vectors: $$\dfrac{I \cdot J}{\sqrt{|I| |J|}}$$. Save this matrix to a file boolean.txt, where values are separated by a space, and where the rows and columns are arranged in alphabetical order of filename (i.e., the same order they were given in classbios.txt). Do not include any row or column labels in this file. In your write-up, answer: Which three pairs of documents are the most similar? Which three pairs are the least similar?

3. tf-idf similarity. Use the normalized, tokenized bio documents to create a term-document matrix of raw frequencies. Then transform this raw frequency matrix into a tf-idf matrix by using the formula $$(1+\log(tf))(\log(\dfrac{N}{df}))$$ where $$tf$$ is the term’s raw frequency in this document, $$df$$ is the number of documents the term occurs in, and $$N$$ is the total number of documents. Now, again, convert this term-document matrix into a document similarity matrix in the same way as before and save it as tf_idf.txt, in the same format as before. In your write-up, answer: Which three pairs of documents are the most similar now? Which three pairs are the least similar? Does this more advanced method seem to be doing a better job than just using binary similarity? Why?

## Indexing and retrieval

On the last assignment, you used Lucene simply to tokenize and normalize text, but its main function is as a full industrial-strength indexing and retrieval library. Thus, indexing text using Lucene is very often the first step to doing analytics. In this part of the assignment, you’ll gain practice in using Lucene for indexing and retrieval.

• To accomplish this, first, create a Lucene index. Then, add the text from the wikipedia entry for each capital city and each country. Specifically, add one document to the index for each capital city you extract using four Lucene fields: city_name, country_name, city_text, and country_text.
• There’s no need to use any fancy html parsing libraries for this assignment, but you will need to strip out the html from the web pages before sending them on to an EnglishAnalyzer and indexing them. One convenient way to do this is to use the HTMLStripCharFilter class.
• ‘Greek’ and ‘Roman’ but not ‘Persian’: Use a BooleanQuery for this. In addition to the Lucene docs, you might find this page a useful reference.
• ‘Shakespeare’, even if it’s misspelled: Use a FuzzyQuery for this.
• the words ‘located below sea level’ near each other: Use a PhraseQuery with a slop factor of 10. In addition to the Lucene docs, you might find this page a useful reference.