parts adapted from Jorge Moraleda
Turning it in: You’ll turn in a zip file via Canvas that contains (a) a pdf write-up of the answers to the specific questions described below, (b) all your code, and (c) the other items described below.
Calculating document similarity is the first step in many text analytic tasks, including clustering and finding similar documents to a single one known to be of relevance. In this part of the assignment, you’ll compare different ways of computing document similarity.
Splitting, tokenization, normalization. In order to have a set of documents to calculate similarities between, split the classbios.txt
file into one file per person. (Note that you can use regular expressions to do this.) Then, tokenize and normalize these files using any of the methods (in Lucene or CoreNLP) you used on the previous assignment. Next, remove stopwords and any words that contain non-alphabetic characters (e.g., puncutation or numbers), using any method you like. [No need to say anything about this problem in your write-up, but do include the code in your zip.]
Boolean similarity. Use the normalized, tokenized bio documents to create a binary term-document matrix, where each row represents a vocabulary item, each column represents a document, and each element is a 0 or 1. In this matrix, each document is represented by a binary vector. From this binary matrix, create a document similarity matrix, where cell i, j gives the similarity of documents i and j, and where similarity is given by the cosine between the document vectors: \(\dfrac{I \cdot J}{|I| |J|}\). Save this matrix to a file boolean.txt
, where values are separated by a space, and where the rows and columns are arranged in the same order they were given in classbios.txt
. Do not include any row or column labels in this file. In your write-up, answer: Which three pairs of documents are the most similar? Which three pairs are the least similar?
tf-idf similarity. Use the normalized, tokenized bio documents to create a term-document matrix of raw frequencies. Then transform this raw frequency matrix into a tf-idf matrix by using the formula \((1+\log(tf))(\log(\dfrac{N}{df}))\) where \(tf\) is the term’s raw frequency in this document, \(df\) is the number of documents the term occurs in, and \(N\) is the total number of documents. Now, again, convert this term-document matrix into a document similarity matrix in the same way as before and save it as tf_idf.txt
, in the same format as before. In your write-up, answer: Which three pairs of documents are the most similar now? Which three pairs are the least similar? Does this more advanced method seem to be doing a better job than just using binary similarity? Why?
On the last assignment, you used Lucene simply to tokenize and normalize text, but its main function is as a full industrial-strength indexing and retrieval library. Thus, indexing text using Lucene is very often the first step to doing analytics. In this part of the assignment, you’ll gain practice in using Lucene for indexing and retrieval.
city_name
, country_name
, city_text
, and country_text
.EnglishAnalyzer
and indexing them. One convenient way to do this is to use the HTMLStripCharFilter
class.BooleanQuery
for this. In addition to the Lucene docs, you might find this page a useful reference.FuzzyQuery
for this.PhraseQuery
with a slop factor of 10. In addition to the Lucene docs, you might find this page a useful reference.