parts adapted from Jorge Moraleda
Turning it in: You’ll turn in a zip file via Canvas that contains a single folder, named with your first and last name. That single folder should contain (a) a pdf write-up of the answers to the specific questions described below and (b) the other items described below.
The high level goal of this part of the assignment is to explore how differences in preprocessing can affect text analytic results. This also serves to introduce Apache Lucene Core, a powerful text indexing and retrieval library widely used in industry, which we will also be using for later assignments.
Tokenization. You will compare two widely used tokenizers on a short piece of text named wsj_0063
. This text was originally published in the Wall Street Journal in 1989 and is now part of the Penn Treebank, a widely used corpus that has been annotated for grammatical structure. While the full Penn Treebank is not free, this file can be found in the corpus’s free sample as included in the NLTK project. Note that this data is licensed for non-commercial use only. For the purposes of this exercise we’ll be using the raw version of this text, available at raw/wsj_0063
The two tokenizers to compare are:
. Like CoreNLP, Lucene is Java software, so if you want to use this on python, you’ll need a wrapper. The official one is PyLucene.You will report the tokenization differences between both approaches. For each line where the tokenization differs between the two approaches, show the original string and the two alternative tokenizations. What patterns do you see in the differences?
Normalization. You will compare four widely used normalization algorithms, also on wsj_0063
. Note that in Lucene Core, normalization algorithms are referred to as ‘analyzers’. There is a list of analyzers for English
: Lucene Core’s default for English, implementing the Porter Stemmer version 2PorterStemFilter
: implementing the original Porter Stemmer, which is also widely used in analyticsKStemFilter
: a less aggressive stemmerlemma
annotation in CoreNLP.As for the previous exercise, you will report the differences between these four approaches, for each line where the normalization differs between any of them. Describe the main ways the lemmatizer differs from the stemmers. Also describe the patterns you see in the differences between stemmers.
Note: tokenization must be performed first before normalization. To make this simple, use Lucene to tokenize prior to applying the Lucene normalization algorithms and use CoreNLP to tokenize prior to applying the CoreNLP normalization algorithms.
Tokenization and normalization. You will decide on a tokenization and normalization algorithm to apply to classbios.txt
from the previous assignment. Make an argument as to why you selected that particular tokenization and normalization algorithm. Format the output so that there is one sentence per line and each normalized token is separated by a space. Include your normalized classbios.txt
file in the zip.
Basic frequency analysis. Write a program to calculate the frequency of each (normalized) word in the normalized classbios.txt
corpus. Report the top 20 most frequent words, excluding stop words and punctuation, and their frequencies. Describe whether you think this list gives much information about this corpus. Did you learn anything from this list about the backgrounds of the class?
Basic bigram analysis. Write a program to calculate the frequency of each bigram in the corpus (i.e., sequence of two words). Report the top 20 most frequent bigrams, excluding bigrams that include a stop word or punctuation, and their frequencies. Describe whether you think this list gives much information about this corpus. Did you learn anything about the backgrounds of the class?
Sentiment analysis. For the final exercise, you’ll create a naive Bayes classifier to perform binary sentiment analysis on movie reviews. You’ll use Pang & Lee’s (2005) polarity dataset 2.0, which consists of 1000 positive and 1000 negative movie reviews. Note that these reviews have already been pre-processed so that tokenization has already been done. Each review is in its own file, each sentence is on its own line, and each token is followed by a space. Specifically:
.] To apply the naive Bayes classifier, you’ll calculate the log probability of a test review under each class by summing the log probabilities of each word in the review. (Ignore new words in the test reviews that weren’t in training, since your models don’t assign them a probability.) Finally, classify according to whichever log probability is higher. (Recall that log probabilities are always negative, and -2 is higher than -4.)cv0
), classify the 200 reviews with filenames that begin with cv6
and cv7
and report performance: precision, recall, and F score. Then, do the same for training on the first 300 examples, the first 500 examples, and finally, the first 600 examples (cv0
to cv5
), in each case testing on those that begin with cv6
and cv7
. Do you think performance of this classifier would improve with even more than 600 training examples?cv6
and cv7
files, which we’re technically using as a validation set, and make sure that your new system is improving performance on those. Report how much you’ve improved performance on this set (precision, recall, F score).cv8
and cv9
. For this evaluation, train on the first 800 files (cv0
to cv7
). Report performance of the original classifier (and your new classifier if you did the bonus) on this data. If your original classifier performance was substantially different evaluated on cv8
and cv9
than it was previously on cv6
and cv7
, why do you think that was? If you did the bonus, and your new system didn’t outperform your original classifier on this new data, but did outperform it previously on the cv6
and cv7
files, what do you think happened?cv8
to cv9