Text analytics

Klinton Bicknell /// Fall 2015

Huge amounts of valuable data are stored outside of structured databases as human language: text and speech. This course covers modern techniques to extract useful information from this language data.

Schedule Piazza forum


Week Date Topic Reading Materials Assignments
1 Sep 24 introduction; regular expressions; finite-state automata SLP3 2.1 slides 1, slides 2 HW0 out
2 Oct 1 finite-state transducers; text normalization (e.g., tokenization, stemming) SLP2 2.2–2.3, IR 2.1–2.2 slides 1, slides 2 HW0 due (Tue), HW1 out
3 Oct 8 n-gram models; frequency analysis; cooccurrence analysis; edit distance; spelling correction; noisy channel models SLP3 4, SLP3 6 slides 1, slides 2
4 Oct 15 document classification; naive Bayes; logistic regression; sentiment analysis SLP3 7 slides 1, slides 2 HW1 due, HW2 out
5 Oct 22 indexing and retrieval; Lucene IR 1, 6 slides 1, slides 2, slides 3
6 Oct 29 similarity and clustering; latent semantic analysis; latent Dirichlet allocation; distributed word representations SLP3 19, Blei (2012) slides 1, slides 2
7 Nov 5 class cancelled HW2 due, HW3 out
8 Nov 12 part-of-speech tagging; hidden Markov models; Viterbi algorithm; maximum entropy models SLP3 8, SLP3 9 slides 1, slides 2
9 Nov 19 named entity recognition; relation extraction; advanced maximum entropy models; coreference; formal grammars SLP3 20 slides 1, slides 2, slides 3, slides 4 HW3 due, HW4 out
10 Nov 26 Thanksgiving holiday: no class!
11 Dec 3 syntactic parsing; wrap-up; speech recognition for automatic transcription slides 1, slides 2, slides 3, slides 4
12 Dec 10 [finals week] HW4 due 4pm



Thursdays 1–4
North Campus Garage, Padula Room 1430
There is no required textbook, but the following are good references we will be drawing from in this course.


Klinton Bicknell
Teaching assistant
Papis Wongchaisuwat
TA office hours
Tuesdays 1–2, North Campus Garage, Office 1421
Instructor office hours
Thursdays 12–1, North Campus Garage, Office 1421


Questions that are not personal should be posted on the Piazza forum (where they can be posted anonymously if desired). To contact the TA or instructor directly, coming to office hours is encouraged. For questions that are personal, students can email the TA or instructor.
This course will explore techniques to analyze unstructured text such as that found in emails, text messages, conversation transcripts, web pages, books, scientific journals, etc. The course strives to offer a balance between breadth and depth, presenting both an overview of the field as well as some insight into the mathematical underpinning of a few representative techniques.
Academic integrity
Violations of academic integrity will be referred to the Dean’s office, per WCAS policies. Sanctions can be quite severe, including suspension or permanent expulsion from the university. For details and discussion of how to avoid plagiarism, see the Academic Integrity section of the WCAS undergraduate handbook.


Course Grade
  • 75% homeworks (4)
  • 25% participation
There will be four homework assignments throughout the quarter. These assignments will involve a combination of programming and short answer responses. The first three assignments should be done individually. For the final assignment, which is more involved, working in pairs is encouraged. Homework must be handed in through Canvas.
A substantial portion of the grade is based on participation, both in class (including regular attendance) and on the Piazza forum.
Keeping up
The syllabus (topics, assignments, due dates) may change. These changes will be announced in class, via Piazza, and on the course website. It is students' responsibility to keep up with them.
All assignments are posted on a Friday, and will generally be due on a Friday at 5pm (unless otherwise specified). Students will have 3 late days that can be used without penalty at any time in the quarter. (Weekend days count half.) These 3 can be distributed in any way: for one assignment three days late, for three assignments each 1 day late, or for one assignment 1 day late and one assignment 2 days late. After the 3 late days are used up, late work will only be accepted in extraordinary circumstances. Common occurrences such as exams in other classes or computer problems are not extraordinary. (In the event of actual extraordinary circumstances which will cause you have trouble meeting a deadline, please contact the instructor as early as possible.) To avoid such problems, it's recommended that students start on all assignments soon after they're available.
Any student requesting accommodations related to a disability or other condition is required to register with AccessibleNU (accessiblenu@northwestern.edu; 847-467-5530) and provide professors with an accommodation notification from AccessibleNU, preferably within the first two weeks of class. All information will remain confidential.