Text analytics

Klinton Bicknell /// Fall 2017

Huge amounts of valuable data are stored outside of structured databases as human language: text and speech. This course covers modern techniques to extract useful information from this language data.

Schedule Piazza forum

Schedule

Week Date Topic Reading Materials Assignments
1 Sep 21 introduction; regular expressions; finite-state automata SLP3 2.1 slides 1, slides 2 HW0 out
2 Sep 28 finite-state transducers; text normalization (e.g., tokenization, stemming) SLP2 2.2–2.3, IR 2.1–2.2 slides 1, slides 2, slides 3 HW0 due (Tue), HW1 out
3 Oct 5 n-gram models; frequency analysis; cooccurrence analysis; edit distance; spelling correction; noisy channel models SLP3 4, SLP3 5 slides 1, slides 2
4 Oct 12 document classification; naive Bayes; logistic regression; sentiment analysis SLP3 6 slides 1, slides 2 HW1 due
5 Oct 19 indexing and retrieval; Lucene IR 1, 6 slides 1, slides 2, slides 3 HW2 out
6 Oct 26 similarity and clustering; latent semantic analysis; latent Dirichlet allocation; distributed word representations SLP3 15, SLP3 16, Blei (2012) slides 1, slides 2
7 Nov 2 Class cancelled HW2 due, HW3 out
8 Nov 9 part-of-speech tagging; hidden Markov models; Viterbi algorithm; maximum entropy models SLP3 9, SLP3 10 slides 1, slides 2
9 Nov 16 named entity recognition; relation extraction; advanced maximum entropy models; coreference; formal grammars; syntactic parsing; wrap-up SLP3 21 slides 1, slides 2, slides 3, slides 4, slides 5 HW3 due, HW4 out
10 Nov 23 Thanksgiving holiday!
11 Nov 30 speech recognition for automatic transcription slides 1, slides 2
12 Dec 4 HW4 due 5pm

Logistics

Course

Time
Thursdays 1–4
Location
North Campus Garage, Padula Room 1430
Textbooks
There is no required textbook, but the following are good references we will be drawing from in this course.
Website
www.klintonbicknell.com/ling400fall2017

Instructors

Instructor
Klinton Bicknell
Teaching assistant
Alexandros Nathan
TA office hours
Mondays 2–3, North Campus Garage, Office 1421
Instructor office hours
Thursdays 12–1, North Campus Garage, Office 1421

Policies

Email
Questions that are not personal should be posted on the Piazza forum (where they can be posted anonymously if desired). To contact the TA or instructor directly, coming to office hours is encouraged. For questions that are personal, students can email the TA or instructor.
Description
This course will explore techniques to analyze unstructured text such as that found in emails, text messages, conversation transcripts, web pages, books, scientific journals, etc. The course strives to offer a balance between breadth and depth, presenting both an overview of the field as well as some insight into the mathematical underpinning of a few representative techniques.
Academic integrity
Violations of academic integrity will be referred to the Dean’s office, per WCAS policies. Sanctions can be quite severe, including suspension or permanent expulsion from the university. For details and discussion of how to avoid plagiarism, see the Academic Integrity section of the WCAS undergraduate handbook.

Requirements

Course Grade
  • 75% homeworks (4)
  • 25% participation
Homeworks
There will be four homework assignments throughout the quarter. These assignments will involve a combination of programming and short answer responses. The first three assignments should be done individually. The final assignment, which is more involved, must be done in pairs. Homework must be handed in through Canvas.
Participation
A substantial portion of the grade is based on participation, both in class (including regular attendance) and on the Piazza forum.
Keeping up
The syllabus (topics, assignments, due dates) may change. These changes will be announced in class, via Piazza, and on the course website. It is students' responsibility to keep up with them.
Deadlines
All assignments are posted on a Friday, and will generally be due on a Friday at 5pm (unless otherwise specified). Students will have 3 late days that can be used without penalty at any time in the quarter. (Weekend days count half.) These 3 can be distributed in any way: for one assignment three days late, for three assignments each 1 day late, or for one assignment 1 day late and one assignment 2 days late. After the 3 late days are used up, late work will only be accepted in extraordinary circumstances. Common occurrences such as exams in other classes or computer problems are not extraordinary. (In the event of actual extraordinary circumstances which will cause you have trouble meeting a deadline, please contact the instructor as early as possible.) To avoid such problems, it's recommended that students start on all assignments soon after they're available.
AccessibleNU
Any student requesting accommodations related to a disability or other condition is required to register with AccessibleNU (accessiblenu@northwestern.edu; 847-467-5530) and provide professors with an accommodation notification from AccessibleNU, preferably within the first two weeks of class. All information will remain confidential.