Linguistics/CSE 256: Statistical Natural Language Processing

Course information

Lecture Times TuTh 5:30-7pm
Lecture Location AP&M 2452
Class webpage http://grammar.ucsd.edu/courses/lign256/

Instructor information

Instructor Roger Levy (rlevy@ling.ucsd.edu)
Instructor's office AP&M 4220
Instructor's office hours Th 3:15-5:15pm

Course Description

The goal of this course is to train you, the students, to do research in natural language processing — work that can potentially be published in the leading conferences and journals of the field. In addition to helping you succeed academically in this field (and related fields including AI, machine learning, and psycholinguistics), this is also great training if you are interested in doing NLP work in industry, either in a research lab (Google, Microsoft, Powerset, Yahoo, etc.) or in a startup.

Intended Audience

Graduate students in linguistics, computer science, engineering, cognitive science, psychology, and any other discipline who are interested in how to process natural language by computer. Highly motivated undergraduates are also welcome, but please talk to the instructor before enrolling.

Reading material

We're going to be using the two premier textbooks in the field for this course:

  1. Jurafsky and Martin 2008: Speech and Language Processing (2nd Edition). Prentice Hall. This is a brand new edition, we beta tested it last year, and it's really nice!
  2. Manning and Schütze 1999: Foundations of Statistical Natural Language Processing. MIT Press. Here's the online version.

There will also be at least one reading from the following new textbook:

  1. Manning, Raghavan, and Schütze 2008: Introduction to Information Retrieval. Avaliable here.

Finally, we may occasionally read recent papers from the literature, and I have some alpha-version book chapters that we may also make use of.

Skills required & suggested background

Working in natural language processing requires putting several different types of skills together:

You may not have all of these skills yet, but hopefully you have a substantial subset of them. It may require a bit of extra work for you to strengthen your background in any area where you're deficient — the focus within the class will be on how to put them together.

Mailing List

The mailing list for the class is ligncse256@ling.ucsd.edu. Sign up for the mailing list here.

Syllabus (subject to change!)

Week Day Topic Readings Materials Homework Assignments
Week 1 6 Jan Class Introduction M&S 1, 2 Lecture 1
8 Jan No class, Roger out of town
Week 2 13 Jan Language Modeling I M&S Chapter 6, J&M Chapter 4 Lecture 2 Programming Assignment 1 (due 27 Jan)
15 Jan Language Modeling II Chen and Goodman 1998 (an absolute classic) Lecture 3; Kneser-Ney mini-example
Week 3 20 Jan Text Categorization: supervised methods MRS 2008, Chapter 13 Lecture 4
22 Jan Unsupervised learning I: topic models for text categorization Griffiths & Steyvers, 2004 Lecture 5
Week 4 27 Jan Unsupervised Learning II: Word segmentation Goldwater et al., 2006 Lecture 6 Written assignment 1; Mini example of unsupervised word segmentation
29 Feb Formalisms: weighted finite state automata & context-free grammars J&M Chapter 3; Levy, 2008 Final project guidelines go out; Short intro to directed graphical models
Week 5 3 Feb Part-of-speech Tagging M&S Chapter 9, J&M Chapter 6 Lecture 8 Programming Assignment 2 (due 19 Feb); HMM Viterbi inference mini-example
5 Feb Syntax M&S Chapter 10, J&M Chapter 12 Lecture 9
Week 6 10 Feb Parsing I J&M Chapter 13, M&S Chapter 11 Lecture 10
12 Feb Parsing II J&M Chapter 14, M&S Chapter 12 Lecture 11
Week 7 17 Feb Computational Psycholinguistics and Incremental Parsing Hale 2001; Levy et al., 2009 Lecture 12 PDF component; Lecture 12 PPT component
19 Feb Word-sense disambiguation and semantic roles M&S Chapter 7, J&M Chapters 19 & 20 Lecture 13 You should show me a draft final project proposal by this point
Week 8 24 Feb Compositional Semantics J&M Chapter 18, handout Lecture 14 Short written homework assignment to be handed out
26 Feb Discourse Processing J&M Chapter 21 Lecture 15
Week 9 3 Mar Unsupervised Learning III: POS induction, morphology Clark 2000, Goldsmith 2001, Goldwater & Griffiths, 2007 Lecture 16
5 Mar Unsupervised Learning IV: Syntactic acquisition Klein & Manning 2002, 2004; Johnson et al., 2007 Lecture 17
Week 10 10 Mar Finish grammar induction, talk a bit about machine translation Lecture 18
12 Mar No class, Roger out of town
Finals 20 Mar Final Projects Due

Requirements & grading

Your grade will be based on the following criteria:

  1. Several written homework assignments, to be distributed at various times during the class (approx 15%);
  2. Several programming assignments, to be distributed at various times during the class (approx 50%);
  3. A final project (approx 35%).

Collaboration is encouraged for homework assignments and final projects, but you must be explicit about who you collaborated with and what the division of labor was.

Late homework policy

You have seven late days to use on your assignments, at your discretion. No more than five days can be used per assignment. After those days are used up, you lose 10% of your grade for that assignment per day late. I reserve the right to increase the number of late days if that seems appropriate (it can be challenging to correctly assess the difficulty and length of an assignment), but don't count on it!

Leading Conferences and Journals in the field

Computational linguistics/NLP is a very conference-oriented field; many of the classic articles in the literature never wind up getting published in journals. The top conferences include:

There are also some excellent workshops and conferences run regularly by "special interest groups", including

and others. Finally, excellent work in computational linguistics/NLP also appears in machine learning, artificial intelligence, and other conferences, notably the Conference on Neural Information Processing Systems (NIPS).

The flagship and leading journal of the field is Computational Linguistics. Other excellent journals in the field include: