Linguistics 252: Advanced Probabilistic Models of Language (Spring 2012)

1 Instructor info

InstructorRoger Levy (rlevy@ucsd.edu)
OfficeApplied Physics & Math (AP&M) 4220
Office hoursFridays 10am-12pm
Class TimeTuTu 2:00-3:50pm (in general, Thursdays 3-3:50pm will be practicum times)
Class LocationAP&M 4301
Class webpagehttp://grammar.ucsd.edu/courses/lign252/

2 Course Description

Probabilistic modeling is transforming the study of human language, ranging from novel theories of linguistic cognition to sophisticated techniques for statistical analysis of complex, structured linguistic data to practical methods for automated processing of language. Doing cutting-edge research in these areas requires skill with probability and statistics, familiarity with formalisms from computational linguistics, ability to use and develop new computational tools, and comfort with handling complex datasets. This course will cover both theory and application, covering conceptual fundamentals and giving hands-on opportunies for skill building, while covering a number of important topics in this area. We'll start out with maximum-entropy models and hierarchical regression models, then move on to latent-variable models including Latent Dirichlet Allocation ("topics" models), the Dirichlet Process, and on to weighted grammar formalisms including probabilistic finite-state automata and probabilistic context-free grammars. We'll apply these techniques for both data analysis and modeling on a variety of problems and datasets. Both maximum-likelihood and Bayesian approaches will be covered.

We'll be using a variety of computational tools including the open-source R programming language, the Bayesian graphical-modeling toolkit JAGS, packages for R that implement Latent Dirichlet Allocation and other hierarchical models, OpenFST for weighted finite-state machines, and my implementations of an incremental parser for probabilistic context-free grammars, as well as general weighted finite-state automaton/context-free grammar intersection. Comfort with programming in a high-level language such as Python may also come in useful during the course.

3 Target audience

Interested students should have interest and some background in studying language using quantitative and modeling techniques. You should also have some background in probability theory, statistics, and/or machine learning, and you should know how to program. Anyone who has taken my course Linguistics 251 (Probabilistic Methods in Linguistics) fulfills all these background prerequisites; if you haven't taken Linguistics 251 but are interested in taking the course, just talk to me.

4 Reading material

The main reading material will be draft chapters of a textbook-in-progress, Probabilistic Models in the Study of Language, that I am writing. These draft chapters can be found here. There are also a number of other reference texts that may be of use in the course, including:

Finally, we may supplement these with additional readings, both from statistics texts and pertinent linguistics articles.

5 Syllabus

WeekDayTopic & ReadingTextbook chaptersOther readingHomework Assignments
Week 13 AprBrief review of probability theory & statisticsPMSL Chapters 2-4Homework 1 (due 5 April)
5 AprComplete review of probability & statisticsPMSL Chapter 5Homework 2 (due 10 April)
Week 210 AprMaximum Entropy models IPMSL Chapter 6Berger et al., 1996
12 AprMaximum Entropy models IIPMSL Chapter 6Hayes & Wilson, 2008Homework 4; examplemegaminputfile.txt; wordsuffixes
Week 317 AprMaximum Entropy models III
19 AprHierarchical regression models IPMSL Chapter 8Baayen et al., 2008
Week 424 AprHierarchical regression models IIPMSL Chapter 8Jaeger 2008
26 AprHierarchical regression models IIIPMSL Chapter 8Barr, Levy, Scheepers, & Tily, in revision
Week 51 MayRoger out of town, no class
3 MayRoger out of town, no class
Week 68 MayLatent-variable models I: mixtures of GaussiansPMSL Chapter 9Vallabha et al., 2007; Feldman et al., 2009
10 MayLatent-variable models II: latent Dirichlet allocationPMSL Chapter 9Blei et al., 2003, Griffiths & Steyvers, 2004
Week 715 MayLatent-variable models IIIPMSL Chapter 9Homework 5
17 MayNonparametric models I: Dirichlet ProcessPMSL Chapter 10Teh et al., 2006; Teh, 2006
Week 822 MayNonparametric models II: Hierarchical Dirichlet ProcessPMSL Chapter 10Goldwater et al., 2009
24 MayProbabilistic grammar formalisms I: Probabilistic Finite-State MachinesPMSL Chapter 11
Week 929 MayProbabilistic grammar formalisms II: Probabilistic Finite-State Machines cont'dPMSL Chapter 11
31 MayProbabilistic grammar formalisms III: Probabilistic context-free grammarsPMSL Chapter 11Charniak, 1997
Week 105 JunProbabilistic grammar formalisms IV: applicationsPMSL Chapter 12; Probabilistic Earley Algorithm slidesLevy, 2008
7 JunProbabilistic grammar formalisms V: Fragment grammarsTBDO'Donnell et al., 2011
Finals14 JunFinal projects due!

6 Requirements

If you are taking the course for credit, there are four things expected of you:

  1. Regular attendance in class.
  2. Doing the assigned readings and coming ready to discuss them in class.
  3. Doing several homework assignments to be assigned throughout the quarter. Email submission of the homework assignments is encouraged, but please send it to lign252-homework@ling.ucsd.edu instead of to me directly. If you send it to me directly I may lose track of it.

    You can find some guidelines on writing good homework assignments here. The source file to this PDF is here.

  4. A final project which will involve computational modeling and/or data analysis in some area relevant to the course. Final project guidelines are here.

7 Mailing List

There is a mailing list for this class, lign252-l@mailman.ucsd.edu. Please make sure you're subscribed to the mailing list by filling out the form at https://mailman.ucsd.edu/mailman/listinfo/lign252-l! We'll use it to communicate with each other.

8 Programming help

For this class I'll be maintaining an FAQ. Read the FAQ here.

I also run the R-lang mailing list. I suggest that you subscribe to it; it's a low-traffic list and is a good clearinghouse for technical and conceptual issues that arise in the statistical analysis of language data.

In addition, the searchable R mailing lists are likely to be useful.

Date: 2012-05-21 11:53:29 PDT

Author: Roger Levy

Org version 7.8.06 with Emacs version 23

Validate XHTML 1.0