# Linguistics 252: Advanced Probabilistic Models of Language (Spring 2012)

## 1 Instructor info

Instructor | Roger Levy (rlevy@ucsd.edu) |

Office | Applied Physics & Math (AP&M) 4220 |

Office hours | Fridays 10am-12pm |

Class Time | TuTu 2:00-3:50pm (in general, Thursdays 3-3:50pm will be practicum times) |

Class Location | AP&M 4301 |

Class webpage | http://grammar.ucsd.edu/courses/lign252/ |

## 2 Course Description

Probabilistic modeling is transforming the study of human language, ranging from novel theories of linguistic cognition to sophisticated techniques for statistical analysis of complex, structured linguistic data to practical methods for automated processing of language. Doing cutting-edge research in these areas requires skill with probability and statistics, familiarity with formalisms from computational linguistics, ability to use and develop new computational tools, and comfort with handling complex datasets. This course will cover both theory and application, covering conceptual fundamentals and giving hands-on opportunies for skill building, while covering a number of important topics in this area. We'll start out with maximum-entropy models and hierarchical regression models, then move on to latent-variable models including Latent Dirichlet Allocation ("topics" models), the Dirichlet Process, and on to weighted grammar formalisms including probabilistic finite-state automata and probabilistic context-free grammars. We'll apply these techniques for both data analysis and modeling on a variety of problems and datasets. Both maximum-likelihood and Bayesian approaches will be covered.

We'll be using a variety of computational tools including the open-source R programming language, the Bayesian graphical-modeling toolkit JAGS, packages for R that implement Latent Dirichlet Allocation and other hierarchical models, OpenFST for weighted finite-state machines, and my implementations of an incremental parser for probabilistic context-free grammars, as well as general weighted finite-state automaton/context-free grammar intersection. Comfort with programming in a high-level language such as Python may also come in useful during the course.

## 3 Target audience

Interested students should have interest and some background in studying language using quantitative and modeling techniques. You should also have some background in probability theory, statistics, and/or machine learning, and you should know how to program. Anyone who has taken my course Linguistics 251 (Probabilistic Methods in Linguistics) fulfills all these background prerequisites; if you haven't taken Linguistics 251 but are interested in taking the course, just talk to me.

## 4 Reading material

The main reading material will be draft chapters of a textbook-in-progress, Probabilistic Models in the Study of Language, that I am writing. These draft chapters can be found here. There are also a number of other reference texts that may be of use in the course, including:

- Harald Baayen's book:
*Analyzing Linguistic Data. A Practical Introduction to Statistics*. Cambridge University Press. Available online here. - Shravan Vasishth's book draft: The foundations of statistics: A simulation-based approach (free download)
- Keith Johnson's book on quantitative methods in linguistics ($40 on Amazon; no longer available as a free download)
- David MacKay's Information Theory, Inference, and Learning Algorithms – a great text available freely online
- Manning & Schuetze's Foundations of Statistical Natural Language Processing, available online through UC Libraries here.
- Jurafsky & Martin's Speech and Language Processing
- Christopher Bishop's Pattern Recognition and Machine Learning
- Gelman, Carlin, Stern, and Rubin's Bayesian Data Analysis
- John Rice's Mathematical Statistics and Data Analysis – a good general book for introductory statistics (mostly classical).
- Brian Roark and Richard Sproat's Computational Approaches to Morphology and Syntax for the Probabilistic Grammar Formalisms section

Finally, we may supplement these with additional readings, both from statistics texts and pertinent linguistics articles.

## 5 Syllabus

Week | Day | Topic & Reading | Textbook chapters | Other reading | Homework Assignments | |
---|---|---|---|---|---|---|

Week 1 | 3 Apr | Brief review of probability theory & statistics | PMSL Chapters 2-4 | Homework 1 (due 5 April) | ||

5 Apr | Complete review of probability & statistics | PMSL Chapter 5 | Homework 2 (due 10 April) | |||

Week 2 | 10 Apr | Maximum Entropy models I | PMSL Chapter 6 | Berger et al., 1996 | ||

12 Apr | Maximum Entropy models II | PMSL Chapter 6 | Hayes & Wilson, 2008 | Homework 4; example_{megam}_{input}_{file}.txt; word_{suffixes} | ||

Week 3 | 17 Apr | Maximum Entropy models III | ||||

19 Apr | Hierarchical regression models I | PMSL Chapter 8 | Baayen et al., 2008 | |||

Week 4 | 24 Apr | Hierarchical regression models II | PMSL Chapter 8 | Jaeger 2008 | ||

26 Apr | Hierarchical regression models III | PMSL Chapter 8 | Barr, Levy, Scheepers, & Tily, in revision | |||

Week 5 | 1 May | Roger out of town, no class | ||||

3 May | Roger out of town, no class | |||||

Week 6 | 8 May | Latent-variable models I: mixtures of Gaussians | PMSL Chapter 9 | Vallabha et al., 2007; Feldman et al., 2009 | ||

10 May | Latent-variable models II: latent Dirichlet allocation | PMSL Chapter 9 | Blei et al., 2003, Griffiths & Steyvers, 2004 | |||

Week 7 | 15 May | Latent-variable models III | PMSL Chapter 9 | Homework 5 | ||

17 May | Nonparametric models I: Dirichlet Process | PMSL Chapter 10 | Teh et al., 2006; Teh, 2006 | |||

Week 8 | 22 May | Nonparametric models II: Hierarchical Dirichlet Process | PMSL Chapter 10 | Goldwater et al., 2009 | ||

24 May | Probabilistic grammar formalisms I: Probabilistic Finite-State Machines | PMSL Chapter 11 | ||||

Week 9 | 29 May | Probabilistic grammar formalisms II: Probabilistic Finite-State Machines cont'd | PMSL Chapter 11 | |||

31 May | Probabilistic grammar formalisms III: Probabilistic context-free grammars | PMSL Chapter 11 | Charniak, 1997 | |||

Week 10 | 5 Jun | Probabilistic grammar formalisms IV: applications | PMSL Chapter 12; Probabilistic Earley Algorithm slides | Levy, 2008 | ||

7 Jun | Probabilistic grammar formalisms V: Fragment grammars | TBD | O'Donnell et al., 2011 | |||

Finals | 14 Jun | Final projects due! |

## 6 Requirements

If you are taking the course for credit, there are four things expected of you:

- Regular attendance in class.
- Doing the assigned readings and coming ready to discuss them in class.
- Doing several homework assignments to be assigned throughout the quarter. Email submission of the homework assignments is encouraged, but
*please*send it to lign252-homework@ling.ucsd.edu instead of to me directly. If you send it to me directly I may lose track of it.You can find some guidelines on writing good homework assignments here. The source file to this PDF is here.

- A final project which will involve computational modeling and/or data analysis in some area relevant to the course. Final project guidelines are here.

## 7 Mailing List

There is a mailing list for this class, lign252-l@mailman.ucsd.edu. Please make sure you're subscribed to the mailing list by filling out the form at https://mailman.ucsd.edu/mailman/listinfo/lign252-l! We'll use it to communicate with each other.

## 8 Programming help

For this class I'll be maintaining an FAQ. Read the FAQ here.

I also run the R-lang mailing list. I suggest that you subscribe to it; it's a low-traffic list and is a good clearinghouse for technical and conceptual issues that arise in the statistical analysis of language data.

In addition, the searchable R mailing lists are likely to be useful.