Instructor | Roger Levy (rlevy@ling.ucsd.edu) |

Office | Applied Physics & Math (AP&M) 4220 |

Office hours | W 2-3:50pm |

Class Time | MW 12:00-1:50pm (in general, Wednesdays 1-1:50pm will be practicum times) |

Class Location | AP&M 4218 |

Class webpage | http://grammar.ucsd.edu/courses/lign274/ |

This course is about computational approaches to problems in psycholinguistics, focusing on probabilistic approaches to language knowledge, acquisition, and use. Today, research in this area requires skill with probability and statistics, familiarity with formalisms from computational linguistics, ability to use and develop new computational tools, and comfort with handling complex datasets. This course will involve hands-on skill building, covering several important topics in this area. We'll start out with maximum-entropy models and hierarchical regression models, then move on to latent-variable models including Latent Dirichlet Allocation ("topics" models), the Dirichlet Process, and on to weighted grammar formalisms including probabilistic finite-state automata and probabilistic context-free grammars. We'll apply these techniques for both data analysis and modeling on a variety of problems and datasets. Both maximum-likelihood and Bayesian approaches will be covered.

We'll be using a variety of computational tools including the open-source R programming language, the Bayesian graphical-modeling toolkit JAGS, packages for R that implement Latent Dirichlet Allocation and other hierarchical models, OpenFST for weighted finite-state machines, and my implementations of an incremental parser for probabilistic context-free grammars, as well as general weighted finite-state automaton/context-free grammar intersection. Comfort with programming in a high-level language such as Python may also come in useful during the course.

Interested students should have interest and some background in studying language using quantitative and modeling techniques. You should also have some background in probability theory and/or statistics, and you should know how to program. Anyone who has taken my course Linguistics 251 (Probabilistic Methods in Linguistics) fulfills all these background prerequisites; if you haven't taken Linguistics 251 but are interested in taking the course, just talk to me.

The main reading material will be draft chapters of a textbook-in-progress, ** Probabilistic Models in the Study of Language**, that I am writing. These draft chapters can be found here. There are also a number of other reference texts that may be of use in the course, including:

- Harald Baayen's book:
**Analyzing Linguistic Data. A Practical Introduction to Statistics.**Cambridge University Press. Available online here. - Shravan Vasishth's book draft: The foundations of statistics: A simulation-based approach (free download)
- Keith Johnson's book on quantitative methods in linguistics ($40 on Amazon; no longer available as a free download)
- David MacKay's Information Theory, Inference, and Learning Algorithms — a great text available freely online
- Manning & Schuetze's Foundations of Statistical Natural Language Processing, available online through UC Libraries here.
- Jurafsky & Martin's Speech and Language Processing
- Christopher Bishop's Pattern Recognition and Machine Learning
- Gelman, Carlin, Stern, and Rubin's Bayesian Data Analysis
- John Rice's Mathematical Statistics and Data Analysis — a good general book for introductory statistics (mostly classical).

Finally, we may supplement these with additional readings, both from statistics texts and pertinent linguistics articles.

Week | Day | Topic & Reading | Textbook chapters | Other reading | Homework Assignments |
---|---|---|---|---|---|

Week 1 | 4 Jan | Brief review of probability theory & statistics | PMSL Chapters 2-5 | ||

6 Jan | Roger out of town for LSA Annual Meeting, no class |
||||

Week 2 | 11 Jan | Complete review of probability & statistics | PMSL Chapter 5 | ||

13 Jan | Maximum Entropy models I | PMSL Chapter 6 | Berger et al., 1996 | ||

Week 3 | 18 Jan | Martin Luther King Day, no class |
|||

20 Jan | Maximum Entropy models II | PMSL Chapter 6 | Hayes & Wilson, 2008 | Homework 1 | |

Week 4 | 25 Jan | Hierarchical regression models I | PMSL Chapter 8 | Baayen et al., 2008 | |

27 Jan | Hierarchical regression models II | PMSL Chapter 8 | |||

Week 5 | 1 Feb | Hierarchical regression models III | PMSL Chapter 8 | ||

3 Feb | Latent-variable models I: mixtures of Gaussians | PMSL Chapter 9 | Vallabha et al., 2007 | ||

Week 6 | 8 Feb | Latent-variable models II: latent Dirichlet allocation | PMSL Chapter 9 | Blei et al., 2003, Griffiths & Steyvers, 2004 | Homework 2 |

10 Feb | Latent-variable models III | PMSL Chapter 9 | Final project guidelines | ||

Week 7 | 15 Feb | President's day, no class |
|||

17 Feb | Nonparametric models I: Dirichlet Process | PMSL Chapter 10 | |||

Week 8 | 22 Feb | Nonparametric models II: Dirichlet Process, cont'd | PMSL Chapter 10 | Teh et al., 2006 | |

24 Feb | Nonparametric models III: Hierarchical Dirichlet Processes | PMSL Chapter 10 | Goldwater et al., 2009 | ||

Week 9 | 1 Mar | Probabilistic grammar formalisms I: Probabilistic Finite-State Machines | PMSL Chapter 11 | ||

3 Mar | Probabilistic grammar formalisms II: Probabilistic Finite-State Machines cont'd | PMSL Chapter 11 | |||

Week 10 | 8 Mar | Probabilistic grammar formalisms III: Probabilistic context-free grammars | PMSL Chapter 11 | Charniak, 1997 | |

10 Mar | Probabilistic grammar formalisms IV: applications | PMSL Chapter 9; Probabilistic Earley Algorithm slides | Levy, 2008 | ||

Finals | 19 Mar | Final projects due! |

If you are taking the course for credit, there are four things expected of you:

1. Regular attendance in class.

2. Doing the assigned readings and coming ready to discuss them in class.

3. Doing several homework assignments to be assigned throughout the
quarter. Email submission of the homework assignments is encouraged,
but ** please** send it to lign274-homework@ling.ucsd.edu instead of
to me directly. If you send it to me directly I may lose track of it.

You can find some guidelines on writing good homework assignments here. The source file to this PDF is here.

4. A final project which will involve computational modeling and/or data analysis in some area relevant to the course.

There is a mailing list for this class, lign274@ling.ucsd.edu. Please subscribe to the mailing list by filling out the form at http://pidgin.ucsd.edu/mailman/listinfo/lign274! We'll use it to communicate with each other.

For this class I'll be maintaining an FAQ. Read the FAQ here.

I also run the R-lang mailing list. I suggest that you subscribe to it; it's a low-traffic list and is a good clearinghouse for technical and conceptual issues that arise in the statistical analysis of language data.

In addition, the searchable R mailing lists are likely to be useful.