| Instructor | Roger Levy (rlevy@ucsd.edu) |
| Office | Applied Physics & Math (AP&M) 4220 |
| Office hours | Tuesdays 2-3pm, Thursdays 10-11am |
| Class Time | TuTh 12:30-2pm |
| Class Location | AP&M 4301 |
| Class webpage | http://grammar.ucsd.edu/courses/lign251/ |
This course is about probabilistic approaches to language knowledge, acquisition, and use. Today, studying language from a probabilistic perspective requires mastery of the fundamentals of probability and statistics, as well as familiarity with more recent developments in probabilistic modeling. In this course we'll move quickly through basic probability theory, then cover fundamental ideas in statistics—parameter estimation and hypothesis testing. We'll then cover a fundamental class of probabilistic models—the linear model—which as a side effect will familiarize you with the most widely used tools in statistics: linear regression, analysis of variance (ANOVA), and generalized linear models (including logistic regression). We'll cover these topics using both frequentist methods (what you need to use in order to write publishable data analyses) and Bayesian methods (which are becoming increasingly popular in all sorts of settings, especially in cognitive modeling of language). We'll then move on to the more advanced topic of hierarchical (a.k.a. multilevel or mixed-effects) modeling, and perhaps even a bit of probabilistic grammars if we have a chance.
The course will involve a hands-on approach to data, and we'll be using the open-source R programming language (and a bit of JAGS, which interfaces nicely with R, for Bayesian modeling). You'll learn the basics of data visualization and statistical analysis in R, and the class will involve periodic programming practica to ensure that your R programming questions are adequately addressed. Transcripts of programming practica will also be put up online. I encourage you to download R here as soon as you can, get it running on your own computer, and go through the R tutorial found in Chapter 1 of Harald Baayen's new book, or this hands-on introduction to R. You can also download JAGS here.
The primary text for this course will be a book that I'm currently in the process of writing, Probabilistic Models in the Study of Language. The current draft is always available here. The goal is to go from the beginning through to the material in Chapter 8.
| Week | Day | Topic | Reading | Materials | Homework Assignments |
|---|---|---|---|---|---|
| Week 0 | 23 Sep | Introduction and motivating material; Fundamentals of probability theory: probability spaces, conditional probability | Chapter 2.1-2.5 | Introduction/Motivation Slides | HW 1 |
| Week 1 | 28 Sep | Bayes' rule; discrete random variables; the Bernoulli and multinomial distributions; cumulative density functions | |||
| 30 Sep | Continuous random variables; the uniform distribution; expectation and variance; | Chapter 2.6-2.8 | |||
| Week 2 | 5 Oct | The normal distribution | Chapter 2.9 | Homework 2; Peterson & Barney's vowel dataset | |
| 7 Oct | Estimating probability densities; first R practicum | Chapter 2.10 | |||
| Week 3 | 12 Oct | First R practicum | R transcript | ||
| 14 Oct | Joint probability distributions; marginalization; linearity of the expectation; covariance | Chapter 3.1-3.4 | R transcript | ||
| Week 4 | 19 Oct | Correlation; the binomial distribution; multivariate normal distributions | Chapter 3.5 | R transcript | Homework 3; brown-counts-lengths-nsyll file for HW 3 |
| 21 Oct | Introductory parameter estimation; consistency, bias, and variance of estimators | Chapter 4.1-4.2 | |||
| Week 5 | 26 Oct | The method of maximum likelihood | Chapter 4.3 | ||
| 28 Oct | Bayesian parameter estimation and density estimation | Chapter 4.4-4.5 | Homework 4 | ||
| Week 6 | 2 Nov | Bayesian confidence intervals and hypothesis testing | Chapter 5.1-5.2 | ||
| 4 Nov | Frequentist confidence intervals and hypothesis testing | Chapter 5.3-5.4 | Homework 5; spillover_word_rts data file for Homework 5 | ||
| Week 7 | 9 Nov | Intro to generalized linear models: linear models | Chapter 6.1-6.2 | ||
| 11 Nov | Veteran's day, no class | ||||
| Week 8 | 16 Nov | Linear models II | Chapter 6.3-6.5 | R transcript | |
| 18 Nov | Linear models III | R transcript | |||
| Week 9 | 23 Nov | Analysis of Variance I | Chapter 6.6 | ||
| 25 Nov | Thanksgiving vacation, no class | ||||
| Week 10 | 30 Nov | Analysis of Variance II | Homework 6; Problem 1 dataset; Problem 2 dataset | ||
| 2 Dec | Logit models | Chapter 6 | |||
| Finals | 10 Dec | Final projects due! |
If you are taking the course for credit, there are four things expected of you:
1. Regular attendance in class.
2. Doing the assigned readings and coming ready to discuss them in class.
3. Doing several homework assignments to be assigned throughout the quarter. Email submission of the homework assignments is encouraged, but please send it to lign251-homework@ling.ucsd.edu instead of to me directly. If you send it to me directly I may lose track of it.
You can find some guidelines on writing good homework assignments here. The source file to this PDF is here.
4. A final project which will involve the analysis and/or modeling of some dataset, either your own or a "standard" dataset that I will provide. Guidelines for the final project can be found here.
There is a mailing list for this class, lign251-l@mailman.ucsd.edu. Please subscribe to the mailing list by filling out the form at https://mailman.ucsd.edu/mailman/listinfo/lign251-l! We'll use it to communicate with each other.
For this class I'll be maintaining an FAQ for our use of R. Read the FAQ here.
I also run the R-lang mailing list. I suggest that you subscribe to it; it's a low-traffic list and is a good clearinghouse for technical and conceptual issues that arise in the statistical analysis of language data.
In addition, the searchable R mailing lists are likely to be useful.
In addition to my own textbook, there are lots of new and useful books for statistics, both in linguistics and more generally. We may be making direct use of some of them. Also, it's always good to read about the same idea or method as described by multiple authors. Here are some sources:
Harald Baayen's textbook Analyzing Linguistic Data. A Practical Introduction to Statistics. University Press. It's available online here. — get this free while it's still online! Or get it on Amazon.
Keith Johnson's book on quantitative methods in linguistics ($40 on Amazon; no longer available as a free download)
Shravan Vasishth's book draft: The foundations of statistics: A simulation-based approach (free download)
John Rice's Mathematical Statistics and Data Analysis — a good general book for introductory statistics (mostly classical).
David MacKay's Information Theory, Inference, and Learning Algorithms (free download of a first-rate published book!!!)
Chapter 2 of Manning & Schuetze