| Instructor | Roger Levy (rlevy@ling.ucsd.edu) |
| Office | Applied Physics & Math (AP&M) 4220 |
| Office hours | M 2-3pm |
| R Programming session | W 2-3pm (AP&M 4301) |
| Class Time | MW 12pm-1:30pm |
| Class Location | AP&M 4301 |
| Class webpage | http://ling.ucsd.edu/courses/lign251/ |
This course is about probabilistic approaches to language knowledge, acquisition, and use. Today, studying language from a probabilistic perspective requires mastery of the fundamentals of probability and statistics, as well as familiarity with more recent developments in probabilistic modeling. In this course we'll move quickly through basic probability theory, then cover fundamental ideas in statistics—-parameter estimation and hypothesis testing. We'll then cover a fundamental class of probabilistic models—-the linear model—-which as a side effect will familiarize you with the most widely used tools in statistics: linear regression, analysis of variance (ANOVA), and generalized linear models (including logistic regression). We'll then move on to more advanced topics, including elementary Bayesian learning & inference, and multilevel (mixed-effects) modeling.
The course will involve a hands-on approach to data, and we'll be using the open-source R programming language. You'll learn the basics of data visualization and statistical analysis in R. I encourage you to download R here as soon as you can, get it running on your own computer, and go through the R tutorial found in Chapter 1 of Harald Baayen's new book, or this hands-on introduction to R.
I'll supply you with lecture notes on a regular basis. In addition, we'll be making heavy use of Harald Baayen's in-press book: Analyzing Linguistic Data. A Practical Introduction to Statistics. Cambridge University Press. It's available online here. Finally, we'll supplement these with additional readings, both from statistics texts and pertinent linguistics articles.
Now that the course is over, here is a link to all the lectures from the course in one big PDF file.
| Week | Day | Topic & Reading | Materials | Homework Assignments |
|---|---|---|---|---|
| Week 1 | 1-3 Oct | Introduction to probability theory & discrete distributions | ||
| 1 Oct | Definitions and axioms; basic data visualization & empirical distributions | Lecture notes; Baayen chapter 1 + 2.1 | Homework 1 | |
| 3 Oct | Counting methods; binomial & Poisson distributions | Lecture notes; Baayen 3.1 + 3.2 | ||
| Week 2 | 8-10 Oct | Continuous distributions, expected values, and joint distributions | ||
| 8 Oct | The normal distribution; mean and variance | Lecture notes; Baayen 3.3 + 3.3.1 | Homework 2 data file | |
| 10 Oct | Joint distributions & visualising them; covariance and correlation | Lecture notes; Baayen 2.3 | ||
| Week 3 | 15-17 Oct | Parameter Estimation, frequentist and Bayesian | ||
| 15 Oct | Parameter estimation, the method of maximum likelihood, and statistical bias | Lecture notes | Homework 3 | |
| 17 Oct | Bayesian parameter estimation and confidence intervals | Lecture notes; Baayen 4.1, 4.2; MacKay chapter 3, esp. section 3.2 (optional) | ||
| Week 4 | 22-24 Oct | Campus closed due to wildfires, no class | ||
| 22 Oct | Campus closed | |||
| 22 Oct | Campus closed | |||
| Week 5 | 29-31 Oct | Confidence intervals and frequentist hypothesis testing | ||
| 29 Oct | The chi-squared and t distributions; frequentist confidence intervals | Lecture notes; Baayen 3.3.2; Rice 9.1,9.2; MacKay section 23.2 (optional, for the t distribution) | ||
| 31 Oct | The Neyman-Pearson Paradigm and frequentist hypothesis testing; contingency tables; Fisher's exact test; chi-squared and likelihood-ratio tests | Lecture notes; Baayen 4.3 | Homework 4 supplementary data file for problem 3c | |
| Week 6 | 5-7 Nov | Bayesian hypothesis testing; linear models | ||
| 5 Nov | Bayesian hypothesis testing | Lecture notes; Griffiths & Yuille 2006, pages 1-5 | ||
| 7 Nov | Linear models I: Linear regression | Lecture notes; Baayen 4.3.2, 6.1, 6.2; | Programming session log & notes; Sample project outlines | |
| Week 7 | 12 Nov | Veterans Day Holiday, no class | ||
| 14 Nov | Linear models II: Analysis of variance (ANOVA); application of linear models to magnitude estimation | Lecture notes; Bard, Ellen Gurman, Robertson, Dan, and Sorace, Antonella. 1996. Magnitude Estimation of Linguistic Acceptability. Language 72: 32-68. | Programming session log | |
| Week 8 | 19-21 Nov | Finish linear models, start generalized linear models | ||
| 19 Nov | Linear models III: Analysis of variance in the real world | Lecture notes | ||
| 21 Nov | Introduction to generalized linear models & logistic regression | Lecture notes; Baayen 6.3; | Homework 5; Programming session: how to construct barplots with standard error lines in R | |
| Week 9 | 26-28 Nov | Multilevel (mixed-effects) modeling | ||
| 26 Nov | Introduction to multilevel modeling | Lecture notes; Baayen 7.1; E. Judith Weiner and William Labov. 1983. Constraints on the agentless passive. Journal of Linguistics 19: 29-58. | ||
| 26 Nov | Multilevel modeling continued | Lecture notes; Baayen 7.2.3, 7.4; Bresnan et al., 2007 | Homework 6 (due 5 Dec); Supplementary data file for Homework 6 | |
| Week 10 | 3-5 Dec | Regression techniques; course wrap-up | ||
| 3 Dec | Splines and breakpoints in regression | Lecture notes; Baayen 6.2.1, 6.4 | ||
| 5 Dec | Course wrap-up & review of key points |
If you are taking the course for credit, there are three things expected of you:
1. Regular attendance in class.
2. Doing the assigned readings and coming ready to discuss them in class.
3. Doing several homework assignments to be assigned throughout the quarter. Email submission of the homework assignments is encouraged, but please send it to lign251-homework@ling.ucsd.edu instead of to me directly. If you send it to me directly I may lose track of it.
4. A final project which will involve the analysis and/or modeling of some dataset, either your own or a "standard" dataset that I will provide.
There is a mailing list for this class, lign251@ling.ucsd.edu. Please subscribe to the mailing list by filling out the form at http://pidgin.ucsd.edu/mailman/listinfo/lign251! We'll use it to communicate with each other.
For this class I'll be maintaining an FAQ for our use of R. Read the FAQ here.
In addition, the searchable R mailing lists might be useful.
There are lots of new and useful books for statistics, both in linguistics and more generally. We'll be making direct use of some of them. Also, it's always good to read about the same idea or method as described by multiple authors. Here are some sources:
Harald Baayen's textbook — get this while it's still online!
Keith Johnson's book on quantitative methods in linguistics (free download)
Shravan Vasishth's book draft: The foundations of statistics: A simulation-based approach (free download)
John Rice's Mathematical Statistics and Data Analysis — a good general book for introductory statistics (mostly classical).
David MacKay's Information Theory, Inference, and Learning Algorithms (free download of a first-rate published book!!!)
Chapter 2 of Manning & Schuetze