Instructor | Roger Levy (rlevy@ling.ucsd.edu) |

Office | Applied Physics & Math (AP&M) 4220 |

Office hours | Th 2:15pm-4:15pm |

Class Time | TuTh 5:30pm-7pm |

Class Location | AP&M 2452 |

R Programming session | Th 7-8pm, AP&M 2452 |

Class webpage | http://grammar.ucsd.edu/courses/lign251/ |

This course is about probabilistic approaches to language knowledge,
acquisition, and use. Today, studying language from a probabilistic
perspective requires mastery of the fundamentals of probability and
statistics, as well as familiarity with more recent developments in
probabilistic modeling. In this course we'll move quickly through
basic probability theory, then cover fundamental ideas in
statistics—-parameter estimation and hypothesis testing. We'll then
cover a fundamental class of probabilistic models—-the *linear
model*—-which as a side effect will familiarize you with the most
widely used tools in statistics: linear regression, analysis of
variance (ANOVA), and generalized linear models (including logistic
regression). We'll then move on to more advanced topics, including
elementary Bayesian learning & inference, multilevel (mixed-effects)
modeling, and a little bit of probabilistic grammars if we have a
chance.

The course will involve a hands-on approach to data, and we'll be using the open-source R programming language (and a bit of JAGS, which interfaces nicely with R, for Bayesian modeling). You'll learn the basics of data visualization and statistical analysis in R. I encourage you to download R here as soon as you can, get it running on your own computer, and go through the R tutorial found in Chapter 1 of Harald Baayen's new book, or this hands-on introduction to R. You can also download JAGS here.

I'll supply you with lecture notes/draft book chapters on a regular
basis. In addition, we'll be making some use of Harald Baayen's
in-press book: **Analyzing Linguistic Data. A Practical Introduction to
Statistics.** Cambridge University Press. It's available online here.
Finally, we'll supplement these with additional readings, both from
statistics texts and pertinent linguistics articles.

The text for this course is in a continually evolving set of extensive lecture notes/book chapters. You can always find the current version here. You can also look at the complete lecture notes from the the time I taught the course in Fall 2007 here.

Week | Day | Topic & Reading | Materials | Homework Assignments |
---|---|---|---|---|

Week 0 | 25 Sep | Introduction and motivating material | Lecture notes Baayen chapter 1 + 2.1 | Homework 1 |

Week 1 | 30 Sep | Fundamentals of probability theory: probability spaces, discrete and continuous random variables; the Bernoulli and uniform distributions | Lecture notes; Baayen 3.1 + 3.2 | |

2 Oct | Expectation and variance; joint probability distributions; marginalization; covariance and correlation; basic density estimation | Lecture notes; Baayen 3.3 + 3.3.1 | Transcript of R programming practicum number 1 | |

Week 2 | 7 Oct | Kernel density estimation; the binomial distribution | Lecture notes; Baayen 2.3 | |

9 Oct | The normal distribution; intro to parameter estimation; | Lecture notes | Homework 2; Brown corpus data file for Homework 2; R programming practicum 2 transcript | |

Week 3 | 14 Oct | The method of maximum likelihood; consistency and bias; the beta distribution | ||

16 Oct | Bayesian prediction, parameter estimation, confidence intervals | R programming practicum 3 transcript | ||

Week 4 | 21 Oct | Bayesian hypothesis testing | ||

23 Oct | Frequentist confidence intervals and hypothesis testing | Finish Chapter 4 | Homework 3; Spillover word RT file; R programming practicum 4 transcript | |

Week 5 | 28 Oct | Roger out of town, no class |
||

30 Oct | Intro to generalized linear models | R programming practicum 5 transcript | ||

Week 6 | 4 Nov | Linear models | ||

6 Nov | Analysis of variance | R programming practicum 6 transcript | ||

Week 7 | 11 Nov | Veteran's day, no class |
Homework 4 | |

13 Nov | Generalized linear models again; logistic regression | Finish Chapter 4 | R programming practicum 6 transcript | |

Week 8 | 18 Nov | Hierarchical models I | Start Chapter 6 | |

20 Nov | Hierarchical models II | Finish Chapter 6 | ||

Week 9 | 25 Nov | Hierarchical models III | Chapter 7 | Homework 5; Dataset 1 for homework 5; Dataset 2 for homework 5 |

27 Nov | Thanksgiving vacation, no class |
|||

Week 10 | 2 Dec | Hierarchical models IV: logit models | ||

4 Dec | Roger out of town, no class (we may schedule a make-up class) |
One-page (double-sided) list of main concepts to remember from class! | ||

Finals | 12 Dec | Final projects due! |

If you are taking the course for credit, there are four things expected of you:

1. Regular attendance in class.

2. Doing the assigned readings and coming ready to discuss them in class.

3. Doing several homework assignments to be assigned throughout the
quarter. Email submission of the homework assignments is encouraged,
but ** please** send it to lign251-homework@ling.ucsd.edu instead of
to me directly. If you send it to me directly I may lose track of it.

You can find some guidelines on writing good homework assignments here. The source file to this PDF is here.

4. A final project which will involve the analysis and/or modeling of some dataset, either your own or a "standard" dataset that I will provide. Guidelines for the final project can be found here.

There is a mailing list for this class, lign251@ling.ucsd.edu. Please subscribe to the mailing list by filling out the form at http://pidgin.ucsd.edu/mailman/listinfo/lign251! We'll use it to communicate with each other.

For this class I'll be maintaining an FAQ for our use of R. Read the FAQ here.

I also run the R-lang mailing list. I suggest that you subscribe to it; it's a low-traffic list and is a good clearinghouse for technical and conceptual issues that arise in the statistical analysis of language data.

In addition, the searchable R mailing lists are likely to be useful.

There are lots of new and useful books for statistics, both in linguistics and more generally. We'll be making direct use of some of them. Also, it's always good to read about the same idea or method as described by multiple authors. Here are some sources:

Harald Baayen's textbook — get this free while it's still online! Or get it on Amazon.

Keith Johnson's book on quantitative methods in linguistics ($40 on Amazon; no longer available as a free download)

Shravan Vasishth's book draft: The foundations of statistics: A simulation-based approach (free download)

John Rice's Mathematical Statistics and Data Analysis — a good general book for introductory statistics (mostly classical).

David MacKay's Information Theory, Inference, and Learning Algorithms (free download of a first-rate published book!!!)

Chapter 2 of Manning & Schuetze