Linguistics 165: Computational Linguistics (Fall 2010)

Course information

Lecture Times TuTh 6:30pm-8:00pm
Lecture Location AP&M Room 3402
Class webpage http://grammar.ucsd.edu/courses/lign165/
Final Exam Tu 7 Dec 7pm-10pm

Instructor information

Instructor Roger Levy (rlevy@ucsd.edu)
Instructor's office AP&M Room 4220
Instructor's office hours Tuesdays 2pm-3pm, Thursdays 10-11am
Teaching Assistants (TAs) Gwen Gillingham (ggilling@ucsd.edu)
TA's office hours Wed 10am-12pm
TA's office AP&M Room 3331B

Course Description

Computational linguistics is the study of how to get computers to do useful things with human language (a.k.a. "natural language"). Linguistics 165 is a brief introduction to this rich and exciting field, and includes coverage of topics such as:

During the course you'll both learn the theory underlying these topics and gain hands-on experience using computational tools and programming computers to perform some of the tasks described above. Class meetings will be a mixture of lecture and programming practicum sessions. Homework assignments and final projects will involve both written and programming components.

Course organization

The twice-weekly course meetings will be a combination of lecture and hands-on programming practice. Interrupting (politely!) to ask questions is highly encouraged.

Intended Audience

Upper-division students interested in how to get computers to do useful things with language. Computational linguistics is a highly interdisciplinary course and students with a variety of majors, including (but not limited to) linguistics, computer science, cognitive science, psychology, and electrical engineering are likely to find the content interesting. There are no prerequisites for the course, but there will be quite a bit of computer programming involved, and you will find the course easier if you have taken a programming course (e.g., CSE 5A, 8A/B, or 11). You'll also find the course easier if you've taken introductory linguistics (LIGN 101), and potentially syntax as well (LIGN 121).

Course objectives

Computational linguistics is a rich and deep field, and we can only scratch the surface in a single quarter. By the end of this course you will get a sense of the flavor of some of the major subfields, you will have familiarity with some of the key models and techniques for processing human language by computer, and you will have had practical experience working with these models and techniques on computer.

Textbook

The following textbook is required for the class:

Jurafsky, Daniel, and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall. Second edition.

This textbook is the single most comprehensive and up-to-date introduction available to the field of computational linguistics. It is expensive ($117.45 on Amazon.com), but it is worth it. Please do get the second edition, not the first edition.

We will also be doing some of our programming in the Python programming language, and will make quite a bit of use of the Natural Language Toolkit (or NLTK) for Python. There is a book for NLTK:

Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O'Reilly Media.

You can buy this book, or you can freely access it on the Web at http://www.nltk.org/book.

Labwork

This class is being held in AP&M 3432, a computer lab in the Linguistics Department's Library, and we use the computers in the lab in conjunction with UCSD's instructional Linux servers to do practical programming work for the class. In addition to classroom time, you're welcome to come in to the departmental Library and use either the lab computers in 3432 when the room is not occupied with a class, or the regular computers in the library. Additionally, by installing an X server (e.g., Xming for Windows or X11 for Mac) and a secure shell program (e.g., PuTTY for Windows or ssh for Mac) onto your own computer, you can set yourself up to do computer work for the class on your own computer.

WebCT

We will be using WebCT for administering homework assignments and various surveys, and as a discussion forum for all participants in the class. Most of you should be familiar with WebCT from another class; if you aren't, take a look at http://iwdc.ucsd.edu/docs/step1_webct_fa07.pdf.

Discussion boards

There will be discussion boards on the course WebCT site for the major topics covered in this class. If you have a question about course content that may be relevant to other students in the course, we strongly encourage you to post it to the WebCT discussion board for this class. We encourage you to read the discussion boards regularly, and if you know the answer to a question, to post the answer! Active, positive contributions to the discussion boards will be given favorable consideration in determining final grades.

Syllabus (subject to modification)

Week Day Topic Reading In-class materials Homework Assignments
Week 0 23 Sep Class Introduction: what is computational linguistics? Admin. Getting started with Unix, Python, and NLTK. Word frequencies and word frequency distributions. SLP Chapter 1, NLTK Chapter 1 Introduction: what is computational linguistics? Beginning of Class Survey; Homework 1
Week 1 28 Sep More Python; word concordances.
30 Sep Word frequencies. Collocations. NLTK Chapter 2 Homework 2
Week 2 5 Oct Basic probability theory. Joint and conditional probability. Counting collocations and estimating probabilities in Python.
7 Oct Class catch-up
Week 3 12 Oct Elementary text classification with Naive Bayes NLTK Chapter 6.5, MRS Chapter 13
14 Oct Regular expressions and finite-state automata. SLP Chapter 2, NLTK Chapter 3 Homework 3
Week 4 19 Oct Basic morphology. Getting started with OpenFST. SLP Chapter 3 (and start working through NLTK Chapter 4 on your own)
21 Oct Finite-state transducers.
Week 5 26 Oct Review.
28 Oct Finite-state transducers, continued.
Week 6 2 Nov Midterm Exam!
4 Nov Syntax and context-free grammars. SLP Chapter 12.1-12.3, Chapter 16; NLTK Chapter 8.1-8.3
Week 7 9 Nov Syntax and context-free grammars, continued. Searching trees with Tregex. SLP Chapter 12.4-12.9, Chapter 13; NLTK Chapter 8.4-8.5 Homework 4
11 Nov Veteran's Day, no class
Week 8 16 Nov Probabilistic context-free grammars. SLP Chapter 14.1, 14.3, NLTK Chapter 8.6
18 Nov Probabilistic context-free grammars, continued. Parsing with PCFGs and evaluating the results. Improving probabilistic parsers. SLP Chapter 14.2, 14.4, 14.5, 14.7
Week 9 23 Nov Improving PCFGs, continued Homework 5
25 Nov Thanksgiving Day, no class
Week 10 30 Nov Computational psycholinguistics: modeling human language understanding SLP Chapter 14.10
2 Dec Course wrapup
Finals 7 Dec Final exam!

Instructor contact policy

Coming to talk to the instructor or TA during their office hours is highly encouraged. Electronic communications about course content should be made through the WebCT discussion board (see above). We ask that you use email contact only for communications that are not relevant to other students (e.g., specific learning circumstances or medical/personal emergency).

Academic Integrity

Please take some time to read the UCSD Policy on Integrity of Scholarship. We will be conducting this course in full accordance with this policy. In particular, any suspected cheating or plagiarism in the course will be taken very seriously and investigated. If we determine that cheating or plagiarism has taken place, it will be reported to UCSD's Office of the Academic Integrity Coordinator, in accordance with UCSD policy. Please note that it is not at our discretion whether or not to report instances of academic dishonesty: we are obligated by UCSD policy to report such instances.

Examples of academic integrity violations

Here are some examples of academic integrity violations. DO NOT DO THESE!!!

This is not an exhaustive list — please read the UCSD Policy on Integrity of Scholarship and use your common sense!

Requirements & grading

Your grade will be based on five criteria:

1. Homework assignments: there will be a number of homework assignments (somewhere between five and eleven) throughout the quarter; they will be worth 50% of your grade. You will be allowed to drop the assignment for which you receive the lowest score; we'll average the rest of the assignments to determine your overall homework grade.

You can turn in homework assigments either as physical copies in class or e-mail them to lign165-homework@ling.ucsd.edu. Please don't send e-mail directly to either me or Gwen instead, as it will be much more likely to get lost in the shuffle.

2. Midterm exam: the midterm exam on November 2 is worth 15% of your grade. This will be a pen-and-paper exam; you won't have to implement anything on a computer, though you may have to write out a sketch of a computer program or two.

3. Final exam: the final exam on December 7 is worth 25% of your grade. This will be a pen-and-paper exam; you won't have to implement anything on a computer, though you may have to write out a sketch of a computer program or two (or more).

4. In-class participation: actively contributing to class in a positive way, through asking questions and/or responding to questions I pose, is an important component of your work in the class. This is worth 5% of your grade.

5. One of the two following options (5% of your grade; no extra credit for doing both!):

a. Participation in 4 hours of the Human Subject Pool (http://experimetrix2.com/ucsd/): each hour of participation counts as 1% of your grade, plus a 1% bonus for participation in all four hours. You are encouraged to participate in language-related experiments, and to participate in these experiments early—the last day for participation is 3 December, and there is no guarantee that there will be experiment slots open for participation in the latest part of the quarter.

b. Writing a research paper (1000-1500 words) on some topic covered in the class. The due date for such a paper is November 30, and no late papers will be accepted. If you choose this option, you must discuss it and get an OK on your research topic from Professor Levy or his teaching assistant, Gwen Gillingham, by November 10before writing the paper and turning it in. We will not accept papers whose topic has not been cleared in advance.

In addition, positive participation on WebCT forums (including asking well-thought-out questions and/or answering other students' questions) is looked on positively—-it increases the chance that your final grade may get bumped up a notch, if it is borderline.

Homework grading policy

Homework assignments may be turned in up to six days late, but they will be downgraded 10% per day. Furthermore, nothing may be turned in after December 7.

Exceptions to the late policy will only be granted for medical or personal emergencies, and the instructor or his TA must be notified as soon as possible (not several days after the emergency is over).

Regrading/correction policy

We all make mistakes—TAs and professors as well as students—so please do look over your returned work. In addition to helping ensure that you get the credit you deserve, this checking will improve your retention of the material. However, there is a statute of limitations: all grading mistakes must be brought to our attention within one week of our returning the work. This prevents us from getting a backlog of corrections at the end of the quarter, which would interfere with the time-consuming activities of preparing lectures and grading. Thank you in advance for your cooperation!