| Lecture Times | TuTh 6:30pm-8:00pm |
| Lecture Location | AP&M Room 3402 |
| Class webpage | http://grammar.ucsd.edu/courses/lign165/ |
| Final Exam | Tu 7 Dec 7pm-10pm |
| Instructor | Roger Levy (rlevy@ucsd.edu) |
| Instructor's office | AP&M Room 4220 |
| Instructor's office hours | Tuesdays 2pm-3pm, Thursdays 10-11am |
| Teaching Assistants (TAs) | Gwen Gillingham (ggilling@ucsd.edu) |
| TA's office hours | Wed 10am-12pm |
| TA's office | AP&M Room 3331B |
Computational linguistics is the study of how to get computers to do useful things with human language (a.k.a. "natural language"). Linguistics 165 is a brief introduction to this rich and exciting field, and includes coverage of topics such as:
During the course you'll both learn the theory underlying these topics and gain hands-on experience using computational tools and programming computers to perform some of the tasks described above. Class meetings will be a mixture of lecture and programming practicum sessions. Homework assignments and final projects will involve both written and programming components.
The twice-weekly course meetings will be a combination of lecture and hands-on programming practice. Interrupting (politely!) to ask questions is highly encouraged.
Upper-division students interested in how to get computers to do useful things with language. Computational linguistics is a highly interdisciplinary course and students with a variety of majors, including (but not limited to) linguistics, computer science, cognitive science, psychology, and electrical engineering are likely to find the content interesting. There are no prerequisites for the course, but there will be quite a bit of computer programming involved, and you will find the course easier if you have taken a programming course (e.g., CSE 5A, 8A/B, or 11). You'll also find the course easier if you've taken introductory linguistics (LIGN 101), and potentially syntax as well (LIGN 121).
Computational linguistics is a rich and deep field, and we can only scratch the surface in a single quarter. By the end of this course you will get a sense of the flavor of some of the major subfields, you will have familiarity with some of the key models and techniques for processing human language by computer, and you will have had practical experience working with these models and techniques on computer.
The following textbook is required for the class:
Jurafsky, Daniel, and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall. Second edition.
This textbook is the single most comprehensive and up-to-date introduction available to the field of computational linguistics. It is expensive ($117.45 on Amazon.com), but it is worth it. Please do get the second edition, not the first edition.
We will also be doing some of our programming in the Python programming language, and will make quite a bit of use of the Natural Language Toolkit (or NLTK) for Python. There is a book for NLTK:
Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O'Reilly Media.
You can buy this book, or you can freely access it on the Web at http://www.nltk.org/book.
This class is being held in AP&M 3432, a computer lab in the Linguistics Department's Library, and we use the computers in the lab in conjunction with UCSD's instructional Linux servers to do practical programming work for the class. In addition to classroom time, you're welcome to come in to the departmental Library and use either the lab computers in 3432 when the room is not occupied with a class, or the regular computers in the library. Additionally, by installing an X server (e.g., Xming for Windows or X11 for Mac) and a secure shell program (e.g., PuTTY for Windows or ssh for Mac) onto your own computer, you can set yourself up to do computer work for the class on your own computer.
We will be using WebCT for administering homework assignments and various surveys, and as a discussion forum for all participants in the class. Most of you should be familiar with WebCT from another class; if you aren't, take a look at http://iwdc.ucsd.edu/docs/step1_webct_fa07.pdf.
There will be discussion boards on the course WebCT site for the major topics covered in this class. If you have a question about course content that may be relevant to other students in the course, we strongly encourage you to post it to the WebCT discussion board for this class. We encourage you to read the discussion boards regularly, and if you know the answer to a question, to post the answer! Active, positive contributions to the discussion boards will be given favorable consideration in determining final grades.
| Week | Day | Topic | Reading | In-class materials | Homework Assignments |
|---|---|---|---|---|---|
| Week 0 | 23 Sep | Class Introduction: what is computational linguistics? Admin. Getting started with Unix, Python, and NLTK. Word frequencies and word frequency distributions. | SLP Chapter 1, NLTK Chapter 1 | Introduction: what is computational linguistics? | Beginning of Class Survey; Homework 1 |
| Week 1 | 28 Sep | More Python; word concordances. | |||
| 30 Sep | Word frequencies. Collocations. | NLTK Chapter 2 | Homework 2 | ||
| Week 2 | 5 Oct | Basic probability theory. Joint and conditional probability. Counting collocations and estimating probabilities in Python. | |||
| 7 Oct | Class catch-up | ||||
| Week 3 | 12 Oct | Elementary text classification with Naive Bayes | NLTK Chapter 6.5, MRS Chapter 13 | ||
| 14 Oct | Regular expressions and finite-state automata. | SLP Chapter 2, NLTK Chapter 3 | Homework 3 | ||
| Week 4 | 19 Oct | Basic morphology. Getting started with OpenFST. | SLP Chapter 3 (and start working through NLTK Chapter 4 on your own) | ||
| 21 Oct | Finite-state transducers. | ||||
| Week 5 | 26 Oct | Review. | |||
| 28 Oct | Finite-state transducers, continued. | ||||
| Week 6 | 2 Nov | Midterm Exam! | |||
| 4 Nov | Syntax and context-free grammars. | SLP Chapter 12.1-12.3, Chapter 16; NLTK Chapter 8.1-8.3 | |||
| Week 7 | 9 Nov | Syntax and context-free grammars, continued. Searching trees with Tregex. | SLP Chapter 12.4-12.9, Chapter 13; NLTK Chapter 8.4-8.5 | Homework 4 | |
| 11 Nov | Veteran's Day, no class | ||||
| Week 8 | 16 Nov | Probabilistic context-free grammars. | SLP Chapter 14.1, 14.3, NLTK Chapter 8.6 | ||
| 18 Nov | Probabilistic context-free grammars, continued. Parsing with PCFGs and evaluating the results. Improving probabilistic parsers. | SLP Chapter 14.2, 14.4, 14.5, 14.7 | |||
| Week 9 | 23 Nov | Improving PCFGs, continued | Homework 5 | ||
| 25 Nov | Thanksgiving Day, no class | ||||
| Week 10 | 30 Nov | Computational psycholinguistics: modeling human language understanding | SLP Chapter 14.10 | ||
| 2 Dec | Course wrapup | ||||
| Finals | 7 Dec | Final exam! |
Coming to talk to the instructor or TA during their office hours is highly encouraged. Electronic communications about course content should be made through the WebCT discussion board (see above). We ask that you use email contact only for communications that are not relevant to other students (e.g., specific learning circumstances or medical/personal emergency).
Please take some time to read the UCSD Policy on Integrity of Scholarship. We will be conducting this course in full accordance with this policy. In particular, any suspected cheating or plagiarism in the course will be taken very seriously and investigated. If we determine that cheating or plagiarism has taken place, it will be reported to UCSD's Office of the Academic Integrity Coordinator, in accordance with UCSD policy. Please note that it is not at our discretion whether or not to report instances of academic dishonesty: we are obligated by UCSD policy to report such instances.
Here are some examples of academic integrity violations. DO NOT DO THESE!!!
This is not an exhaustive list — please read the UCSD Policy on Integrity of Scholarship and use your common sense!
Your grade will be based on five criteria:
1. Homework assignments: there will be a number of homework assignments (somewhere between five and eleven) throughout the quarter; they will be worth 50% of your grade. You will be allowed to drop the assignment for which you receive the lowest score; we'll average the rest of the assignments to determine your overall homework grade.
You can turn in homework assigments either as physical copies in class or e-mail them to lign165-homework@ling.ucsd.edu. Please don't send e-mail directly to either me or Gwen instead, as it will be much more likely to get lost in the shuffle.
2. Midterm exam: the midterm exam on November 2 is worth 15% of your grade. This will be a pen-and-paper exam; you won't have to implement anything on a computer, though you may have to write out a sketch of a computer program or two.
3. Final exam: the final exam on December 7 is worth 25% of your grade. This will be a pen-and-paper exam; you won't have to implement anything on a computer, though you may have to write out a sketch of a computer program or two (or more).
4. In-class participation: actively contributing to class in a positive way, through asking questions and/or responding to questions I pose, is an important component of your work in the class. This is worth 5% of your grade.
5. One of the two following options (5% of your grade; no extra credit for doing both!):
a. Participation in 4 hours of the Human Subject Pool (http://experimetrix2.com/ucsd/): each hour of participation counts as 1% of your grade, plus a 1% bonus for participation in all four hours. You are encouraged to participate in language-related experiments, and to participate in these experiments early—the last day for participation is 3 December, and there is no guarantee that there will be experiment slots open for participation in the latest part of the quarter.
b. Writing a research paper (1000-1500 words) on some topic covered in the class. The due date for such a paper is November 30, and no late papers will be accepted. If you choose this option, you must discuss it and get an OK on your research topic from Professor Levy or his teaching assistant, Gwen Gillingham, by November 10— before writing the paper and turning it in. We will not accept papers whose topic has not been cleared in advance.
In addition, positive participation on WebCT forums (including asking well-thought-out questions and/or answering other students' questions) is looked on positively—-it increases the chance that your final grade may get bumped up a notch, if it is borderline.
Homework assignments may be turned in up to six days late, but they will be downgraded 10% per day. Furthermore, nothing may be turned in after December 7.
Exceptions to the late policy will only be granted for medical or personal emergencies, and the instructor or his TA must be notified as soon as possible (not several days after the emergency is over).
We all make mistakes—TAs and professors as well as students—so please do look over your returned work. In addition to helping ensure that you get the credit you deserve, this checking will improve your retention of the material. However, there is a statute of limitations: all grading mistakes must be brought to our attention within one week of our returning the work. This prevents us from getting a backlog of corrections at the end of the quarter, which would interfere with the time-consuming activities of preparing lectures and grading. Thank you in advance for your cooperation!