Linguistics 165: Computational Linguistics (Winter 2015)

1 Course information

Class Times MWF 11-11:50am
Class Location AP&M 4452 (MW); AP&M B432 (F)
Class webpage http://grammar.ucsd.edu/courses/lign165

2 Instructor information

Instructor Roger Levy (rlevy@ucsd.edu)
Instructor's office AP&M Room 4220
Instructor's office hours Thursdays 1-3pm
Teaching Assistant Meilin Zhan (mezhan@ucsd.edu)
Teaching Assistant's office 3351E
Teaching Assistant's office hours Mondays 9-11am

3 Course Description

Computational linguistics (CL; also called natural language processing, or NLP) is the study of how to get computers to do useful things with human language (a.k.a. "natural language"). Linguistics 165 is a brief introduction to this rich and exciting field, and includes coverage of topics such as:

  • formal tools for describing and computing with fragments of language structure at multiple levels – phonology, morphology, syntax, semantics
  • automatically classifying documents into different categories (financial versus sports news articles, positive versus negative movie reviews, spam versus non-spam email)
  • how speech recognition works
  • automated spelling and grammar correction
  • automatically identifying phrases and clause structure in sentences (parsing)
  • automatic translation between natural languages (machine translation)
  • using ideas from computational linguistics to shed light on how we humans use natural language

During the course you'll both learn the theory underlying these topics and gain hands-on experience with Python software used to perform some of these tasks above. You'll also gain hands-on experience programming your own NLP tools. Class meetings will be a mixture of lecture and programming practicum sessions. Homework assignments and final projects will involve both written and programming components.

4 Course organization

The course meets three times a week – Mondays, Wednesdays, and Fridays. Mondays and Wednesdays will be lecture sessions, and Fridays will be hands-on computer lab sessions where we do programming in real time.Interrupting (politely!) to ask questions is highly encouraged in both lectures and lab sessions.

5 Intended Audience

Upper-division students interested in how to get computers to do useful things with language. Computational linguistics is a highly interdisciplinary course and students with a variety of majors, including (but not limited to) linguistics, computer science, cognitive science, psychology, and electrical engineering are likely to find the content interesting. There are no prerequisites for the course, but there will be quite a bit of computer programming involved, and you will find the course easier if you have taken a programming course (e.g., CSE 5A, 8A/B, or 11). You'll also find the course easier if you've taken introductory linguistics (LIGN 101), and potentially syntax as well (LIGN 121).

6 Course objectives

Computational linguistics is a rich and deep field, and we can only scratch the surface in a single quarter. By the end of this course you will get a sense of the flavor of some of the major subfields, you will have familiarity with some of the key models and techniques for processing human language by computer, and you will have had practical experience working with these models and techniques on computer.

7 Textbook

The following books are required for the class:

  1. Jurafsky, Daniel, and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall. Second edition. (I refer to this book as "SLP" in the syllabus.)

    This textbook is the single most comprehensive and up-to-date introduction available to the field of computational linguistics. It is expensive ($117.45 on Amazon.com), but it is worth it. Please do get the second edition, not the first edition.

  2. Bird, Steven, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python. O'Reilly Media. (I refer to this book as "NLTK" in the syllabus.)

    This is the book for the Natural Language Toolkit (or NLTK), which we will be using extensively to do programming We will also be doing some of our programming in the Python programming language, and will make quite a bit of use of for Python. You can buy this book, or you can freely access it on the Web at http://www.nltk.org/book.

We will also be occasionally using other readings linked to directly from the syllabus. These include selections from the following textbooks:

  • Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press. (I refer to this book as "MRS" in the syllabus.)

8 Ted

We will be using Ted for administering homework assignments and various surveys, and as a discussion forum for all participants in the class. Most of you should be familiar with Ted/Blackboard from another class (in years past it was called WebCT); if you aren't, poke around at http://acms.ucsd.edu/units/iwdc/students.html.

9 Discussion boards

There will be discussion boards on the course Ted site for the major topics covered in this class. If you have a question about course content that may be relevant to other students in the course, we strongly encourage you to post it to the Ted discussion board for this class. We encourage you to read the discussion boards regularly, and if you know the answer to a question, to post the answer! *Active, positive contributions to the discussion boards will be given favorable consideration in determining final grades.*

10 Syllabus (subject to modification)

Week Day Topic Reading In-class materials Homework Assignments
Week 1 5 Jan Class Introduction: what is computational linguistics? Admin. SLP Chapter 1, NLTK Chapter 1 Introduction: what is computational linguistics?  
  7 Jan Regular expressions. SLP Chapter 2.1-2.1.6 Lecture 2 Homework 1
  9 Jan Introduction to Python and NLTK (Given by Emily Morgan) NLTK Chapters 2 & 3    
Week 2 12 Jan Finite-state automata (FSAs) SLP Chapter 2.2 Lecture 3  
  14 Jan Finish FSAs SLP Chapter 2.3-2.4 Lecture 4  
  16 Jan Python control & data structures; FSAs and FSTs in NLTK NLTK Chapter 4 Lecture 5 (practicum) Homework 2
Week 3 19 Jan No class due to Martin Luther King, Jr. Holiday      
  21 Jan Morphology and finite-state transducers (FSTs) SLP Chapter 3; Manning & Schütze 3.1 Lecture 6  
  23 Jan Elementary text classification in NLTK. MRS Chapter 13 Lecture 7  
Week 4 26 Jan Finish finite-state transducer lectures.   Lecture 8  
  28 Jan Basic probability theory. Naive Bayes. Use in text classification. NLTK Chapter 6.1-6.5    
  30 Jan Finite-state transducers in OpenFst.   Lecture 10  
Week 5 2 Feb Finish Naive Bayes text classification     Homework 3 (due 9 Feb)
  4 Feb N-gram modeling SLP Chapter 4.1-4.5    
  6 Feb N-gram modeling in NLTK   Lecture 11  
Week 6 9 Feb More advanced issues in N-gram modeling SLP Chapter 4.6-4.12 Lecture 12  
  11 Feb Midterm Exam      
  13 Feb N-gram modeling in SRILM   Lecture 13  
Week 7 16 Feb No class due to Presidents' Day Holiday      
  18 Feb Part-of-speech tagging SLP Chapter 5.1-5.5 Lecture 14 Homework 4 (due 25 Feb)
  20 Feb Part-of-speech tagging, continued SLP Chapter 5.5-5.10 (yes, re-read 5.5) Lecture 15  
Week 8 23 Feb Hidden Markov Models for part-of-speech tagging NLTK Chapter 5; SLP Chapter 6.1-6.4 Lecture 16  
  25 Feb Syntax and context-free grammars. SLP Chapter 12.1-12.3, Chapter 16; NLTK Chapter 8.1-8.3    
  27 Feb Syntax and context-free grammars, continued. SLP Chapter 12.4-12.9, Chapter 13; NLTK Chapter 8.4-8.5 Lecture 18 Homework 5
Week 9 2 Mar Parsing with context-free grammars. SLP Chapter 14.1, 14.3, NLTK Chapter 8.6    
  4 Mar Dynaming programming in CFG parsing: the CKY algorithm. SLP Chapter 14.2, 14.4, 14.5, 14.7 Lecture 20  
  6 Mar Large-scale CFGs. Searching trees with Tregex. Probabilistic context-free grammars.   Lecture 21 Homework 6
Week 10 9 Mar Parsing with PCFGs. SLP Chapter 14.10    
  11 Mar Guest lecture by Titus von der Malsburg on machine translation      
  13 Mar Course wrapup      
Finals 16 Mar Final exam!!! From 11:30am to 2:30pm.      

11 Instructor contact policy

Coming to talk to the instructor or TA during their office hours is highly encouraged. Electronic communications about course content should be made through the WebCT discussion board (see above). We ask that you use email contact only for communications that are not relevant to other students (e.g., specific learning circumstances or medical/personal emergency).

12 Academic Integrity

Please take some time to read the UCSD Policy on Integrity of Scholarship. We will be conducting this course in full accordance with this policy. In particular, any suspected cheating or plagiarism in the course will be taken very seriously and investigated. If we determine that cheating or plagiarism has taken place, it will be reported to UCSD's Office of the Academic Integrity Coordinator, in accordance with UCSD policy. Please note that it is not at our discretion whether or not to report instances of academic dishonesty: we are obligated by UCSD policy to report such instances.

12.1 Examples of academic integrity violations

Here are some examples of academic integrity violations. DO NOT DO THESE!!!

  • Copying from or looking on to a neighbor's exam during the midterm or final.
  • Copying a friend or roommate's homework assignment.
  • Changing a graded homework assignment or exam and returning it for a regrade.
  • Smuggling notes into a closed-book exam.
  • Finding the answer key to a homework assignment (e.g., on the Web) and copying it.
  • Giving a false reason (e.g., death of a relative) for missing an exam or turning in an assignment late.

This is not an exhaustive list – please read the UCSD Policy on Integrity of Scholarship and use your common sense!

13 Requirements & grading

Your grade will be based on five criteria:

  1. Homework assignments: there will be a number of homework assignments (somewhere between five and eleven) throughout the quarter; they will be worth 50% of your grade. You will be allowed to drop the assignment for which you receive the lowest score; we'll average the rest of the assignments to determine your overall homework grade.

You can turn in homework assigments either as physical copies in class or e-mail them to lign165-homework@ling.ucsd.edu. Please don't send e-mail directly to either me or Meilin instead, as it will be much more likely to get lost in the shuffle.

  1. Midterm exam: the midterm exam on Wednesday 11 February is worth 15% of your grade. This will be a pen-and-paper exam; you won't have to implement anything on a computer, though you may have to write out a sketch of a computer program or two.
  2. Final exam: the final exam on Tuesday 16 March is worth 25% of your grade. This will be a pen-and-paper exam; you won't have to implement anything on a computer, though you may have to write out a sketch of a computer program or two (or more).
  3. In-class participation: actively contributing to class in a positive way, through asking questions and/or responding to questions I pose, is an important component of your work in the class. This is worth 5% of your grade.
  4. One of the two following options (5% of your grade; no extra credit

for doing both!):

a. Participation in 4 hours of the Human Subject Pool (http://ucsd.sona-systems.com): each hour of participation counts as 1% of your grade, plus a 1% bonus for participation in all four hours. You are encouraged to participate in language-related experiments, and to participate in these experiments early–the last day for participation is 11 March, and there is no guarantee that there will be experiment slots open for participation in the latest part of the quarter.

b. Writing a research paper (1000-1500 words) on some topic covered in the class. The due date for such a paper is 13 March, and no late papers will be accepted. If you choose this option, you must discuss it and get an OK on your research topic from Professor Levy or his teaching assistant, Meilin Zhan, by 1 March–*before* writing the paper and turning it in. We will not accept papers whose topic has not been cleared in advance.

In addition, positive participation on WebCT forums (including asking well-thought-out questions and/or answering other students' questions) is looked on positively—it increases the chance that your final grade may get bumped up a notch, if it is borderline.

14 Notes on writing homework assignments

  • Sometimes you will find yourself including Python or other computer code in your homework writeups. Please use a fixed-width (also called monospaced, non-proportional, or fixed-pitch) font to present computer code in writeups – e.g., use Courier, Courier New, Menlo, or Monaco. It is extremely difficult to read computer code that is not in a fixed-width font but rather in a variable-width font. To get a sense of the difference, here's a comparison for you to look at!
  • We have FAQs for various homework assignments now:

15 Homework grading policy

Homework assignments may be turned in up to six days late, but they will be downgraded 10% per day. Furthermore, nothing may be turned in after December 7.

Exceptions to the late policy will only be granted for medical or personal emergencies, and the instructor or his TA must be notified as soon as possible (not several days after the emergency is over).

15.1 Regrading/correction policy

We all make mistakes–TAs and professors as well as students–so please do look over your returned work. In addition to helping ensure that you get the credit you deserve, this checking will improve your retention of the material. However, there is a statute of limitations: all grading mistakes must be brought to our attention within one week of our returning the work. This prevents us from getting a backlog of corrections at the end of the quarter, which would interfere with the time-consuming activities of preparing lectures and grading. Thank you in advance for your cooperation!

16 Useful links

Parts of this class will require you to use remote Linux compute servers, and the Linux command-line shell. This web page for CSE 11 has a bunch of useful links that can help you get familiarized with this process. In particular take a look at the Unix tutorial – go through it step by step as time allows! Here are some other useful links:

  • a tutorial on using PuTTY as a secure-shell (SSH) client to connect to remote compute servers (and Googling "PuTTY tutorial" will reveal many more such tutorials)

Author: Roger Levy

Created: 2015-03-07 Sat 08:52

Emacs 24.4.1 (Org mode 8.2.5h)

Validate