My main current project is documenting generational and regional variation in the tone sandhi of the dialects around Jinhua in Zhejiang, China. There is an amazing amount of regional variation in this area, and a fair amount of generational differences too. There is a fair amount of dialectology data, but not the kind of detailed data needed to model synchronic variation.
The other big semi-current project is my masters thesis for the
comp-ling program at SDSU, which I have basically finished researching
and have half written, but it has seen only fitful progress in the last
few years.
My thesis
research developed a method for evaluating hierarchical
discourse segmentation, i.e. shallow discourse parsing or unlabeled
outlining, which is difficult to evaluate taking into consideration the
differing importance of the section breaks and the intrinsic
imprecision of the section break locations. The research involved
recruiting several dozen students to annotate passages via a web form
interface, developing a method for deriving a gold standard from
conflicting annotations, adapting two segmentation programs to produce
hierarchical segmentations, and proposing a statistical measure
suitable to the peculiarities of hierarchical discourse segmentation.
Lucien Carroll. forthcoming. Evaluation of
Hierarchical Discourse Segmentation of Expository Speech.
Unpublished thesis, carried out under the supervision of Rob Malouf
and Eniko
Csomay. Presented at the 29th Linguistics Students
Association
Colloquium at SDSU, April 8, 2006. slides
Abstract: There is a large body of literature
describing work in linear discourse segmentation, especially of news
data, and some work describing algorithms for hierarchical discourse
segmentation. However, little work has been done on segmenting more
conversational genres, and even less on evaluating hierarchical
segmentation. I describe a method for compiling a gold standard for
tree segmentation of expository monolog, and I propose an error metric.
I then evaluate two hierarchical segmentation algorithms with that
metric. The segmentation algorithms both perform quite poorly on this
language variety, but one of the two is shown to be significantly
better than baseline segmentations.
In the coming year I hope to start cool stuff based on
stochastic optimality theory or information-theoretic models of
sentence processing, and continue a collaboration dealing with Chinese
discourse structure.