Navigation

Linky linky

UCSD Linguistics
My new home
AndyLab
My posse
Fieldwork Lab
Language documentation
Center for Research in Language
Interdisciplinary language research
SDSU Linguistics
My former home department
Discursus
My mostly serious blog
Valid XHTML | Valid CSS
Template "Life is Simple!" by Solucija.
Discourse Segmentation

My thesis research at SDSU developed a method for evaluating hierarchical discourse segmentation, i.e. shallow discourse parsing or unlabeled outlining, which is difficult to evaluate taking into consideration the differing importance of the section breaks and the intrinsic imprecision of the section break locations. In the thesis, I focused on evaluating segmentations of lectures, but deriving an annotation procedure for spoken corpora is itself a complicated problem, so in the paper presented at NAACL10, I used encyclopedia articles.

NAACL/HLT 2010

The version of the paper posted here includes a corrected bibliographic entry for Biber et al, and a change of "geometrically" to "exponentially" in footnote 7. Supplementary material for the NAACL paper includes a python script for calculating the hierarchical atom-error rate and the hierarchically segmented reference corpus derived from Wikipedia articles.

Lucien Carroll. 2010. Evaluating Hierarchical Discourse Segmentation. In Proceedings of NAACL10. Presented at NAACL/HLT in Los Angeles, June 4, 2010. slides paper supplement

Abstract: Hierarchical discourse segmentation is a useful technology, but it is difficult to evaluate. I propose an error measure based on the word error rate of Beeferman et al. (1999). I then show that this new measure not only reliably distinguishes baseline segmentations from lexically-informed hierarchical segmentations and more informed segmentations from less informed segmentations, but it also offers an improvement over previous linear error measures.

MA Thesis

My thesis research involved recruiting several dozen students to annotate lecture passages via a web form interface, developing a method for deriving a gold standard from conflicting annotations, adapting two segmentation programs to produce hierarchical segmentations, and proposing a statistical measure suitable to the peculiarities of hierarchical discourse segmentation.

Lucien Carroll. Evaluation of Hierarchical Discourse Segmentation of Expository Speech. MA thesis carried out under the supervision of Rob Malouf and Eniko Csomay. Presented at the 29th Linguistics Students Association Colloquium at SDSU, April 8, 2006. slides

Abstract: There is a large body of literature describing work in linear discourse segmentation, especially of news data, and some work describing algorithms for hierarchical discourse segmentation. However, little work has been done on segmenting more conversational genres, and even less on evaluating hierarchical segmentation. I describe a method for compiling a gold standard for tree segmentation of expository monolog, and I propose an error metric. I then evaluate two hierarchical segmentation algorithms with that metric. The segmentation algorithms both perform quite poorly on this language variety, but one of the two is shown to be significantly better than baseline segmentations.