[nl-uiuc] Talk by Jason Baldridge at 2 pm, 3405 SC.
rsamdan2 at illinois.edu
Fri Apr 30 12:05:11 CDT 2010
This is a gentle reminder for today's talk by Jason Balridge (http://comp.ling.utexas.edu/people/jason_baldridge). He is going to give a talk in the AIIS seminar (April 30, 2pm, 3405 SC).
Here are title, abstract, and bio.
Using universal grammar and integer programming to improve
The last decade has seen a great deal of work in computational
linguistics on or using categorial grammar, especially Combinatory
Categorial Grammar (CCG). These efforts include wide-coverage
grammars/parsers based on CCGbank and domain-specific grammars developed with OpenCCG. A recurring theme in this work is that identifying the correct lexical categories for words (or pieces of
logical forms) is highly useful, whether it is for grammar development, supertagging to speed up parsing, hypertagging for
sentence realization, or using categories as the basis for features in
various other tasks, such as machine translation. However, building
models for labeling categories requires training material, which so
far has meant using a resource such as CCGbank which has texts labeled with categories and derivations. In the context of OpenCCG, there is the problem of assigning categories to words that are outside of the grammar. This reliance on labor-intensive resources or effort limits the cross-linguistic applicability of categorial grammar for work in computational linguistics, so it would naturally be of interest to find ways to bootstrap at least some of the information.
In this talk, I'll discuss experiments on weakly supervised supertagging using Hidden Markov Models as a means for expanding
categorial lexicons. Applied naively to supertagging for CCGbank, HMMs perform quite poorly when given only a tag dictionary and standard EM training; I'll discuss two complementary strategies for improving the learned HMM that use no additional annotations or knowledge about the language or dataset being analyzed. The first is to use knowledge about the universal grammar of category combination to create a grammar-informed initialization for transition probabilities before starting EM. The second is to use an integer program that finds the smallest set of supertag bigrams that covers the text while obeying the constraints of the tag dictionary. Both strategies provide massive gains over standard, randomly initialized EM, for both English and Italian supertagging. In combination, they deliver further error reductions. The computational complexity of the integer program in the face of supertag ambiguity is very high, so we employ a two-stage method that---while not guaranteed!
o find the optimal solution---works very well in practice.
[This talk describes joint work with Sujith Ravi and Kevin Knight.]
Jason Baldridge is an assistant professor in the Department of
Linguistics at the University of Texas at Austin. He received his
Ph.D. from the University of Edinburgh in 2002 and was then a
post-doctoral researcher there on the ROSIE project until 2005. His
main research interests include categorial grammars, active learning,
discourse structure, coreference resolution, and georeferencing. He is
one of the co-creators of OpenNLP and has been active for many years in the creation and promotion of open source software for natural language processing.
Dept. of Computer Science,
University of Illinois at Urbana-Champaign.
More information about the nl-uiuc