Explaining Unintelligible Words by Means of their Context

Explaining unintelligible words is a practical problem for text obtained by optical character recognition, from the Web (e.g., because of misspellings), etc. Approaches to wikification, to enriching text by linking words to Wikipedia articles, could help solve this problem. However, existing methods for wikification assume that the text is correct, so they are not capable of wikifying erroneous text. Because of errors, the problem of disambiguation (identifying the appropriate article to link to) becomes large-scale: as the word to be disambiguated is unknown, the article to link to has to be selected from among hundreds, maybe thousands of candidate articles. Existing approaches for the case where the word is known build upon the distributional hypothesis: words that occur in the same contexts tend to have similar meanings. The increased number of candidate articles makes the difficulty of spuriously similar contexts (when two contexts are similar but belong to different articles) more severe. We propose a method to overcome this difficulty by combining the distributional hypothesis with structured sparsity, a rapidly expanding area of research. Empirically, our approach based on structured sparsity compares favorably to various traditional classification methods.

[1]  Hwee Tou Ng,et al.  An Empirical Evaluation of Knowledge Sources and Learning Algorithms for Word Sense Disambiguation , 2002, EMNLP.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Ganesh Ramakrishnan,et al.  Collective annotation of Wikipedia entities in web text , 2009, KDD.

[4]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[7]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[8]  Doug Downey,et al.  Local and Global Algorithms for Disambiguation to Wikipedia , 2011, ACL.

[9]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[10]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[11]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[12]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[13]  Ian H. Witten,et al.  Learning to link with wikipedia , 2008, CIKM '08.

[14]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.

[15]  Shuiwang Ji,et al.  SLEP: Sparse Learning with Efficient Projections , 2011 .

[16]  Ellen M. Voorhees,et al.  The TREC Spoken Document Retrieval Track: A Success Story , 2000, TREC.

[17]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[18]  Patrick Pantel,et al.  From Frequency to Meaning: Vector Space Models of Semantics , 2010, J. Artif. Intell. Res..

[19]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[20]  Claudia Leacock,et al.  Automated Grammatical Error Detection for Language Learners , 2010, Synthesis Lectures on Human Language Technologies.

[21]  Julien Mairal,et al.  Proximal Methods for Hierarchical Sparse Coding , 2010, J. Mach. Learn. Res..

[22]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.