A Framework for Collocation Error Correction in Web Pages and Text Documents

Much of the English in text documents today comes from nonnative speakers. Web searches are also conducted very often by non-native speakers. Though highly qualified in their respective fields, these speakers could potentially make errors in collocation, e.g., "dark money" and "stock agora" (instead of the more appropriate English expressions "black money" and "stock market" respectively). These may arise due to literal translation from the respective speaker's native language or other factors. Such errors could cause problems in contexts such as querying over Web pages, correct understanding of text documents and more. This paper proposes a framework called CollOrder to detect such collocation errors and suggest correctly ordered collocated responses for improving the semantics. This framework integrates machine learning approaches with natural language processing techniques, proposing suitable heuristics to provide responses to collocation errors, ranked in the order of correctness. We discuss the proposed framework with algorithms and experimental evaluation in this paper. We claim that it would be useful in semantically enhancing Web querying e.g., financial news, online shopping etc. It would also help in providing automated error correction in machine translated documents and offering assistance to people using ESL tools.

[1]  Wendy G. Lehnert,et al.  Corpus-Driven Knowledge Acquisition for Discourse Analysis , 1994, AAAI.

[2]  Derrick Higgins,et al.  Using Singular-value Decomposition on Local Word Contexts to Derive a Measure of Constructional Similarity , 2007 .

[3]  Hwee Tou Ng,et al.  Correcting Semantic Collocation Errors with L1-induced Paraphrases , 2011, EMNLP.

[4]  Gerhard Weikum,et al.  SITAC: discovering semantically identical temporally altering concepts in text archives , 2011, EDBT/ICDT '11.

[5]  Danushka Bollegala,et al.  Measuring the similarity between implicit semantic relations using web search engines , 2009, WSDM '09.

[6]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[7]  Pascal Poupart,et al.  Is the sky pure today? AwkChecker: an assistive tool for detecting and correcting collocation errors , 2008, UIST '08.

[8]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[9]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[10]  Orsolya Vincze,et al.  Towards a Motivated Annotation Schema of Collocation Errors in Learner Corpora , 2010, LREC.

[11]  Jing Peng,et al.  Automatic Classification of Article Errors in L2 Written English , 2010, FLAIRS Conference.

[12]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[13]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[14]  David Wible,et al.  Automated Suggestions for Miscollocations , 2009, BEA@NAACL.

[15]  Nikos Mamoulis,et al.  Durable top-k search in document archives , 2010, SIGMOD Conference.

[16]  Martin Chodorow,et al.  A computational approach to detecting collocation errors in the writing of non-native speakers of English , 2008 .

[17]  Charles L. A. Clarke,et al.  Frequency Estimates for Statistical Word Similarity Measures , 2003, NAACL.