Annotation and verification of sense pools in OntoNotes

The paper describes the OntoNotes, a multilingual (English, Chinese and Arabic) corpus with large-scale semantic annotations, including predicate-argument structure, word senses, ontology linking, and coreference. The underlying semantic model of OntoNotes involves word senses that are grouped into so-called sense pools, i.e., sets of near-synonymous senses of words. Such information is useful for many applications, including query expansion for information retrieval (IR) systems, (near-)duplicate detection for text summarization systems, and alternative word selection for writing support systems. Although a sense pool provides a set of near-synonymous senses of words, there is still no knowledge about whether two words in a pool are interchangeable in practical use. Therefore, this paper devises an unsupervised algorithm that incorporates Google n-grams and a statistical test to determine whether a word in a pool can be substituted by other words in the same pool. The n-gram features are used to measure the degree of context mismatch for a substitution. The statistical test is then applied to determine whether the substitution is adequate based on the degree of mismatch. The proposed method is compared with a supervised method, namely Linear Discriminant Analysis (LDA). Experimental results show that the proposed unsupervised method can achieve comparable performance with the supervised method.

[1]  Daniel Jurafsky,et al.  Learning to Merge Word Senses , 2007, EMNLP.

[2]  Winnie Cheng,et al.  From n-gram to skipgram to concgram , 2006 .

[3]  Giuseppe Attardi,et al.  Semantically Annotated Snapshot of the English Wikipedia , 2008, LREC.

[4]  Yuen-Hsien Tseng,et al.  Automatic thesaurus generation for Chinese documents , 2002, J. Assoc. Inf. Sci. Technol..

[5]  Diana Inkpen,et al.  Near-Synonym Choice in an Intelligent Thesaurus , 2007, NAACL.

[6]  Graeme Hirst,et al.  Acquiring Collocations for Lexical Choice between Near-Synonyms , 2002, Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition -.

[7]  P. Smith,et al.  A review of ontology based query expansion , 2007, Inf. Process. Manag..

[8]  Olga Babko-Malaya,et al.  Different Sense Granularities for Different Applications , 2004, HLT-NAACL 2004.

[9]  Noriko Tomuro,et al.  Tree-Cut and a Lexicon Based on Systematic Polysemy , 2001, NAACL.

[10]  Takenobu Tokunaga,et al.  Query expansion using heterogeneous thesauri , 2000, Inf. Process. Manag..

[11]  Nigel Shadbolt,et al.  Web Search Disambiguation by Collaborative Tagging , 2008 .

[12]  Graeme Hirst,et al.  Building and Using a Lexical Knowledge Base of Near-Synonym Differences , 2006, Computational Linguistics.

[13]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[14]  David G. Stork,et al.  Pattern Classification , 1973 .

[15]  Mitchell P. Marcus,et al.  OntoNotes: The 90% Solution , 2006, NAACL.

[16]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[17]  Donna Harman,et al.  Information Processing and Management , 2022 .

[18]  Karen Spärck Jones,et al.  Information Retrieval and Artificial Intelligence , 1999, Artif. Intell..

[19]  George A. Miller,et al.  Squibs and Discussions: WordNet Nouns: Classes and Instances , 2006, CL.

[20]  Qiang Dong,et al.  Hownet And The Computation Of Meaning , 2006 .

[21]  Ani Nenkova,et al.  Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion , 2007, Information Processing & Management.

[22]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[23]  Hwee Tou Ng,et al.  Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach , 1996, ACL.

[24]  Steffen Staab,et al.  Introducing Triple Play for Improved Resource Retrieval in Collaborative Tagging Systems , 2008 .

[25]  Nicola Guarino,et al.  Sweetening Ontologies with DOLCE , 2002, EKAW.

[26]  Eneko Agirre,et al.  Clustering WordNet word senses , 2003, RANLP.

[27]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[28]  Chung-Hsien Wu,et al.  OntoNotes: Corpus Cleanup of Mistaken Agreement Using Word Sense Disambiguation , 2008, COLING.

[29]  Eric SanJuan,et al.  Annotation of Scientific Summaries for Information Retrieval , 2011, ESAIR 2011.

[30]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[31]  Adam Pease,et al.  Towards a standard upper ontology , 2001, FOIS.

[32]  Ellen M. Voorhees,et al.  Query expansion using lexical-semantic relations , 1994, SIGIR '94.

[33]  Carolyn J. Crouch,et al.  An approach to the automatic construction of global thesauri , 1990, Inf. Process. Manag..

[34]  Chung-Hsien Wu,et al.  Ontology-based speech act identification in a bilingual dialog system using partial pattern trees , 2008, J. Assoc. Inf. Sci. Technol..

[35]  Sergei Nirenburg,et al.  Lexical Acquisition with WordNet and the Mikrokosmos Ontology , 1998, WordNet@ACL/COLING.

[36]  Piek T. J. M. Vossen,et al.  The Top-Down Strategy for Building EuroWordNet: Vocabulary Coverage, Base Concepts and Top Ontology , 1998, Comput. Humanit..

[37]  Roberto Navigli,et al.  An analysis of ontology-based query expansion strategies , 2003 .

[38]  M. A. R T H A P A L,et al.  Making fine-grained and coarse-grained sense distinctions , both manually and automatically , 2005 .

[39]  W. Charles Contextual correlates of meaning , 2000, Applied Psycholinguistics.

[40]  Rada Mihalcea,et al.  Automatic generation of a coarse grained WordNet , 2001, HTL 2001.

[41]  Darren Pearce,et al.  Synonymy in collocation extraction , 2001 .

[42]  Roberto Navigli,et al.  Meaningful Clustering of Senses Helps Boost Word Sense Disambiguation Performance , 2006, ACL.

[43]  Clement T. Yu,et al.  An effective approach to document retrieval via utilizing WordNet and recognizing phrases , 2004, SIGIR '04.

[44]  Diana McCarthy,et al.  Relating WordNet Senses for Word Sense Disambiguation , 2006 .

[45]  Chung-Hsien Wu,et al.  Psychiatric Consultation Record Retrieval Using Scenario-Based Representation and Multilevel Mixture Model , 2007, IEEE Transactions on Information Technology in Biomedicine.

[46]  Rada Mihalcea,et al.  Building a Sense Tagged Corpus with Open Mind Word Expert , 2002, SENSEVAL.

[47]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[48]  Kavi Mahesh,et al.  Ontology Development for Machine Translation: Ideology and Methodology , 1996 .

[49]  Mitchell P. Marcus,et al.  OntoNotes: A Unified Relational Semantic Representation , 2007, International Conference on Semantic Computing (ICSC 2007).

[50]  Patrick Pantel,et al.  The Omega Ontology , 2005, IJCNLP.

[51]  Wim Peters,et al.  Automatic sense clustering in eurowordnet , 1998, LREC.

[52]  Ellen Riloff,et al.  An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains , 1996, Artif. Intell..

[53]  Chung-Hsien Wu,et al.  HAL-Based Evolutionary Inference for Pattern Induction From Psychiatry Web Resources , 2008, IEEE Transactions on Evolutionary Computation.

[54]  Ángel F. Zazo Rodríguez,et al.  Reformulation of queries using similarity thesauri , 2005, Inf. Process. Manag..

[55]  Christopher C. Yang,et al.  Building parallel corpora by automatic title alignment using length-based and text-based approaches , 2004, Inf. Process. Manag..

[56]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[57]  Marie-Francine Moens,et al.  Collaborative annotation for pseudo relevance feedback , 2008 .

[58]  Antonietta Alonge,et al.  The Top-Down Strategy for Building EuroWordNet: Vocabulary Coverage , 1998 .