Complex Terminology Extraction Model from Unstructured Web Text Based Linguistic and Statistical Knowledge

Textual data remain the most interesting source of information in the web. In the authors' research, they focus on a very specific kind of information namely "complex terms". Indeed, complex terms are defined as semantic units composed of several lexical units that can describe in a relevant and exhaustive way the text content. In this paper, they present a new model for complex terminology extraction COTEM, which integrates linguistic and statistical knowledge. Thus, the authors try to focus on three main contributions: firstly, they show the possibility of using a linear Conditional Random Fields CRF for complex terminology extraction from a specialized text corpus. Secondly, prove the ability of a Conditional Random Field to model linguistic knowledge by incorporating grammatical observations in the CRF's features. Finally, the authors present the benefits gained by the integration of statistical knowledge on the quality of the terminology extraction.

[1]  Didier Bourigault,et al.  Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases , 1992, COLING.

[2]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[3]  Sophia Ananiadou,et al.  Fast Full Parsing by Linear-Chain Conditional Random Fields , 2009, EACL.

[5]  Jun'ichi Tsujii,et al.  Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition , 2006, ACL.

[6]  Vincent Claveau Acquisition automatique de lexiques sémantiques pour la recherche d'information. (Automatic acquisition of semantic lexicons for information retrieval) , 2003 .

[7]  Didier Bourigault,et al.  LEXTER, a Natural Language Processing Tool for Terminology Extraction , 1996 .

[8]  Bing Liu,et al.  Structured Data Extraction from the Web Based on Partial Tree Alignment , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[10]  Witold Abramowicz,et al.  Proximity Window Context Method for Term Extraction in Ontology Learning from Text , 2008, 2008 19th International Workshop on Database and Expert Systems Applications.

[11]  Guohong Fu,et al.  Chinese named entity recognition using lexicalized HMMs , 2005, SKDD.

[12]  Chia-Hui Chang,et al.  Automatic information extraction from semi-structured Web pages by pattern discovery , 2003, Decis. Support Syst..

[13]  Li Liu,et al.  A combined method for automatic domain-specific Terminology extraction , 2011, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[14]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[15]  Oren Etzioni,et al.  Extracting Product Features and Opinions from Reviews , 2005, HLT.

[16]  Thierry Hamon,et al.  Improving Term Extraction with Terminological Resources , 2006, FinTAL.

[17]  Mathieu Roche,et al.  EXIT : Un système itératif pour l'extraction de la terminologie du domaine à partir de corpus spécialisés , 2004 .

[18]  Ibrahim Bounhas,et al.  A hybrid approach for Arabic multi-word term extraction , 2009, 2009 International Conference on Natural Language Processing and Knowledge Engineering.

[19]  Razvan C. Bunescu,et al.  Collective Information Extraction with Relational Markov Networks , 2004, ACL.

[20]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[21]  Patrick Pantel,et al.  Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations , 2006, ACL.

[22]  Matthieu Constant,et al.  Intégrer des connaissances linguistiques dans un CRF : application à l'apprentissage d'un segmenteur-étiqueteu r du français , 2011 .

[23]  Michèle Sebag,et al.  Learning Interestingness Measures in Terminology Extraction. A ROC-based approach , 2004, ROCAI.

[24]  Mohamed Nazih Omri,et al.  A Linguistic Model for Terminology Extraction based Conditional Random Fields , 2012, ArXiv.

[25]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[26]  B. Daille Approche mixte pour l'extraction de terminologie : statistique lexicale et filtres linguistiques , 1994 .

[27]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[28]  Zhang Liang,et al.  Extracting Chinese multi-word terms from small corpus , 2008, 2008 3rd International Conference on Intelligent System and Knowledge Engineering.

[29]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[30]  Fabio Rinaldi,et al.  ExtrAns: Extracting Answers from Technical Texts , 2003, IEEE Intell. Syst..

[31]  Oren Etzioni,et al.  Scaling question answering to the Web , 2001, WWW '01.

[32]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[33]  R. Subhashini,et al.  A Roadmap to Integrate Document Clustering in Information Retrieval , 2011, Int. J. Inf. Retr. Res..

[34]  Hanna M. Wallach,et al.  Efficient Training of Conditional Random Fields , 2002 .