Extraction : A new combination of Statistical and Web Mining Approaches

The objective of this work is to combine statistical and web mining methods for the automatic extraction, and ranking of biomedical terms from free text. We present new extraction methods that use linguistic patterns specialized for the biomedical field, and use term extraction measures, such as C-value, and keyword extraction measures, such as Okapi BM25, and TFIDF. We propose several combinations of these measures to improve the extraction and ranking process and we investigate which combinations are more relevant for different cases. Each measure gives us a ranked list of candidate terms that we finally re-rank with a new web-based measure. Our experiments show, first that an appropriate harmonic mean of C-value used with keyword extraction measures offers better precision results than used alone, either for the extraction of single-word and multi-words terms; second, that best precision results are often obtained when we re-rank using the web-based measure. We illustrate our results on the extraction of English and French biomedical terms from a corpus of laboratory tests available online in both languages. The results are validated by using UMLS (in English) and only MeSH (in French) as reference dictionary.

[1]  Špela Vintar Comparative Evaluation of C-value in the Treatment of Nested Terms , 2004 .

[2]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[3]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[4]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[5]  Frank Smadja,et al.  Xtract: An overview , 1992, Comput. Humanit..

[6]  ChengXiang Zhai,et al.  When documents are very long, BM25 fails! , 2011, SIGIR.

[7]  Stephen E. Robertson,et al.  Okapi at TREC-7: Automatic Ad Hoc, Filtering, VLC and Interactive , 1998, TREC.

[8]  Richard Mitchell,et al.  Automatic keyphrase extraction: a comparison of methods , 2012 .

[9]  Eduardo Mena,et al.  Web-Based Measure of Semantic Relatedness , 2008, WISE.

[10]  Qin Lu,et al.  Chinese Terminology Extraction Using Window-Based Contextual Information , 2009, CICLing.

[11]  Evangelos E. Milios,et al.  A Comparison of Word- and Term-based Methods for Automatic Web Site Summarization , 2004 .

[12]  Paola Velardi,et al.  TermExtractor: a Web Application to Learn the Common Terminology of Interest Groups and Research Communities , 2007 .

[13]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[14]  Alberto Barrón-Cedeño,et al.  An Improved Automatic Term Recognition Method for Spanish , 2009, CICLing.

[15]  Zdenek Zdrahal,et al.  Towards a framework for comparing automatic term recognition methods , 2009 .

[16]  E. Milios,et al.  A Comparison of Keyword-and Keyterm-based Methods for Automatic Web Site Summarization , 2004 .

[17]  Euripides G. M. Petrakis,et al.  The AMTEx approach in the medical document indexing and retrieval application , 2009, Data Knowl. Eng..

[18]  R. Gaizauskas,et al.  Term Recognition and Classification in Biological Science Journal Articles , 1998 .

[19]  Ed C. M. Noyons,et al.  Automatic term identification for bibliometric mapping , 2008, Scientometrics.

[20]  Khalid Al Khatib,et al.  Automatic extraction of Arabic multi-word terms , 2010, Proceedings of the International Multiconference on Computer Science and Information Technology.

[21]  Sophia Ananiadou,et al.  Morpho-syntactic Clues for Terminological Processing in Serbian , 2003 .

[22]  Hideki Mima,et al.  An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese , 2000 .

[23]  Kyo Kageura,et al.  Japanese term extraction , 2001 .

[24]  Eduardo Mena,et al.  Querying the web: a multiontology disambiguation method , 2006, ICWE '06.

[25]  Ian H. Witten,et al.  Human-competitive tagging using automatic keyphrase extraction , 2009, EMNLP.

[26]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[27]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.