FlexiTerm: a flexible term recognition method

BackgroundThe increasing amount of textual information in biomedicine requires effective term recognition methods to identify textual representations of domain-specific concepts as the first step toward automating its semantic interpretation. The dictionary look-up approaches may not always be suitable for dynamic domains such as biomedicine or the newly emerging types of media such as patient blogs, the main obstacles being the use of non-standardised terminology and high degree of term variation.ResultsIn this paper, we describe FlexiTerm, a method for automatic term recognition from a domain-specific corpus, and evaluate its performance against five manually annotated corpora. FlexiTerm performs term recognition in two steps: linguistic filtering is used to select term candidates followed by calculation of termhood, a frequency-based measure used as evidence to qualify a candidate as a term. In order to improve the quality of termhood calculation, which may be affected by the term variation phenomena, FlexiTerm uses a range of methods to neutralise the main sources of variation in biomedical terms. It manages syntactic variation by processing candidates using a bag-of-words approach. Orthographic and morphological variations are dealt with using stemming in combination with lexical and phonetic similarity measures. The method was evaluated on five biomedical corpora. The highest values for precision (94.56%), recall (71.31%) and F-measure (81.31%) were achieved on a corpus of clinical notes.ConclusionsFlexiTerm is an open-source software tool for automatic term recognition. It incorporates a simple term variant normalisation method. The method proved to be more robust than the baseline against less formally structured texts, such as those found in patient blogs or medical notes. The software can be downloaded freely at http://www.cs.cf.ac.uk/flexiterm.

[1]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[2]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[3]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[4]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[5]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[8]  Anna Korhonen,et al.  Exploring subdomain variation in biomedical language , 2010, BMC Bioinformatics.

[9]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[10]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[11]  Fei Xia,et al.  Community annotation experiment for ground truth generation for the i2b2 medication challenge , 2010, J. Am. Medical Informatics Assoc..

[12]  Yehuda Lindell,et al.  Text Mining at the Term Level , 1998, PKDD.

[13]  Gregory Grefenstette,et al.  Use of syntactic context to produce term association lists for text retrieval , 1992, SIGIR '92.

[14]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[15]  T. Katerina,et al.  Automatic Term Recognition using Contextual Cues , 1997 .

[16]  Richard Kittredge,et al.  Sublanguage : studies of language in restricted semantic domains , 1982 .

[17]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[18]  Paul M. B. Vitányi,et al.  The Google Similarity Distance , 2004, IEEE Transactions on Knowledge and Data Engineering.

[19]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[20]  Kenji Kita,et al.  A comparative study of automatic extraction of collocations from corpora: mutual information vs , 1994 .

[21]  Goran Nenadic,et al.  Mining term similarities from corpora , 2004 .

[22]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[23]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[24]  Ziqi Zhang,et al.  A Comparative Evaluation of Term Recognition Algorithms , 2008, LREC.

[25]  Isabel Rojas,et al.  Interdisciplinary Work : The Key to Functional Genomics , 2002 .

[26]  Carol Friedman,et al.  Two biomedical sublanguages: a description based on the theories of Zellig Harris , 2002, J. Biomed. Informatics.

[27]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[28]  Catherine Arnott-Smith,et al.  PatientsLikeMe: Consumer Health Vocabulary as a Folksonomy , 2008, AMIA.

[29]  Sofia Ananiadou Towards a methodology for automatic term recognition. (volumes i and ii) (term banks) , 1988 .

[30]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[31]  Mark Yandell,et al.  Identification of key concepts in biomedical literature using a modified Markov heuristic , 2003, Bioinform..

[32]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[33]  Udo Hahn,et al.  Effective Grading of Termhood in Biomedical Literature , 2005, AMIA.

[34]  Goran Nenadic,et al.  Automatic Acronym Acquisition and Term Variation Management within Domain-Specific Texts , 2002, LREC.

[35]  Fabio Rinaldi,et al.  Terminological resources for text mining over biomedical scientific literature , 2011, Artif. Intell. Medicine.

[36]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[37]  Carol S. Bond,et al.  What E-patients Want From the Doctor-Patient Relationship: Content Analysis of Posts on Discussion Boards , 2012, Journal of medical Internet research.

[38]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[39]  Allen C. Browne,et al.  Lexical methods for managing variation in biomedical terminologies. , 1994, Proceedings. Symposium on Computer Applications in Medical Care.

[40]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[41]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[42]  Özlem Uzuner,et al.  Viewpoint Paper: Recognizing Obesity and Comorbidities in Sparse Data , 2009, J. Am. Medical Informatics Assoc..

[43]  Mark Steedman Information and syntax in spoken language systems , 1989 .

[44]  Christian Jacquemin,et al.  Spotting and Discovering Terms through Natural Language Processing , 1997 .

[45]  Sujin Kim,et al.  Content analysis of cancer blog posts. , 2009, Journal of the Medical Library Association : JMLA.

[46]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[47]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[48]  Luca Bernardi,et al.  Mining Information for Functional Genomics , 2002, IEEE Intell. Syst..

[49]  Richard E Ashcroft,et al.  Virtual community consultation? Using the literature and weblogs to link community perspectives and health technology assessment , 2008, Health expectations : an international journal of public participation in health care and health policy.

[50]  Béatrice Daille,et al.  Study and Implementation of Combined Techniques for Automatic Extraction of Terminology , 1994 .

[51]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[52]  SmadjaFrank Retrieving collocations from text , 1993 .

[53]  M. González Rodríguez,et al.  Proceedings of the third International Conference on Language Resources and Evaluation , 2002 .

[54]  Goran Nenadic,et al.  Medication information extraction with linguistic pattern matching and semantic rules , 2010, J. Am. Medical Informatics Assoc..

[55]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[56]  William R. Hersh,et al.  Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis , 1997, AMIA.

[57]  Sophia Ananiadou,et al.  The C-value/NC-value domain-independent method for multi-word term extraction , 1999 .