The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature

BackgroundOne of the most neglected areas of biomedical Text Mining (TM) is the development of systems based on carefully assessed user needs. We have recently investigated the user needs of an important task yet to be tackled by TM -- Cancer Risk Assessment (CRA). Here we take the first step towards the development of TM technology for the task: identifying and organizing the scientific evidence required for CRA in a taxonomy which is capable of supporting extensive data gathering from biomedical literature.ResultsThe taxonomy is based on expert annotation of 1297 abstracts downloaded from relevant PubMed journals. It classifies 1742 unique keywords found in the corpus to 48 classes which specify core evidence required for CRA. We report promising results with inter-annotator agreement tests and automatic classification of PubMed abstracts to taxonomy classes. A simple user test is also reported in a near real-world CRA scenario which demonstrates along with other evaluation that the resources we have built are well-defined, accurate, and applicable in practice.ConclusionWe present our annotation guidelines and a tool which we have designed for expert annotation of PubMed abstracts. A corpus annotated for keywords and document relevance is also presented, along with the taxonomy which organizes the keywords into classes defining core evidence for CRA. As demonstrated by the evaluation, the materials we have constructed provide a good basis for classification of CRA literature along multiple dimensions. They can support current manual CRA as well as facilitate the development of an approach based on TM. We discuss extending the taxonomy further via manual and machine learning approaches and the subsequent steps required to develop TM technology for the needs of CRA.

[1]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[2]  T. Dragani,et al.  Libri Ricevuti: IARC Monographs on the Evaluation of Carcinogenic Risks to Humans , 1992 .

[3]  Xiaoyan Wang,et al.  Automated Knowledge Acquisition from Clinical Narrative Reports , 2008, AMIA.

[4]  A. Zuckerman,et al.  IARC Monographs on the Evaluation of Carcinogenic Risks to Humans , 1995, IARC monographs on the evaluation of carcinogenic risks to humans.

[5]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[6]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[7]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[8]  Christina Rudén,et al.  What influences a health risk assessment? , 2006, Toxicology letters.

[9]  D. Bleyl,et al.  IARC Monographs on the Evaluation of Carcinogenic Risks to Humans. Overall Evaluations of Carcinogenicity: An Updating of IARC Monographs vol. 1 to 42. Supplement 7. 440 Seiten. International Agency for Research on Cancer, Lyon 1987. Preis: 65, – s.Fr , 1989 .

[10]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[11]  C Rudén,et al.  The use and evaluation of primary data in 29 trichloroethylene carcinogen risk assessments. , 2001, Regulatory toxicology and pharmacology : RTP.

[12]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[13]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[14]  K. Bretonnel Cohen,et al.  Corpus Design for Biomedical Natural Language Processing , 2005, LBLODMBS@IDMB.

[15]  Sophia Ananiadou,et al.  Text mining and its potential applications in systems biology. , 2006, Trends in biotechnology.

[16]  Guidance on information requirements and chemical safety assessment , 2008 .

[17]  Anna Korhonen,et al.  A New Challenge for Text Mining: Cancer Risk Assessment , 2008 .

[18]  Jeffrey Demaine,et al.  LitMiner: integration of library services within a bio-informatics application , 2006, Biomedical digital libraries.

[19]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[20]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[21]  Gerben Menschaert,et al.  PubMeth: a cancer methylation database combining text-mining and expert annotation , 2007, Nucleic Acids Res..

[22]  Xiaoyan Zhu,et al.  Exploiting and integrating rich features for biological literature classification , 2008, BMC Bioinformatics.

[23]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[24]  Peer Bork,et al.  Extraction of Transcript Diversity from Scientific Literature , 2005, PLoS Comput. Biol..

[25]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[26]  K. Bretonnel Cohen,et al.  NEW FRONTIERS IN BIOMEDICAL TEXT MINING – AN INTRODUCTION , 2006 .

[27]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[28]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[29]  S. Siegel,et al.  Nonparametric Statistics for the Behavioral Sciences. , 1957 .

[30]  K. Bretonnel Cohen,et al.  NEW FRONTIERS IN BIOMEDICAL TEXT MINING , 2007 .

[31]  Arthur Gretton,et al.  Learning Taxonomies by Dependence Maximization , 2008, NIPS.

[32]  F B ROGERS,et al.  Medical Subject Headings , 1948, Nature.

[33]  Fred E. Cohen,et al.  Automated extraction of mutation data from the literature: application of MuteXt to G protein-coupled receptors and nuclear hormone receptors , 2004, Bioinform..

[34]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[35]  Jun'ichi Tsujii,et al.  Corpus annotation for mining biomedical events from literature , 2008, BMC Bioinformatics.

[36]  Dietrich Rebholz-Schuhmann,et al.  BioLexicon: A Lexical Resource for the Biology Domain , 2008, SMBM 2008.

[37]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[38]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[39]  Hsin-Chang Yang,et al.  A Platform of Biomedical Literature Mining for Categorization of Cancer Related Abstracts , 2007, Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007).

[40]  Hsin-Chang Yang,et al.  Text Mining of Clinical Records for Cancer Diagnosis , 2007, Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007).

[41]  George Hripcsak,et al.  Automated acquisition of disease drug knowledge from biomedical and clinical documents: an initial study. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[42]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[43]  Slobodan Vucetic Substring selection for biomedical document classification , 2006, TMBIO '06.

[44]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[45]  Li Li,et al.  Comparing ICD9-Encoded Diagnoses and NLP-Processed Discharge Summaries for Clinical Trials Pre-Screening: A Case Study , 2008, AMIA.

[46]  Ted Briscoe,et al.  Natural Language Processing in aid of FlyBase curators , 2008, BMC Bioinformatics.

[47]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[48]  Hong Yu,et al.  TRANSLATING BIOLOGY: TEXT MINING TOOLS THAT WORK. , 2008, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[49]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[50]  W. H. Farland,et al.  Workshop report on EPA (Environmental Protection Agency) guidelines for carcinogen risk assessment. Held in Virginia Beach, Virginia on January 11-13, 1989 , 1989 .

[51]  Jeyakumar Natarajan,et al.  Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line , 2006, BMC Bioinformatics.

[52]  Gerlinde Knetsch REACh - Workflows and Software Tools for the Process of Registration, Evaluation, Authorisation and Restriction of European Chemicals , 2008, EnviroInfo.

[53]  Anna Korhonen,et al.  User-Driven Development of Text Mining Resources for Cancer Risk Assessment , 2009, BioNLP@HLT-NAACL.

[54]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[55]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[56]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[57]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[58]  Masoud Nikravesh,et al.  Feature Extraction - Foundations and Applications , 2006, Feature Extraction.

[59]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..