Text Mining for Literature Review and Knowledge Discovery in Cancer Risk Assessment and Research

Research in biomedical text mining is starting to produce technology which can make information in biomedical literature more accessible for bio-scientists. One of the current challenges is to integrate and refine this technology to support real-life scientific tasks in biomedicine, and to evaluate its usefulness in the context of such tasks. We describe CRAB – a fully integrated text mining tool designed to support chemical health risk assessment. This task is complex and time-consuming, requiring a thorough review of existing scientific data on a particular chemical. Covering human, animal, cellular and other mechanistic data from various fields of biomedicine, this is highly varied and therefore difficult to harvest from literature databases via manual means. Our tool automates the process by extracting relevant scientific data in published literature and classifying it according to multiple qualitative dimensions. Developed in close collaboration with risk assessors, the tool allows navigating the classified dataset in various ways and sharing the data with other users. We present a direct and user-based evaluation which shows that the technology integrated in the tool is highly accurate, and report a number of case studies which demonstrate how the tool can be used to support scientific discovery in cancer risk assessment and research. Our work demonstrates the usefulness of a text mining pipeline in facilitating complex research tasks in biomedicine. We discuss further development and application of our technology to other types of chemical risk assessment in the future.

[1]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[2]  U. Epa Guidelines for carcinogen risk assessment , 1986 .

[3]  A. Feinstein,et al.  High agreement but low kappa: II. Resolving the paradoxes. , 1990, Journal of clinical epidemiology.

[4]  A. Feinstein,et al.  High agreement but low kappa: I. The problems of two paradoxes. , 1990, Journal of clinical epidemiology.

[5]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[6]  J. Carlin,et al.  Bias, prevalence and kappa. , 1993, Journal of clinical epidemiology.

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  K T Morgan,et al.  A brief review of formaldehyde carcinogenesis in relation to rat nasal pathology and human health risk assessment. , 1997, Toxicologic pathology.

[9]  Kevin T. Morgan,et al.  Review Article: A Brief Review of Formaldehyde Carcinogenesis in Relation to Rat Nasal Pathology and Human Health Risk Assessment , 1997 .

[10]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  J M Rice,et al.  An IARC evaluation of polychlorinated dibenzo-p-dioxins and polychlorinated dibenzofurans as risk factors in human carcinogenesis. , 1998, Environmental health perspectives.

[13]  L. Hilakivi-Clarke,et al.  Genistein: does it prevent or promote breast cancer? , 2000, Environmental health perspectives.

[14]  M. Waters,et al.  A review of the genetic and related effects of 1,3-butadiene in rodents and humans. , 2000, Mutation research.

[15]  Christer Johansson,et al.  Cancer risk assessment, indicators, and guidelines for polycyclic aromatic hydrocarbons in the ambient air. , 2002, Environmental health perspectives.

[16]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[17]  A Kolman,et al.  ヒトにおけるエチレンオキシド,プロピレンオキシド,エピクロロヒドリンの遺伝毒性作用 最新のレビュー(1990~2001) , 2002 .

[18]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[19]  A. Kolman,et al.  Genotoxic effects of ethylene oxide, propylene oxide and epichlorohydrin in humans: update review (1990-2001). , 2002, Mutation research.

[20]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[21]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[22]  Matthias Hein,et al.  Hilbertian Metrics and Positive Definite Kernels on Probability Measures , 2005, AISTATS.

[23]  Sophia Ananiadou,et al.  Text Mining for Biology And Biomedicine , 2005 .

[24]  K. Cohen,et al.  Biomedical language processing: what's beyond PubMed? , 2006, Molecular cell.

[25]  M. E. (Bette) Meek,et al.  4-Aminobiphenyl and DNA Reactivity: Case Study Within the Context of the 2006 IPCS Human Relevance Framework for Analysis of a Cancer Mode of Action for Humans , 2006, Critical reviews in toxicology.

[26]  Igor Vajda,et al.  On Divergences and Informations in Statistics and Information Theory , 2006, IEEE Transactions on Information Theory.

[27]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[28]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[29]  Ted Briscoe,et al.  Natural Language Processing in aid of FlyBase curators , 2008, BMC Bioinformatics.

[30]  Beatrice Alex,et al.  Assisted Curation: Does Text Mining Really Help? , 2007, Pacific Symposium on Biocomputing.

[31]  Xiaoyan Zhu,et al.  Exploiting and integrating rich features for biological literature classification , 2008, BMC Bioinformatics.

[32]  L. Grivell,et al.  Text mining for biology - the way forward: opinions from leading scientists , 2008, Genome Biology.

[33]  Diarmuid Ó Séaghdha,et al.  Semantic Classification with Distributional Kernels , 2008, COLING.

[34]  Steven K. Gibb Toxicity testing in the 21st century: a vision and a strategy. , 2008, Reproductive toxicology.

[35]  Hong Yu,et al.  TRANSLATING BIOLOGY: TEXT MINING TOOLS THAT WORK. , 2008, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[36]  M. Schuemie,et al.  Anni 2.0: a multipurpose text-mining tool for the life sciences , 2008, Genome Biology.

[37]  R. Judson,et al.  The Toxicity Data Landscape for Environmental Chemicals , 2008, Environmental health perspectives.

[38]  Anna Korhonen,et al.  The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature , 2009, BMC Bioinformatics.

[39]  D Hattis,et al.  A preliminary operational classification system for nonmutagenic modes of action for carcinogenesis. , 2009, Critical reviews in toxicology.

[40]  Halil Kilicoglu,et al.  Adapting semantic natural language processing technology to address information overload in influenza epidemic management , 2010, J. Assoc. Inf. Sci. Technol..

[41]  Maria Liakata,et al.  A comparison and user-based evaluation of models of textual information structure in the context of cancer risk assessment , 2011, BMC Bioinformatics.

[42]  Halil Kilicoglu,et al.  Adapting semantic natural language processing technology to address information overload in influenza epidemic management , 2010 .

[43]  J. Bailar,et al.  Toxicity Testing in the 21st Century: A Vision and a Strategy , 2010, Journal of toxicology and environmental health. Part B, Critical reviews.

[44]  Jacob de Vlieg,et al.  Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases , 2010, PLoS Comput. Biol..

[45]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[46]  Halil Kilicoglu,et al.  Semantic MEDLINE: An advanced information management application for biomedicine , 2011, Inf. Serv. Use.

[47]  J. Qiu,et al.  Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA , 2011, PloS one.