Scientific Discovery by Machine Intelligence: A New Avenue for Drug Research

The majority of big data is unstructured and of this majority the largest chunk is text. While data mining techniques are well developed and standardized for structured, numerical data, the realm of unstructured data is still largely unexplored. The general focus lies on information extraction, which attempts to retrieve known information from text. The Holy Grail, however is knowledge discovery, where machines are expected to unearth entirely new facts and relations that were not previously known by any human expert. Indeed, understanding the meaning of text is often considered as one of the main characteristics of human intelligence. The ultimate goal of semantic artificial intelligence is to devise software that can understand the meaning of free text, at least in the practical sense of providing new, actionable information condensed out of a body of documents. As a stepping stone on the road to this vision I will introduce a totally new approach to drug research, namely that of identifying relevant information by employing a self-organizing semantic engine to text mine large repositories of biomedical research papers, a technique pioneered by Merck with the InfoCodex software. I will describe the methodology and a first successful experiment for the discovery of new biomarkers and phenotypes for diabetes and obesity on the basis of PubMed abstracts, public clinical trials and Merck internal documents. The reported approach shows much promise and has potential to impact fundamentally pharmaceutical research as a way to shorten time-to-market of novel drugs, and for early recognition of dead ends.

[1]  B. Bru Anders Hald, "A History of Mathematical Statistics from 1750 to 1930", New-York-Chicheste-Weinhei-Brisban-Singapor-Toronto, John Willey & Sons, Inc., 1998 , 2005 .

[2]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[3]  Daniel Choquet,et al.  The data deluge , 2012, Nature Cell Biology.

[4]  Jayanthi Ranjan,et al.  Journal of Theoretical and Applied Information Technology Applications of Data Mining Techniques in Pharmaceutical Industry , 2022 .

[5]  K. Bretonnel Cohen,et al.  Mining the pharmacogenomics literature - a survey of the state of the art , 2012, Briefings Bioinform..

[6]  Chitta Baral,et al.  A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions , 2012, J. Biomed. Informatics.

[7]  A. Bevan The data deluge , 2015, Antiquity.

[8]  Carlo A. Trugenberger,et al.  Discovery of novel biomarkers and phenotypes by semantic technologies , 2012, BMC Bioinformatics.

[9]  Zhiyong Lu,et al.  PubMed and beyond: a survey of web tools for searching biomedical literature , 2011, Database J. Biol. Databases Curation.

[10]  Rong Xu,et al.  A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text , 2012, J. Biomed. Informatics.

[11]  Yael Garten,et al.  Recent progress in automatically extracting information from the pharmacogenomic literature. , 2010, Pharmacogenomics.

[12]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[13]  Bridget T. McInnes,et al.  Using PharmGKB to train text mining approaches for identifying potential gene targets for pharmacogenomic studies , 2012, J. Biomed. Informatics.

[14]  J. P. Pollard,et al.  A method of parallel iteration , 1989 .

[15]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[16]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[17]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[18]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[19]  Zhiyong Lu,et al.  Systematic identification of pharmacogenomics information from clinical trials , 2012, J. Biomed. Informatics.

[20]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .