Imitating Manual Curation of Text-Mined Facts in Biomedicine

Text-mining algorithms make mistakes in extracting facts from natural-language texts. In biomedical applications, which rely on use of text-mined data, it is critical to assess the quality (the probability that the message is correctly extracted) of individual facts—to resolve data conflicts and inconsistencies. Using a large set of almost 100,000 manually produced evaluations (most facts were independently reviewed more than once, producing independent evaluations), we implemented and tested a collection of algorithms that mimic human evaluation of facts provided by an automated information-extraction system. The performance of our best automated classifiers closely approached that of our human evaluators (ROC score close to 0.95). Our hypothesis is that, were we to use a larger number of human experts to evaluate any given sentence, we could implement an artificial-intelligence curator that would perform the classification job at least as accurately as an average individual human evaluator. We illustrated our analysis by visualizing the predicted accuracy of the text-mined relations involving the term cocaine.

[1]  Yves Chauvin,et al.  Backpropagation: theory, architectures, and applications , 1995 .

[2]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[3]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[4]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[5]  Yong Rao,et al.  The SH2/SH3 Adaptor Protein Dock Interacts with the Ste20-like Kinase Misshapen in Controlling Growth Cone Motility , 1999, Neuron.

[6]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[7]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[8]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  E. Jaynes Probability theory : the logic of science , 2003 .

[11]  Claude E. Shannon,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[13]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[14]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[15]  C. Blaschke,et al.  The potential use of SUISEKI as a protein interaction discovery tool. , 2001, Genome informatics. International Conference on Genome Informatics.

[16]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[17]  Claude E. Shannon,et al.  The Mathematical Theory of Communication , 1950 .

[18]  J. von Neumann,et al.  Probabilistic Logic and the Synthesis of Reliable Organisms from Unreliable Components , 1956 .

[19]  John G. Flanagan,et al.  The Middle and the End Slit Brings Guidance and Branching Together in Axon Pathway Selection , 1999, Neuron.

[20]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[21]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[22]  Gerald W. Zamponi,et al.  Cysteine String Protein Regulates G Protein Modulation of N-Type Calcium Channels , 2000, Neuron.

[23]  Thorsten Joachims,et al.  Making large-scale support vector machine learning practical , 1999 .

[24]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[25]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[26]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[27]  Mark H. Ellisman,et al.  Fission and Uncoating of Synaptic Clathrin-Coated Vesicles Are Perturbed by Disruption of Interactions with the SH3 Domain of Endophilin , 2000, Neuron.

[28]  David R. Colman,et al.  Molecular Modification of N-Cadherin in Response to Synaptic Activity , 2000, Neuron.

[29]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[30]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[31]  W. N. Zagotta,et al.  Direct Interaction between Amino- and Carboxyl-Terminal Domains of Cyclic Nucleotide-Gated Channels , 1997, Neuron.

[32]  P. Dolph,et al.  The Formation of Stable Rhodopsin-Arrestin Complexes Induces Apoptosis and Photoreceptor Cell Degeneration , 2000, Neuron.

[33]  Yuh Nung Jan,et al.  Presenilins, Processing of β-Amyloid Precursor Protein, and Notch Signaling , 1999, Neuron.

[34]  Li-Huei Tsai,et al.  NUDEL Is a Novel Cdk5 Substrate that Associates with LIS1 and Cytoplasmic Dynein , 2000, Neuron.