Identification of transcription factor contexts in literature using machine learning approaches

BackgroundAvailability of information about transcription factors (TFs) is crucial for genome biology, as TFs play a central role in the regulation of gene expression. While manual literature curation is expensive and labour intensive, the development of semi-automated text mining support is hindered by unavailability of training data. There have been no studies on how existing data sources (e.g. TF-related data from the MeSH thesaurus and GO ontology) or potentially noisy example data (e.g. protein-protein interaction, PPI) could be used to provide training data for identification of TF-contexts in literature.ResultsIn this paper we describe a text-classification system designed to automatically recognise contexts related to transcription factors in literature. A learning model is based on a set of biological features (e.g. protein and gene names, interaction words, other biological terms) that are deemed relevant for the task. We have exploited background knowledge from existing biological resources (MeSH and GO) to engineer such features. Weak and noisy training datasets have been collected from descriptions of TF-related concepts in MeSH and GO, PPI data and data representing non-protein-function descriptions. Three machine-learning methods are investigated, along with a vote-based merging of individual approaches and/or different training datasets. The system achieved highly encouraging results, with most classifiers achieving an F-measure above 90%.ConclusionsThe experimental results have shown that the proposed model can be used for identification of TF-related contexts (i.e. sentences) with high accuracy, with a significantly reduced set of features when compared to traditional bag-of-words approach. The results of considering existing PPI data suggest that there is not as high similarity between TF and PPI contexts as we have expected. We have also shown that existing knowledge sources are useful both for feature engineering and for obtaining noisy positive training data.

[1]  Xin Chen,et al.  TRANSFAC: an integrated system for gene expression regulation , 2000, Nucleic Acids Res..

[2]  Lorenz Wernisch,et al.  Applying GIFT, a Gene Interactions Finder in Text, to fly literature , 2005, Bioinform..

[3]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[4]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[5]  Jong C. Park,et al.  Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar , 2000, Pacific Symposium on Biocomputing.

[6]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from the literature: Part II , 2005, Bioinform..

[7]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[8]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[9]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[11]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[12]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[13]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[14]  Obi L. Griffith,et al.  ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation , 2006, Bioinform..

[15]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[16]  Anton Yuryev,et al.  Extracting human protein interactions from MEDLINE using a full-sentence parser , 2004, Bioinform..

[17]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[18]  Jung-Hsien Chiang,et al.  MeKE: Discovering the Functions of Gene Products from Biomedical Literature Via Sentence Alignment , 2003, Bioinform..

[19]  Joel D. Martin,et al.  PreBIND and Textomy – mining the biomedical literature for protein-protein interactions using a support vector machine , 2003, BMC Bioinformatics.

[20]  Sarah A. Teichmann,et al.  DBD: a transcription factor prediction database , 2005, Nucleic Acids Res..

[21]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[22]  Yasunori Yamamoto,et al.  Automatic Construction of Knowledge Base from Biological Papers , 1997, ISMB.

[23]  D. G. Simpson,et al.  The Statistical Analysis of Discrete Data , 1989 .

[24]  Burr Settles,et al.  ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[25]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[26]  Toshihisa Takagi,et al.  PRIME: automatically extracted PRotein Interactions and Molecular Information databas , 2004, Silico Biol..

[27]  Vasileios Hatzivassiloglou,et al.  Learning anchor verbs for biological interaction patterns from published text articles , 2002, Int. J. Medical Informatics.

[28]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[29]  J. Kawai,et al.  A genome-wide and nonredundant mouse transcription factor database. , 2004, Biochemical and biophysical research communications.

[30]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[31]  Alfonso Valencia,et al.  The Frame-Based Module of the SUISEKI Information Extraction System , 2002, IEEE Intell. Syst..

[32]  Bob Carpenter,et al.  Phrasal Queries with LingPipe and Lucene: Ad Hoc Genomics Text Retrieval , 2004, TREC.

[33]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[34]  Madeline A. Crosby,et al.  FlyBase: genes and gene models , 2004, Nucleic Acids Res..

[35]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[36]  Masatoshi Yoshikawa,et al.  Extracting Information on Protein-Protein Interactions from Biological Literature Based on Machine Learning Approaches , 2003 .