Combining Position Weight Matrices and Document-Term Matrix for Efficient Extraction of Associations of Methylated Genes and Diseases from Free Text

Background In a number of diseases, certain genes are reported to be strongly methylated and thus can serve as diagnostic markers in many cases. Scientific literature in digital form is an important source of information about methylated genes implicated in particular diseases. The large volume of the electronic text makes it difficult and impractical to search for this information manually. Methodology We developed a novel text mining methodology based on a new concept of position weight matrices (PWMs) for text representation and feature generation. We applied PWMs in conjunction with the document-term matrix to extract with high accuracy associations between methylated genes and diseases from free text. The performance results are based on large manually-classified data. Additionally, we developed a web-tool, DEMGD, which automates extraction of these associations from free text. DEMGD presents the extracted associations in summary tables and full reports in addition to evidence tagging of text with respect to genes, diseases and methylation words. The methodology we developed in this study can be applied to similar association extraction problems from free text. Conclusion The new methodology developed in this study allows for efficient identification of associations between concepts. Our method applied to methylated genes in different diseases is implemented as a Web-tool, DEMGD, which is freely available at http://www.cbrc.kaust.edu.sa/demgd/. The data is available for online browsing and download.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Frans Coenen,et al.  Threshold Tuning for Improved Classification Association Rule Mining , 2005, PAKDD.

[3]  John Elder,et al.  Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications , 2012 .

[4]  Marc Toussaint,et al.  Probabilistic inference for solving discrete and continuous state Markov Decision Processes , 2006, ICML.

[5]  Éric Renault,et al.  MethDB - a public database for DNA methylation data , 2001, Nucleic Acids Res..

[6]  Jiawei Han,et al.  CPAR: Classification based on Predictive Association Rules , 2003, SDM.

[7]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[8]  Jie Lv,et al.  DiseaseMeth: a human disease methylation database , 2011, Nucleic Acids Res..

[9]  David Aldous,et al.  The Continuum Random Tree III , 1991 .

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  M. Daumer,et al.  Serum-Based DNA Methylation Biomarkers in Colorectal Cancer: Potential for Screening and Early Detection , 2013, Journal of Cancer.

[12]  I. Ioshikhes,et al.  Optimizing the GATA-3 position weight matrix to improve the identification of novel binding sites , 2012, BMC Genomics.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[15]  Jean-Michel Claverie,et al.  Some Useful Statistical Properties of Position-weight Matrices , 1994, Comput. Chem..

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Charu C. Aggarwal,et al.  Mining Text Data , 2012, Springer US.

[18]  Tong Zhang,et al.  Fundamentals of Predictive Text Mining , 2010, Texts in Computer Science.

[19]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[20]  Benjamin Meder,et al.  Alterations in cardiac DNA methylation in human dilated cardiomyopathy , 2013, EMBO molecular medicine.

[21]  Xiumin Wang,et al.  Hic1 Modulates Prostate Cancer Progression by Epigenetic Modification No Potential Conflicts of Interest Were Disclosed Statement of Translational Relevance , 2022 .

[22]  Hsuan-Cheng Huang,et al.  MeInfoText: associated gene methylation and cancer information from text mining , 2008, BMC Bioinformatics.

[23]  Nick Cercone,et al.  2001 IEEE International Conference on Data Mining , 2001 .

[24]  Russ B. Altman,et al.  Author ' s personal copy Using text to build semantic networks for pharmacogenomics , 2010 .

[25]  Johannes Söding,et al.  The XXmotif web server for eXhaustive, weight matriX-based motif discovery in nucleotide sequences , 2012, Nucleic Acids Res..

[26]  Daniel Barbará,et al.  Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, USA, May 1-3, 2003 , 2003, SDM.

[27]  V. Brower Epigenetics: Unravelling the cancer code , 2011, Nature.

[28]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[29]  Yurong Xin,et al.  MethylomeDB: a database of DNA methylation profiles of the brain , 2011, Nucleic Acids Res..

[30]  Jiajie Zhang,et al.  MethyCancer: the database of human DNA methylation and cancer , 2007, Nucleic Acids Res..

[31]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[32]  Wen-Lian Hsu,et al.  MeInfoText 2.0: gene methylation and cancer relation extraction from biomedical literature , 2011, BMC Bioinformatics.

[33]  G. Stormo Gene-finding approaches for eukaryotes. , 2000, Genome research.

[34]  Arturas Petronis,et al.  DNA Methylation Microarrays: Experimental Design and Statistical Analysis , 2008 .

[35]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[36]  Thomas Lengauer,et al.  Computational epigenetics , 2008, Bioinform..

[37]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[38]  Peter A. Jones,et al.  Epigenetics in human disease and prospects for epigenetic therapy , 2004, Nature.

[39]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[40]  A. Feinberg Phenotypic plasticity and the epigenetics of human disease , 2007, Nature.

[41]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[42]  Gerben Menschaert,et al.  PubMeth: a cancer methylation database combining text-mining and expert annotation , 2007, Nucleic Acids Res..

[43]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[44]  Christoph Grunau,et al.  An improved version of the DNA methylation database (MethDB) , 2003, Nucleic Acids Res..

[45]  Chitta Baral,et al.  A SNPshot of PubMed to associate genetic variants with drugs, diseases, and adverse reactions , 2012, J. Biomed. Informatics.

[46]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[47]  Michael Hackenberg,et al.  NGSmethDB: a database for next-generation sequencing single-cytosine-resolution DNA methylation data , 2010, Nucleic Acids Res..

[48]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[49]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[50]  Hiromu Suzuki,et al.  DNA methylation and microRNA dysregulation in cancer , 2012, Molecular oncology.

[51]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.