Generalized Query-Based Active Learning to Identify Differentially Methylated Regions in DNA

Active learning is a supervised learning technique that reduces the number of examples required for building a successful classifier, because it can choose the data it learns from. This technique holds promise for many biological domains in which classified examples are expensive and time-consuming to obtain. Most traditional active learning methods ask very specific queries to the Oracle (e.g., a human expert) to label an unlabeled example. The example may consist of numerous features, many of which are irrelevant. Removing such features will create a shorter query with only relevant features, and it will be easier for the Oracle to answer. We propose a generalized query-based active learning (GQAL) approach that constructs generalized queries based on multiple instances. By constructing appropriately generalized queries, we can achieve higher accuracy compared to traditional active learning methods. We apply our active learning method to find differentially DNA methylated regions (DMRs). DMRs are DNA locations in the genome that are known to be involved in tissue differentiation, epigenetic regulation, and disease. We also apply our method on 13 other data sets and show that our method is better than another popular active learning technique.

[1]  Ivan Bratko,et al.  Microarray data mining with visual programming , 2005, Bioinform..

[2]  Dale Schuurmans,et al.  Discriminative Batch Mode Active Learning , 2007, NIPS.

[3]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[4]  Diane J. Cook,et al.  Ask me better questions: active learning queries based on rule induction , 2011, KDD.

[5]  Xiaowei Xu,et al.  Representative Sampling for Text Classification Using Support Vector Machines , 2003, ECIR.

[6]  Jun Du,et al.  Asking Generalized Queries to Domain Experts to Improve Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[8]  Daiya Takai,et al.  Comprehensive analysis of CpG islands in human chromosomes 21 and 22 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Tatsuya Akutsu,et al.  Fast and accurate database homology search using upper bounds of local alignment scores , 2005, Bioinform..

[10]  T. Alonzo,et al.  Molecular Cancer BioMed Central Review , 2007 .

[11]  Thomas Lengauer,et al.  Computational epigenetics , 2008, Bioinform..

[12]  Yi Zhang,et al.  Incorporating Diversity and Density in Active Learning for Relevance Feedback , 2007, ECIR.

[13]  A. Bird DNA methylation patterns and epigenetic memory. , 2002, Genes & development.

[14]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[15]  Michael Weber,et al.  Genomic patterns of DNA methylation: targets and function of an epigenetic mark. , 2007, Current opinion in cell biology.

[16]  Michael K. Skinner,et al.  Epigenetic Transgenerational Actions of Endocrine Disruptors and Male Fertility , 2005, Science.

[17]  Richard S. Johannes,et al.  Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus , 1988 .

[18]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[19]  Rong Jin,et al.  Active Learning by Querying Informative and Representative Examples , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Arnold W. M. Smeulders,et al.  Active learning using pre-clustering , 2004, ICML.

[21]  Andrew McCallum,et al.  Employing EM and Pool-Based Active Learning for Text Classification , 1998, ICML.

[22]  H. K. Dai,et al.  A survey of DNA motif finding algorithms , 2007, BMC Bioinformatics.

[23]  Michael K. Skinner,et al.  Transgenerational Actions of Environmental Compounds on Reproductive Disease and Identification of Epigenetic Biomarkers of Ancestral Exposures , 2012, PloS one.

[24]  Haym Hirsh,et al.  Improving Short-Text Classification using Unlabeled Data for Classification Problems , 2000, ICML.

[25]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[26]  Michael K. Skinner,et al.  Epigenetic Transgenerational Actions of Vinclozolin on Promoter Regions of the Sperm Epigenome , 2010, PloS one.

[27]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[28]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[29]  Michael K. Skinner,et al.  Epigenetic transgenerational actions of environmental factors in disease etiology , 2010, Trends in Endocrinology & Metabolism.

[30]  Klaus Brinker,et al.  Incorporating Diversity in Active Learning with Support Vector Machines , 2003, ICML.

[31]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[32]  M. Skinner,et al.  Environmentally Induced Epigenetic Transgenerational Inheritance of Altered Sertoli Cell Transcriptome and Epigenome: Molecular Etiology of Male Infertility , 2013, PloS one.

[33]  H. Sebastian Seung,et al.  Selective Sampling Using the Query by Committee Algorithm , 1997, Machine Learning.

[34]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[35]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[36]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[37]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[38]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.