Large classifier systems in bio- and cheminformatics

Large classifier systems are systems in machine learning that combine multiple classifiers. The goal is to improve the performance of classifiers for data sets of complex structure. This thesis introduces two algorithms that employ large classifier systems for multi-label and multi-relational classification. The usefulness of the approaches is shown on problems from bio- and cheminformatics, the prediction of biodegradation and the prediction of toxic effects.

[1]  John D. Walker,et al.  Predicting the biodegradation products of perfluorinated chemicals using CATABOL , 2004, SAR and QSAR in environmental research.

[2]  Michael B. Black,et al.  A comprehensive statistical analysis of predicting in vivo hazard using high-throughput in vitro screening. , 2012, Toxicological sciences : an official journal of the Society of Toxicology.

[3]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[4]  Hsuan-Tien Lin,et al.  Multilabel Classification with Principal Label Space Transformation , 2012, Neural Computation.

[5]  Stefan Kramer,et al.  Leveraging Chemical Background Knowledge for the Prediction of Growth Inhibition , 2006, Sixth IEEE Symposium on BioInformatics and BioEngineering (BIBE'06).

[6]  Eyke Hüllermeier,et al.  Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[7]  Alfonso Valencia,et al.  New books , 2006, Philosophy.

[8]  Sunita Sarawagi,et al.  Discriminative Methods for Multi-labeled Classification , 2004, PAKDD.

[9]  Luc De Raedt,et al.  Top-Down Induction of Clustering Trees , 1998, ICML.

[10]  Stefan Kramer,et al.  Machine Learning and Data Mining Approaches to Biodegradation Pathway Prediction , 2008 .

[11]  David M. Reif,et al.  Xenobiotic-Metabolizing Enzyme and Transporter Gene Expression in Primary Cultures of Human Hepatocytes Modulated by Toxcast Chemicals , 2010, Journal of toxicology and environmental health. Part B, Critical reviews.

[12]  Stefan Kramer,et al.  Inductive logic programming for gene regulation prediction , 2007, Machine Learning.

[13]  Robert J Kavlock,et al.  Incorporating biological, chemical, and toxicological knowledge into predictive models of toxicity. , 2012, Toxicological sciences : an official journal of the Society of Toxicology.

[14]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[15]  Geoffrey E. Hinton,et al.  Recognizing Handwritten Digits Using Mixtures of Linear Models , 1994, NIPS.

[16]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[17]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Stefan Kramer,et al.  Predicting biodegradation products and pathways: a hybrid knowledge- and machine learning-based approach , 2010, Bioinform..

[19]  Stefan Kramer,et al.  Data-driven extraction of relative reasoning rules to limit combinatorial explosion in biodegradation pathway prediction , 2008, Bioinform..

[20]  Lynda B. M. Ellis,et al.  Encoding microbial metabolic logic: predicting biodegradation , 2004, Journal of Industrial Microbiology & Biotechnology.

[21]  Stefan Kramer,et al.  Ensembles of Balanced Nested Dichotomies for Multi-class Problems , 2005, PKDD.

[22]  S Dimitrov,et al.  A kinetic model for predicting biodegradation , 2007, SAR and QSAR in environmental research.

[23]  Thomas Hofmann,et al.  Predicting structured objects with support vector machines , 2009, Commun. ACM.

[24]  Eyke Hüllermeier,et al.  Multilabel classification via calibrated label ranking , 2008, Machine Learning.

[25]  D. L. Taylor,et al.  Systems cell biology based on high-content screening. , 2006, Methods in enzymology.

[26]  Robert P. W. Duin,et al.  Using two-class classifiers for multiclass classification , 2002, Object recognition supported by user interaction for service robots.

[27]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[28]  Grigorios Tsoumakas,et al.  Random k -Labelsets: An Ensemble Method for Multilabel Classification , 2007, ECML.

[29]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[30]  Saso Dzeroski,et al.  Decision trees for hierarchical multi-label classification , 2008, Machine Learning.

[31]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[32]  D. Dix,et al.  The ToxCast program for prioritizing toxicity testing of environmental chemicals. , 2007, Toxicological sciences : an official journal of the Society of Toxicology.

[33]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[34]  Zhen Li,et al.  A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model , 2008, BMC Bioinformatics.

[35]  Saso Dzeroski,et al.  Hierarchical Multi-classification with Predictive Clustering Trees in Functional Genomics , 2005, EPIA.

[36]  David M. Reif,et al.  Update on EPA's ToxCast program: providing high throughput decision support tools for chemical risk management. , 2012, Chemical research in toxicology.

[37]  Pantelis Sopasakis,et al.  Collaborative development of predictive toxicology applications , 2010, J. Cheminformatics.

[38]  Hendrik Blockeel,et al.  Top-Down Induction of First Order Logical Decision Trees , 1998, AI Commun..

[39]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[40]  Stefan Kramer,et al.  Multi-label classification using boolean matrix decomposition , 2012, SAC '12.

[41]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.

[42]  Adam Yasgar,et al.  Quantitative high-throughput screening: a titration-based approach that efficiently identifies biological activities in large chemical libraries. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Zhi-Hua Zhou,et al.  Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization , 2006, IEEE Transactions on Knowledge and Data Engineering.

[44]  Grigorios Tsoumakas,et al.  Correlation-Based Pruning of Stacked Binary Relevance Models for Multi-Label Learning , 2009 .

[45]  Eric J Kunkel,et al.  Characterization of compound mechanisms and secondary activities by BioMAP analysis. , 2006, Journal of pharmacological and toxicological methods.

[46]  Stefan Kramer,et al.  Frequent free tree discovery in graph data , 2004, SAC '04.

[47]  Grigorios Tsoumakas,et al.  Multi-Label Classification of Music into Emotions , 2008, ISMIR.

[48]  Fangping Mu,et al.  Prediction of oxidoreductase-catalyzed reactions based on atomic properties of metabolites , 2006, Bioinform..

[49]  David M. Reif,et al.  Aggregating Data for Computational Toxicology Applications: The U.S. Environmental Protection Agency (EPA) Aggregated Computational Toxicology Resource (ACToR) System , 2012, International journal of molecular sciences.

[50]  David M. Reif,et al.  In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization: The ToxCast Project , 2009, Environmental health perspectives.

[51]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[52]  Lynda B. M. Ellis,et al.  The University of Minnesota Biocatalysis/Biodegradation Database: the first decade , 2005, Nucleic Acids Res..

[53]  David M. Reif,et al.  Profiling Chemicals Based on Chronic Toxicity Results from the U.S. EPA ToxRef Database , 2008, Environmental health perspectives.

[54]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[55]  Thomas G. Dietterich,et al.  Learning with Many Irrelevant Features , 1991, AAAI.

[56]  Robert J Kavlock,et al.  Predictive models of prenatal developmental toxicity from ToxCast high-throughput screening data. , 2011, Toxicological sciences : an official journal of the Society of Toxicology.

[57]  Ruili Huang,et al.  The future of toxicity testing: a focus on in vitro methods using a quantitative high-throughput screening platform. , 2010, Drug discovery today.

[58]  Alistair B A Boxall,et al.  Assessing the ecotoxicity of pesticide transformation products. , 2003, Environmental science & technology.

[59]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[60]  David M. Reif,et al.  Activity profiles of 309 ToxCast™ chemicals evaluated across 292 biochemical targets. , 2011, Toxicology.

[61]  Eyke Hüllermeier,et al.  Bayes Optimal Multilabel Classification via Probabilistic Classifier Chains , 2010, ICML.

[62]  Johannes Fürnkranz,et al.  Efficient Pairwise Multilabel Classification for Large-Scale Problems in the Legal Domain , 2008, ECML/PKDD.

[63]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[64]  Yi Liu,et al.  Semi-supervised Multi-label Learning by Constrained Non-negative Matrix Factorization , 2006, AAAI.

[65]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[66]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[67]  Richard S. Judson,et al.  Profiling the activity of environmental chemicals in prenatal developmental toxicity studies using the U.S. EPA's ToxRefDB. , 2009, Reproductive toxicology.

[68]  Jiebo Luo,et al.  Learning multi-label scene classification , 2004, Pattern Recognit..

[69]  John Langford,et al.  Multi-Label Prediction via Compressed Sensing , 2009, NIPS.

[70]  Zhi-Hua Zhou,et al.  A k-nearest neighbor based algorithm for multi-label classification , 2005, 2005 IEEE International Conference on Granular Computing.

[71]  R. Judson,et al.  The Toxicity Data Landscape for Environmental Chemicals , 2008, Environmental health perspectives.

[72]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[73]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[74]  Grigorios Tsoumakas,et al.  Effective and Efficient Multilabel Classification in Domains with Large Number of Labels , 2008 .

[75]  Jesse Read,et al.  Scalable Multi-label Classification , 2010 .

[76]  Pauli Miettinen The Boolean Column and Column-Row Matrix Decompositions , 2008, ECML/PKDD.

[77]  Grigorios Tsoumakas,et al.  Multilabel Text Classification for Automated Tag Suggestion , 2008 .

[78]  Gilles Klopman,et al.  META, 3. A Genetic Algorithm for Metabolic Transform Priorities Optimization , 1997, J. Chem. Inf. Comput. Sci..

[79]  Saso Dzeroski,et al.  Multi-relational data mining: an introduction , 2003, SKDD.

[80]  Gerhard Widmer,et al.  Prediction of Ordinal Classes Using Regression Trees , 2001, Fundam. Informaticae.

[81]  Robert J Kavlock,et al.  Profiling the reproductive toxicity of chemicals from multigeneration studies in the toxicity reference database. , 2009, Toxicological sciences : an official journal of the Society of Toxicology.

[82]  Stefan Kramer,et al.  Predicting a small molecule-kinase interaction map: A machine learning approach , 2011, J. Cheminformatics.

[83]  Robert J Kavlock,et al.  Impact of environmental chemicals on key transcription regulators and correlation to toxicity end points within EPA's ToxCast program. , 2010, Chemical research in toxicology.

[84]  Philip N. Judson,et al.  Using Absolute and Relative Reasoning in the Prediction of the Potential Metabolism of Xenobiotics. , 2003 .

[85]  Amanda Clare,et al.  Knowledge Discovery in Multi-label Phenotype Data , 2001, PKDD.

[86]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[87]  Ruili Huang,et al.  Chemical Genomics Profiling of Environmental Chemical Modulation of Human Nuclear Receptors , 2011, Environmental health perspectives.

[88]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[89]  Grigorios Tsoumakas,et al.  Mining Multi-label Data , 2010, Data Mining and Knowledge Discovery Handbook.

[90]  Geoff Holmes,et al.  Multi-label Classification Using Ensembles of Pruned Sets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[91]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[92]  David M. Reif,et al.  Environmental Impact on Vascular Development Predicted by High-Throughput Screening , 2011, Environmental health perspectives.

[93]  K. Dembczynski,et al.  On Label Dependence in Multi-Label Classification , 2010 .

[94]  Yiannis Kompatsiaris,et al.  An Empirical Study of Multi-label Learning Methods for Video Annotation , 2009, 2009 Seventh International Workshop on Content-Based Multimedia Indexing.

[95]  P N Judson,et al.  Knowledge-based expert systems for toxicity and metabolism prediction: DEREK, StAR and METEOR. , 1999, SAR and QSAR in environmental research.

[96]  David M. Reif,et al.  Predictive Model of Rat Reproductive Toxicity from ToxCast High Throughput Screening1 , 2011, Biology of reproduction.