Positive unlabelled learning with applications in computational biology

Con el aumento de la cantidad de informacion almacenada, el uso de tecnicas de mineria de datos se han convertido en una pieza clave en muchos campos. Los algoritmos de induccion de clasificadores son herramientas muy utiles ya que permiten condensar la informacion contenida en las bases de datos en clasificadores que pueden luego ser usados para realizar predicciones sobre nuevos datos. Una de las aplicaciones de los algoritmos de induccion de clasificadores es la recuperacion de informacion, que puede ser definida como la recuperacion de los objetos de un tipo determinado (aquellos en los cuales estamos interesados, normalmente llamados 'positivos') de grandes conjuntos de objetos no etiquetados (es decir, objetos que no sabemos a que clase pertenecen). Las aproximaciones clasicas implican tener ejemplos positivos (ejemplos del tipo de objetos que queremos recuperar) y ejemplos negativos (ejemplos de objetos diferentes a los que queremos recuperar), pero no siempre hay disponibles ejemplos negativos. Por este motivo, durante los ultimos anos se han venido desarrollando algoritmos que permitan aprender clasificadores binarios en ausencia de ejemplos negativos. El tema de esta tesis es el aprendizaje a partir de ejemplos positivos y no etiquetados. Las contribuciones de esta tesis abarcan la induccion de modelos de clasificacion, el promediado de clasificadores, la seleccion de variables y la evaluacion de clasificadores. En la parte aplicada, algunos de los algoritmos propuestos son utilizados para resolver dos problemas del area de la biologia, la identificacion de genes asociados a enfermedad y genes involucrados en cancer.

[1]  Xiaojin Zhu,et al.  Semi-Supervised Learning , 2010, Encyclopedia of Machine Learning.

[2]  José Antonio Lozano,et al.  Multi-Objective Learning of Multi-Dimensional Bayesian Classifiers , 2008, 2008 Eighth International Conference on Hybrid Intelligent Systems.

[3]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[4]  Sunita Sarawagi Learning with Graphical Models , 2008 .

[5]  Desmond G. Higgins,et al.  Distinct Patterns in the Regulation and Evolution of Human Cancer Genes , 2008, Silico Biol..

[6]  Xing-Ming Zhao,et al.  Gene function prediction using labeled and unlabeled data , 2008, BMC Bioinformatics.

[7]  A. Sparks,et al.  The Genomic Landscapes of Human Breast and Colorectal Cancers , 2007, Science.

[8]  Linda C. van der Gaag,et al.  Inference and Learning in Multi-dimensional Bayesian Network Classifiers , 2007, ECSQARU.

[9]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2007 .

[10]  See-Kiong Ng,et al.  Learning to Classify Documents with Only a Small Positive Training Set , 2007, ECML.

[11]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[12]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[13]  E. Birney,et al.  Patterns of somatic mutation in human cancer genomes , 2007, Nature.

[14]  See-Kiong Ng,et al.  Learning to Identify Unexpected Instances in the Test Set , 2007, IJCAI.

[15]  Robert D. Finn,et al.  New developments in the InterPro database , 2007, Nucleic Acids Res..

[16]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[17]  John T. Wei,et al.  Integrative molecular concept modeling of prostate cancer progression , 2007, Nature Genetics.

[18]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[19]  Paul A. Bates,et al.  Global topological features of cancer proteins in the human interactome , 2006, Bioinform..

[20]  Chris H. Q. Ding,et al.  PSoL: a positive sample only learning algorithm for finding non-coding RNA genes , 2006, Bioinform..

[21]  Z. Szallasi,et al.  A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers , 2006, Nature Genetics.

[22]  Núria López-Bigas,et al.  Differences in the evolutionary history of disease genes affected by dominant or recessive mutations , 2006, BMC Genomics.

[23]  Zhigang Liu,et al.  Partially Supervised Classification: Based on Weighted Unlabeled Samples Support Vector Machine , 2006, Int. J. Data Warehous. Min..

[24]  L. Chin,et al.  Comparative Oncogenomics Identifies NEDD9 as a Melanoma Metastasis Gene , 2006, Cell.

[25]  Vladimir A Kuznetsov,et al.  In the pursuit of complexity: systems medicine in cancer biology. , 2006, Cancer cell.

[26]  Concha Bielza,et al.  Machine Learning in Bioinformatics , 2008, Encyclopedia of Database Systems.

[27]  Christos A. Ouzounis,et al.  Highly consistent patterns for inherited human diseases at the molecular level , 2006, Bioinform..

[28]  Hailong Yu,et al.  A New PU Learning Algorithm for Text Classification , 2005, MICAI.

[29]  Hwanjo Yu,et al.  Single-Class Classification with Mapping Convergence , 2005, Machine Learning.

[30]  Xiaoli Li,et al.  Learning from Positive and Unlabeled Examples with Different Data Distributions , 2005, ECML.

[31]  Gert Vriend,et al.  GeneSeeker: extraction and integration of human disease-related information from web-based genetic databases , 2005, Nucleic Acids Res..

[32]  H. Horvitz,et al.  MicroRNA expression profiles classify human cancers , 2005, Nature.

[33]  J. Baak,et al.  Genomics and proteomics--the way forward. , 2005, Annals of oncology : official journal of the European Society for Medical Oncology.

[34]  R. Guigó,et al.  Are splicing mutations the most frequent cause of hereditary disease? , 2005, FEBS letters.

[35]  A. Bardelli,et al.  Identification of cancer genes by mutational profiling of tumor genomes , 2005, FEBS letters.

[36]  Alan R. Powell,et al.  Integration of text- and data-mining using ontologies successfully selects disease gene candidates , 2005, Nucleic acids research.

[37]  Daniel Zelterman,et al.  Bayesian Artificial Intelligence , 2005, Technometrics.

[38]  Xiaojin Zhu,et al.  Semi-Supervised Learning Literature Survey , 2005 .

[39]  David J. Porteous,et al.  Speeding disease gene discovery by sequence based candidate prioritization , 2005, BMC Bioinformatics.

[40]  L. Staudt,et al.  Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. , 2004, The New England journal of medicine.

[41]  C. Ouzounis,et al.  Genome-wide identification of genes likely to be involved in human genetic disease. , 2004, Nucleic acids research.

[42]  Karl-Michael Schneider Learning to Filter Junk E-Mail from Positive and Unlabeled Examples , 2004, IJCNLP.

[43]  T. Hubbard,et al.  A census of human cancer genes , 2004, Nature Reviews Cancer.

[44]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[45]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[46]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[47]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[48]  David Maxwell Chickering,et al.  Learning Bayesian Networks: The Combination of Knowledge and Statistical Data , 1994, Machine Learning.

[49]  Gregory F. Cooper,et al.  A Bayesian method for the induction of probabilistic networks from data , 1992, Machine Learning.

[50]  José A. Gámez,et al.  Advances in Bayesian networks , 2004 .

[51]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[52]  Robert Castelo,et al.  Splice site identification by idlBNs , 2004, ISMB/ECCB.

[53]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[54]  Diego G. Silva,et al.  Identification of "pathologs" (disease-related genes) from the RIKEN mouse cDNA dataset using human curation plus FACTS, a new biological information extraction system , 2004, BMC Genomics.

[55]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[56]  S. Amladi,et al.  Online Mendelian Inheritance in Man 'OMIM'. , 2003, Indian journal of dermatology, venereology and leprology.

[57]  Frances S. Turner,et al.  POCUS: mining genomic sequence annotation to predict disease genes , 2003, Genome Biology.

[58]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[59]  Hwanjo Yu SVMC: Single-Class Classification With Support Vector Machines , 2003, IJCAI.

[60]  B. Liu,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[61]  Tom Burr,et al.  Causation, Prediction, and Search , 2003, Technometrics.

[62]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[63]  Pedro Larrañaga,et al.  Learning Bayesian networks in the space of structures by estimation of distribution algorithms , 2003, Int. J. Intell. Syst..

[64]  P. Kemmeren,et al.  A new web-based data mining tool for the identification of candidate genes for human genetic disorders , 2003, European Journal of Human Genetics.

[65]  F. Denis Classification and Co-training from Positive and Unlabeled Examples , 2003 .

[66]  Jose Miguel Puerta,et al.  Ant colony optimization for learning Bayesian networks , 2002, Int. J. Approx. Reason..

[67]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[68]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[69]  D. M. Hutton,et al.  Advances in the Evolutionary Synthesis of Intelligent Agents , 2002 .

[70]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[71]  T. Golub,et al.  DNA microarrays in clinical oncology. , 2002, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[72]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[73]  S. Karlin,et al.  Amino acid runs in eukaryotic proteomes and disease associations , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[74]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[75]  Philip Lijnzaad,et al.  The Ensembl genome database project , 2002, Nucleic Acids Res..

[76]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[77]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[78]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[79]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[80]  David M. J. Tax,et al.  One-class classification , 2001 .

[81]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[82]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[83]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[84]  Michal Linial,et al.  Using Bayesian networks to analyze expression data , 2000, RECOMB '00.

[85]  P. Stenson,et al.  Human Gene Mutation Database—A biomedical information and research resource , 2000, Human mutation.

[86]  K. Katz,et al.  Introducing RefSeq and LocusLink: curated human genome resources at the NCBI. , 2000, Trends in genetics : TIG.

[87]  Kathryn B. Laskey,et al.  Learning Bayesian Networks from Incomplete Data with Stochastic Search Algorithms , 1999, UAI.

[88]  Thilo Mahnig,et al.  Evolutionary Synthesis of Bayesian Networks for Optimization , 1999 .

[89]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1999, Innovations in Bayesian Networks.

[90]  Edoardo Amaldi,et al.  On the Approximability of Minimizing Nonzero Variables or Unsatisfied Relations in Linear Systems , 1998, Theor. Comput. Sci..

[91]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[92]  Hiroshi Motoda,et al.  Feature Extraction, Construction and Selection: A Data Mining Perspective , 1998 .

[93]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[94]  Michael I. Jordan Graphical Models , 1998 .

[95]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[96]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[97]  J. Hinde,et al.  Models for diagnosing chest pain: is CART helpful? , 1997, Statistics in medicine.

[98]  Enrique F. Castillo,et al.  Expert Systems and Probabilistic Network Models , 1996, Monographs in Computer Science.

[99]  L. A. Smith,et al.  Feature Subset Selection: A Correlation Based Filter Approach , 1997, ICONIP.

[100]  George H. John Enhancements to the data mining process , 1997 .

[101]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[102]  Pedro Larrañaga,et al.  Learning Bayesian network structures by searching for the best ordering with genetic algorithms , 1996, IEEE Trans. Syst. Man Cybern. Part A.

[103]  Janusz Zalewski,et al.  Rough sets: Theoretical aspects of reasoning about data , 1996 .

[104]  Wray L. Buntine A Guide to the Literature on Learning Probabilistic Networks from Data , 1996, IEEE Trans. Knowl. Data Eng..

[105]  R. Bouckaert Bayesian belief networks : from construction to inference , 1995 .

[106]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[107]  David Heckerman,et al.  Learning Bayesian Networks: Search Methods and Experimental Results , 1995 .

[108]  Michael J. Pazzani,et al.  Searching for Dependencies in Bayesian Classifiers , 1995, AISTATS.

[109]  B S Todd,et al.  The Relative Accuracy of a Variety of Medical Diagnostic Programs , 1994, Methods of Information in Medicine.

[110]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[111]  Pavel Brazdil,et al.  Proceedings of the European Conference on Machine Learning , 1993 .

[112]  K. Kinzler,et al.  The multistep nature of cancer. , 1993, Trends in genetics : TIG.

[113]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[114]  Dan Geiger,et al.  An Entropy-based Learning Algorithm of Bayesian Conditional Trees , 1992, UAI.

[115]  Wray L. Buntine Theory Refinement on Bayesian Networks , 1991, UAI.

[116]  P. Spirtes,et al.  An Algorithm for Fast Recovery of Sparse Causal Graphs , 1991 .

[117]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[118]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[119]  Steffen L. Lauritzen,et al.  Independence properties of directed markov fields , 1990, Networks.

[120]  David J. Spiegelhalter,et al.  Sequential updating of conditional probabilities on directed graphical structures , 1990, Networks.

[121]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[122]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[123]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[124]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[125]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[126]  C. Ohmann,et al.  Bayes theorem and conditional dependence of symptoms: different models applied to data of upper gastrointestinal bleeding. , 1988, Methods of information in medicine.

[127]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, STOC '84.

[128]  B. Efron Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation , 1983 .

[129]  David G. Kleinbaum,et al.  Logistic regression analysis of epidemiologic data: theory and practice , 1982 .

[130]  Moshe Ben-Bassat,et al.  35 Use of distance measures, information measures and error bounds in feature evaluation , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[131]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[132]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[133]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[134]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[135]  M. Stone,et al.  Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[136]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[137]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[138]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[139]  E. Forgy Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[140]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[141]  Marvin Minsky,et al.  Steps toward Artificial Intelligence , 1995, Proceedings of the IRE.