Active Learning Strategies for Phenotypic Profiling of High-Content Screens

High-content screening is a powerful method to discover new drugs and carry out basic biological research. Increasingly, high-content screens have come to rely on supervised machine learning (SML) to perform automatic phenotypic classification as an essential step of the analysis. However, this comes at a cost, namely, the labeled examples required to train the predictive model. Classification performance increases with the number of labeled examples, and because labeling examples demands time from an expert, the training process represents a significant time investment. Active learning strategies attempt to overcome this bottleneck by presenting the most relevant examples to the annotator, thereby achieving high accuracy while minimizing the cost of obtaining labeled data. In this article, we investigate the impact of active learning on single-cell–based phenotype recognition, using data from three large-scale RNA interference high-content screens representing diverse phenotypic profiling problems. We consider several combinations of active learning strategies and popular SML methods. Our results show that active learning significantly reduces the time cost and can be used to reveal the same phenotypic targets identified using SML. We also identify combinations of active learning strategies and SML methods which perform better than others on the phenotypic profiling problems we studied.

[1]  P. Selzer,et al.  Differentiation and Visualization of Diverse Cellular Phenotypic Responses in Primary High-Content Screening , 2012, Journal of biomolecular screening.

[2]  David A. Cohn,et al.  Improving generalization with active learning , 1994, Machine Learning.

[3]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[4]  Dana Angluin,et al.  Queries and concept learning , 1988, Machine Learning.

[5]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[6]  Ari Helenius,et al.  DC-SIGN as a receptor for phleboviruses. , 2011, Cell host & microbe.

[7]  D. Swinney,et al.  How were new medicines discovered? , 2011, Nature Reviews Drug Discovery.

[8]  Y. Kalaidzidis,et al.  Systems survey of endocytosis by multiparametric image analysis , 2010, Nature.

[9]  Jean-Christophe Olivo-Marin,et al.  Extraction of spots in biological images using multiscale products , 2002, Pattern Recognit..

[10]  Naoki Abe,et al.  Query Learning Strategies Using Boosting and Bagging , 1998, ICML.

[11]  Oliver Dürr,et al.  Robust Hit Identification by Quality Assurance and Multivariate Data Analysis of a High-Content, Cell-Based Assay , 2007, Journal of biomolecular screening.

[12]  Mark Craven,et al.  Active Learning with Real Annotation Costs , 2008 .

[13]  Burr Settles,et al.  Active Learning , 2012, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[14]  D. Angluin Queries and Concept Learning , 1988 .

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  Thomas Wild,et al.  Machine Learning Improves the Precision and Robustness of High-Content Screens , 2011, Journal of biomolecular screening.

[17]  Anne E Carpenter,et al.  CellProfiler: image analysis software for identifying and quantifying cell phenotypes , 2006, Genome Biology.

[18]  J. Barr,et al.  Recent advances in the molecular and cellular biology of bunyaviruses. , 2011, The Journal of general virology.

[19]  I. Banerjee,et al.  High-Content Analysis of Sequential Events during the Early Phase of Influenza A Virus Infection , 2013, PloS one.

[20]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[21]  R. Durbin,et al.  Phenotypic profiling of the human genome by time-lapse microscopy reveals cell division genes , 2010, Nature.

[22]  Silke Stertz,et al.  Uncovering the global host cell requirements for influenza virus replication via RNAi screening. , 2011, Microbes and infection.

[23]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[24]  Ed Hurt,et al.  Pre-ribosomes on the road from the nucleolus to the cytoplasm. , 2003, Trends in cell biology.

[25]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[26]  Pirjo Spuul,et al.  Phosphatidylinositol 3-Kinase-, Actin-, and Microtubule-Dependent Transport of Semliki Forest Virus Replication Complexes from the Plasma Membrane to Modified Lysosomes , 2010, Journal of Virology.

[27]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[28]  Péter Horváth,et al.  Enhanced CellClassifier: a multi-class classification tool for microscopy images , 2010, BMC Bioinformatics.

[29]  Mark Craven,et al.  An Analysis of Active Learning Strategies for Sequence Labeling Tasks , 2008, EMNLP.

[30]  Ruth R. Montgomery,et al.  RNA interference screen for human genes associated with West Nile virus infection , 2008, Nature.

[31]  Peter Horvath,et al.  A Protein Inventory of Human Ribosome Biogenesis Reveals an Essential Function of Exportin 5 in 60S Subunit Export , 2010, PLoS biology.