Mathematical modeling of the diversity in human B and T cell receptors using machine learning

We here propose an empirical approach based on the analysis of next-generation sequencing (NGS) data for describing the number of distinct clones of B and T-cell receptors in the human immune system. The status of a human immune system is (amongst other features) defined by the diversity of these receptor cells. It is a well-known issue that NGS data have a higher error rate, and therefore the number of distinct sequences found in sequencing data rises with the number of sequences measured by second generation sequencers. We here present a modeling approach that formulates the number of distinct clones depending on the number of read sequences considering two effects. On the one hand there is a true number of distinct sequences which is asymptotically reached by increasing the number of reads, on the other hand the number of randomly found sequences rises linearly due to read errors. The parameters for this combined model are identified using parameter optimization methods using evolution strategies. This modeling approach is evaluated on the basis of immune status data of several human patients. Additionally, the results are compared to those produced by machine learning methods.

[1]  Patrice Duroux,et al.  IMGT/HIGHV-QUEST: THE IMGT® WEB PORTAL FOR IMMUNOGLOBULIN (IG) OR ANTIBODY AND T CELL RECEPTOR (TR) ANALYSIS FROM NGS HIGH THROUGHPUT AND DEEP SEQUENCING , 2012 .

[2]  Stephan M. Winkler,et al.  Genetic Algorithms and Genetic Programming - Modern Concepts and Practical Applications , 2009 .

[3]  Jaroslaw Jacak,et al.  Identification of patterns in microscopy images of biological samples using evolution strategies , 2012 .

[4]  O. Nelles Nonlinear System Identification: From Classical Approaches to Neural Networks and Fuzzy Models , 2000 .

[5]  Klaus D. Elgert,et al.  Immunology: Understanding The Immune System , 1996 .

[6]  M. Egholm,et al.  Measurement and Clinical Monitoring of Human Lymphocyte Clonality by Massively Parallel V-D-J Pyrosequencing , 2009, Science Translational Medicine.

[7]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[9]  Y. Louzoun,et al.  Rep‐Seq: uncovering the immunological repertoire through next‐generation sequencing , 2012, Immunology.

[10]  Stephan M. Winkler,et al.  Using enhanced genetic programming techniques for evolving classifiers in the context of medical diagnosis , 2009, Genetic Programming and Evolvable Machines.

[11]  M. Marra,et al.  Massively parallel sequencing: the next big thing in genetic medicine. , 2009, American journal of human genetics.

[12]  Stephan M. Winkler,et al.  Evolutionary System Identification , 2009 .

[13]  K. Bollen,et al.  Pearson's R and Coarsely Categorized Measures , 1981 .

[14]  Stephan M. Winkler,et al.  Architecture and Design of the HeuristicLab Optimization Environment , 2014 .

[15]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[16]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[17]  John R. Koza,et al.  Genetic programming as a means for programming computers by natural selection , 1994 .

[18]  Lin Liu,et al.  Comparison of Next-Generation Sequencing Systems , 2012, Journal of biomedicine & biotechnology.