Convolutional Embedded Networks for Population Scale Clustering and Bio-ancestry Inferencing.

The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the ‘1000 genomes’ (covering 2,504 individuals from 26 ethnic origins) and ‘Simons genome diversity’ (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.

[1]  Dietrich Rebholz-Schuhmann,et al.  Deep learning-based clustering approaches for bioinformatics , 2020, Briefings Bioinform..

[2]  Stefan Decker,et al.  Prognostically Relevant Subtypes and Survival Prediction for Breast Cancer Based on Multimodal Genomics Data , 2019, IEEE Access.

[3]  Stefan Decker,et al.  OncoNetExplainer: Explainable Predictions of Cancer Types Based on Gene Expression Data , 2019, 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE).

[4]  Stefan Decker,et al.  Drug-Drug Interaction Prediction Based on Knowledge Graph Embeddings and Convolutional-LSTM Network , 2019, BCB.

[5]  Amy Loutfi,et al.  Semantic Referee: A Neural-Symbolic Framework for Enhancing Geospatial Semantic Segmentation , 2019, Semantic Web.

[6]  Hamid Behravan,et al.  Machine learning identifies interacting genetic variants contributing to breast cancer risk: A case study in Finnish cases and controls , 2018, Scientific Reports.

[7]  Saurabh Belsare,et al.  Evaluating the quality of the 1000 genomes project data , 2018, BMC Genomics.

[8]  Qiang Liu,et al.  A Survey of Clustering With Deep Learning: From the Perspective of Network Architecture , 2018, IEEE Access.

[9]  M. Kaminski The right to explanation, explained , 2018, Research Handbook on Information Law and Governance.

[10]  Dietrich Rebholz-Schuhmann,et al.  Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction , 2018, ArXiv.

[11]  Ismail Uysal,et al.  Learning Latent Representations in Neural Networks for Clustering through Pseudo Supervision and Graph-based Activity Regularization , 2018, ICLR.

[12]  Daniel Cremers,et al.  Clustering with Deep Learning: Taxonomy and New Methods , 2018, ArXiv.

[13]  Seokjun Seo,et al.  Hybrid Approach of Relation Network and Localized Graph Convolutional Filtering for Breast Cancer Subtype Classification , 2017, IJCAI.

[14]  Christopher I. Amos,et al.  Ancestry inference using principal component analysis and spatial analysis: a distance-based analysis to account for population substructure , 2017, BMC Genomics.

[15]  Akane Sano,et al.  Multimodal autoencoder: A deep learning approach to filling in missing sensor data and enabling better mood prediction , 2017, 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII).

[16]  Dietrich Rebholz-Schuhmann,et al.  A Deep Learning Approach to Genomics Data for Population Scale Clustering and Ethnicity Prediction , 2017, SeWeBMeDA@ESWC.

[17]  Scott Lundberg,et al.  A Unified Approach to Interpreting Model Predictions , 2017, NIPS.

[18]  Klaudia Walter,et al.  The impact of rare and low-frequency genetic variants in common disease , 2017, Genome Biology.

[19]  Klaudia Walter,et al.  The impact of rare and low-frequency genetic variants in common disease , 2017, Genome Biology.

[20]  Dietrich Rebholz-Schuhmann,et al.  Improving data workflow systems with cloud services and use of open data for bioinformatics research , 2017, Briefings Bioinform..

[21]  Ross E. Curtis,et al.  Clustering of 770,000 genomes reveals post-colonial population structure of North America , 2017, Nature Communications.

[22]  Fabian A. Buske,et al.  VariantSpark: population scale clustering of genotype information , 2015, BMC Genomics.

[23]  Ali Farhadi,et al.  Unsupervised Deep Embedding for Clustering Analysis , 2015, ICML.

[24]  James Y. Zou Analysis of protein-coding genetic variation in 60,706 humans , 2015, Nature.

[25]  Yun S. Song,et al.  Deep Learning for Population Genetic Inference , 2015, bioRxiv.

[26]  E. Birney,et al.  Human genomics: The end of the start for population sequencing , 2015, Nature.

[27]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[28]  A. Chakravarti Perspectives on Human Variation through the Lens of Diversity and Race. , 2015, Cold Spring Harbor perspectives in biology.

[29]  Wei Wang,et al.  Deep Embedding Network for Clustering , 2014, 2014 22nd International Conference on Pattern Recognition.

[30]  Badri Padhukasahasram,et al.  Inferring ancestry from population genomic data and its applications , 2014, Front. Genet..

[31]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[32]  Rong Chen,et al.  Population genetic inference from personal genome data: impact of ancestry and admixture on human genomic variation. , 2012, American journal of human genetics.

[33]  Soo-Hwang Teo,et al.  Haplotype analysis of the 185delAG BRCA1 mutation in ethnically diverse populations , 2012, European Journal of Human Genetics.

[34]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[35]  R. Jain,et al.  Limitations of maximum likelihood estimation procedures when a majority of the observations are below the limit of detection. , 2008, Analytical chemistry.

[36]  Xiaoyi Gao,et al.  Human population structure detection via multilocus genotype clustering , 2007, BMC Genetics.

[37]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[38]  L. Jorde,et al.  Genetic variation, classification and 'race' , 2004, Nature Genetics.

[39]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[40]  M. Miller,et al.  Understanding human disease mutations through the use of interspecific genetic variation. , 2001, Human molecular genetics.

[41]  M. Feldman,et al.  Population growth of human Y chromosomes: a study of Y chromosome microsatellites. , 1999, Molecular biology and evolution.

[42]  M. Batzer,et al.  Estimating African American admixture proportions by use of population-specific alleles. , 1998, American journal of human genetics.

[43]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 1971, Scientific Reports.

[44]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[45]  Charu C. Aggarwal,et al.  Data Clustering: Algorithms and Applications , 2014 .

[46]  James M. Joyce Kullback-Leibler Divergence , 2011, International Encyclopedia of Statistical Science.

[47]  Sergiy Butenko,et al.  Network Clustering , 2014, Encyclopedia of Social Network Analysis and Mining.

[48]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[49]  R. Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[50]  R. J. Mitchell,et al.  Y-chromosomal Diversity in Europe Is Clinal and Influenced Primarily by Geography, Rather than by Language , 2022 .