Artificial intelligence powered statistical genetics in biobanks

Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.

[1]  Laura J. Scott,et al.  Stratifying Type 2 Diabetes Cases by BMI Identifies Genetic Risk Variants in LAMA1 and Enrichment for Risk Variants in Lean Compared to Obese Cases , 2012, PLoS genetics.

[2]  Jianqing Fan,et al.  Sure independence screening in generalized linear models with NP-dimensionality , 2009, The Annals of Statistics.

[3]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  B. Rannala,et al.  The Bayesian revolution in genetics , 2004, Nature Reviews Genetics.

[6]  G. Tamiya,et al.  GWAS with principal component analysis identifies a gene comprehensively controlling rice architecture , 2019, Proceedings of the National Academy of Sciences.

[7]  Francis S. Collins,et al.  The case for a US prospective cohort study of genes and environment , 2004, Nature.

[8]  Jun Akatsuka,et al.  Automated acquisition of explainable knowledge from unannotated histopathology images , 2019, Nature Communications.

[9]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[10]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[11]  G. Tamiya,et al.  Smooth‐Threshold Multivariate Genetic Prediction with Unbiased Model Selection , 2016, Genetic epidemiology.

[12]  S. Akira,et al.  Epistatic interaction between Toll-like receptor 3 (TLR3) and prostaglandin E receptor 3 (PTGER3) genes. , 2012, The Journal of allergy and clinical immunology.

[13]  D. Thomas,et al.  Gene–environment-wide association studies: emerging approaches , 2010, Nature Reviews Genetics.

[14]  Jianqing Fan,et al.  Ultrahigh Dimensional Variable Selection: beyond the linear model , 2008, 0812.3201.

[15]  Judy H. Cho,et al.  Finding the missing heritability of complex diseases , 2009, Nature.

[16]  H. Grüneberg,et al.  Introduction to quantitative genetics , 1960 .

[17]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[18]  Asta Försti,et al.  Opinion: The balance between heritable and environmental aetiology of human disease , 2006, Nature Reviews Genetics.

[19]  Zhifu Sun,et al.  Genetic variants and risk of lung cancer in never smokers: a genome-wide association study. , 2010, The Lancet. Oncology.

[20]  Francis S. Collins,et al.  Genes, environment and the value of prospective cohort studies , 2006, Nature Reviews Genetics.

[21]  Kengo Kinoshita,et al.  Outlier detection for questionnaire data in biobanks. , 2019, International journal of epidemiology.

[22]  G. Taubes Epidemiology faces its limits. , 1995, Science.

[23]  B. Maher Personal genomes: The case of the missing heritability , 2008, Nature.

[24]  H. Akaike Fitting autoregressive models for prediction , 1969 .

[25]  Thomas Meitinger,et al.  Genome-wide meta-analysis identifies new susceptibility loci for migraine , 2013, Nature Genetics.

[26]  J. Snow On the Mode of Communication of Cholera , 1856, Edinburgh medical journal.

[27]  Masao Ueki,et al.  Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis , 2012, BMC Bioinformatics.

[28]  M. Kikuya,et al.  Potential identification of vitamin B6 responsiveness in autism spectrum disorder utilizing phenotype variables and machine learning methods , 2018, Scientific Reports.

[29]  R. Saxena,et al.  Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction , 2020, Nature Communications.

[30]  Yichao Wu,et al.  Ultrahigh Dimensional Feature Selection: Beyond The Linear Model , 2009, J. Mach. Learn. Res..

[31]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[32]  M. Kikuya,et al.  Clustering by phenotype and genome-wide association study in autism , 2019, bioRxiv.

[33]  Kengo Kinoshita,et al.  Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection , 2020, Translational Psychiatry.

[34]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[35]  Gen Tamiya,et al.  A genotype imputation method for de-identified haplotype reference information by using recurrent neural network , 2020, PLoS Comput. Biol..

[37]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .