DNA Sequencing via Quantum Mechanics and Machine Learning

Rapid sequencing of individual human genome is prerequisite to genomic medicine, where diseases will be prevented by preemptive cures. Quantum-mechanical tunneling through single-stranded DNA in a solid-state nanopore has been proposed for rapid DNA sequencing, but unfortunately the tunneling current alone cannot distinguish the four nucleotides due to large fluctuations in molecular conformation and solvent. Here, we propose a machine-learning approach applied to the tunneling current-voltage (I-V) characteristic for efficient discrimination between the four nucleotides. We first combine principal component analysis (PCA) and fuzzy c-means (FCM) clustering to learn the "fingerprints" of the electronic density-of-states (DOS) of the four nucleotides, which can be derived from the I-V data. We then apply the hidden Markov model and the Viterbi algorithm to sequence a time series of DOS data (i.e., to solve the sequencing problem). Numerical experiments show that the PCA-FCM approach can classify unlabeled DOS data with 91% accuracy. Furthermore, the classification is found to be robust against moderate levels of noise, i.e., 70% accuracy is retained with a signal-to-noise ratio of 26 dB. The PCA-FCM-Viterbi approach provides a 4-fold increase in accuracy for the sequencing problem compared with PCA alone. In conjunction with recent developments in nanotechnology, this machine-learning method may pave the way to the much-awaited rapid, low-cost genome sequencer.

[1]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[2]  Junmei Wang,et al.  Development and testing of a general amber force field , 2004, J. Comput. Chem..

[3]  M. Di Ventra,et al.  Influence of the environment and probes on rapid DNA sequencing via transverse electronic transport. , 2007, Biophysical journal.

[4]  Timothy B. Stockwell,et al.  The Sequence of the Human Genome , 2001, Science.

[5]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[6]  K. Schulten,et al.  Simulation of the electric response of DNA translocation through a semiconductor nanopore–capacitor , 2006 .

[7]  H. Bayley,et al.  Continuous base identification for single-molecule nanopore DNA sequencing. , 2009, Nature nanotechnology.

[8]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[9]  Marc Gershow,et al.  DNA molecules and configurations in a solid-state nanopore microscope , 2003, Nature materials.

[10]  P. Boufounos,et al.  HIDDEN MARKOV MODELS FOR DNA SEQUENCING , 2002 .

[11]  P. Vashishta,et al.  General Density Functional Theory , 1983 .

[12]  K. Burke,et al.  Generalized Gradient Approximation Made Simple [Phys. Rev. Lett. 77, 3865 (1996)] , 1997 .

[13]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[14]  Shibing Long,et al.  Differential conductance as a promising approach for rapid DNA sequencing with nanopore-embedded electrodes , 2010 .

[15]  Blöchl,et al.  Projector augmented-wave method. , 1994, Physical review. B, Condensed matter.

[16]  D. Deamer,et al.  Nanopores and nucleic acids: prospects for ultrarapid sequencing. , 2000, Trends in biotechnology.

[17]  Carlo Cavazzoni,et al.  Electronic structure of single DNA molecules resolved by transverse scanning tunnelling spectroscopy. , 2008, Nature materials.

[18]  Tomoji Kawai,et al.  Partial sequencing of a single DNA molecule with a scanning tunnelling microscope. , 2009, Nature nanotechnology.

[19]  J. Seminario,et al.  Transverse electronic transport in double-stranded DNA nucleotides. , 2009, The journal of physical chemistry. B.

[20]  Priya Vashishta,et al.  Molecular dynamics simulations of rapid hydrogen production from water using aluminum clusters as catalyzers. , 2010, Physical review letters.

[21]  Michael Zwolak,et al.  Fast DNA sequencing via transverse electronic transport. , 2006, Nano letters.

[22]  Steven G. Louie,et al.  Nonlinear ionic pseudopotentials in spin-density-functional calculations , 1982 .

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  I. Jolliffe Principal Component Analysis , 2002 .

[25]  D. Branton,et al.  Characterization of individual polynucleotide molecules using a membrane channel. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[26]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[27]  P. Hohenberg,et al.  Inhomogeneous Electron Gas , 1964 .

[28]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[29]  H. Bayley,et al.  Recognizing a single base in an individual DNA strand: a step toward DNA sequencing in nanopores. , 2005, Angewandte Chemie.

[30]  Burke,et al.  Generalized Gradient Approximation Made Simple. , 1996, Physical review letters.

[31]  Yoshio Umezawa,et al.  Complementary base-pair-facilitated electron tunneling for electrically pinpointing complementary nucleobases. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[32]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[33]  Ashish Sharma,et al.  De Novo Ultrascale Atomistic Simulations On High-End Parallel Supercomputers , 2008, Int. J. High Perform. Comput. Appl..

[34]  Thomas G. Dietterich Overfitting and undercomputing in machine learning , 1995, CSUR.

[35]  D. Branton,et al.  The potential and challenges of nanopore sequencing , 2008, Nature Biotechnology.

[36]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[37]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[38]  T. Arias,et al.  Iterative minimization techniques for ab initio total energy calculations: molecular dynamics and co , 1992 .

[39]  Jiajun Gu,et al.  PROBING SINGLE DNA MOLECULE TRANSPORT USING FABRICATED NANOPORES. , 2004, Nano letters.

[40]  J. Shendure,et al.  Advanced sequencing technologies: methods and goals , 2004, Nature Reviews Genetics.

[41]  C. Dekker,et al.  Fabrication of solid-state nanopores with single-nanometre precision , 2003, Nature materials.