SAVI: a statistical algorithm for variant frequency identification

BackgroundMany problems in biomedical research can be posed as a comparison between related samples (healthy vs. disease, subtypes of the same disease, longitudinal data representing the progression of a disease, etc). In the cases in which the distinction has a genetic or epigenetic basis, next-generation sequencing technologies have become a major tool for obtaining the difference between the samples. A commonly occurring application is the identification of somatic mutations occurring in tumor tissue samples driving a single cell to expand clonally. In this case, the progression of the disease can be traced through the trajectory of the frequency of the oncogenic alleles. Thus obtaining precise estimates of the frequency of abnormal alleles at various stages of the disease is paramount to understanding the processes driving it. Although the procedure is conceptually simple, technical difficulties arise due to inhomogeneous samples, existence of competing subclonal populations, and systematic and non-systematic errors introduced by the sequencing technologies.ResultsWe present a method, Statistical Algorithm for Variant Frequency Identification (SAVI), to estimate the frequency of alleles in a set of samples. The method employs Bayesian analysis and uses an iterative procedure to derive empirical priors. The approach allows for the comparison of allele frequencies across several samples, e.g. normal/tumor pairs and more complex experimental designs comparing multiple samples in tumor progression, as well as analyzing sequencing data from RNA sequencing experiments.ConclusionsAnalyzing sequencing data through estimating allele frequencies using empirical Bayes methods is a powerful complement to the ever-increasing throughput of the sequencing technologies.

[1]  A. Rukhin Bayes and Empirical Bayes Methods for Data Analysis , 1997 .

[2]  Ken Chen,et al.  Recurring mutations found by sequencing an acute myeloid leukemia genome. , 2009, The New England journal of medicine.

[3]  Bradley P. Carlin,et al.  BAYES AND EMPIRICAL BAYES METHODS FOR DATA ANALYSIS , 1996, Stat. Comput..

[4]  K. Kinzler,et al.  Cancer genes and the pathways they control , 2004, Nature Medicine.

[5]  L. Pasqualucci,et al.  Analysis of the chronic lymphocytic leukemia coding genome: role of NOTCH1 mutational activation , 2011, The Journal of experimental medicine.

[6]  Kevin P. Murphy,et al.  SNVMix: predicting single nucleotide variants from next-generation sequencing of tumors , 2010, Bioinform..

[7]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[8]  L. Chin,et al.  Making sense of cancer genomic data. , 2011, Genes & development.

[9]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[10]  Mingyao Li,et al.  Widespread RNA and DNA Sequence Differences in the Human Transcriptome , 2011, Science.

[11]  Life Technologies,et al.  A map of human genome variation from population-scale sequencing , 2011 .

[12]  M. Stratton,et al.  The cancer genome , 2009, Nature.

[13]  G. Casella An Introduction to Empirical Bayes Data Analysis , 1985 .

[14]  S. Pileri,et al.  BRAF mutations in hairy-cell leukemia. , 2011, The New England journal of medicine.

[15]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[16]  Raul Rabadan,et al.  Inactivating mutations of acetyltransferase genes in B-cell lymphoma , 2010, Nature.

[17]  H. Robbins An Empirical Bayes Approach to Statistics , 1956 .

[18]  Amy E. Hawkins,et al.  DNA sequencing of a cytogenetically normal acute myeloid leukemia genome , 2008, Nature.

[19]  G. Parmigiani,et al.  The Consensus Coding Sequences of Human Breast and Colorectal Cancers , 2006, Science.

[20]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[21]  Ryan D. Morin,et al.  Mutational evolution in a lobular breast tumour profiled at single nucleotide resolution , 2009, Nature.

[22]  Martin A. Nowak,et al.  Comparative lesion sequencing provides insights into tumor evolution , 2008, Proceedings of the National Academy of Sciences.

[23]  Jerzy Neyman,et al.  Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability : held at the Statistical Laboratory, University of California, December, 1954, July and August, 1955 , 1958 .