Estimating Phred scores of Illumina base calls by logistic regression and sparse modeling

BackgroundPhred quality scores are essential for downstream DNA analysis such as SNP detection and DNA assembly. Thus a valid model to define them is indispensable for any base-calling software. Recently, we developed the base-caller 3Dec for Illumina sequencing platforms, which reduces base-calling errors by 44-69% compared to the existing ones. However, the model to predict its quality scores has not been fully investigated yet.ResultsIn this study, we used logistic regression models to evaluate quality scores from predictive features, which include different aspects of the sequencing signals as well as local DNA contents. Sparse models were further obtained by three methods: the backward deletion with either AIC or BIC and the L1 regularization learning method. The L1-regularized one was then compared with the Illumina scoring method.ConclusionsThe L1-regularized logistic regression improves the empirical discrimination power by as large as 14 and 25% respectively for two kinds of preprocessed sequencing signals, compared to the Illumina scoring method. Namely, the L1 method identifies more base calls of high fidelity. Computationally, the L1 method can handle large dataset and is efficient enough for daily sequencing. Meanwhile, the logistic model resulted from BIC is more interpretable. The modeling suggested that the most prominent quenching pattern in the current chemistry of Illumina occurred at the dinucleotide “GT”. Besides, nucleotides were more likely to be miscalled as the previous bases if the preceding ones were not “G”. It suggested that the phasing effect of bases after “G” was somewhat different from those after other nucleotide types.

[1]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  M. Morgante,et al.  An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis , 2013, PloS one.

[5]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[6]  Lei M. Li,et al.  An adaptive decorrelation method removes Illumina DNA base-calling errors caused by crosstalk between adjacent clusters , 2017, Scientific reports.

[7]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[8]  J. Ghosh,et al.  AIC, BIC and Recent Advances in Model Selection , 2011 .

[9]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[10]  An Hongzhi,et al.  On the selection of regression variables , 1985 .

[11]  Tjalling J. Ypma,et al.  Historical Development of the Newton-Raphson Method , 1995, SIAM Rev..

[12]  Nicholas A. Bokulich,et al.  Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing , 2012, Nature Methods.

[13]  D. Hosmer,et al.  Applied Logistic Regression , 1991 .

[14]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[15]  Genady Grabarnik,et al.  Sparse Modeling: Theory, Algorithms, and Applications , 2014 .

[16]  柴田 里程 Selection of regression variables , 1981 .

[17]  Markus Sauer,et al.  NUCLEOBASE-SPECIFIC QUENCHING OF FLUORESCENT DYES. 1. NUCLEOBASE ONE-ELECTRON REDOX POTENTIALS AND THEIR CORRELATION WITH STATIC AND DYNAMIC QUENCHING EFFICIENCIES , 1996 .

[18]  Chengxi Ye,et al.  BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution , 2014, Bioinform..

[19]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[20]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[21]  Lei M. Li,et al.  Adjust quality scores from alignment and improve sequencing accuracy. , 2004, Nucleic acids research.

[22]  T. Fearn Ridge Regression , 2013 .

[23]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[24]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[25]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[26]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[27]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.