PACE: Probabilistic Assessment for Contributor Estimation- A machine learning-based assessment of the number of contributors in DNA mixtures.

The deconvolution of DNA mixtures remains one of the most critical challenges in the field of forensic DNA analysis. In addition, of all the data features required to perform such deconvolution, the number of contributors in the sample is widely considered the most important, and, if incorrectly chosen, the most likely to negatively influence the mixture interpretation of a DNA profile. Unfortunately, most current approaches to mixture deconvolution require the assumption that the number of contributors is known by the analyst, an assumption that can prove to be especially faulty when faced with increasingly complex mixtures of 3 or more contributors. In this study, we propose a probabilistic approach for estimating the number of contributors in a DNA mixture that leverages the strengths of machine learning. To assess this approach, we compare classification performances of six machine learning algorithms and evaluate the model from the top-performing algorithm against the current state of the art in the field of contributor number classification. Overall results show over 98% accuracy in identifying the number of contributors in a DNA mixture of up to 4 contributors. Comparative results showed 3-person mixtures had a classification accuracy improvement of over 6% compared to the current best-in-field methodology, and that 4-person mixtures had a classification accuracy improvement of over 20%. The Probabilistic Assessment for Contributor Estimation (PACE) also accomplishes classification of mixtures of up to 4 contributors in less than 1s using a standard laptop or desktop computer. Considering the high classification accuracy rates, as well as the significant time commitment required by the current state of the art model versus seconds required by a machine learning-derived model, the approach described herein provides a promising means of estimating the number of contributors and, subsequently, will lead to improved DNA mixture interpretation.

[1]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[2]  Peter Gill,et al.  Towards understanding the effect of uncertainty in the number of contributors to DNA stains. , 2007, Forensic science international. Genetics.

[3]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[4]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[5]  Hinda Haned,et al.  Estimating the Number of Contributors to Forensic DNA Mixtures: Does Maximum Likelihood Perform Better Than Maximum Allele Count? , 2011, Journal of forensic sciences.

[6]  Carissa M Krane,et al.  Empirical analysis of the STR profiles resulting from conceptual mixtures. , 2005, Journal of forensic sciences.

[7]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[8]  T. Egeland,et al.  Estimating the number of contributors to a DNA profile , 2003, International Journal of Legal Medicine.

[9]  Fernando Nogueira,et al.  Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning , 2016, J. Mach. Learn. Res..

[10]  Jo-Anne Bright,et al.  The effect of the uncertainty in the number of contributors to mixed DNA profiles on profile interpretation. , 2014, Forensic science international. Genetics.

[11]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[14]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[15]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[16]  Adele A. Mitchell,et al.  Estimating the number of contributors to two-, three-, and four-person mixtures containing DNA in high template and low template amounts , 2011, Croatian medical journal.

[17]  Sung-Bae Cho,et al.  Machine Learning in DNA Microarray Analysis for Cancer Classification , 2003, APBC.

[18]  Hinda Haned,et al.  Assessment of mock cases involving complex low template DNA mixtures: A descriptive study. , 2012, Forensic science international. Genetics.

[19]  H Haned,et al.  The predictive value of the maximum likelihood estimator of the number of contributors to a DNA mixture. , 2011, Forensic science international. Genetics.

[20]  Muriel Medard,et al.  NOCIt: a computational method to infer the number of contributors to DNA samples analyzed by STR genotyping. , 2015, Forensic science international. Genetics.

[21]  John M. Butler,et al.  Advanced Topics in Forensic DNA Typing: Interpretation , 2014 .

[22]  J. Whitaker,et al.  Analysis and interpretation of mixed forensic stains using DNA STR profiling. , 1998, Forensic science international.

[23]  David H. Wolpert,et al.  The Lack of A Priori Distinctions Between Learning Algorithms , 1996, Neural Computation.

[24]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[25]  Jo-Anne Bright,et al.  Uncertainty in the number of contributors in the proposed new CODIS set. , 2015, Forensic science international. Genetics.

[26]  J. Orbach Principles of Neurodynamics. Perceptrons and the Theory of Brain Mechanisms. , 1962 .

[27]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[28]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[29]  R. Bellman Dynamic programming. , 1957, Science.

[30]  Duncan Taylor,et al.  Interpreting forensic DNA profiling evidence without specifying the number of contributors. , 2014, Forensic science international. Genetics.

[31]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.