Efficient Maximum Likelihood Estimation for Pedigree Data with the Sum-Product Algorithm

Objective: We analyze data sets consisting of pedigrees with age at onset of colorectal cancer (CRC) as phenotype. The occurrence of familial clusters of CRC suggests the existence of a latent, inheritable risk factor. We aimed to compute the probability of a family possessing this risk factor as well as the hazard rate increase for these risk factor carriers. Due to the inheritability of this risk factor, the estimation necessitates a costly marginalization of the likelihood. Methods: We propose an improved EM algorithm by applying factor graphs and the sum-product algorithm in the E-step. This reduces the computational complexity from exponential to linear in the number of family members. Results: Our algorithm is as precise as a direct likelihood maximization in a simulation study and a real family study on CRC risk. For 250 simulated families of size 19 and 21, the runtime of our algorithm is faster by a factor of 4 and 29, respectively. On the largest family (23 members) in the real data, our algorithm is 6 times faster. Conclusion: We introduce a flexible and runtime-efficient tool for statistical inference in biomedical event data with latent variables that opens the door for advanced analyses of pedigree data.

[1]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[2]  Ulrich Mansmann,et al.  Matching Study to Registry data: Maintaining Data Privacy in a Study on Family based Colorectal Cancer , 2014, MIE.

[3]  Stefan Hentschel,et al.  Krebs in Deutschland 2009/2010 , 2013 .

[4]  Anna K Rieger,et al.  Prediction of being a risk family for colorectal cancer , 2013 .

[5]  A. Crispin,et al.  Risk of Advanced Colorectal Neoplasia According to Age and Gender , 2011, PloS one.

[6]  S. Omholt,et al.  Phenomics: the next challenge , 2010, Nature Reviews Genetics.

[7]  Charles Elkan,et al.  Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[8]  E. El-Darzi,et al.  Analysis of stopping criteria for the EM algorithm in the context of patient grouping according to length of stay , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[9]  H. Brenner,et al.  Family History and Age at Initiation of Colorectal Cancer Screening , 2008, The American Journal of Gastroenterology.

[10]  T. Ramón y Cajal Asensio [Hereditary colon cancer]. , 2008, Cirugia espanola.

[11]  Nadezhda M. Belonogova,et al.  Optimal peeling order for pedigrees with incomplete genotypic information , 2007, Comput. Biol. Chem..

[12]  G. McLachlan,et al.  Extensions of the EM Algorithm , 2007 .

[13]  W. Foulkes,et al.  Familial Adenomatous Polyposis , 2006, The American Journal of Gastroenterology.

[14]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15]  I. Blanco,et al.  [Familial adenomatous polyposis]. , 2006, Gastroenterologia y hepatologia.

[16]  John W. Fisher,et al.  Loopy Belief Propagation: Convergence and Effects of Message Errors , 2005, J. Mach. Learn. Res..

[17]  K. Hemminki,et al.  Familial risk of cancer shortly after diagnosis of the first familial tumor. , 2005, Journal of the National Cancer Institute.

[18]  Constantin F. Aliferis,et al.  Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery , 2003, METMBS.

[19]  Elizabeth A. Thompson,et al.  Statistical inference from genetic data on pedigrees , 2003 .

[20]  Zoubin Ghahramani,et al.  An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[21]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[22]  Finn V. Jensen,et al.  Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[23]  D. Schaid Mathematical and Statistical Methods for Genetic Analysis , 1999 .

[24]  G. Jarvik,et al.  Complex segregation analyses: uses and limitations. , 1998, American journal of human genetics.

[25]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[26]  E M Wijsman,et al.  Toward localization of the Werner syndrome gene by linkage disequilibrium and ancestral haplotyping: lessons learned from analysis of 35 chromosome 8p11.1-21.1 markers. , 1996, American journal of human genetics.

[27]  S. Kak Information, physics, and computation , 1996 .

[28]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[29]  Claude J. P. Bélisle Convergence theorems for a class of simulated annealing algorithms on ℝ d , 1992, Journal of Applied Probability.

[30]  E A Thompson,et al.  A Monte Carlo method for combined segregation and linkage analysis. , 1992, American journal of human genetics.

[31]  C. Geyer,et al.  Constrained Monte Carlo Maximum Likelihood for Dependent Data , 1992 .

[32]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[33]  E A Thompson,et al.  Pedigree analysis for quantitative traits: variance components without matrix inversion. , 1990, Biometrics.

[34]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[35]  C. Cannings,et al.  Probability functions on complex pedigrees , 1978, Advances in Applied Probability.

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[38]  R. Elston,et al.  A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[39]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..