论文信息 - Efficient Maximum Likelihood Estimation for Pedigree Data with the Sum-Product Algorithm

Efficient Maximum Likelihood Estimation for Pedigree Data with the Sum-Product Algorithm

Objective: We analyze data sets consisting of pedigrees with age at onset of colorectal cancer (CRC) as phenotype. The occurrence of familial clusters of CRC suggests the existence of a latent, inheritable risk factor. We aimed to compute the probability of a family possessing this risk factor as well as the hazard rate increase for these risk factor carriers. Due to the inheritability of this risk factor, the estimation necessitates a costly marginalization of the likelihood. Methods: We propose an improved EM algorithm by applying factor graphs and the sum-product algorithm in the E-step. This reduces the computational complexity from exponential to linear in the number of family members. Results: Our algorithm is as precise as a direct likelihood maximization in a simulation study and a real family study on CRC risk. For 250 simulated families of size 19 and 21, the runtime of our algorithm is faster by a factor of 4 and 29, respectively. On the largest family (23 members) in the real data, our algorithm is 6 times faster. Conclusion: We introduce a flexible and runtime-efficient tool for statistical inference in biomedical event data with latent variables that opens the door for advanced analyses of pedigree data.

[1] R Core Team,et al. R: A language and environment for statistical computing. , 2014 .

[2] Ulrich Mansmann,et al. Matching Study to Registry data: Maintaining Data Privacy in a Study on Family based Colorectal Cancer , 2014, MIE.

[3] Stefan Hentschel,et al. Krebs in Deutschland 2009/2010 , 2013 .

[4] Anna K Rieger,et al. Prediction of being a risk family for colorectal cancer , 2013 .

[5] A. Crispin,et al. Risk of Advanced Colorectal Neoplasia According to Age and Gender , 2011, PloS one.

[6] S. Omholt,et al. Phenomics: the next challenge , 2010, Nature Reviews Genetics.

[7] Charles Elkan,et al. Expectation Maximization Algorithm , 2010, Encyclopedia of Machine Learning.

[8] E. El-Darzi,et al. Analysis of stopping criteria for the EM algorithm in the context of patient grouping according to length of stay , 2008, 2008 4th International IEEE Conference Intelligent Systems.

[9] H. Brenner,et al. Family History and Age at Initiation of Colorectal Cancer Screening , 2008, The American Journal of Gastroenterology.

[10] T. Ramón y Cajal Asensio. [Hereditary colon cancer]. , 2008, Cirugia espanola.

[11] Nadezhda M. Belonogova,et al. Optimal peeling order for pedigrees with incomplete genotypic information , 2007, Comput. Biol. Chem..

[12] G. McLachlan,et al. Extensions of the EM Algorithm , 2007 .

[13] W. Foulkes,et al. Familial Adenomatous Polyposis , 2006, The American Journal of Gastroenterology.

[14] Stephen P. Boyd,et al. Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[15] I. Blanco,et al. [Familial adenomatous polyposis]. , 2006, Gastroenterologia y hepatologia.

[16] John W. Fisher,et al. Loopy Belief Propagation: Convergence and Effects of Message Errors , 2005, J. Mach. Learn. Res..

[17] K. Hemminki,et al. Familial risk of cancer shortly after diagnosis of the first familial tumor. , 2005, Journal of the National Cancer Institute.

[18] Constantin F. Aliferis,et al. Causal Explorer: A Causal Probabilistic Network Learning Toolkit for Biomedical Discovery , 2003, METMBS.

[19] Elizabeth A. Thompson,et al. Statistical inference from genetic data on pedigrees , 2003 .

[20] Zoubin Ghahramani,et al. An Introduction to Hidden Markov Models and Bayesian Networks , 2001, Int. J. Pattern Recognit. Artif. Intell..

[21] Brendan J. Frey,et al. Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[22] Finn V. Jensen,et al. Bayesian Networks and Decision Graphs , 2001, Statistics for Engineering and Information Science.

[23] D. Schaid. Mathematical and Statistical Methods for Genetic Analysis , 1999 .

[24] G. Jarvik,et al. Complex segregation analyses: uses and limitations. , 1998, American journal of human genetics.

[25] G. McLachlan,et al. The EM algorithm and extensions , 1996 .

[26] E M Wijsman,et al. Toward localization of the Werner syndrome gene by linkage disequilibrium and ancestral haplotyping: lessons learned from analysis of 35 chromosome 8p11.1-21.1 markers. , 1996, American journal of human genetics.

[27] S. Kak. Information, physics, and computation , 1996 .

[28] J. Nocedal,et al. A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[29] Claude J. P. Bélisle. Convergence theorems for a class of simulated annealing algorithms on ℝ d , 1992, Journal of Applied Probability.

[30] E A Thompson,et al. A Monte Carlo method for combined segregation and linkage analysis. , 1992, American journal of human genetics.

[31] C. Geyer,et al. Constrained Monte Carlo Maximum Likelihood for Dependent Data , 1992 .

[32] Adrian F. M. Smith,et al. Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[33] E A Thompson,et al. Pedigree analysis for quantitative traits: variance components without matrix inversion. , 1990, Biometrics.

[34] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[35] C. Cannings,et al. Probability functions on complex pedigrees , 1978, Advances in Applied Probability.

[36] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37] L. Baum,et al. An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[38] R. Elston,et al. A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[39] John A. Nelder,et al. A Simplex Method for Function Minimization , 1965, Comput. J..