Correlation for tree-shaped datasets and its Bayesian estimation

Abstract Tree-shaped datasets have arisen in various research and industrial fields, such as gene expression data measured on a cell lineage tree and information spreading on tree-shaped paths. Certain correlation measure between two tree-shaped datasets, i.e., how the values increase or decrease together along corresponding paths of the two trees, is desired; but the tree topology prohibits the use of classical vector-based correlation measures such as Pearson correlation coefficient. To this end, a statistical framework for measuring such tree correlation is proposed. As a specific model in this framework, a parametric model based on bivariate Gaussian distributions is provided, and a Bayesian approach for parameter estimation is introduced. The model allows the coupling degree of corresponding nodes to change with the depth of the tree. It provides an intuitive mapping of the trend similarity of the values along two trees to the classical Pearson correlation. A Metropolis-within-Gibbs algorithm is used to obtain the posterior estimates. Extensive simulations and in-depth sensitivity analyses are performed to demonstrate the validity and robustness of the method. Furthermore, an application to embryonic gene expression datasets shows that this tree similarity measure aligns well with the biological properties.

[1]  J. Rothman,et al.  The potential to differentiate epidermis is unequally distributed in the AB lineage during early embryonic development in C. elegans. , 1994, Developmental biology.

[2]  E. Schierenberg,et al.  Cell lineages of the embryo of the nematode Caenorhabditis elegans. , 1978, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Nazrul I. Shaikh,et al.  A particle-learning-based approach to estimate the influence matrix of online social networks , 2018, Comput. Stat. Data Anal..

[4]  Correcting for the edge effect in density estimation: Explorations around a new method , 1979 .

[5]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[6]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[7]  R. Nelsen An Introduction to Copulas , 1998 .

[8]  E. Myers,et al.  A 3D Digital Atlas of C. elegans and Its Application To Single-Cell Analyses , 2009, Nature Methods.

[9]  J. Sulston,et al.  The embryonic cell lineage of the nematode Caenorhabditis elegans. , 1983, Developmental biology.

[10]  G. Schuepbach-Regula,et al.  Evidence for Emergency Vaccination Having Played a Crucial Role to Control the 1965/66 Foot-and-Mouth Disease Outbreak in Switzerland , 2015, Front. Vet. Sci..

[11]  K. Kaneko,et al.  Cell division, differentiation and dynamic clustering , 1993, adap-org/9311001.

[12]  Kenny Q. Ye,et al.  Bayesian detection of embryonic gene expression onset in C. elegans , 2015, The Annals of Applied Statistics.

[13]  Prashant Mishra,et al.  Mitochondrial dynamics and inheritance during cell division, development and disease , 2014, Nature Reviews Molecular Cell Biology.

[14]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[15]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[16]  David Angeli,et al.  Convergence speed of distributed consensus and topology of the associated information spread , 2007, 2007 46th IEEE Conference on Decision and Control.

[17]  C. Spearman The proof and measurement of association between two things. By C. Spearman, 1904. , 1987, The American journal of psychology.

[18]  R. Schnabel,et al.  glp-1 and inductions establishing embryonic axes in C. elegans. , 1994, Development.

[19]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[20]  Eugene W. Myers,et al.  Analysis of Cell Fate from Single-Cell Gene Expression Profiles in C. elegans , 2009, Cell.

[21]  F. Galton Regression Towards Mediocrity in Hereditary Stature. , 1886 .

[22]  Gary Ruvkun,et al.  The unc-86 gene product couples cell lineage and cell identity in C. elegans , 1990, Cell.

[23]  R. Waterston,et al.  Multidimensional regulation of gene expression in the C. elegans embryo , 2012, Genome research.

[24]  Vipin T. Sreedharan,et al.  A spatial and temporal map of C. elegans gene expression. , 2011, Genome research.

[25]  Yang Yang,et al.  A data-augmentation method for infectious disease incidence data from close contact groups , 2007, Comput. Stat. Data Anal..

[26]  Miguel de Carvalho,et al.  Affinity-based measures of biomarker performance evaluation , 2020, Statistical methods in medical research.

[27]  Axel Munk,et al.  Testing for dependence on tree structures , 2020, Proceedings of the National Academy of Sciences.

[28]  J. Bakdash,et al.  Repeated Measures Correlation , 2017, Front. Psychol..

[29]  Wuqiong Luo,et al.  Identifying infection sources in large tree networks , 2012, 2012 9th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks (SECON).

[30]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[31]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[32]  J. Rothman,et al.  Combinatorial specification of blastomere identity by glp-1-dependent cellular interactions in the nematode Caenorhabditis elegans. , 1994, Development.

[33]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[34]  Anthony C. Davison,et al.  Spectral Density Ratio Models for Multivariate Extremes , 2014 .