Data Integration in Multi-dimensional Data Sets: Informational Asymmetry in the Valid Correlation of Subdivided Samples

Background: Flow cytometry is the only currently available high throughput technology that can measure multiple physical and molecular characteristics of individual cells. It is common in flow cytometry to measure a relatively large number of characteristics or features by performing separate experiments on subdivided samples. Correlating data from multiple experiments using certain shared features (e.g. cell size) could provide useful information on the combination pattern of the not shared features. Such correlation, however, are not always reliable. Methods: We developed a method to assess the correlation reliability by estimating the percentage of cells that can be unambiguously correlated between two samples. This method was evaluated using 81 pairs of subdivided samples of microspheres (artificial cells) with known molecular characteristics. Results: Strong correlation (R=0.85) was found between the estimated and actual percentage of unambiguous correlation. Conclusion: The correlation reliability we developed can be used to support data integration of experiments on subdivided samples.

[1]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[2]  S C Chu,et al.  Database issues in object-oriented clinical information systems design. , 1997, Studies in health technology and informatics.

[3]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[4]  Melba M. Crawford,et al.  Unsupervised multistage image classification using hierarchical clustering with a bayesian similarity measure , 2005, IEEE Transactions on Image Processing.

[5]  J. Marron,et al.  SiZer for Exploration of Structures in Curves , 1999 .

[6]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[7]  I. T. Young Proof without prejudice: use of the Kolmogorov-Smirnov test for the analysis of histograms from flow systems and other sources. , 1977, The journal of histochemistry and cytochemistry : official journal of the Histochemistry Society.

[8]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[9]  R G Mark,et al.  Efficient hemodynamic event detection utilizing relational databases and wavelet analysis , 2001, Computers in Cardiology 2001. Vol.28 (Cat. No.01CH37287).

[10]  Valerie L. Ng,et al.  Practical Flow Cytometry, 4th Edition , 2004 .

[11]  David G. Stork,et al.  Pattern Classification , 1973 .

[12]  G. Brecher,et al.  Evaluation of electronic red blood cell counter. , 1956, American journal of clinical pathology.

[13]  Howard M. Shapiro,et al.  Practical Flow Cytometry , 1985 .

[14]  Amir Assadi,et al.  Unsupervised clustering algorithm for N-dimensional data , 2005, Journal of Neuroscience Methods.