Matching samples of multiple views

Multi-view learning studies how several views, different feature representations, of the same objects could be best utilized in learning. In other words, multi-view learning is analysis of co-occurrence data, where the observations are co-occurrences of samples in the views. Standard multi-view learning such as joint density modeling cannot be done in the absence of co-occurrence, when the views are observed separately and the identities of objects are not known. As a practical example, joint analysis of mRNA and protein concentrations requires mapping between genes and proteins. We introduce a data-driven approach for learning the correspondence of the observations in the different views, in order to enable joint analysis also in the absence of known co-occurrence. The method finds a matching that maximizes statistical dependency between the views, which is particularly suitable for multi-view methods such as canonical correlation analysis which has the same objective. We apply the method to translational metabolomics, to identify differences and commonalities in metabolic processes in different species or tissues. The metabolite identities and roles in the different species are not generally known, and it is necessary to search for a matching. In this paper we show, using different metabolomics measurement batches as the views so that the ground truth is known, that the metabolite identities can be reliably matched by a consensus of several matching solutions.

[1]  John Shawe-Taylor,et al.  Using KCCA for Japanese–English cross-language information retrieval and document classification , 2006, Journal of Intelligent Information Systems.

[2]  A. Volgenant,et al.  A shortest augmenting path algorithm for dense and sparse linear assignment problems , 1987, Computing.

[3]  Tao Liu,et al.  Investigating the correspondence between transcriptomic and proteomic expression profiles using coupled cluster models , 2008, Bioinform..

[4]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[5]  Doris Damian,et al.  Applications of a new subspace clustering algorithm (COSA) in medical systems biology , 2007, Metabolomics.

[6]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, Discovery Science.

[7]  John Shawe-Taylor,et al.  Two view learning: SVM-2K, Theory and Practice , 2005, NIPS.

[8]  David R. Hardoon,et al.  LEARNING THE SEMANTICS OF MULTIMEDIA CONTENT WITH APPLICATION TO WEB IMAGE RETRIEVAL AND CLASSIFICATION , 2003 .

[9]  Steffen Bickel,et al.  Estimation of Mixture Models Using Co-EM , 2005, ECML.

[10]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[11]  Matej Oresic,et al.  Lipidomics: a new window to biomedical frontiers. , 2008, Trends in biotechnology.

[12]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[13]  Mauro Dell'Amico,et al.  8. Quadratic Assignment Problems: Algorithms , 2009 .

[14]  Olli Simell,et al.  Gender-dependent progression of systemic metabolic states in early childhood , 2008, Molecular systems biology.

[15]  Michael I. Jordan,et al.  A Probabilistic Interpretation of Canonical Correlation Analysis , 2005 .

[16]  Samuel Kaski,et al.  Probabilistic approach to detecting dependencies between data sets , 2008, Neurocomputing.

[17]  Iain S. Duff,et al.  On Algorithms For Permuting Large Entries to the Diagonal of a Sparse Matrix , 2000, SIAM J. Matrix Anal. Appl..

[18]  Mirella Lapata,et al.  Proceedings of ACL-08: HLT , 2008 .

[19]  Olli Simell,et al.  Dysregulation of lipid and amino acid metabolism precedes islet autoimmunity in children who later progress to type 1 diabetes , 2008, The Journal of experimental medicine.

[20]  Samuel Kaski,et al.  Infinite factorization of multiple non-parametric views , 2010, Machine Learning.

[21]  Samuel Kaski,et al.  Non-parametric dependent components , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[22]  Mauro Dell'Amico,et al.  Assignment Problems , 1998, IFIP Congress: Fundamentals - Foundations of Computer Science.

[23]  Sami Virpioja,et al.  Bilingual sentence matching using Kernel CCA , 2010, 2010 IEEE International Workshop on Machine Learning for Signal Processing.

[24]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[25]  Luís Torgo,et al.  Machine Learning: ECML 2005, 16th European Conference on Machine Learning, Porto, Portugal, October 3-7, 2005, Proceedings , 2005, ECML.

[26]  Sridhar Mahadevan,et al.  Manifold alignment using Procrustes analysis , 2008, ICML '08.

[27]  Regina Barzilay,et al.  Sentence Alignment for Monolingual Comparable Corpora , 2003, EMNLP.

[28]  Le Song,et al.  A Hilbert Space Embedding for Distributions , 2007, IFIP Working Conference on Database Semantics.

[29]  Dan Klein,et al.  Learning Bilingual Lexicons from Monolingual Corpora , 2008, ACL.

[30]  Samuel Kaski,et al.  Using dependencies to pair samples for multi-view learning , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[31]  Alexander J. Smola,et al.  The kernel mutual information , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[32]  Le Song,et al.  Kernelized Sorting , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.