A sparse integrative cluster analysis for understanding soybean phenotypes

Soybean is one of the most important crops for food, feed and bio-energy world-wide. The study of soybean phenotypic variation at different geographical locations can help the understanding of soybean domestication, population structure of soybean, and the conservation of soybean biodiversity. We investigate if soybean varieties can be identified that they differ from other varieties on multiple traits even when growing at different geographical locations. When a collection of traits are observed for the same soybean type at different locations (different views), joint analysis of the multiple-view data is required in order to identify the same soybean clusters based on data from different locations. We employ a new multi-view singular value decomposition approach that simultaneously decomposes the data matrix gathered at each location into sparse singular vectors. This approach is able to group soybean samples consistently across the different locations and simultaneously identify the phenotypes at each location on which the soybean samples within a cluster are the most similar. Comparison with several latest multi-view co-clustering methods demonstrates the superior performance of the proposed approach.

[1]  Hal Daumé,et al.  Co-regularized Multi-view Spectral Clustering , 2011, NIPS.

[2]  Michael I. Jordan,et al.  A Unified Probabilistic Model for Global and Local Unsupervised Feature Selection , 2011, ICML.

[3]  Hal Daumé,et al.  A Co-training Approach for Multi-view Spectral Clustering , 2011, ICML.

[4]  Shuiwang Ji,et al.  A sparsity-inducing formulation for evolutionary co-clustering , 2012, KDD.

[5]  Maria-Florina Balcan,et al.  Co-Training and Expansion: Towards Bridging Theory and Practice , 2004, NIPS.

[6]  R. Hajjar,et al.  The use of wild relatives in crop improvement: a survey of developments over the last 20 years , 2007, Euphytica.

[7]  E. Xing,et al.  A HIERARCHICAL DIRICHLET PROCESS MIXTURE MODEL FOR HAPLOTYPE RECONSTRUCTION FROM MULTI-POPULATION DATA , 2008, 0812.4648.

[8]  Jianhua Z. Huang,et al.  Biclustering via Sparse Singular Value Decomposition , 2010, Biometrics.

[9]  Hans-Hermann Bock,et al.  Two-mode clustering methods: astructuredoverview , 2004, Statistical methods in medical research.

[10]  Michael I. Jordan,et al.  Multiple Non-Redundant Spectral Clustering Views , 2010, ICML.

[11]  R. Bharat Rao,et al.  Bayesian Co-Training , 2007, J. Mach. Learn. Res..

[12]  Arindam Banerjee,et al.  Bayesian Co-clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[14]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[15]  Martha White,et al.  Convex Multi-view Subspace Learning , 2012, NIPS.

[16]  George Michailidis,et al.  A co‐training algorithm for multi‐view data with applications in data fusion , 2009 .

[17]  A. Mohamed,et al.  Nutrient composition and anti-nutritional factors in vegetable soybean: II. Oil, fatty acids, sterols, and lipoxygenase activity , 1992 .

[18]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[19]  Y. Wang,et al.  Population structure of the wild soybean (Glycine soja) in China: implications from microsatellite analyses. , 2012, Annals of botany.

[20]  O. Phillips,et al.  Biodiversity conservation: Uncertainty in predictions of extinction risk/Effects of changes in climate and land use/Climate change and extinction risk (reply) , 2004, Nature.

[21]  Wei Li,et al.  Genetic diversity in domesticated soybean (Glycine max) and its wild progenitor (Glycine soja) for simple sequence repeat and single-nucleotide polymorphism loci. , 2010, The New phytologist.