Predicting missing values with biclustering: A coherence-based approach

In this work, a novel biclustering-based approach to data imputation is proposed. This approach is based on the Mean Squared Residue metric, used to evaluate the degree of coherence among objects of a dataset, and presents an algebraic development that allows the modeling of the predictor as a quadratic programming problem. The proposed methodology is positioned in the field of missing data, its theoretical aspects are discussed and artificial and real-case scenarios are simulated to evaluate the performance of the technique. Additionally, relevant properties introduced by the biclustering process are also explored in post-imputation analysis, to highlight other advantages of the proposed methodology, more specifically confidence estimation and interpretability of the imputation process.

[1]  Morven Leese,et al.  Book Review: Mathematical Classification and Clustering (Nonconvex Optimization and Its Applications, Vol. 11) , 2003 .

[2]  Fabrício Olivetti de França,et al.  Finding a high coverage set of 5-biclusters with swarm intelligence , 2010, IEEE Congress on Evolutionary Computation.

[3]  James Bennett,et al.  The Netflix Prize , 2007 .

[4]  R. Varga Geršgorin And His Circles , 2004 .

[5]  Sushmita Mitra,et al.  Multi-objective evolutionary biclustering of gene expression data , 2006, Pattern Recognit..

[6]  Panagiotis Symeonidis,et al.  Nearest-Biclusters Collaborative Filtering with Constant Values , 2006, WEBKDD.

[7]  Kenneth Y. Goldberg,et al.  Eigentaste: A Constant Time Collaborative Filtering Algorithm , 2001, Information Retrieval.

[8]  Fabrício Olivetti de França,et al.  Query expansion using an immune-inspired biclustering algorithm , 2010, Natural Computing.

[9]  Fabrício Olivetti de França,et al.  Multi-Objective Biclustering: When Non-dominated Solutions are not Enough , 2009, J. Math. Model. Algorithms.

[10]  Judi Scheffer,et al.  Dealing with Missing Data , 2020, The Big R‐Book.

[11]  Françoise Fessant,et al.  State-of-the-Art Recommender Systems , 2009 .

[12]  D. Harville Matrix Algebra From a Statistician's Perspective , 1998 .

[13]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[14]  Sven Bergmann,et al.  Iterative signature algorithm for the analysis of large-scale gene expression data. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Fabrício Olivetti de França,et al.  Evaluating the Performance of a Biclustering Algorithm Applied to Collaborative Filtering - A Comparative Analysis , 2007, 7th International Conference on Hybrid Intelligent Systems (HIS 2007).

[16]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[17]  Fabrício Olivetti de França,et al.  Extracting additive and multiplicative coherent biclusters with swarm intelligence , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[18]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[19]  Marco Dorigo,et al.  Optimization, Learning and Natural Algorithms , 1992 .

[20]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[21]  R. Steele Optimization , 2005 .

[22]  Federico Divina,et al.  Virtual Error: A New Measure for Evolutionary Biclustering , 2007, EvoBIO.

[23]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[24]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[25]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[26]  Donald Goldfarb,et al.  An O(n3L) primal interior point algorithm for convex quadratic programming , 1991, Math. Program..

[27]  Fritz Scheuren,et al.  Hot Deck Imputation Procedure Applied to Double Sampling Design , 1986 .

[28]  Fabrício Olivetti de França,et al.  Applying Biclustering to Perform Collaborative Filtering , 2007, Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007).

[29]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.