Maximal-Sum submatrix search using a hybrid contraint programming/linear programming approach

Abstract A Maximal-Sum Submatrix (MSS) maximizes the sum of the entries corresponding to the Cartesian product of a subset of rows and columns from an original matrix (with positive and negative entries). Despite being NP-hard, this recently introduced problem was already proven to be useful for practical data-mining applications. It was used for identifying bi-clusters in gene expression data or to extract a submatrix that is then visualized in a circular plot. The state-of-the-art results for MSS are obtained using an advanced Constraint Programing approach that combines a custom filtering algorithm with a Large Neighborhood Search. We improve the state-of-the-art approach by introducing new upper bounds based on linear and mixed-integer programming formulations, along with dedicated pruning algorithms. We experiment on both synthetic and real-life data, and show that our approach outperforms the previous methods.

[1]  Vladimir Kolmogorov,et al.  Optimizing Binary MRFs via Extended Roof Duality , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Abraham P. Punnen,et al.  The bipartite unconstrained 01 quadratic programming problem , 2015 .

[3]  Jing Zhao,et al.  It is time to apply biclustering: a comprehensive review of biclustering applications in biological and biomedical data , 2019, Briefings Bioinform..

[4]  Pierre Schaus,et al.  Identifying gene-specific subgroups: an alternative to biclustering , 2019, BMC Bioinformatics.

[5]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Andrea Lodi,et al.  Cost-Based Domain Filtering , 1999, CP.

[7]  Pierre Schaus,et al.  Mining a Maximum Weighted Set of Disjoint Submatrices , 2019, DS.

[8]  P. Schaus,et al.  Global Migration in the 20th and 21st Centuries: the Unstoppable Force of Demography , 2018 .

[9]  Kavé Salamatian,et al.  Traffic matrix estimation: existing techniques and new directions , 2002, SIGCOMM '02.

[10]  Meinolf Sellmann,et al.  Theoretical Foundations of CP-Based Lagrangian Relaxation , 2004, CP.

[11]  S. Plaza,et al.  Migration and Remittances for Development in Asia , 2018 .

[12]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[13]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[14]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[15]  S. Nash,et al.  Linear and Nonlinear Optimization , 2008 .

[16]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[17]  Paul Shaw,et al.  Using Constraint Programming and Local Search Methods to Solve Vehicle Routing Problems , 1998, CP.

[18]  Pierre Schaus,et al.  The Maximum Weighted Submatrix Coverage Problem: A CP Approach , 2019, CPAIOR.

[19]  Jean-Guillaume Fages,et al.  New filtering for AtMostNValue and its weighted variant: A Lagrangian approach , 2015, Constraints.

[20]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[21]  Pierre Schaus,et al.  Mining a Sub-Matrix of Maximal Sum , 2017, PKDD 2017.

[22]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[23]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[24]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[26]  Jinsong Tan,et al.  Inapproximability of Maximum Weighted Edge Biclique and Its Applications , 2007, TAMC.

[27]  Tias Guns,et al.  CoverSize: A Global Constraint for Frequency-Based Itemset Mining , 2017, CP.

[28]  Luc De Raedt,et al.  Ranked Tiling , 2014, ECML/PKDD.