Data mining using the crossing minimization paradigm

Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data.

[1]  Jinze Liu,et al.  Biclustering in gene expression data by tendency , 2004 .

[2]  E. Mäkinen,et al.  Genetic algorithms for drawing bipartite graphs , 1994 .

[3]  Rafael Martí,et al.  GRASP and Path Relinking for 2-Layer Straight Line Crossing Minimization , 1999, INFORMS J. Comput..

[4]  Shigeto Seno,et al.  A method for clustering gene expression data based on graph structure. , 2004, Genome informatics. International Conference on Genome Informatics.

[5]  M. Golumbic Algorithmic graph theory and perfect graphs , 1980 .

[6]  Gisele L. Pappa,et al.  A Multiobjective Genetic Algorithm for Attribute Selection , 2002 .

[7]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[8]  Ilya Shmulevich,et al.  Binary analysis and optimization-based normalization of gene expression data , 2002, Bioinform..

[9]  Lothar Thiele,et al.  A systematic comparison and evaluation of biclustering methods for gene expression data , 2006, Bioinform..

[10]  Naveed A. Sherwani,et al.  Algorithms for VLSI Physical Design Automation , 1999, Springer US.

[11]  Yutao Fu,et al.  Gene expression module discovery using gibbs sampling. , 2004, Genome informatics. International Conference on Genome Informatics.

[12]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[13]  D. West Introduction to Graph Theory , 1995 .

[14]  Paul Molitor,et al.  Using Sifting for k -Layer Straightline Crossing Minimization , 1999, GD.

[15]  Pierre Hansen,et al.  Cluster analysis and mathematical programming , 1997, Math. Program..

[16]  R. Martí,et al.  A branch and bound algorithm for minimizing the number of crossing arcs in bipartite graphs , 1996 .

[17]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[18]  H. Wallace,et al.  New Frontiers , 1934 .

[19]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[20]  Bart De Moor,et al.  Biclustering microarray data by Gibbs sampling , 2003, ECCB.

[21]  Franco P. Preparata,et al.  DNA Sequencing by Hybridization Using Semi-Degenerate Bases , 2004, J. Comput. Biol..

[22]  René Peeters,et al.  The maximum edge biclique problem is NP-complete , 2003, Discret. Appl. Math..

[23]  Gisele L. Pappa Multiobjective Genetic Algorithms for Attribute Selection , 2002 .

[24]  Fred W. Glover,et al.  Reducing the bandwidth of a sparse matrix with tabu search , 2001, Eur. J. Oper. Res..

[25]  T. M. Murali,et al.  Extracting Conserved Gene Expression Motifs from Gene Expression Data , 2002, Pacific Symposium on Biocomputing.

[26]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .

[27]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[28]  George H. John Enhancements to the data mining process , 1997 .

[29]  Arlindo L. Oliveira,et al.  Biclustering algorithms for biological data analysis: a survey , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[30]  Ramasamy Uthurusamy,et al.  Data mining and knowledge discovery in databases , 1996, CACM.

[31]  Christos H. Papadimitriou,et al.  The NP-Completeness of the bandwidth minimization problem , 1976, Computing.

[32]  Isaac Plana,et al.  GRASP and path relinking for the matrix bandwidth minimization , 2004, Eur. J. Oper. Res..

[33]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[34]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[35]  C. Mueller,et al.  Sparse Matrix Reordering Algorithms for Cluster Identification , 2004 .

[36]  J. Pasciak,et al.  Computer solution of large sparse positive definite systems , 1982 .

[37]  Rafael Martí,et al.  Arc crossing minimization in hierarchical digraphs with tabu search , 1997, Comput. Oper. Res..

[38]  Rafael Martí,et al.  Heuristics and Meta-heuristics for 2-layer Straight Line Crossing Minimization , 2003, Discret. Appl. Math..

[39]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[40]  Walter L. Ruzzo,et al.  Bayesian Classification of DNA Array Expression Data , 2000 .

[41]  Matthias F. Stallmann,et al.  Heuristics, Experimental Subjects, and Treatment Evaluation in Bigraph Crossing Minimization , 2001, JEAL.

[42]  Wojciech Szpankowski,et al.  Biclustering gene-feature matrices for statistically significant dense patterns , 2004 .

[43]  Ondrej Sýkora,et al.  Two New Heuristics for Two-Sided Bipartite Graph Drawing , 2002, Graph Drawing.

[44]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[45]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[46]  Frank Thomson Leighton,et al.  New lower bound techniques for VLSI , 1981, 22nd Annual Symposium on Foundations of Computer Science (sfcs 1981).

[47]  Ao Li,et al.  Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme , 2006, BMC Bioinformatics.

[48]  David S. Johnson,et al.  Crossing Number is NP-Complete , 1983 .

[49]  Y. Tu,et al.  Quantitative noise analysis for gene expression microarray experiments , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Ying Yang,et al.  A comparative study of discretization methods for naive-Bayes classifiers , 2002 .

[51]  David Hung-Chang Du,et al.  Efficient Algorithms for Layer Assignment Problem , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[52]  Filippo Menczer,et al.  Evolutionary model selection in unsupervised learning , 2002, Intell. Data Anal..

[53]  Alfonso Valencia,et al.  A hierarchical unsupervised growing neural network for clustering gene expression patterns , 2001, Bioinform..

[54]  Erkki Mäkinen,et al.  The Barycenter Heuristic and the Reorderable Matrix , 2005, Informatica.