Design and Implementation of Scalable Hierarchical Density Based Clustering

Clustering is a useful technique that divides data points into groups, also known as clusters, such that the data points of the same cluster exhibit similar properties. Typical clustering algorithms assign each data point to at least one cluster. However, in practical datasets like microarray gene dataset, only a subset of the genes are highly correlated and the dataset is often polluted with a huge volume of genes that are irrelevant. In such cases, it is important to ignore the poorly correlated genes and just cluster the highly correlated genes. Automated Hierarchical Density Shaving (Auto-HDS) is a non-parametric density based technique that partitions only the relevant subset of the dataset into multiple clusters while pruning the rest. Auto-HDS performs a hierarchical clustering that identifies dense clusters of different densities and finds a compact hierarchy of the clusters identified. Some of the key features of Auto-HDS include selection and ranking of clusters using custom stability criterion and a topologically meaningful 2D projection and visualization of the clusters discovered in the higher dimensional original space. However, a key limitation of Auto-HDS is that it requires O(n) storage, and O(nlogn) computational complexity, making it scale up to only a few 10s of thousands of points. In this thesis, two extensions to Auto-HDS are presented for lower dimensional datasets that can generate clustering identical to Auto-HDS but can scale to much larger datasets. We first introduce Partitioned HDS that provides significant reduction in time and space complexity and makes it possible to generate the Auto-HDS cluster hierarchy on much larger datasets with 100s of millions of data points. Then, we describe Parallel Auto-HDS that takes advantage of the inherent parallelism available in Partitioned Auto-HDS to scale to even larger datasets without a corresponding increase in actual run time when a group of processors are available for parallel execution. Partitioned Auto-HDS is implemented on top of GeneDIVER, a previously existing Java based streaming implementation of Auto-HDS, and thus it retains all the key features of Auto-HDS including ranking, automatic selection of clusters and 2D visualization of the discovered cluster topology. Java Based Auto-HDS reduces the space complexity by streaming the distance matrix to the secondary storage nevertheless storage required is O(n). limited by the computation time and not the memory, since the O(n) storage is on the hard drive. Java based Implementation of Auto-HDS http://www.ideal.ece.utexas.edu/~gunjan/genediver.

[1]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[2]  Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18-22 December 2006, Hong Kong, China , 2006, ICDM.

[3]  Ralph Arnote,et al.  Hong Kong (China) , 1996, OECD/G20 Base Erosion and Profit Shifting Project.

[4]  Inderjit S. Dhillon,et al.  A scalable framework for discovering coherent co-clusters in noisy data , 2009, ICML '09.

[5]  Kunle Olukotun,et al.  Map-Reduce for Machine Learning on Multicore , 2006, NIPS.

[6]  Joydeep Ghosh,et al.  Detecting Seasonal Trends and Cluster Motion Visualization for Very High Dimensional Transactional Data , 2001, SDM.

[7]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[8]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Koby Crammer,et al.  A needle in a haystack: local one-class optimization , 2004, ICML.

[13]  Giorgio Valle,et al.  A global gene evolution analysis on Vibrionaceae family using phylogenetic profile , 2007, BMC Bioinformatics.

[14]  Joydeep Ghosh,et al.  Relationship-Based Clustering and Visualization for High-Dimensional Data Mining , 2003, INFORMS J. Comput..

[15]  Robert P. W. Duin,et al.  Data domain description using support vectors , 1999, ESANN.

[16]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[17]  Nickolay Y. Gnedin,et al.  Cosmological radiative transfer comparison project – II. The radiation-hydrodynamic tests , 2009, 0905.2920.

[18]  Joydeep Ghosh,et al.  Hierarchical Density Shaving: A clustering and visualization framework for large biological datasets , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[19]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.