Large-Scale Data Analytics Using Ensemble Clustering

Data clustering is a highly used analysis technique in many application domains. From the end user’s perspective, the wide variety of available algorithms and their technical parameterization bring major difficulties in the determination of a user-satisfying clustering result. To overcome this issue in the context of large-scale analysis, we developed a novel feedback-driven clustering process. Aside from presenting the theoretical concepts, we also describe our developed infrastructure to efficiently handle the still increasing data volumes, within our process.

[1]  Alkis Simitsis,et al.  Modeling and managing ETL processes , 2003, VLDB PhD Workshop.

[2]  Maria-Florina Balcan,et al.  Clustering with Interactive Feedback , 2008, ALT.

[3]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[4]  Wolfgang Lehner,et al.  Data-Grey-BoxWeb Services in Data-Centric Environments , 2007, IEEE International Conference on Web Services (ICWS 2007).

[5]  Robert A. van Engelen,et al.  Pushing the SOAP Envelope with Web Services for Scientific Computing , 2003, ICWS.

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[8]  Wolfgang Lehner,et al.  BPEL-DT - Data-aware Extension of BPEL to Support Data-Intensive Service Applications , 2007, WEWST.

[9]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[10]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[11]  Wolfgang Lehner,et al.  BPEL DT - Data-Aware Extension for Data-Intensive Service Applications , 2007, WEWST@ECOWS.

[12]  Alex Ng,et al.  Optimising Web services performance with table driven XML , 2006, Australian Software Engineering Conference (ASWEC'06).

[13]  Ben Shneiderman,et al.  The eyes have it: a task by data type taxonomy for information visualizations , 1996, Proceedings 1996 IEEE Symposium on Visual Languages.

[14]  Wolfgang Lehner,et al.  How to Control Clustering Results? Flexible Clustering Aggregation , 2009, IDA.

[15]  Wolfgang Lehner,et al.  Evolving Ensemble-Clustering to a Feedback-Driven Process , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[16]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[18]  Wolfgang Lehner,et al.  Visual Decision Support for Ensemble Clustering , 2010, SSDBM.

[19]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[20]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[21]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[22]  Wolfgang Lehner,et al.  Using Cloud Technologies to Optimize Data-Intensive Service Applications , 2010, 2010 IEEE 3rd International Conference on Cloud Computing.

[23]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Dirk Habich,et al.  Supporting Gene Expression Analysis Processes by a Service-Oriented Platform , 2007, IEEE International Conference on Services Computing (SCC 2007).

[25]  Guang R. Gao,et al.  An adaptive meta-clustering approach: combining the information from different clustering results , 2002, Proceedings. IEEE Computer Society Bioinformatics Conference.

[26]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[27]  Wolfgang Lehner,et al.  Two-phase clustering strategy for gene expression data sets , 2006, SAC '06.

[28]  Thomas Erl,et al.  Service-Oriented Architecture: Concepts, Technology, and Design , 2005 .

[29]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[30]  Frank Leymann,et al.  Web Services Platform Architecture: SOAP, WSDL, WS-Policy, WS-Addressing, WS-BPEL, WS-Reliable Messaging, and More , 2005 .

[31]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[32]  Abraham Silberschatz,et al.  HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads , 2009, Proc. VLDB Endow..

[33]  Madhusudhan Govindaraju,et al.  Investigating the limits of SOAP performance for scientific computing , 2002, Proceedings 11th IEEE International Symposium on High Performance Distributed Computing.