Merging distributed database summaries

The database summarization system coined S<scp>aint</scp>E<scp>ti</scp>Q provides multi-resolution summaries of structured data stored into acentralized database. Summaries are computed online with a conceptual hierarchical clustering algorithm. However, most companies work in distributed legacy environments and consequently the current centralized version of S<scp>aint</scp>E<scp>ti</scp>Q is either not feasible (privacy preserving) or not desirable (resource limitations). To address this problem, we propose new algorithms to generate a single summary hierarchy given two distinct hierarchies, without scanning the raw data. The Greedy Merging Algorithm (GMA) takes all leaves of both hierarchies and generates the optimal partitioning for the considered data set with regards to a cost function (compactness and separation). Then, a hierarchical organization of summaries is built by agglomerating or dividing clusters such that the cost function may emphasize local or global patterns in the data. Thus, we obtain two different hierarchies according to the performed optimisation. However, this approach breaks down due to its exponential time complexity. Two alternative approaches with constant time complexity w.r.t. the number of data items, are proposed to tackle this problem. The first one, called Merge by Incorporation Algorithm (MIA), relies on the S<scp>aint</scp>E<scp>ti</scp>Q engine whereas the second approach, named Merge by Alignment Algorithm (MAA), consists in rearranging summaries by levels in a top-down manner. Then, we compare those approaches using an original quality measure in order to quantify how good our merged hierarchies are. Finally, an experimental study, using real data sets, shows that merging processes (MIA and MAA) are efficient in terms of computational time.

[1]  P. Langley,et al.  Concept formation in structured domains , 1991 .

[2]  L. Zadeh Fuzzy sets as a basis for a theory of possibility , 1999 .

[3]  Vasant Honavar,et al.  Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources , 2003 .

[4]  E. Rosch,et al.  Family resemblances: Studies in the internal structure of categories , 1975, Cognitive Psychology.

[5]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[6]  M. Pazzani,et al.  Concept formation knowledge and experience in unsupervised learning , 1991 .

[7]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[8]  Jérôme Euzenat,et al.  A Survey of Schema-Based Matching Approaches , 2005, J. Data Semant..

[9]  Rong Chen,et al.  Collective Mining of Bayesian Networks from Distributed Heterogeneous Data , 2004, Knowl. Inf. Syst..

[10]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[11]  Noureddine Mouaddib,et al.  General Purpose Database Summarization , 2005, VLDB.

[12]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[13]  Joydeep Ghosh,et al.  Cluster Ensembles A Knowledge Reuse Framework for Combining Partitionings , 2002, AAAI/IAAI.

[14]  Stanley M. Selkow,et al.  The Tree-to-Tree Editing Problem , 1977, Inf. Process. Lett..

[15]  Yannis Kalfoglou,et al.  Ontology mapping: the state of the art , 2003, The Knowledge Engineering Review.

[16]  Joydeep Ghosh,et al.  Distributed Clustering with Limited Knowledge Sharing , 2022 .

[17]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[18]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[19]  Tancred Lindholm,et al.  A three-way merge for XML documents , 2004, DocEng '04.

[20]  Tao Jiang,et al.  Some MAX SNP-Hard Results Concerning Unordered Labeled Trees , 1994, Inf. Process. Lett..

[21]  Xiaowei Sun,et al.  Online B-tree merging , 2005, SIGMOD '05.

[22]  Joydeep Ghosh,et al.  A Supra-Classifier Architecture for Scalable Knowledge Reuse , 1998, ICML.

[23]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[24]  W. A. Rosenblith Information and Control in Organ Systems , 1959 .

[25]  Joydeep Ghosh,et al.  A Consensus Framework for Integrating Distributed Clusterings Under Limited Knowledge Sharing , 2002 .

[26]  A. Tversky Features of Similarity , 1977 .

[27]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[28]  Pedro M. Domingos,et al.  Entity Resolution with Markov Logic , 2006, Sixth International Conference on Data Mining (ICDM'06).

[29]  Tom Mens,et al.  A State-of-the-Art Survey on Software Merging , 2002, IEEE Trans. Software Eng..

[30]  Nagiza F. Samatova,et al.  RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets , 2002, Distributed and Parallel Databases.

[31]  Kaizhong Zhang,et al.  A constrained edit distance between unordered labeled trees , 1996, Algorithmica.

[32]  Lotfi A. Zadeh,et al.  The Concepts of a Linguistic Variable and its Application to Approximate Reasoning , 1975 .

[33]  Lise Getoor,et al.  Query-time entity resolution , 2006, KDD '06.

[34]  Matthias Klusch,et al.  Distributed Clustering Based on Sampling Local Density Estimates , 2003, IJCAI.

[35]  H. Kriegel,et al.  Towards Effective and Efficient Distributed Clustering , 2003 .