Knowledge-Free Table Summarization

Considering relational tables as the object of analysis, methods to summarize them can help the analyst to have a starting point to explore the data. Typically, table summarization aims at producing an informative data summary through the use of metadata supplied by attribute taxonomies. Nevertheless, such a hierarchical knowledge is not always available or may even be inadequate when existing. To overcome these limitations, we propose a new framework, named cTabSum, to automatically generate attribute value taxonomies and directly perform table summarization based on its own content. Our innovative approach considers a relational table as input and proceeds in a two-step way. First, a taxonomy for each attribute is extracted. Second, a new table summarization algorithm exploits the automatic generated taxonomies. An information theory measure is used to guide the summarization process. Associated with the new algorithm we also develop a prototype. Interestingly, our prototype incorporates some additional features to help the user familiarizing with the data: i the resulting summarized table produced by cTabSum can be used as recommended starting point to browse the data; ii some very easy-to-understand charts allow to visualize how taxonomies have been so built; iii finally, standard OLAP operators, i.e. drill-down and roll-up, have been implemented to easily navigate within the data set. In addition we also supply an objective evaluation of our table summarization strategy over real data.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  K. Selçuk Candan,et al.  Reducing metadata complexity for faster table summarization , 2010, EDBT '10.

[3]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[4]  Ian Witten,et al.  Data Mining , 2000 .

[5]  Reema Thareja,et al.  Data Warehousing , 2018, Encyclopedia of GIS.

[6]  Noureddine Mouaddib,et al.  General Purpose Database Summarization , 2005, VLDB.

[7]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[8]  Ruggero G. Pensa,et al.  From Context to Distance: Learning Dissimilarity for Categorical Data Clustering , 2012, TKDD.

[9]  Maguelonne Teisseire,et al.  Towards an automatic construction of Contextual Attribute-Value Taxonomies , 2012, SAC '12.

[10]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[11]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[12]  Luca Chittaro,et al.  Visualizing information on mobile devices , 2006, Computer.

[13]  Philip S. Yu,et al.  TabSum: a flexible and dynamic table summarization approach , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[14]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[15]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[16]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[17]  S. Sumathi,et al.  Data Warehousing, Data Mining, and OLAP , 2006 .

[18]  A. Karr Exploratory Data Mining and Data Cleaning , 2006 .

[19]  Tamir Tassa,et al.  k -Anonymization with Minimal Loss of Information , 2007, ESA.