Data summarization: a survey

Summarization has been proven to be a useful and effective technique supporting data analysis of large amounts of data. Knowledge discovery from data (KDD) is time consuming, and summarization is an important step to expedite KDD tasks by intelligently reducing the size of processed data. In this paper, different summarization techniques for structured and unstructured data are discussed. The key finding of this survey is that not all summarization techniques create a summary suitable for further analysis. It is highlighted that sampling techniques are a viable way of creating a summary for further knowledge discovery such as anomaly detection from summary. Also different summary evaluation metrics are discussed.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Rajeev Motwani,et al.  Sliding Window Computations over Data Streams , 2002 .

[3]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[4]  Mohiuddin Ahmed,et al.  Clustering based semantic data summarization technique: A new approach , 2014, 2014 9th IEEE Conference on Industrial Electronics and Applications.

[5]  Barbara Hammer,et al.  Patch clustering for massive data sets , 2009, Neurocomputing.

[6]  Diego R. Lopez,et al.  Summarization and Analysis of Network Traffic Flow Records , 2011 .

[7]  Padmini Srinivasan,et al.  A quality-threshold data summarization algorithm , 2008, 2008 IEEE International Conference on Research, Innovation and Vision for the Future in Computing and Communication Technologies.

[8]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[9]  S. Muthukrishnan,et al.  Mining Deviants in a Time Series Database , 1999, VLDB.

[10]  Jiawei Han,et al.  Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[11]  Zahir Tari,et al.  Data Summarization Techniques for Big Data - A Survey , 2015, Handbook on Data Centers.

[12]  Michael J. Maher,et al.  An Efficient Technique for Network Traffic Summarization using Multiview Clustering and Statistical Sampling , 2015, EAI Endorsed Trans. Scalable Inf. Syst..

[13]  Andrei Broder,et al.  Network Applications of Bloom Filters: A Survey , 2004, Internet Math..

[14]  Martti Juhola,et al.  Informal identification of outliers in medical data , 2000 .

[15]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[16]  Anthony K. H. Tung,et al.  ItCompress: an iterative semantic compression algorithm , 2004, Proceedings. 20th International Conference on Data Engineering.

[17]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[18]  H. V. Jagadish,et al.  Semantic Compression and Pattern Extraction with Fascicles , 1999, VLDB.

[19]  Hans-Peter Kriegel,et al.  Data bubbles: quality preserving performance boosting for hierarchical clustering , 2001, SIGMOD '01.

[20]  Li Tu,et al.  Stream data clustering based on grid density and attraction , 2009, TKDD.

[21]  Dianne P. O'Leary,et al.  Text summarization via hidden Markov models , 2001, SIGIR '01.

[22]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies , 2000, ArXiv.

[23]  Yanqing Zhang,et al.  Multi-document Text Summarization Using Topic Model and Fuzzy Logic , 2013, MLDM.

[24]  Jiawei Han,et al.  Attribute-Oriented Induction in Relational Databases , 1991, Knowledge Discovery in Databases.

[25]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[26]  Quang-Khai Pham,et al.  Time Sequence Summarization: Theory and Applications , 2010 .

[27]  Tan Yee Fan,et al.  A Tutorial on Support Vector Machine , 2009 .

[28]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[29]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[30]  Piotr Indyk,et al.  Identifying Representative Trends in Massive Time Series Data Sets Using Sketches , 2000, VLDB.

[31]  Philip S. Yu,et al.  An effective and efficient algorithm for high-dimensional outlier detection , 2005, The VLDB Journal.

[32]  Lawrence O. Hall,et al.  Scalable clustering: a distributed approach , 2004, 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No.04CH37542).

[33]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[34]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[35]  Michael J. Maher,et al.  An Efficient Approach for Complex Data Summarization Using Multiview Clustering , 2014, Infoscale.

[36]  H. P. Edmundson,et al.  New Methods in Automatic Extracting , 1969, JACM.

[37]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[38]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[39]  Zahir Tari,et al.  Data summarization for network traffic monitoring , 2014, J. Netw. Comput. Appl..

[40]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[41]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[42]  Robert Kooi,et al.  The Optimization of Queries in Relational Databases , 1980 .

[43]  Achour Mostéfaoui,et al.  Efficiently Summarizing Data Streams over Sliding Windows , 2015, 2015 IEEE 14th International Symposium on Network Computing and Applications.

[44]  Shashi Shekhar,et al.  A Unified Approach to Detecting Spatial Outliers , 2003, GeoInformatica.

[45]  Yannis E. Ioannidis,et al.  Selectivity Estimation Without the Attribute Value Independence Assumption , 1997, VLDB.

[46]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[47]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[48]  Michael Barlow,et al.  Computing Hierarchical Summary of the Data Streams , 2016, PAKDD.

[49]  Sherif A. Elfayoumy,et al.  A Survey of Unstructured Text Summarization Techniques , 2014 .

[50]  Regina Barzilay,et al.  Using Lexical Chains for Text Summarization , 1997 .

[51]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[52]  Mohiuddin Ahmed,et al.  A survey of network anomaly detection techniques , 2016, J. Netw. Comput. Appl..

[53]  Vishal Gupta,et al.  Recent automatic text summarization techniques: a survey , 2016, Artificial Intelligence Review.

[54]  Jade Goldstein-Stewart,et al.  The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[55]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[56]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[57]  Michael J. Maher,et al.  A Novel Approach for Network Traffic Summarization , 2014, Infoscale.

[58]  Zhilin Li,et al.  A Multiscale Approach for Spatio‐Temporal Outlier Detection , 2006, Trans. GIS.

[59]  Md. Rafiqul Islam,et al.  A survey of anomaly detection techniques in financial domain , 2016, Future Gener. Comput. Syst..

[60]  Jiawei Han,et al.  DBLearn: a system prototype for knowledge discovery in relational databases , 1994, SIGMOD '94.

[61]  Ronald R. Yager,et al.  A new approach to the summarization of data , 1982, Inf. Sci..

[62]  Ira Assent,et al.  Self-Adaptive Anytime Stream Clustering , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[63]  Stan Zdonik,et al.  Load Shedding Techniques for Data Stream Management Systems , 2007 .

[64]  Phyllis B. Baxendale,et al.  Machine-Made Index for Technical Literature - An Experiment , 1958, IBM J. Res. Dev..

[65]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[66]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[67]  Francine Chen,et al.  A trainable document summarizer , 1995, SIGIR '95.

[68]  Jörg Sander,et al.  Data Bubbles for Non-Vector Data: Speeding-up Hierarchical Clustering in Arbitrary Metric Spaces , 2003, VLDB.

[69]  N. Nazari,et al.  A survey on Automatic Text Summarization , 2019 .

[70]  Yannis E. Ioannidis,et al.  Approximate Query Answering using Histograms , 1999, IEEE Data Eng. Bull..

[71]  Ani Nenkova,et al.  A Survey of Text Summarization Techniques , 2012, Mining Text Data.

[72]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[73]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[74]  Josef Schmee,et al.  Outliers in Statistical Data (2nd ed.) , 1986 .

[75]  A. Odlyzko,et al.  Internet growth: is there a Moore's law for data traffic? , 2000 .

[76]  Damodaram Kamma,et al.  Countering Parkinson's law for improving productivity , 2013, ISEC.

[77]  Danai Koutra,et al.  A Graph Summarization: A Survey , 2016, ArXiv.

[78]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[79]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[80]  Sukumar Nandi,et al.  Tolerance Rough Set Theory Based Data Summarization for Clustering Large Datasets , 2011, Trans. Rough Sets.

[81]  Abdun Naser Mahmood Hierarchical clustering and summarization of network traffic data , 2008 .

[82]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[83]  David Evans,et al.  Similarity-based Multilingual Multi-Document Summarization , 2005 .

[84]  Jiawei Han,et al.  Attribute-Oriented Induction in data Mining , 1996, Advances in Knowledge Discovery and Data Mining.

[85]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[86]  Regina Barzilay,et al.  Information Fusion in the Context of Multi-Document Summarization , 1999, ACL.

[87]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[88]  Eduard H. Hovy,et al.  Identifying Topics by Position , 1997, ANLP.

[89]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[90]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[91]  Peter Steenkiste,et al.  Network Anomaly Detection Using Co-clustering , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[92]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[93]  Hans-Peter Kriegel,et al.  Fast Hierarchical Clustering Based on Compressed Data and OPTICS , 2000, PKDD.

[94]  Noureddine Mouaddib,et al.  Time sequence summarization to scale up chronology-dependent applications , 2009, CIKM.

[95]  Patrick Wendel pjw Scalable clustering on the data grid , 2004 .

[96]  André Carlos Ponce de Leon Ferreira de Carvalho,et al.  Data stream clustering: A survey , 2013, CSUR.

[97]  Jiawei Han,et al.  DBMiner: A System for Mining Knowledge in Large Relational Databases , 1996, KDD.

[98]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[99]  Daniel A. Keim,et al.  Wavelets and their Applications in Databases , 2001, IEEE International Conference on Data Engineering.

[100]  David Salesin,et al.  Wavelets for computer graphics: theory and applications , 1996 .

[101]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[102]  Geoff Holmes,et al.  MOA: Massive Online Analysis , 2010, J. Mach. Learn. Res..

[103]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[104]  Vipin Kumar,et al.  Summarization - compressing data into an informative representation , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[105]  Rajeev Rastogi,et al.  SPARTAN: a model-based semantic compression system for massive data tables , 2001, SIGMOD '01.

[106]  Sam Yuan Sung,et al.  Detecting pattern-based outliers , 2003, Pattern Recognit. Lett..

[107]  Michael J. Maher,et al.  An Investigation of Performance Analysis of Anomaly Detection Techniques for Big Data in SCADA Systems , 2015, EAI Endorsed Trans. Ind. Networks Intell. Syst..

[108]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[109]  Dragomir R. Radev,et al.  Introduction to the Special Issue on Summarization , 2002, CL.

[110]  Chin-Yew Lin Training a selection function for extraction , 1999, CIKM '99.

[111]  Lucy Vanderwende,et al.  Enhancing Single-Document Summarization by Combining RankNet and Third-Party Sources , 2007, EMNLP.

[112]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[113]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[114]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.