Open challenges for data stream mining research

Every day, huge volumes of sensory, transactional, and web data are continuously generated as streams, which need to be analyzed online as they arrive. Streaming data can be considered as one of the main sources of what is called big data. While predictive modeling for data streams and big data have received a lot of attention over the last decade, many research approaches are typically designed for well-behaved controlled problem settings, overlooking important challenges imposed by real-world applications. This article presents a discussion on eight open challenges for data stream mining. Our goal is to identify gaps between current research and meaningful applications, highlight open problems, and define new application-relevant research directions for data stream mining. The identified challenges cover the full cycle of knowledge discovery and involve such problems as: protecting data privacy, dealing with legacy systems, handling incomplete and delayed information, analysis of complex data, and evaluation of stream mining algorithms. The resulting analysis is illustrated by practical applications and provides general suggestions concerning lines of future research in data stream mining.

[1]  Vipin Kumar,et al.  Land cover change detection: a case study , 2008, KDD.

[2]  Misako Takayasu,et al.  STABLE INFINITE VARIANCE FLUCTUATIONS IN RANDOMLY AMPLIFIED LANGEVIN SYSTEMS , 1997 .

[3]  Kai-Uwe Sattler,et al.  On detection of changes in sensor data streams , 2011, MoMM '11.

[4]  João Gama,et al.  On evaluating stream learning algorithms , 2012, Machine Learning.

[5]  Hisashi Kashima,et al.  Unsupervised Change Analysis Using Supervised Learning , 2008, PAKDD.

[6]  Jaime G. Carbonell,et al.  Machine learning research , 1981, SGAR.

[7]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[8]  Yehuda Lindell,et al.  Privacy Preserving Data Mining , 2002, Journal of Cryptology.

[9]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[10]  Nesime Tatbul,et al.  Efficiently correlating complex events over live and archived data streams , 2011, DEBS '11.

[11]  Ravi Kumar,et al.  Influence and correlation in social networks , 2008, KDD.

[12]  Dimitrios Gunopulos,et al.  Distributed deviation detection in sensor networks , 2003, SGMD.

[13]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[14]  Thomas Hofmann,et al.  Hierarchical document categorization with support vector machines , 2004, CIKM '04.

[15]  S. N. Dorogovtsev,et al.  Evolution of networks with aging of sites , 2000, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[16]  Balachander Krishnamurthy,et al.  Sketch-based change detection: methods, evaluation, and applications , 2003, IMC '03.

[17]  Charu C. Aggarwal,et al.  On change diagnosis in evolving data streams , 2005, IEEE Transactions on Knowledge and Data Engineering.

[18]  Ioannis Partalas,et al.  Comparative Classifier Evaluation for Web-Scale Taxonomies Using Power Law , 2013, ESWC.

[19]  John F. Roddick,et al.  Evolution and change in data management — issues and directions , 2000, SGMD.

[20]  Philip S. Yu,et al.  Identifying the influential bloggers in a community , 2008, WSDM '08.

[21]  João Gama,et al.  A framework to monitor clusters evolution applied to economy and finance problems , 2012, Intell. Data Anal..

[22]  Cornelia Metzig,et al.  A Model for Scaling in Firms' Size and Growth Rate Distribution , 2013, 1304.4311.

[23]  Christoforos Anagnostopoulos,et al.  Deciding what to observe next: adaptive variable selection for regression in multivariate data streams , 2008, SAC '08.

[24]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[25]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[26]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[27]  João Gama,et al.  A survey on concept drift adaptation , 2014, ACM Comput. Surv..

[28]  John Gantz,et al.  The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far East , 2012 .

[29]  Saeideh Bakhshi,et al.  "I need to try this"?: a statistical overview of pinterest , 2013, CHI.

[30]  Jennifer Widom,et al.  Deco: declarative crowdsourcing , 2012, CIKM.

[31]  Cliff Lampe,et al.  A familiar face(book): profile elements as signals in an online social network , 2007, CHI.

[32]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[33]  Thomas Seidl,et al.  Towards a Mobile Health Context Prediction: Sequential Pattern Mining in Multiple Streams , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[34]  Tim Kraska,et al.  CrowdDB: answering queries with crowdsourcing , 2011, SIGMOD '11.

[35]  Nitesh V. Chawla,et al.  Noname manuscript No. (will be inserted by the editor) Learning from Streaming Data with Concept Drift and Imbalance: An Overview , 2022 .

[36]  Vincent Lemaire,et al.  Learning with few examples: An empirical study on leading classifiers , 2011, The 2011 International Joint Conference on Neural Networks.

[37]  Roland Müller,et al.  Efficiency of the Columbus Failure Management System , 2010 .

[38]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[39]  J. Manyika Big data: The next frontier for innovation, competition, and productivity , 2011 .

[40]  Bin Jiang,et al.  Continuous privacy preserving publishing of data streams , 2009, EDBT '09.

[41]  Katarzyna Musial,et al.  Next challenges for adaptive learning systems , 2012, SKDD.

[42]  Won Suk Lee,et al.  estWin: adaptively monitoring the recent change of frequent itemsets over online data streams , 2003, CIKM '03.

[43]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[44]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[45]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[46]  Dimitris K. Tasoulis,et al.  Online annotation and prediction for regime switching data streams , 2009, SAC '09.

[47]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[48]  S. Muthukrishnan,et al.  Sequential Change Detection on Data Streams , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[49]  Beng Chin Ooi,et al.  Federation in Cloud Data Management: Challenges and Opportunities , 2014, IEEE Transactions on Knowledge and Data Engineering.

[50]  Charu C. Aggarwal,et al.  Mining Data Streams: Systems and Algorithms , 2016 .

[51]  Jie Tang,et al.  Who will follow you back?: reciprocal relationship prediction , 2011, CIKM '11.

[52]  Pramod K. Varshney,et al.  Performance Analysis of Distributed Detection in a Random Sensor Field , 2008, IEEE Transactions on Signal Processing.

[53]  Edward Omiecinski,et al.  Evolution in Data Streams , 2003 .

[54]  Bonnie A. Nardi,et al.  Why we blog , 2004, CACM.

[55]  Bogdan Gabrys,et al.  Adaptive Preprocessing for Streaming Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[56]  R. Kay The Analysis of Survival Data , 2012 .

[57]  Filip Radlinski,et al.  Mortal Multi-Armed Bandits , 2008, NIPS.

[58]  Jeffrey L. Schnipper,et al.  Inability of Providers to Predict Unplanned Readmissions , 2011, Journal of General Internal Medicine.

[59]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[60]  Beng Chin Ooi,et al.  ES2: A cloud data storage system for supporting both OLTP and OLAP , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[61]  João Gama,et al.  Distributed clustering of ubiquitous data streams , 2014, WIREs Data Mining Knowl. Discov..

[62]  Víctor M Eguíluz,et al.  Scaling in the structure of directory trees in a computer cluster. , 2005, Physical review letters.

[63]  Johannes Gehrke,et al.  A framework for measuring changes in data characteristics , 1999, PODS '99.

[64]  Gregory Ditzler,et al.  Semi-supervised learning in nonstationary environments , 2011, The 2011 International Joint Conference on Neural Networks.

[65]  Jenna Wiens,et al.  Active Learning Applied to Patient-Adaptive Heartbeat Classification , 2010, NIPS.

[66]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[67]  Yunqian Ma,et al.  Imbalanced Learning: Foundations, Algorithms, and Applications , 2013 .

[68]  Guido Caldarelli,et al.  Preferential attachment in the growth of social networks: the case of Wikipedia , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[69]  Mark E. J. Newman,et al.  Power-Law Distributions in Empirical Data , 2007, SIAM Rev..

[70]  Vipin Kumar,et al.  Chapman & Hall/CRC Data Mining and Knowledge Discovery Series , 2008 .

[71]  Graham Cormode,et al.  Efficient Strategies for Continuous Distributed Tracking Tasks , 2005, IEEE Data Eng. Bull..

[72]  S. Havlin,et al.  Self-similarity of complex networks , 2005, Nature.

[73]  Myra Spiliopoulou,et al.  Classification Rule Mining for a Stream of Perennial Objects , 2011, RuleML Europe.

[74]  Ming Li,et al.  Online Manifold Regularization: A New Learning Setting and Empirical Study , 2008, ECML/PKDD.

[75]  Jerzy Stefanowski,et al.  Reacting to Different Types of Concept Drift: The Accuracy Updated Ensemble Algorithm , 2014, IEEE Transactions on Neural Networks and Learning Systems.

[76]  Indre Zliobaite Controlled permutations for testing adaptive learning models , 2013, Knowledge and Information Systems.

[77]  G. Jona-Lasinio Renormalization group and probability theory , 2000, cond-mat/0009219.

[78]  Alex Goodall,et al.  The guide to expert systems , 1985 .

[79]  Shai Ben-David,et al.  Detecting Change in Data Streams , 2004, VLDB.

[80]  Qiang Yang,et al.  Deep classification in large-scale text hierarchies , 2008, SIGIR '08.

[81]  Pramod K. Varshney,et al.  Distributed detection in a large wireless sensor network , 2006, Inf. Fusion.

[82]  Beng Chin Ooi,et al.  A hybrid machine-crowdsourcing system for matching web tables , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[83]  William A. Young,et al.  A survey of methodologies for the treatment of missing values within datasets: limitations and benefits , 2011 .

[84]  Richard Sproat,et al.  Mining named entities with temporally correlated bursts from multilingual web news streams , 2011, WSDM '11.

[85]  Yutaka Matsuo,et al.  Tweet Analysis for Real-Time Event Detection and Earthquake Reporting System Development , 2013, IEEE Transactions on Knowledge and Data Engineering.

[86]  Eric Gilbert,et al.  Specialization, homophily, and gender in a social curation site: findings from pinterest , 2014, CSCW.

[87]  M E J Newman Assortative mixing in networks. , 2002, Physical review letters.

[88]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[89]  Charu C. Aggarwal A segment-based framework for modeling and mining data streams , 2010, Knowledge and Information Systems.

[90]  Ashbindu Singh,et al.  Review Article Digital change detection techniques using remotely-sensed data , 1989 .

[91]  Yoram Singer,et al.  Large margin hierarchical classification , 2004, ICML.

[92]  M. E. J. Newman,et al.  Power laws, Pareto distributions and Zipf's law , 2005 .

[93]  Manoranjan Dash,et al.  A change detector for mining frequent patterns over evolving data streams , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[94]  Johannes Gehrke,et al.  A Framework for Measuring Differences in Data Characteristics , 2002, J. Comput. Syst. Sci..

[95]  Benoit B. Mandelbrot,et al.  A Note On a Class of Skew Distribution Functions: Analysis and Critique of a Paper by H. A. Simon , 1959, Inf. Control..

[96]  Gang Chen,et al.  E3: an Elastic Execution Engine for Scalable Data Processing , 2012, J. Inf. Process..

[97]  Aoying Zhou,et al.  Tracking clusters in evolving data streams over sliding windows , 2008, Knowledge and Information Systems.

[98]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[99]  Rob Miller,et al.  Crowdsourced Databases: Query Processing with People , 2011, CIDR.

[100]  Sorin Solomon,et al.  POWER LAWS ARE DISGUISED BOLTZMANN LAWS , 2001 .

[101]  Arno Siebes,et al.  StreamKrimp: Detecting Change in Data Streams , 2008, ECML/PKDD.

[102]  Rong Jin,et al.  Batch mode active learning and its application to medical image classification , 2006, ICML.

[103]  S. Venkatasubramanian,et al.  An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams , 2006 .

[104]  Younès Bennani,et al.  Change detection in data streams through unsupervised learning , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[105]  Keke Chen,et al.  HE-Tree: a framework for detecting changes in clustering structure for categorical data streams , 2009, The VLDB Journal.

[106]  Fei Wang,et al.  Combining Knowledge and Data Driven Insights for Identifying Risk Factors using Electronic Health Records , 2012, AMIA.

[107]  Virgílio A. F. Almeida,et al.  Ladies First: Analyzing Gender Roles and Behaviors in Pinterest , 2013, ICWSM.

[108]  Thomas G. Dietterich Machine-Learning Research , 1997, AI Mag..

[109]  Amit P. Sheth,et al.  Challenges in understanding clinical notes: why NLP engines fall short and where background knowledge can help , 2013, DARE '13.

[110]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[111]  Yiming Yang,et al.  Support vector machines classification with a very large-scale taxonomy , 2005, SKDD.

[112]  Kirk D. Borne,et al.  Scalable Distributed Change Detection from Astronomy Data Streams Using Local, Asynchronous Eigen Monitoring Algorithms , 2009, SDM.

[113]  P. Howe,et al.  Multicritical points in two dimensions, the renormalization group and the ϵ expansion , 1989 .

[114]  Duncan J. Watts,et al.  Everyone's an influencer: quantifying influence on twitter , 2011, WSDM '11.

[115]  Doina Precup,et al.  Assessing the Predictability of Hospital Readmission Using Machine Learning , 2013, IAAI.

[116]  Sandra Geisler,et al.  A data stream-based evaluation framework for traffic information systems , 2010, IWGS '10.

[117]  Hezi Halpert Survival Analysis Meets Data Stream Mining , 2013 .

[118]  Saso Dzeroski,et al.  Adaptive Windowing for Online Learning from Multiple Inter-related Data Streams , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[119]  Christos Faloutsos,et al.  Finding patterns in blog shapes and blog evolution , 2007, ICWSM.

[120]  Pravin Varaiya,et al.  Distributed Online Simultaneous Fault Detection for Multiple Sensors , 2008, 2008 International Conference on Information Processing in Sensor Networks (ipsn 2008).

[121]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[122]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[123]  Harry Wechsler,et al.  Detecting Changes in Unlabeled Data Streams Using Martingale , 2007, IJCAI.

[124]  Myra Spiliopoulou,et al.  Where Are We Going? Predicting the Evolution of Individuals , 2012, IDA.

[125]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[126]  Gerhard Weikum,et al.  Human computing games for knowledge acquisition , 2013, CIKM.

[127]  Georg Krempl,et al.  The Algorithm APT to Classify in Concurrence of Latency and Drift , 2011, IDA.

[128]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[129]  David Cohn,et al.  Active Learning , 2010, Encyclopedia of Machine Learning.

[130]  Daniel Nikovski,et al.  Fast adaptive algorithms for abrupt change detection , 2009, Machine Learning.

[131]  Beng Chin Ooi,et al.  CDAS: A Crowdsourcing Data Analytics System , 2012, Proc. VLDB Endow..

[132]  Weiyun Huang,et al.  History Guided Low-Cost Change Detection in Streams , 2009, DaWaK.

[133]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[134]  Claudio J. Tessone,et al.  Sustainable growth in complex networks , 2010, 1007.1330.

[135]  Eyke Hüllermeier,et al.  Survival analysis on data streams: Analyzing temporal events in dynamically changing environments , 2014, Int. J. Appl. Math. Comput. Sci..

[136]  I KunchevaLudmila Change Detection in Streaming Multivariate Data Using Likelihood Detectors , 2013 .

[137]  Xindong Wu,et al.  Mining distribution change in stock order streams , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[138]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[139]  Theodore Johnson,et al.  Stream warehousing with DataDepot , 2009, SIGMOD Conference.

[140]  M. Kaufmann What Can Be Computed Locally ? , 2003 .

[141]  Ryan Field Disciplined Entrepreneurship: 24 Steps to a Successful Startup by Bill Aulet , 2014 .

[142]  Nitesh V. Chawla,et al.  Model Monitor (M2): Evaluating, Comparing, and Monitoring Models , 2009, J. Mach. Learn. Res..

[143]  Manoranjan Dash,et al.  A Test Paradigm for Detecting Changes in Transactional Data Streams , 2008, DASFAA.

[144]  Michael Stonebraker,et al.  Are We Polishing a Round Ball? (Panel Abstract) , 1993, IEEE International Conference on Data Engineering.

[145]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[146]  Marc Boullé,et al.  A supervised approach for change detection in data streams , 2011, The 2011 International Joint Conference on Neural Networks.

[147]  Carla E. Brodley,et al.  Challenges and Opportunities in Applied Machine Learning , 2012, AI Mag..

[148]  Ioannis Partalas,et al.  Adaptive Classifier Selection in Large-Scale Hierarchical Classification , 2012, ICONIP.

[149]  Daphne Koller,et al.  Discriminative learning of relaxed hierarchy for large-scale visual recognition , 2011, 2011 International Conference on Computer Vision.

[150]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[151]  Lars Backstrom,et al.  The Anatomy of the Facebook Social Graph , 2011, ArXiv.

[152]  Paul N. Bennett,et al.  Refined experts: improving classification in large taxonomies , 2009, SIGIR.

[153]  João Gama,et al.  Regression Trees from Data Streams with Drift Detection , 2009, Discovery Science.

[154]  Raz Schwartz,et al.  Visualizing Instagram: Tracing Cultural Visual Rhythms , 2012, Proceedings of the International AAAI Conference on Web and Social Media.

[155]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[156]  Mykola Pechenizkiy,et al.  Quantile index for gradual and abrupt change detection from CFB boiler sensor data in online settings , 2012, SensorKDD '12.

[157]  Pramod K Varshney,et al.  Distributed inference in wireless sensor networks , 2012, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[158]  David Sinreich,et al.  An architectural blueprint for autonomic computing , 2006 .

[159]  Lada A. Adamic,et al.  Looking at the Blogosphere Topology through Different Lenses , 2007, ICWSM.

[160]  K. Wilson,et al.  The Renormalization group and the epsilon expansion , 1973 .

[161]  Erik Frisk,et al.  The Columbus module as a Technology Demonstrator for Innovative Failure Management , 2012 .

[162]  Claudio J. Tessone,et al.  A complementary view on the growth of directory trees , 2009 .

[163]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[164]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[165]  Sharon-Lise T. Normand,et al.  An Administrative Claims Measure Suitable for Profiling Hospital Performance on the Basis of 30-Day All-Cause Readmission Rates Among Patients With Heart Failure , 2008, Circulation. Cardiovascular quality and outcomes.

[166]  Jason Weston,et al.  Label Embedding Trees for Large Multi-Class Tasks , 2010, NIPS.

[167]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[168]  Caren Marzban,et al.  Using labeled data to evaluate change detectors in a multivariate streaming environment , 2009, Signal Process..

[169]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[170]  Marcel Karnstedt,et al.  Adaptive burst detection in a stream engine , 2009, SAC '09.

[171]  Jure Leskovec,et al.  Planetary-scale views on a large instant-messaging network , 2008, WWW.

[172]  Masashi Sugiyama,et al.  Change-Point Detection in Time-Series Data by Direct Density-Ratio Estimation , 2009, SDM.

[173]  Sudip Mittal,et al.  The Pin-Bang Theory: Discovering The Pinterest World , 2013, ArXiv.

[174]  Svetha Venkatesh,et al.  Anomaly detection in large-scale data stream networks , 2012, Data Mining and Knowledge Discovery.

[175]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[176]  Suresh Venkatasubramanian,et al.  Change (Detection) You Can Believe in: Finding Distributional Shifts in Data Streams , 2009, IDA.

[177]  Mohamed Medhat Gaber,et al.  Data stream mining in ubiquitous environments: state‐of‐the‐art and current directions , 2014, WIREs Data Mining Knowl. Discov..

[178]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[179]  Leo Egghe Untangling Herdan's law and Heaps' law: Mathematical and informetric arguments , 2007, J. Assoc. Inf. Sci. Technol..

[180]  Wei Fan,et al.  Mining big data: current status, and forecast to the future , 2013, SKDD.

[181]  Graham Cormode,et al.  The continuous distributed monitoring model , 2013, SGMD.

[182]  Yiming Yang,et al.  A scalability analysis of classifiers in text categorization , 2003, SIGIR.

[183]  Yiming Yang,et al.  Bayesian models for Large-scale Hierarchical Classification , 2012, NIPS.

[184]  Philip S. Yu,et al.  Online Mining of Changes from Data Streams: Research Problems and Preliminary Results , 2003 .

[185]  Ludmila I. Kuncheva,et al.  Change Detection in Streaming Multivariate Data Using Likelihood Detectors , 2013, IEEE Transactions on Knowledge and Data Engineering.

[186]  Kian-Lee Tan,et al.  epiC: an extensible and scalable system for processing Big Data , 2014, The VLDB Journal.

[187]  Sanjay Ranka,et al.  Statistical change detection for multi-dimensional data , 2007, KDD '07.

[188]  Michael Stonebraker,et al.  Data Curation at Scale: The Data Tamer System , 2013, CIDR.

[189]  M. Newman Power laws, Pareto distributions and Zipf's law , 2005 .