Statistical methods for complex event processing and real time decision making

While there has been a lot of attention paid recently to big data, in which data is written to massive repositories for later analysis, there also is a rapidly increasing amount of data available in the form of data streams or events. Data streams typically represent very recent measurements or current system states. Events represent things that happen, often in the context of computer processing. When processing data streams or events, we often need to make decisions in real time. Complex event processing CEP is an important area of computer science that provides powerful tools for processing events and analyzing data streams. CEP deals with events that can be comprised of other events and can model complex phenomena like a user's interactions with a web site or a stock market crash. In the current literature, CEP is almost entirely deterministic, that is, it does not account for randomness or rely on statistical methods. However, statistics and machine learning have a critical role to play in the use of data streams and events. Also, understanding how CEP works is critical to analyzing data based on complex events. When processing data streams, a distinction must be made between analysis, the human activity in which we try to gain understanding of an underlying process, and decision making, in which we apply knowledge to data to decide what action to take. Useful statistical techniques for data streams include smoothing, generalized additive models, change point detection, and classification methods. WIREs Comput Stat 2016, 8:5-26. doi: 10.1002/wics.1372

[1]  Bruno Sinopoli,et al.  A kernel-based learning approach to ad hoc sensor network localization , 2005, TOSN.

[2]  Jonathan Weinberg,et al.  Bayesian Forecasting of an Inhomogeneous Poisson Process With Applications to Call Center Data , 2007 .

[3]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[4]  Ron Kohavi,et al.  Practical guide to controlled experiments on the web: listen to your customers not to the hippo , 2007, KDD '07.

[5]  Elif Uysal-Biyikoglu,et al.  Energy-efficient transmission over a wireless link via lazy packet scheduling , 2001, Proceedings IEEE INFOCOM 2001. Conference on Computer Communications. Twentieth Annual Joint Conference of the IEEE Computer and Communications Society (Cat. No.01CH37213).

[6]  Allan R. Wilks,et al.  Fraud Detection in Telecommunications: History and Lessons Learned , 2010, Technometrics.

[7]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[8]  Rupert G. Miller Simultaneous Statistical Inference , 1966 .

[9]  Badrinath Roysam,et al.  Image change detection algorithms: a systematic survey , 2005, IEEE Transactions on Image Processing.

[10]  Nick Andrews,et al.  A Statistical Algorithm for the Early Detection of Outbreaks of Infectious Disease , 1996 .

[11]  Byung K. Yi,et al.  Location Based Services for Mobiles :Technologies and Standards , 2008 .

[12]  N. L. Johnson,et al.  Sequential Analysis: A Survey , 1961 .

[13]  Sascha Ossowski,et al.  Event-Driven Architecture for Decision Support in Traffic Management Systems , 2008, 2008 11th International IEEE Conference on Intelligent Transportation Systems.

[14]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[15]  Tin Kam Ho,et al.  Motion feature filtering for event detection in crowded scenes , 2014, Pattern Recognit. Lett..

[16]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[17]  C. Guestrin,et al.  Distributed regression: an efficient framework for modeling sensor network data , 2004, Third International Symposium on Information Processing in Sensor Networks, 2004. IPSN 2004.

[18]  Opher Etzion,et al.  Event-processing network model and implementation , 2008, IBM Syst. J..

[19]  Sureswaran Ramadass,et al.  A Survey of Botnet and Botnet Detection , 2009, 2009 Third International Conference on Emerging Security Information, Systems and Technologies.

[20]  Michael Zeller,et al.  Efficient deployment of predictive analytics through open standards and cloud computing , 2009, SKDD.

[21]  Anand Rajaraman,et al.  Mining of Massive Datasets , 2011 .

[22]  H. Varian,et al.  Predicting the Present with Google Trends , 2009 .

[23]  Douglas C. Montgomery,et al.  Introduction to Statistical Quality Control , 1986 .

[24]  Subrata Dasgupta,et al.  It Began with Babbage: The Genesis of Computer Science , 2014 .

[25]  Ling Liu,et al.  Encyclopedia of Database Systems , 2009, Encyclopedia of Database Systems.

[26]  R. Tibshirani,et al.  Generalized additive models for medical research , 1986, Statistical methods in medical research.

[27]  Hongjoong Kim,et al.  A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential change-point detection methods , 2006, IEEE Transactions on Signal Processing.

[28]  Patricia Reynaud-Bouret,et al.  Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities , 2003 .

[29]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[30]  P. Chatterjee,et al.  Modeling the Clickstream: Implications for Web-Based Advertising Efforts , 2003 .

[31]  Csaba Szepesvári,et al.  Exploration-exploitation tradeoff using variance estimates in multi-armed bandits , 2009, Theor. Comput. Sci..

[32]  Benyuan Liu,et al.  Predicting Flu Trends using Twitter data , 2011, 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[33]  Na Li,et al.  Snow: A Parallel Computing Framework for the R System , 2009, International Journal of Parallel Programming.

[34]  H. Kushner Stochastic approximation: a survey , 2010 .

[35]  Albert Bifet,et al.  Sentiment Knowledge Discovery in Twitter Streaming Data , 2010, Discovery Science.

[36]  A. Raftery,et al.  Bayesian analysis of a Poisson process with a change-point , 1986 .

[37]  Tackseung Jun A survey on the bandit problem with switching costs , 2004 .

[38]  Lorrie Faith Cranor,et al.  Cantina: a content-based approach to detecting phishing web sites , 2007, WWW '07.

[39]  Fred L. Collopy,et al.  Error Measures for Generalizing About Forecasting Methods: Empirical Comparisons , 1992 .

[40]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[41]  X.S. Wang,et al.  Discovering Frequent Event Patterns with Multiple Granularities in Time Sequences , 1998, IEEE Trans. Knowl. Data Eng..

[42]  Tasneem S. J. Darwish,et al.  Traffic density estimation in vehicular ad hoc networks: A review , 2015, Ad Hoc Networks.

[43]  Gediminas Adomavicius,et al.  Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions , 2005, IEEE Transactions on Knowledge and Data Engineering.

[44]  H. Varian,et al.  Predicting the Present with Google Trends , 2012 .

[45]  E. S. Gardner EXPONENTIAL SMOOTHING: THE STATE OF THE ART, PART II , 2006 .

[46]  David Luckham,et al.  The power of events - an introduction to complex event processing in distributed enterprise systems , 2002, RuleML.

[47]  Matteo Golfarelli,et al.  Beyond data warehousing: what's next in business intelligence? , 2004, DOLAP '04.

[48]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[49]  Wei Jiang,et al.  Churn detection via customer profile modelling , 2006 .

[50]  Abhinav Srivastava,et al.  Credit Card Fraud Detection Using Hidden Markov Model , 2008, IEEE Transactions on Dependable and Secure Computing.

[51]  Jeff Sutherland,et al.  Enterprise application integration and complex adaptive systems , 2002, CACM.

[52]  Steven L. Scott,et al.  A modern Bayesian look at the multi-armed bandit , 2010 .

[53]  Howard S. Burkom,et al.  Statistical Challenges Facing Early Outbreak Detection in Biosurveillance , 2010, Technometrics.

[54]  Galit Shmueli,et al.  Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[55]  Fernando Diaz,et al.  Time is of the essence: improving recency ranking using Twitter data , 2010, WWW '10.

[56]  Florian Skopik,et al.  Combating advanced persistent threats: From network event correlation to incident detection , 2015, Comput. Secur..

[57]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[58]  Albert-László Barabási,et al.  The origin of bursts and heavy tails in human dynamics , 2005, Nature.

[59]  Petra Andrlíková,et al.  Bayesian default probability models , 2014 .

[60]  Kerry L. Taylor,et al.  Ontology-Driven Complex Event Processing in Heterogeneous Sensor Networks , 2011, ESWC.

[61]  D. Mohapatra,et al.  Survey of location based wireless services , 2005, 2005 IEEE International Conference on Personal Wireless Communications, 2005. ICPWC 2005..

[62]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[63]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[64]  Tobias Jolly Could a ‘safest’ option on sat navs save lives? , 2014 .

[65]  Andrew W. Moore,et al.  Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.

[66]  Tim Bass,et al.  Intrusion detection systems and multisensor data fusion , 2000, CACM.

[67]  Dimitri P. Bertsekas,et al.  Incremental Gradient, Subgradient, and Proximal Methods for Convex Optimization: A Survey , 2015, ArXiv.

[68]  Rüdiger W. Brause,et al.  Neural data mining for credit card fraud detection , 1999, Proceedings 11th International Conference on Tools with Artificial Intelligence.

[69]  Sirajum Munir,et al.  Dmodel: Online Taxicab Demand Model from Big Sensor Data in a Roving Sensor Network , 2014, 2014 IEEE International Congress on Big Data.

[70]  Michael Stonebraker,et al.  The 8 requirements of real-time stream processing , 2005, SGMD.

[71]  A. Welsh,et al.  Generalized additive modelling and zero inflated count data , 2002 .

[72]  Alberto Maria Segre,et al.  The Use of Twitter to Track Levels of Disease Activity and Public Concern in the U.S. during the Influenza A H1N1 Pandemic , 2011, PloS one.

[73]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[74]  Jeremy Ginsberg,et al.  Detecting influenza epidemics using search engine query data , 2009, Nature.

[75]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[76]  L. Thomas A survey of credit and behavioural scoring: forecasting financial risk of lending to consumers , 2000 .

[77]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[78]  Qian Yang,et al.  Publish-subscribe services for urgent and emergency response , 2009, COMSWARE '09.

[79]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[80]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[81]  Layth C. Alwan,et al.  Time-Series Modeling for Statistical Process Control , 1988 .

[82]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[83]  Andrew W. Moore,et al.  Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[84]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[85]  Ralph Kimball,et al.  The Data Webhouse Toolkit: Building the Web-enabled Data Warehouse , 2000, Ind. Manag. Data Syst..

[86]  Avishai Mandelbaum,et al.  Statistical Analysis of a Telephone Call Center , 2005 .

[87]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[88]  L. Leemis,et al.  Nonparametric Estimation of the Cumulative Intensity Function for a Nonhomogeneous Poisson Process from Overlapping Realizations , 2000 .

[89]  Vipin Kumar,et al.  Emerging scientific applications in data mining , 2002, CACM.

[90]  Douglas M. Hawkins,et al.  The Changepoint Model for Statistical Process Control , 2003 .

[91]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[92]  Flora S. Tsai,et al.  Detecting Cyber Security Threats in Weblogs Using Probabilistic Models , 2007, PAISI.

[93]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.

[94]  Everette S. Gardner,et al.  Exponential smoothing: The state of the art , 1985 .

[95]  Jameela Al-Jaroodi,et al.  Analysis of Web Alert Models , 2009, 2009 International Conference on Network-Based Information Systems.

[96]  Min Chen,et al.  Semantic event detection via multimodal data mining , 2006, IEEE Signal Processing Magazine.

[97]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[98]  Robert L. Grossman,et al.  Data mining standards initiatives , 2002, CACM.

[99]  W. Cleveland,et al.  Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting , 1988 .

[100]  Robert L. Grossman,et al.  Augustus: the design and architecture of a PMML-based scoring engine , 2006, DMSSP '06.

[101]  Shamik Sural,et al.  Credit card fraud detection: A fusion approach using Dempster-Shafer theory and Bayesian learning , 2009, Inf. Fusion.

[102]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[103]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[104]  Kyung Joon Kwak,et al.  Stochastic Counting in Sensor Networks, or: Noise Is Good , 2008, DCOSS.

[105]  Ben Y. Reis,et al.  Surveillance Sans Frontières: Internet-Based Emerging Infectious Disease Intelligence and the HealthMap Project , 2008, PLoS medicine.

[106]  Robert L. Grossman,et al.  The management and mining of multiple predictive models using the predictive modeling markup language , 1999, Inf. Softw. Technol..

[107]  M. Kulldorff,et al.  A Space–Time Permutation Scan Statistic for Disease Outbreak Detection , 2005, PLoS medicine.

[108]  Jay M. Bennett,et al.  Network outage impact measures for telecommunications , 1995, Proceedings IEEE Symposium on Computers and Communications.

[109]  Chang-Tien Lu,et al.  Survey of fraud detection techniques , 2004, IEEE International Conference on Networking, Sensing and Control, 2004.

[110]  Jennifer Widom,et al.  Towards a streaming SQL standard , 2008, Proc. VLDB Endow..