Large-Scale Multi-Modal Data Exploration with Human in the Loop

A new trend in many scientific fields is to conduct data-intensive research by collecting and analyzing a large amount of high-density, high-quality, multi-modal data streams. In this chapter we present a research framework for analyzing and mining such data streams at large-scale; we exploit parallel sequential pattern mining and iterative MapReduce in particular to enable human-in-the-loop large-scale data exploration powered by High Performance Computing (HPC). One basic problem is that, data scientists are now working with datasets so large and complex that it becomes difficult to process using traditional desktop statistics and visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers” (Jacobs, Queue 7(6):10:10–10:19, 2009). Meanwhile, discovering new knowledge requires the means to exploratively analyze datasets of this scale—allowing us to freely “wander” around the data, and make discoveries by combining bottom-up pattern discovery and top-down human knowledge to leverage the power of the human perceptual system. In this work, we first exploit a novel interactive temporal data mining method that allows us to discover reliable sequential patterns and precise timing information of multivariate time series. For our principal test case of detecting and extracting human sequential behavioral patterns over multiple multi-modal data streams, this suggests a quantitative and interactive data-driven way to ground social interactions in a manner that has never been achieved before. After establishing the fundamental analytics algorithms, we proceed to a research framework that can fulfill the task of extracting reliable patterns from large-scale time series using iterative MapReduce tasks. Our work exploits visual-based information technologies to allow scientists to interactively explore, visualize and make sense of their data. For example, the parallel mining algorithm running on HPC is accessible to users through asynchronous web service. In this way, scientists can compare the intermediate data to extract and propose new rounds of analysis for more scientifically meaningful and statistically reliable patterns, and therefore statistical computing and visualization can bootstrap each another. Finally, we show the results from our principal user application that can demonstrate our system’s capability of handling massive temporal event sets within just a few minutes. All these combine to reveal an effective and efficient way to support large-scale data exploration with human in the loop.

[1]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[2]  Mong-Li Lee,et al.  Mining relationships among interval-based events for classification , 2008, SIGMOD Conference.

[3]  Chen Yu,et al.  Visual Data Mining: An Exploratory Approach to Analyzing Temporal Patterns of Eye Movements. , 2012, Infancy : the official journal of the International Society on Infant Studies.

[4]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[5]  Chen Yu,et al.  Sequential pattern mining of multimodal data streams in dyadic interactions , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[6]  Ada Wai-Chee Fu,et al.  Discovering Temporal Patterns for Interval-Based Events , 2000, DaWaK.

[7]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[8]  Chen Yu,et al.  Visual mining of multimedia data for social and behavioral studies , 2008, 2008 IEEE Symposium on Visual Analytics Science and Technology.

[9]  Chen Yu,et al.  Real-time adaptive behaviors in multimodal human-avatar interactions , 2010, ICMI-MLMI '10.

[10]  Tomonobu Ozaki,et al.  Discovery of Quantitative Sequential Patterns from Event Sequences , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[11]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[12]  Michael D. Ernst,et al.  HaLoop , 2010, Proc. VLDB Endow..

[13]  Thomas Guyet,et al.  Mining Temporal Patterns with Quantitative Intervals , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[14]  Dmitriy Fradkin,et al.  Robust Mining of Time Intervals with Semi-interval Partial Order Patterns , 2010, SDM.

[15]  Guangchen Ruan,et al.  Parallel and quantitative sequential pattern mining for large-scale interval-based temporal data , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[16]  Chen Yu,et al.  A Multimodal Real-Time Platform for Studying Human-Avatar Interactions , 2010, IVA.

[17]  Juan Carlos Guerri,et al.  A software tool to acquire, synchronise and playback multimedia data: an application in kinesiology , 2000, Comput. Methods Programs Biomed..

[18]  Sofian Maabout,et al.  Uncertainty Interval Temporal Sequences Extraction , 2012, ICISTM.

[19]  David Cunningham,et al.  M3R: Increased performance for in-memory Hadoop jobs , 2012, Proc. VLDB Endow..

[20]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[21]  J. Leeuw Applications of Convex Analysis to Multidimensional Scaling , 2000 .

[22]  James F. Allen Maintaining knowledge about temporal intervals , 1983, CACM.

[23]  William R. Sherman,et al.  Reordering virtual reality: recording and recreating real-time experiences , 2012, Other Conferences.

[24]  Michael C. Carroll,et al.  Exploratory space-time analysis of local economic development , 2011 .

[25]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[27]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[28]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[29]  Geoffrey C. Fox,et al.  Twister: a runtime for iterative MapReduce , 2010, HPDC '10.

[30]  Dimitrios Gunopulos,et al.  Discovering frequent arrangements of temporal intervals , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[31]  Jia Liu,et al.  Managing uncertain temporal relations using a probabilistic Interval Algebra , 2008, 2008 IEEE International Conference on Systems, Man and Cybernetics.

[32]  Geoffrey C. Fox,et al.  A deterministic annealing approach to clustering , 1990, Pattern Recognit. Lett..

[33]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.