Research Challenges for Data Mining in Science and Engineering

With the rapid development of computer and information technology in the last several decades, an enormous amount of data in science and engineering has been and will continuously be generated in massive scale, either being stored in gigantic storage devices or flowing into and out of the system in the form of data streams. Moreover, such data has been made widely available, e.g., via the Internet. Such tremendous amount of data, in the order of terato petabytes, has fundamentally changed science and engineering, transforming many disciplines from data-poor to increasingly data-rich, and calling for new, data-intensive methods to conduct research in science and engineering. In this paper, we discuss the research challenges in science and engineering, from the data mining perspective, with a focus on the following issues: (1) information network analysis, (2) discovery, usage, and understanding of patterns and knowledge, (3) stream data mining, (4) mining moving object data, RFID data, and data from sensor networks, (5) spatiotemporal and multimedia data mining, (6) mining text, Web, and other unstructured data, (7) data cube-oriented multidimensional online analytical mining, (8) visual data mining, and (9) data mining by integration of sophisticated scientific and engineering domain knowledge.

[1]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[2]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[3]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[4]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[5]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[6]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[7]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[8]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[9]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[10]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[11]  Philip S. Yu,et al.  Efficient classification across multiple database relations: a CrossMine approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[12]  Jiawei Han,et al.  Semantic annotation of frequent patterns , 2007, TKDD.

[13]  Padhraic Smyth,et al.  Prediction and ranking algorithms for event-based network data , 2005, SKDD.

[14]  Raghu Ramakrishnan,et al.  Exploratory mining in cube space , 2006, Data Mining and Knowledge Discovery.

[15]  Jae-Gil Lee,et al.  Sampling cube: a framework for statistical olap over sampling data , 2008, SIGMOD Conference.

[16]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[17]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[18]  Hans-Peter Kriegel,et al.  Visual classification: an interactive approach to decision tree construction , 1999, KDD '99.

[19]  S. Wasserman,et al.  Models and Methods in Social Network Analysis , 2005 .

[20]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[21]  Lei Liu,et al.  Survey of Biodata Analysis from a Data Mining Perspective , 2005, Data Mining in Bioinformatics.

[22]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[23]  Jimeng Sun,et al.  Relevance search and anomaly detection in bipartite graphs , 2005, SKDD.

[24]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.

[25]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[26]  Dennis Shasha,et al.  High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science) , 2004 .

[27]  Jiawei Han,et al.  DataScope: Viewing Database Contents in Google Maps' Way , 2007, VLDB.

[28]  Yixin Chen,et al.  Multi-Dimensional Regression Analysis of Time-Series Data Streams , 2002, VLDB.

[29]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[30]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[31]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[32]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[33]  Jiawei Han,et al.  Generating semantic annotations for frequent patterns with context analysis , 2006, KDD '06.

[34]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[35]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[36]  Haim Levkowitz,et al.  From Visual Data Exploration to Visual Data Mining: A Survey , 2003, IEEE Trans. Vis. Comput. Graph..

[37]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[38]  Pier Luca Lanzi,et al.  Mining interesting knowledge from weblogs: a survey , 2005, Data Knowl. Eng..

[39]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[40]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[41]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[42]  Jiawei Han,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[43]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[44]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[45]  Jiawei Han,et al.  Discovery of Frequent Substructures , 2006 .

[46]  Jae-Gil Lee,et al.  Trajectory Outlier Detection: A Partition-and-Detect Framework , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[47]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[48]  Jiawei Han,et al.  Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data , 2007, VLDB.

[49]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[50]  Edoardo M. Airoldi,et al.  Sampling algorithms for pure network topologies: a study on the stability and the separability of metric embeddings , 2005, SKDD.

[51]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[52]  Edward R. Tufte,et al.  The Visual Display of Quantitative Information , 1986 .

[53]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[54]  Sangkyum Kim,et al.  ROAM: Rule- and Motif-Based Anomaly Detection in Massive Moving Object Data Sets , 2007, SDM.

[55]  Yixin Chen,et al.  Regression Cubes with Lossless Compression and Aggregation , 2006, IEEE Transactions on Knowledge and Data Engineering.

[56]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[57]  Jiawei Han,et al.  Extracting redundancy-aware top-k patterns , 2006, KDD '06.

[58]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[59]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[60]  Shashi Shekhar,et al.  Spatial Databases: A Tour , 2003 .

[61]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[62]  Jiawei Han,et al.  Flowcube: constructing RFID flowcubes for multi-dimensional analysis of commodity flows , 2006, VLDB.

[63]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[64]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[65]  Alfred Inselberg,et al.  Multidimensional detective , 1997, Proceedings of VIZ '97: Visualization Conference, Information Visualization Symposium and Parallel Rendering Symposium.

[66]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[67]  A. John MINING GRAPH DATA , 2022 .

[68]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[69]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[70]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[71]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[72]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[73]  Yi Lin,et al.  Prediction Cubes , 2005, VLDB.

[74]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[75]  Markus Gross,et al.  Visualizing Informationon a Sphere , 1997 .

[76]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[77]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[78]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[79]  Christos Faloutsos,et al.  Tools for large graph mining , 2005 .

[80]  Philip S. Yu,et al.  Mining Colossal Frequent Patterns by Core Pattern Fusion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[81]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[82]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[83]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[84]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[85]  Alexander S. Szalay,et al.  The World Wide Telescope: An Archetype for Online Science , 2004, ArXiv.

[86]  Jiawei Han,et al.  Adaptive Fastest Path Computation on a Road Network: A Traffic Mining Approach , 2007, VLDB.

[87]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[88]  Ishwar K. Sethi,et al.  eID: a system for exploration of image databases , 2003, Inf. Process. Manag..

[89]  Download Book,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[90]  John F. Roddick,et al.  An Updated Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research , 2000, TSDM.