Chapter 1 Research Challenges for Data Mining in Science and Engineering

With the rapid development of computer and information technology in the last several decades, an enormous amount of data in science and engineering has been and will continuously be generated in massive scale, either being stored in gigantic storage devices or flowing into and out of the system in the form of data streams. Moreover, such data has been made widely available, e.g., via the Internet. Such tremendous amount of data, in the order of terato peta-bytes, has fundamentally changed science and engineering, transforming many disciplines from data-poor to increasingly data-rich, and calling for new, data-intensive methods to conduct research in science and engineering. In this paper, we discuss the research challenges in science and engineering, from the data mining perspective, with a focus on the following issues: (1) information network analysis, (2) discovery, usage, and understanding of patterns and knowledge, (3) stream data mining, (4) mining moving object data, RFID data, and data from sensor networks, (5) spatiotemporal and multimedia data mining, (6) mining text, Web, and other unstructured data, (7) data cube-oriented multidimensional online analytical mining, (8) visual data mining, and (9) data mining by integration of sophisticated scientific and engineering domain knowledge.

[1]  A. John MINING GRAPH DATA , 2022 .

[2]  Philip S. Yu,et al.  A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions , 2007, SDM.

[3]  Philip S. Yu,et al.  Object Distinction: Distinguishing Objects with Identical Names , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[4]  ChengXiang Zhai,et al.  Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[5]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[6]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[7]  Takashi Washio,et al.  State of the art of graph-based data mining , 2003, SKDD.

[8]  James Allan,et al.  Topic detection and tracking: event-based information organization , 2002 .

[9]  Yixin Chen,et al.  Regression Cubes with Lossless Compression and Aggregation , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[11]  Shashi Shekhar,et al.  Spatial Databases: A Tour , 2003 .

[12]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[13]  Ben Taskar,et al.  Learning Probabilistic Models of Relational Structure , 2001, ICML.

[14]  Jiawei Han,et al.  Discovery of Frequent Substructures , 2006 .

[15]  Alfred Inselberg,et al.  Multidimensional detective , 1997, Proceedings of VIZ '97: Visualization Conference, Information Visualization Symposium and Parallel Rendering Symposium.

[16]  Jiawei Han,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[17]  Bing Liu,et al.  Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data , 2006, Data-Centric Systems and Applications.

[18]  Jianyong Wang,et al.  HARMONY: Efficiently Mining the Best Rules for Classification , 2005, SDM.

[19]  Ranga Raju Vatsavai,et al.  Trends in Spatial Data Mining , 2022 .

[20]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .

[21]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[22]  Jiawei Han,et al.  Adaptive Fastest Path Computation on a Road Network: A Traffic Mining Approach , 2007, VLDB.

[23]  Haim Levkowitz,et al.  From Visual Data Exploration to Visual Data Mining: A Survey , 2003, IEEE Trans. Vis. Comput. Graph..

[24]  Philip S. Yu,et al.  Mining Colossal Frequent Patterns by Core Pattern Fusion , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[25]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[26]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[27]  Jae-Gil Lee,et al.  Trajectory Outlier Detection: A Partition-and-Detect Framework , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[28]  Markus Gross,et al.  Visualizing Informationon a Sphere , 1997 .

[29]  Philip S. Yu,et al.  Cross-relational clustering with user's guidance , 2005, KDD '05.

[30]  Wei Fan,et al.  Systematic data selection to mine concept-drifting data streams , 2004, KDD.

[31]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[32]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[33]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[34]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[35]  Sangkyum Kim,et al.  ROAM: Rule- and Motif-Based Anomaly Detection in Massive Moving Object Data Sets , 2007, SDM.

[36]  Marcus A. Maloof,et al.  Using additive expert ensembles to cope with concept drift , 2005, ICML.

[37]  Edoardo M. Airoldi,et al.  Sampling algorithms for pure network topologies: a study on the stability and the separability of metric embeddings , 2005, SKDD.

[38]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[39]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[40]  Philip S. Yu,et al.  Efficient classification across multiple database relations: a CrossMine approach , 2006, IEEE Transactions on Knowledge and Data Engineering.

[41]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[42]  Anthony K. H. Tung,et al.  Mining top-K covering rule groups for gene expression data , 2005, SIGMOD '05.

[43]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[44]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[45]  Yi Lin,et al.  Prediction Cubes , 2005, VLDB.

[46]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[47]  Daniel A. Keim,et al.  Information Visualization and Visual Data Mining , 2002, IEEE Trans. Vis. Comput. Graph..

[48]  Edward Rolf Tufte,et al.  The visual display of quantitative information , 1985 .

[49]  Christos Faloutsos,et al.  Tools for large graph mining , 2005 .

[50]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[51]  Pier Luca Lanzi,et al.  Mining interesting knowledge from weblogs: a survey , 2005, Data Knowl. Eng..

[52]  C. Lee Giles,et al.  Self-Organization and Identification of Web Communities , 2002, Computer.

[53]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[54]  Philip S. Yu,et al.  LinkClus: efficient clustering via heterogeneous semantic links , 2006, VLDB.

[55]  Lise Getoor,et al.  Link mining: a survey , 2005, SKDD.

[56]  Jiawei Han,et al.  Extracting redundancy-aware top-k patterns , 2006, KDD '06.

[57]  Michael Stonebraker,et al.  Load Shedding in a Data Stream Manager , 2003, VLDB.

[58]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[59]  Jiawei Han,et al.  Flowcube: constructing RFID flowcubes for multi-dimensional analysis of commodity flows , 2006, VLDB.

[60]  Philip S. Yu,et al.  Clustering by pattern similarity in large data sets , 2002, SIGMOD '02.

[61]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[62]  Luciano Rossoni,et al.  Models and methods in social network analysis , 2006 .

[63]  Dennis Shasha,et al.  High Performance Discovery In Time Series: Techniques And Case Studies (Monographs in Computer Science) , 2004 .

[64]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[65]  E. Tufte,et al.  The visual display of quantitative information , 1984, The SAGE Encyclopedia of Research Design.

[66]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[67]  Jiawei Han,et al.  Mining Approximate Top-K Subspace Anomalies in Multi-Dimensional Time-Series Data , 2007, VLDB.

[68]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[69]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[70]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[71]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[72]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[73]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[74]  Jiawei Han,et al.  On Appropriate Assumptions to Mine Data Streams: Analysis and Practice , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[75]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[76]  Jiawei Han,et al.  Cost-Conscious Cleaning of Massive RFID Data Sets , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[77]  Jiawei Han,et al.  Semantic annotation of frequent patterns , 2007, TKDD.

[78]  Padhraic Smyth,et al.  Prediction and ranking algorithms for event-based network data , 2005, SKDD.

[79]  Raghu Ramakrishnan,et al.  Exploratory mining in cube space , 2006, Data Mining and Knowledge Discovery.

[80]  Jae-Gil Lee,et al.  Sampling cube: a framework for statistical olap over sampling data , 2008, SIGMOD Conference.

[81]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[82]  Jennifer Widom,et al.  SimRank: a measure of structural-context similarity , 2002, KDD.

[83]  Hans-Peter Kriegel,et al.  Visual classification: an interactive approach to decision tree construction , 1999, KDD '99.

[84]  John F. Roddick,et al.  An Updated Bibliography of Temporal, Spatial, and Spatio-temporal Data Mining Research , 2000, TSDM.

[85]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[86]  Jiawei Han,et al.  DataScope: Viewing Database Contents in Google Maps' Way , 2007, VLDB.

[87]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[88]  Ishwar K. Sethi,et al.  eID: a system for exploration of image databases , 2003, Inf. Process. Manag..

[89]  Jiawei Han,et al.  Mining compressed commodity workflows from massive RFID data sets , 2006, CIKM '06.

[90]  Download Book,et al.  Information Visualization in Data Mining and Knowledge Discovery , 2001 .

[91]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[92]  Lei Liu,et al.  Survey of Biodata Analysis from a Data Mining Perspective , 2005, Data Mining in Bioinformatics.

[93]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[94]  Jimeng Sun,et al.  Relevance search and anomaly detection in bipartite graphs , 2005, SKDD.

[95]  Philip S. Yu,et al.  Truth Discovery with Multiple Conflicting Information Providers on the Web , 2007, IEEE Transactions on Knowledge and Data Engineering.