The EDAM project: Mining atmospheric aerosol datasets

There is a great need to better understand the sources, dynamics, and compositions of atmospheric aerosols. The traditional approach for particle measurement, collecting bulk samples of particulates on filters, is not adequate for studying particle dynamics and real‐time correlations. This has led to the development of a new generation of real‐time instruments that provide continuous or semicontinuous streams of data about certain aerosol properties. However, these instruments have added a significant level of complexity to atmospheric aerosol data and dramatically increased the amounts of data to be collected, managed, and analyzed. Our ability to integrate the data from all of these new and complex instruments now lags far behind our data‐collection capabilities, and severely limits our ability to understand the data and act upon it in a timely manner. In this article, we present an overview of EDAM (Exploratory Data Analysis and Management), a joint project between researchers in Atmospheric Chemistry and Computer Science. Important objectives include environmental monitoring and data quality assurance, and real‐time data mining offers great potential. While atmospheric aerosol analysis is an important and challenging domain, our objective is to develop techniques that have broader applicability. © 2005 Wiley Periodicals, Inc. Int J Int Syst 20: 759–787, 2005.

[1]  Gwilym M. Jenkins,et al.  Time series analysis, forecasting and control , 1972 .

[2]  G. Nemhauser,et al.  Integer Programming , 2020 .

[3]  Raghu Ramakrishnan,et al.  Database Management Systems , 1976 .

[4]  P. Hopke Receptor modeling in environmental chemistry , 1985 .

[5]  Barbara J. Turpin,et al.  An in situ, time-resolved analyzer for aerosol organic and elemental carbon , 1990 .

[6]  David M. Skapura,et al.  Neural networks - algorithms, applications, and programming techniques , 1991, Computation and neural systems series.

[7]  Stephen Grossberg,et al.  ART 2-A: an adaptive resonance algorithm for rapid category learning and recognition , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[8]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1993 .

[9]  Christos Faloutsos,et al.  Efficient Similarity Search In Sequence Databases , 1993, FODO.

[10]  K. Prather,et al.  Real-time characterization of individual aerosol particles using time-of-flight mass spectrometry , 1994 .

[11]  Christos Faloutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.

[12]  R. Palmer,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[13]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[14]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[15]  Andreas S. Weigend,et al.  Time Series Prediction: Forecasting the Future and Understanding the Past , 1994 .

[16]  Jude W. Shavlik,et al.  Knowledge-Based Artificial Neural Networks , 1994, Artif. Intell..

[17]  Lawrence O. Hall,et al.  Knowledge based (re-)clustering , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[18]  Heikki Mannila,et al.  Efficient Algorithms for Discovering Association Rules , 1994, KDD Workshop.

[19]  J C Chow,et al.  Measurement methods to determine compliance with ambient air quality standards for suspended particles. , 1995, Journal of the Air & Waste Management Association.

[20]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[21]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[22]  Kyuseok Shim,et al.  Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases , 1995, VLDB.

[23]  Jiawei Han,et al.  Discovery of Multiple-Level Association Rules from Large Databases , 1995, VLDB.

[24]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[25]  Miron Livny,et al.  The Design and Implementation of a Sequence Database System , 1996, VLDB.

[26]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[27]  Lawrence B. Holder,et al.  Scalable Discovery of Informative Structural Concepts Using Domain Knowledge , 1996, IEEE Expert.

[28]  James J. Schauer,et al.  Source apportionment of airborne particulate matter using organic compounds as tracers , 1996 .

[29]  J. Houghton,et al.  Climate change 1995: the science of climate change. , 1996 .

[30]  Jiawei Han,et al.  Maintenance of discovered association rules in large databases: an incremental updating technique , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[31]  Yasuhiko Morimoto,et al.  Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization , 1996, SIGMOD '96.

[32]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[33]  Donald J. Berndt,et al.  Finding Patterns in Time Series: A Dynamic Programming Approach , 1996, Advances in Knowledge Discovery and Data Mining.

[34]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[35]  K. Prather,et al.  Real-Time Measurement of Correlated Size and Composition Profiles of Individual Atmospheric Aerosol Particles , 1996 .

[36]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[37]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[38]  Jennifer Widom,et al.  Clustering association rules , 1997, Proceedings 13th International Conference on Data Engineering.

[39]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[40]  Jiawei Han,et al.  Metarule-Guided Mining of Multi-Dimensional Association Rules Using Data Cubes , 1997, KDD.

[41]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[42]  Renée J. Miller,et al.  Association rules over interval data , 1997, SIGMOD '97.

[43]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[44]  J. Seinfeld,et al.  Atmospheric Chemistry and Physics: From Air Pollution to Climate Change , 1997 .

[45]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[46]  B. Morrical,et al.  Real-Time Analysis of Individual Atmospheric Aerosol Particles: Design and Performance of a Portable ATOFMS , 1997 .

[47]  Sholom M. Weiss,et al.  Predictive data mining - a practical guide , 1997 .

[48]  Alberto O. Mendelzon,et al.  Similarity-based queries for time series data , 1997, SIGMOD '97.

[49]  Yasuhiko Morimoto,et al.  Computing Optimized Rectilinear Regions for Association Rules , 1997, KDD.

[50]  Vipin Kumar,et al.  Scalable parallel data mining for association rules , 1997, SIGMOD '97.

[51]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[52]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[53]  Paul S. Bradley,et al.  Feature Selection via Concave Minimization and Support Vector Machines , 1998, ICML.

[54]  Laks V. S. Lakshmanan,et al.  Exploratory mining and pruning optimizations of constrained associations rules , 1998, SIGMOD '98.

[55]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[56]  Masaru Kitsuregawa,et al.  Mining Algorithms for Sequential Patterns in Parallel: Hash Based Approach , 1998, PAKDD.

[57]  Alberto O. Mendelzon,et al.  Efficient Retrieval of Similar Time Sequences Using DFT , 1998, FODO.

[58]  Heikki Mannila,et al.  Rule Discovery from Time Series , 1998, KDD.

[59]  Chris Clifton,et al.  Query flocks: a generalization of association-rule mining , 1998, SIGMOD '98.

[60]  Niki Pissinou,et al.  Attribute weighting: a method of applying domain knowledge in the decision tree process , 1998, International Conference on Information and Knowledge Management.

[61]  Salvatore J. Stolfo,et al.  Data Mining Approaches for Intrusion Detection , 1998, USENIX Security Symposium.

[62]  Christos Faloutsos,et al.  Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining , 1998, VLDB.

[63]  Vladimir Cherkassky,et al.  Learning from Data: Concepts, Theory, and Methods , 1998 .

[64]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[65]  Sridhar Ramaswamy,et al.  Cyclic association rules , 1998, Proceedings 14th International Conference on Data Engineering.

[66]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[67]  Allen,et al.  Direct observation of heterogeneous chemistry in the atmosphere , 1998, Science.

[68]  Sunita Sarawagi,et al.  Integrating association rule mining with relational database systems: alternatives and implications , 1998, SIGMOD '98.

[69]  John C. Platt Using Analytic QP and Sparseness to Speed Training of Support Vector Machines , 1998, NIPS.

[70]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[71]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[72]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[73]  Shamkant B. Navathe,et al.  Mining for strong negative associations in a large database of customer transactions , 1998, Proceedings 14th International Conference on Data Engineering.

[74]  Mohammed J. Zaki Efficient enumeration of frequent sequences , 1998, CIKM '98.

[75]  Man Hon Wong,et al.  Fast time-series searching with scaling and shifting , 1999, PODS '99.

[76]  P. Hopke,et al.  Classification of Single Particles Analyzed by ATOFMS Using an Artificial Neural Network, ART-2A , 1999 .

[77]  Anthony K. H. Tung,et al.  Breaking the barrier of transactions: mining inter-transaction association rules , 1999, KDD '99.

[78]  K. Prather,et al.  Mass spectrometry of aerosols. , 1999, Chemical reviews.

[79]  David R. Musicant,et al.  Successive overrelaxation for support vector machines , 1999, IEEE Trans. Neural Networks.

[80]  Laks V. S. Lakshmanan,et al.  Optimization of constrained frequent set queries with 2-variable constraints , 1999, SIGMOD '99.

[81]  Johannes Gehrke,et al.  BOAT—optimistic decision tree construction , 1999, SIGMOD '99.

[82]  Michael J. Kleeman,et al.  Size and composition distribution of atmospheric particles in southern California , 1999 .

[83]  Johannes Gehrke,et al.  Mining Very Large Databases , 1999, Computer.

[84]  Bernhard Spengler,et al.  Data processing in on-line laser mass spectrometry of inorganic, organic, or biological airborne particles , 1999 .

[85]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[86]  Ada Wai-Chee Fu,et al.  Efficient time series matching by wavelets , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[87]  Davood Rafiei,et al.  On similarity-based queries for time series data , 1997, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[88]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[89]  Michael J. Kleeman,et al.  Particle Detection Efficiencies of Aerosol Time of Flight Mass Spectrometers under Ambient Sampling Conditions , 2000 .

[90]  Divyakant Agrawal,et al.  A comparison of DFT and DWT based similarity search in time-series databases , 2000, CIKM '00.

[91]  J. Schauer,et al.  Source Apportionment of Wintertime Gas-Phase and Particle-Phase Air Pollutants Using Organic Compounds as Tracers , 2000 .

[92]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[93]  Sang-Wook Kim,et al.  Index interpolation: an approach to subsequence matching supporting normalization transform in time-series databases , 2000, CIKM '00.

[94]  Raghu Ramakrishnan,et al.  Dynamic Histograms: Capturing Evolving Data Sets , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[95]  Christos Faloutsos,et al.  Online data mining for co-evolving time sequences , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[96]  Sayan Mukherjee,et al.  Feature Selection for SVMs , 2000, NIPS.

[97]  Changzhou Wang,et al.  Supporting content-based searches on time series via approximation , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[98]  Laks V. S. Lakshmanan,et al.  Efficient mining of constrained correlated sets , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[99]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[100]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[101]  Ian Witten,et al.  Data Mining , 2000 .

[102]  David R. Musicant,et al.  Robust Linear and Support Vector Regression , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[103]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[104]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[105]  Anupam Joshi,et al.  On Mining Web Access Logs , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[106]  Padhraic Smyth,et al.  Deformable Markov model templates for time-series pattern matching , 2000, KDD '00.

[107]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[108]  Balaji Padmanabhan,et al.  Small is beautiful: discovering the minimal set of unexpected patterns , 2000, KDD '00.

[109]  Hongjun Lu,et al.  Beyond intratransaction association analysis: mining multidimensional intertransaction association rules , 2000, TOIS.

[110]  David R. Musicant,et al.  Data Discrimination via Nonlinear Generalized Support Vector Machines , 2001 .

[111]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[112]  A S Wexler,et al.  Application of the ART-2a algorithm to laser ablation aerosol mass spectrometry of particle standards. , 2001, Analytical chemistry.

[113]  P. Bhave,et al.  Source apportionment of fine particulate matter by clustering single-particle data: tests of receptor model accuracy. , 2001, Environmental science & technology.

[114]  Eric R. Ziegel,et al.  Mastering Data Mining , 2001, Technometrics.

[115]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[116]  Huan Liu,et al.  Rule mining with prior knowledge - a belief networks approach , 2001, Intell. Data Anal..

[117]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[118]  J F Collins,et al.  Time-resolved characterization of diesel particulate emissions. 2. Instruments for elemental and organic carbon measurements. , 2001, Environmental science & technology.

[119]  Johannes Gehrke,et al.  DEMON: Mining and Monitoring Evolving Data , 2001, IEEE Trans. Knowl. Data Eng..

[120]  Philippe Bonnet,et al.  Towards Sensor Database Systems , 2001, Mobile Data Management.

[121]  Heikki Mannila,et al.  Time series segmentation for context recognition in mobile devices , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[122]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[123]  Yuh-Jye Lee,et al.  RSVM: Reduced Support Vector Machines , 2001, SDM.

[124]  Laks V. S. Lakshmanan,et al.  Mining frequent itemsets with convertible constraints , 2001, Proceedings 17th International Conference on Data Engineering.

[125]  G R Cass,et al.  Quantification of ATOFMS data by multivariate methods. , 2001, Analytical chemistry.

[126]  J F Collins,et al.  Time resolved characterization of diesel particulate emissions. 1. Instruments for particle mass measurements. , 2001, Environmental science & technology.

[127]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[128]  O. Mangasarian,et al.  Semi-Supervised Support Vector Machines for Unlabeled Data Classification , 2001 .

[129]  R. Stevens,et al.  Development and characterization of an annular denuder methodology for the measurement of divalent inorganic reactive gaseous mercury in ambient air. , 2002, Environmental science & technology.

[130]  Christos Faloutsos,et al.  Data-driven evolution of data mining algorithms , 2002, CACM.

[131]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[132]  Greg J Evans,et al.  Chemically-assigned classification of aerosol mass spectra , 2002, Journal of the American Society for Mass Spectrometry.

[133]  J. Schauer,et al.  Source apportionment of PM2.5 in the Southeastern United States using solvent-extractable organic compounds as tracers. , 2002, Environmental science & technology.

[134]  Jonathan O. Allen,et al.  A field-based approach for deterimining ATOFMS instrument sensitities to ammonium and nitrate. , 2002, Environmental science & technology.

[135]  Samuel Madden,et al.  Continuously adaptive continuous queries over streams , 2002, SIGMOD '02.

[136]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[137]  Srinivasan Parthasarathy,et al.  Efficiently Mining Approximate Models of Associations in Evolving Databases , 2002, PKDD.

[138]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[139]  J. Schauer,et al.  Source reconciliation of atmospheric gas-phase and particle-phase pollutants during a severe photochemical smog episode. , 2002, Environmental science & technology.

[140]  Renée J. Miller,et al.  Similarity search over time-series data using wavelets , 2002, Proceedings 18th International Conference on Data Engineering.

[141]  Johannes Gehrke,et al.  A Framework for Measuring Differences in Data Characteristics , 2002, J. Comput. Syst. Sci..

[142]  Eamonn J. Keogh,et al.  Locally adaptive dimensionality reduction for indexing large time series databases , 2001, SIGMOD '01.

[143]  Jennifer Widom,et al.  Characterizing memory requirements for queries over continuous data streams , 2002, PODS '02.

[144]  Glenn Fung,et al.  Knowledge-Based Support Vector Machine Classifiers , 2002, NIPS.

[145]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[146]  David Davenport,et al.  Anonymity on the Internet: why the price may be too high , 2002, CACM.

[147]  Eamonn J. Keogh,et al.  Finding surprising patterns in a time series database in linear time and space , 2002, KDD.

[148]  Christos Faloutsos Future directions in data mining: streams, networks, self-similarity and power laws , 2002, CIKM '02.

[149]  Johannes Gehrke,et al.  Scaling mining algorithms to large databases , 2002, CACM.

[150]  Peter A. Flach,et al.  RSD: Relational Subgroup Discovery through First-Order Feature Construction , 2002, ILP.

[151]  Yong Yao,et al.  The cougar approach to in-network query processing in sensor networks , 2002, SGMD.

[152]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[153]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[154]  Vipin Kumar,et al.  Optimizing F-Measure with Support Vector Machines , 2003, FLAIRS Conference.

[155]  P. Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, Very Large Data Bases Conference.

[156]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[157]  J. Seinfeld,et al.  ACE-Asia intercomparison of a thermal-optical method for the determination of particle-phase organic and elemental carbon. , 2003, Environmental science & technology.

[158]  David R. Musicant,et al.  Large Scale Kernel Regression via Linear Programming , 2002, Machine Learning.

[159]  Rajeev Motwani,et al.  Scalable Techniques for Mining Causal Structures , 1998, Data Mining and Knowledge Discovery.

[160]  Zheng Huang,et al.  Cost-based labeling of groups of mass spectra , 2004, SIGMOD '04.

[161]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[162]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[163]  Leonid Khachiyan,et al.  Cubegrades: Generalizing Association Rules , 2002, Data Mining and Knowledge Discovery.

[164]  JOHANNES GEHRKE,et al.  RainForest—A Framework for Fast Decision Tree Construction of Large Datasets , 1998, Data Mining and Knowledge Discovery.

[165]  Tom Michael Mitchell,et al.  The Role of Unlabeled Data in Supervised Learning , 2004 .

[166]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[167]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[168]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.