Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources

Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision-making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user's point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS – an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.

[1]  Vasant Honavar,et al.  Learning classifiers from distributed, semantically heterogeneous, autonomous data sources , 2004 .

[2]  David Haussler,et al.  Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Learning Framework , 1988, Artif. Intell..

[3]  Daryl E. Hershberger,et al.  Collective Data Mining: a New Perspective toward Distributed Data Mining Advances in Distributed Data Mining Book , 1999 .

[4]  Vasant Honavar,et al.  Learning Decision Trees from Multi-Relational Data , 2003 .

[5]  Richard Hull,et al.  Managing semantic heterogeneity in databases: a theoretical prospective , 1997, PODS.

[6]  Vasant Honavar,et al.  Learning Classifiers for Assigning Protein Sequences to Gene Ontology Functional Families , 2004 .

[7]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[8]  Yike Guo,et al.  Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets , 2001 .

[9]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[10]  Hillol Kargupta,et al.  Constructing Simpler Decision Trees from Ensemble Models Using Fourier Analysis , 2002, DMKD.

[11]  D. Rubin Multiple Imputation After 18+ Years , 1996 .

[12]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[13]  Craig A. Knoblock,et al.  Retrieving and Integrating Data from Multiple Information Sources , 1993, Int. J. Cooperative Inf. Syst..

[14]  James A. Hendler,et al.  Ontology-based Induction of High Level Classification Rules , 1997, DMKD.

[15]  Bruce G. Buchanan,et al.  The WoRLD: Knowledge Discovery from Multiple Distributed Databases , 2007 .

[16]  Alon Y. Levy The Information Manifold Approach to Data Integration , 2007 .

[17]  Laura M. Haas,et al.  Optimizing Queries Across Diverse Data Sources , 1997, VLDB.

[18]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[19]  Vasant Honavar,et al.  AVT-NBL: an algorithm for learning compact and accurate naive Bayes classifiers from attribute value taxonomies and data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[20]  Limsoon Wong,et al.  The Kleisli Query System as a Backbone for Bioinformatics Data Integration and Analysis , 2003, Bioinformatics.

[21]  Nick Roussopoulos,et al.  MOCHA: a self-extensible database middleware system for distributed data sources , 2000, SIGMOD '00.

[22]  Arbee L. P. Chen,et al.  Evaluating Aggregate Operations Over Imprecise Data , 1996, IEEE Trans. Knowl. Data Eng..

[23]  LINDA G. DEMICHIEL,et al.  Resolving Database Incompatibility: An Approach to Performing Relational Operations over Mismatched Domains , 1989, IEEE Trans. Knowl. Data Eng..

[24]  Gio Wiederhold,et al.  Abstraction of Representation for Interoperation , 1997, ISMIS.

[25]  Surajit Chaudhuri,et al.  On the Efficient Gathering of Sufficient Statistics for Classification from Large SQL Databases , 1998, KDD.

[26]  Jude W. Shavlik,et al.  Knowledge-Based Artificial Neural Networks , 1994, Artif. Intell..

[27]  John F. Sowa,et al.  Knowledge representation: logical, philosophical, and computational foundations , 2000 .

[28]  Xia Wang,et al.  Data-Driven Discovery of Rules for Protein Function Classification Based on Sequence Motifs , 2003 .

[29]  Nicholas T. Longford,et al.  Missing data and small area estimation in the UK Labour Force Survey , 2004 .

[30]  Vasant Honavar,et al.  Information extraction and integration from heterogeneous, distributed, autonomous information sources - a federated ontology-driven query-centric approach , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[31]  Vasant Honavar,et al.  Identifying protein-protein interaction sites from surface residues-a support vector machine approac , 2004 .

[32]  Vasant Honavar,et al.  Learning support vector machine clas-sifiers from distributed data , 2005 .

[33]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[34]  V. S. Subrahmanian,et al.  An ontology-extended relational algebra , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[35]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[36]  Patrick Valduriez,et al.  Scaling heterogeneous databases and the design of Disco , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[37]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[38]  Vasant Honavar,et al.  Learning Support Vector Machines from Distributed Data Sources , 2005, AAAI.

[39]  Yishay Mansour,et al.  Learning Boolean Functions via the Fourier Transform , 1994 .

[40]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[41]  Craig A. Knoblock,et al.  The Ariadne Approach to Web-Based Information Integration , 2001, Int. J. Cooperative Inf. Syst..

[42]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[43]  O. Mangasarian,et al.  Massive data discrimination via linear support vector machines , 2000 .

[44]  Jennifer Neville,et al.  Simple estimators for relational Bayesian classifiers , 2003, Third IEEE International Conference on Data Mining.

[45]  Vasant Honavar,et al.  Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources , 2005, DILS.

[46]  Pierre Baldi,et al.  Modeling the Internet and the Web: Probabilistic Methods and Algorithms: Baldi/Probabilistic , 2002 .

[47]  Marie desJardins,et al.  Using Feature Hierarchies in Bayesian Network Learning , 2000, SARA.

[48]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[49]  Vasant Honavar,et al.  Learning Classifiers from Semantically Heterogeneous Data , 2004, CoopIS/DOA/ODBASE.

[50]  Philip K. Chan,et al.  Meta-learning in distributed data mining systems: Issues and approaches , 2007 .

[51]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[52]  Ingram Olkin,et al.  Incomplete data in sample surveys. Vol. 2: theory and bibliographies , 1983 .

[53]  Raj Bhatnagar,et al.  Pattern Discovery in Distributed Databases , 1997, AAAI/IAAI.

[54]  D. Rubin,et al.  MULTIPLE IMPUTATIONS IN SAMPLE SURVEYS-A PHENOMENOLOGICAL BAYESIAN APPROACH TO NONRESPONSE , 2002 .

[55]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[56]  Jiawei Han,et al.  Exploration of the power of attribute-oriented induction in data mining , 1995, KDD 1995.

[57]  Gregory R. Grant,et al.  Bioinformatics - The Machine Learning Approach , 2000, Comput. Chem..

[58]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[59]  Kevin Chen-Chuan Chang,et al.  Mind your vocabulary: query mapping across heterogeneous information sources , 1999, SIGMOD '99.

[60]  Ali R. Hurson,et al.  A taxonomy and current issues in multidatabase systems , 1992, Computer.

[61]  Vasant Honavar,et al.  Learning Classifiers from Distributed, Ontology-Extended Data Sources , 2006, DaWaK.

[62]  Harri Niska,et al.  Methods for imputation of missing values in air quality data sets , 2004 .

[63]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[64]  Alon Y. Levy Logic-based techniques in data integration , 2001 .

[65]  Thierry Barsalou,et al.  M(DM): an open framework for interoperation of multimodel multidatabase systems , 1992, [1992] Eighth International Conference on Data Engineering.

[66]  Ingram Olkin,et al.  Incomplete data in sample surveys. Vol. 1: report and case studies , 1983 .

[67]  Vasant Honavar,et al.  Decision Tree Induction from Distributed Heterogeneous Autonomous Data Sources , 2003 .

[68]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[69]  Inderpal Singh Mumick,et al.  Efficient Maintenance Of Materialized Mediated Views , 1999 .

[70]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[71]  Michael R. Genesereth,et al.  The Conceptual Basis for Mediation Services , 1997, IEEE Expert.

[72]  Pedro M. Domingos Knowledge Acquisition from Examples Via Multiple Models , 1997 .

[73]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[74]  Vasant Honavar,et al.  A two-stage classifier for identification of protein-protein interface residues , 2004, ISMB/ECCB.

[75]  Francesco Bergadano,et al.  Guiding induction with domain theories , 1990 .

[76]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[77]  Ingram Olkin,et al.  Incomplete data in sample surveys. Vol. 3: proceedings of the symposium , 1983 .

[78]  Michael J. Pazzani,et al.  Beyond Concise and Colorful: Learning Intelligible Rules , 1997, KDD.

[79]  James A. Hendler,et al.  Advances in High Performance Knowledge Representation , 1996 .

[80]  G. Casella,et al.  Statistical Inference , 2003, Encyclopedia of Social Network Analysis and Mining.

[81]  Trivellore E Raghunathan,et al.  What do we do with missing data? Some options for analysis of incomplete data. , 2004, Annual review of public health.

[82]  Vasant Honavar,et al.  Analysis and Synthesis of Agents That Learn from Distributed Dynamic Data Sources , 2001, Emergent Neural Computational Architectures Based on Neuroscience.

[83]  Christos Faloutsos,et al.  Automated Learning and Discovery State-of-the-Art and Research Topics in a Rapidly Growing Field , 1999, AI Mag..

[84]  Sally I. McClean,et al.  Aggregation of Imprecise and Uncertain Information in Databases , 2001, IEEE Trans. Knowl. Data Eng..

[85]  Vasant Honavar,et al.  Collaborative Ontology Building with Wiki@nt , 2004 .

[86]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[87]  James A. Larson,et al.  Federated databases: architectures and issues , 1990 .

[88]  Thure Etzold,et al.  SRS: An Integration Platform for Databanks and Analysis Tools in Bioinformatics , 2003, Bioinformatics.

[89]  Barbara A. Eckman,et al.  A Practitioner's Guide to Data Management and Data Integration in Bioinformatics , 2003, Bioinformatics.

[90]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[91]  Adrian Walker,et al.  On Retrieval from a Small Version of a Large Data Base , 1980, VLDB.

[92]  Andrew W. Moore,et al.  Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets , 1998, J. Artif. Intell. Res..

[93]  Vasant Honavar,et al.  Learning decision tree classifiers from attribute value taxonomies and partially specified data , 2003, ICML 2003.

[94]  Frank van Harmelen,et al.  C-OWL: Contextualizing Ontologies , 2003, SEMWEB.

[95]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[96]  Vasant Dhar,et al.  Abstract-Driven Pattern Discovery in Databases , 1992, IEEE Trans. Knowl. Data Eng..

[97]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[98]  Michael J. Pazzani,et al.  Comprehensible Knowledge-Discovery in Databases , 1997 .

[99]  Alon Y. Halevy,et al.  The Nimble XML data integration system , 2001, Proceedings 17th International Conference on Data Engineering.