Toward Semantics-Enabled Infrastructure for Knowledge Acquisition from Distributed Data

We summarize progress on algorithms and software knowledge acquisition from large, distributed, autonomous, and semantically disparate information sources. Some key results include: scalable algorithms for constructing predictive models from data based on a novel decomposition of learning algorithms that interleave queries for sufficient statistics from data with computations using the statistics; provably exact algorithms from distributed data (relative to their centralized counterparts); and statistically sound approaches to learning predictive models from partially specified data that arise in settings where the schema and the data semantics and hence the granularity of data differ across the different sources.

[1]  Vasant Honavar,et al.  Modular Ontologies - A Formal Investigation of Semantics and Expressivity , 2006, ASWC.

[2]  Chris Clifton,et al.  Privacy-preserving data mining: why, how, and when , 2004, IEEE Security & Privacy Magazine.

[3]  James A. Larson,et al.  Federated databases: architectures and issues , 1990 .

[4]  Subbarao Kambhampati,et al.  Optimizing Recursive Information-Gathering Plans , 1999, IJCAI.

[5]  Alexander Borgida,et al.  Distributed Description Logics: Directed Domain Correspondences in Federated Information Sources , 2002, OTM.

[6]  Pedro M. Domingos Knowledge Acquisition from Examples Via Multiple Models , 1997 .

[7]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[8]  Vasant Honavar,et al.  A two-stage classifier for identification of protein-protein interface residues , 2004, ISMB/ECCB.

[9]  Vasant Honavar,et al.  Identifying protein-protein interaction sites from surface residues-a support vector machine approac , 2004 .

[10]  Barbara A. Eckman,et al.  A Practitioner's Guide to Data Management and Data Integration in Bioinformatics , 2003, Bioinformatics.

[11]  Maurizio Lenzerini,et al.  Data integration: a theoretical perspective , 2002, PODS.

[12]  Peishen Qi,et al.  Ontology Translation on the Semantic Web , 2003, J. Data Semant..

[13]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[14]  Vasant Honavar,et al.  Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources , 2005, Discovery Science.

[15]  Nick Roussopoulos,et al.  MOCHA: a self-extensible database middleware system for distributed data sources , 2000, SIGMOD '00.

[16]  Vasant Honavar,et al.  A Tableau-Based Federated Reasoning Algorithm for Modular Ontologies , 2006, 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings)(WI'06).

[17]  Christos Faloutsos,et al.  Automated Learning and Discovery State-of-the-Art and Research Topics in a Rapidly Growing Field , 1999, AI Mag..

[18]  R. Fisher,et al.  On the Mathematical Foundations of Theoretical Statistics , 1922 .

[19]  Gio Wiederhold,et al.  Abstraction of Representation for Interoperation , 1997, ISMIS.

[20]  Craig A. Knoblock,et al.  Retrieving and Integrating Data from Multiple Information Sources , 1993, Int. J. Cooperative Inf. Syst..

[21]  Craig A. Knoblock,et al.  The Ariadne Approach to Web-Based Information Integration , 2001, Int. J. Cooperative Inf. Syst..

[22]  Bijan Parsia,et al.  Generalized link properties for expressive ε-connections of description logics , 2005, AAAI 2005.

[23]  Pedro M. Domingos,et al.  Ontology Matching: A Machine Learning Approach , 2004, Handbook on Ontologies.

[24]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[25]  Anne E. Trefethen,et al.  Cyberinfrastructure for e-Science , 2005, Science.

[26]  Vasant Honavar,et al.  Automated data-driven discovery of motif-based protein function classifiers , 2003, Inf. Sci..

[27]  Vipin Kumar,et al.  Parallel Formulations of Decision-Tree Classification Algorithms , 2004, Data Mining and Knowledge Discovery.

[28]  Hillol Kargupta,et al.  Constructing Simpler Decision Trees from Ensemble Models Using Fourier Analysis , 2002, DMKD.

[29]  Kevin Chen-Chuan Chang,et al.  Mind your vocabulary: query mapping across heterogeneous information sources , 1999, SIGMOD '99.

[30]  Jennifer Widom,et al.  The TSIMMIS Approach to Mediation: Data Models and Languages , 1997, Journal of Intelligent Information Systems.

[31]  Alon Y. Halevy,et al.  Semantic Integration Research in the Database Community : A Brief Survey , 2005 .

[32]  Michael R. Genesereth,et al.  The Conceptual Basis for Mediation Services , 1997, IEEE Expert.

[33]  Vasant Honavar,et al.  A Semantic Importing Approach to Knowledge Reuse from Multiple Ontologies , 2007, AAAI.

[34]  Ali R. Hurson,et al.  A taxonomy and current issues in multidatabase systems , 1992, Computer.

[35]  Foster J. Provost,et al.  A Survey of Methods for Scaling Up Inductive Algorithms , 1999, Data Mining and Knowledge Discovery.

[36]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[37]  Diego Calvanese,et al.  Logical foundations of peer-to-peer data integration , 2004, PODS '04.

[38]  Diego Calvanese,et al.  Data Integration: A Logic-Based Perspective , 2005, AI Mag..

[39]  Yishay Mansour,et al.  Learning Boolean Functions via the Fourier Transform , 1994 .

[40]  Alon Y. Levy The Information Manifold Approach to Data Integration , 2007 .

[41]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[42]  Alon Y. Halevy,et al.  The Nimble XML data integration system , 2001, Proceedings 17th International Conference on Data Engineering.

[43]  Vasant Honavar,et al.  On the Semantics of Linking and Importing in Modular Ontologies , 2006, SEMWEB.

[44]  Vasant Honavar,et al.  Information Integration and Knowledge Acquisition from Semantically Heterogeneous Biological Data Sources , 2005, DILS.

[45]  AnHai Doan,et al.  iMAP: Discovering Complex Mappings between Database Schemas. , 2004, SIGMOD 2004.

[46]  Alon Y. Levy Logic-based techniques in data integration , 2001 .

[47]  Pedro M. Domingos,et al.  Representing and reasoning about mappings between domain models , 2002, AAAI/IAAI.

[48]  V. S. Subrahmanian,et al.  An ontology-extended relational algebra , 2003, Proceedings Fifth IEEE Workshop on Mobile Computing Systems and Applications.

[49]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[50]  Patrick Valduriez,et al.  Scaling heterogeneous databases and the design of Disco , 1996, Proceedings of 16th International Conference on Distributed Computing Systems.

[51]  Joann J. Ordille,et al.  Data integration: the teenage years , 2006, VLDB.

[52]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[53]  Guido Moerkotte,et al.  Efficient maintenance of materialized mediated views , 1995, SIGMOD '95.

[54]  James A. Hendler,et al.  E-Science: The Grid and the Semantic Web , 2004, IEEE Intell. Syst..

[55]  Raj Bhatnagar,et al.  Pattern Discovery in Distributed Databases , 1997, AAAI/IAAI.

[56]  Vasant Honavar,et al.  Learning Classifiers from Semantically Heterogeneous Data , 2004, CoopIS/DOA/ODBASE.

[57]  Philip K. Chan,et al.  Meta-learning in distributed data mining systems: Issues and approaches , 2007 .

[58]  Daniel Atkins,et al.  Revolutionizing Science and Engineering Through Cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure , 2003 .

[59]  Vasant Honavar,et al.  Learning decision tree classifiers from attribute value taxonomies and partially specified data , 2003, ICML 2003.

[60]  Vasant Honavar,et al.  AVT-NBL: an algorithm for learning compact and accurate naive Bayes classifiers from attribute value taxonomies and data , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[61]  Laura M. Haas,et al.  Clio grows up: from research prototype to industrial tool , 2005, SIGMOD '05.

[62]  Thierry Barsalou,et al.  M(DM): an open framework for interoperation of multimodel multidatabase systems , 1992, [1992] Eighth International Conference on Data Engineering.

[63]  Vasant Honavar,et al.  Learning classifiers from distributed, semantically heterogeneous, autonomous data sources , 2004 .

[64]  Daryl E. Hershberger,et al.  Collective Data Mining: a New Perspective toward Distributed Data Mining Advances in Distributed Data Mining Book , 1999 .

[65]  Richard Hull,et al.  Managing semantic heterogeneity in databases: a theoretical prospective , 1997, PODS.

[66]  Yike Guo,et al.  Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets , 2001 .

[67]  Andrea Calì,et al.  Accessing Data Integration Systems through Conceptual Schemas , 2001, ER.