Discovering interesting information with advances in web technology

The Web is a steadily evolving resource comprising much more than mere HTML pages. With its ever-growing data sources in a variety of formats, it provides great potential for knowledge discovery. In this article, we shed light on some interesting phenomena of the Web: the deep Web, which surfaces database records as Web pages; the Semantic Web, which defines meaningful data exchange formats; XML, which has established itself as a lingua franca for Web data exchange; and domain-specific markup languages, which are designed based on XML syntax with the goal of preserving semantics in targeted domains. We detail these four developments in Web technology, and explain how they can be used for data mining. Our goal is to show that all these areas can be as useful for knowledge discovery as the HTML-based part of the Web.

[1]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[2]  Yi Li,et al.  RiMOM: A Dynamic Multistrategy Ontology Alignment Framework , 2009, IEEE Transactions on Knowledge and Data Engineering.

[3]  Deborah L. McGuinness,et al.  When owl: sameAs Isn't the Same: An Analysis of Identity in Linked Data , 2010, SEMWEB.

[4]  Axel Polleres,et al.  Some entities are more equal than others: statistical methods to consolidate Linked Data , 2010 .

[5]  Richi Nayak,et al.  Frequent pattern mining on XML documents , 2008 .

[6]  Nicolás Marín,et al.  Review of Data on the Web: from relational to semistructured data and XML by Serge Abiteboul, Peter Buneman, and Dan Suciu. Morgan Kaufmann 1999. , 2003, SGMD.

[7]  Serge Abiteboul,et al.  PARIS: Probabilistic Alignment of Relations, Instances, and Schema , 2011, Proc. VLDB Endow..

[8]  Ling Liu,et al.  Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Richi Nayak,et al.  XML schema clustering with semantic and hierarchical similarity measures , 2007, Knowl. Based Syst..

[10]  BrightPlanet The Deep Web : Surfacing Hidden Value. , 2000 .

[11]  Erik Wilde,et al.  XML Fever , 2008, ACM Queue.

[12]  Mark A. Musen,et al.  PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment , 2000, AAAI/IAAI.

[13]  Tom M. Mitchell,et al.  Acquiring temporal constraints between relations , 2012, CIKM.

[14]  Renée J. Miller,et al.  Leveraging data and structure in ontology integration , 2007, SIGMOD '07.

[15]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[16]  Mansur R. Kabuka,et al.  Ontology matching with semantic verification , 2009, J. Web Semant..

[17]  Bart Goethals,et al.  Relational Association Rules: Getting WARMeR , 2002, Pattern Detection and Discovery.

[18]  Stefan Decker,et al.  Sig.ma: live views on the web of data , 2010, WWW '10.

[19]  Richi Nayak,et al.  Knowledge Discovery from XML documents: PAKDD 2006 Workshop Proceedings First International Workshop, KDXD 2006, Singapore, April 9, 2006.Vol. 3915. , 2006 .

[20]  Jens Lehmann,et al.  DL-Learner: Learning Concepts in Description Logics , 2009, J. Mach. Learn. Res..

[21]  Valter Crescenzi,et al.  RoadRunner: Towards Automatic Data Extraction from Large Web Sites , 2001, VLDB.

[22]  Fabian M. Suchanek,et al.  AMIE: association rule mining under incomplete evidence in ontological knowledge bases , 2013, WWW.

[23]  Gerhard Weikum,et al.  LINDA: distributed web-of-data-scale entity matching , 2012, CIKM.

[24]  Frank van Harmelen,et al.  A semantic web primer , 2004 .

[25]  Alun D. Preece,et al.  Learning Meta-descriptions of the FOAF Network , 2004, SEMWEB.

[26]  Mohammed J. Zaki Efficiently mining frequent trees in a forest: algorithms and applications , 2005, IEEE Transactions on Knowledge and Data Engineering.

[27]  Rafael Berlanga Llavori,et al.  Finding association rules in semantic web data , 2012, Knowl. Based Syst..

[28]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[29]  Ellis Horowitz,et al.  Indexing the invisible web: a survey , 2005, Online Inf. Rev..

[30]  Fabian M. Suchanek,et al.  Yago: A Core of Semantic Knowledge Unifying WordNet and Wikipedia , 2007 .

[31]  Takashi Washio,et al.  A General Framework for Mining Frequent Subgraphs from Labeled Graphs , 2004, Fundam. Informaticae.

[32]  Kevin Chen-Chuan Chang,et al.  Toward Large Scale Integration: Building a MetaQuerier over Databases on the Web , 2005, CIDR.

[33]  Wenfei Fan,et al.  Propagating XML constraints to relations , 2007, J. Comput. Syst. Sci..

[34]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[35]  Aparna S. Varde,et al.  MatML: XML for information exchange with materials property data , 2006, DMSSP '06.

[36]  Hannu Toivonen,et al.  Discovery of frequent DATALOG patterns , 1999, Data Mining and Knowledge Discovery.

[37]  Erhard Rahm,et al.  Schema and ontology matching with COMA++ , 2005, SIGMOD '05.

[38]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[39]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[40]  Arvind Malhotra,et al.  XML Schema Part 2: Datatypes Second Edition , 2004 .

[41]  Andreas Harth,et al.  Performing Object Consolidation on the Semantic Web Data Graph , 2007, I3.

[42]  Tobias Dönz Extracting Structured Data from Web Pages , 2003 .

[43]  Juliana Freire,et al.  Searching for Hidden-Web Databases , 2005, WebDB.

[44]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[45]  Oren Etzioni,et al.  Learning First-Order Horn Clauses from Web Text , 2010, EMNLP.

[46]  Nathalie Pernelle,et al.  Combining a Logical and a Numerical Method for Data Reconciliation , 2009, J. Data Semant..

[47]  Min Song,et al.  Handbook of Research on Text and Web Mining Technologies , 2008 .

[48]  Alfio Ferrara,et al.  Automatic Identity Recognition in The Semantic Web , 2008, IRSW.

[49]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[50]  E F. Begley MatML Version 3.0 - Schema | NIST , 2003 .

[51]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[52]  Agnieszka Lawrynowicz,et al.  The role of semantics in mining frequent patterns from knowledge bases in description logics with rules , 2010, Theory and Practice of Logic Programming.

[53]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles , 1999, J. Chem. Inf. Comput. Sci..

[54]  K. Chang,et al.  Accessing the Deep Web : A Survey , 2005 .

[55]  Ke Wang,et al.  Mining Generalized Associations of Semantic Relations from Textual Web Content , 2007, IEEE Transactions on Knowledge and Data Engineering.

[56]  Mathieu d'Aquin,et al.  Large scale integration of senses for the semantic web , 2009, WWW '09.

[57]  Yuzhong Qu,et al.  A self-training approach for resolving object coreference on the semantic web , 2011, WWW.

[58]  Stefan Schlobach,et al.  An Empirical Study of Instance-Based Ontology Matching , 2007, ISWC/ASWC.

[59]  Heiner Stuckenschmidt,et al.  Leveraging Terminological Structure for Object Reconciliation , 2010, ESWC.

[60]  Scott Boag,et al.  XQuery 1.0 : An XML Query Language , 2007 .

[61]  Yuzhong Qu,et al.  How Matchable Are Four Thousand Ontologies on the Semantic Web , 2011, ESWC.

[62]  Mohammed Maniruzzaman,et al.  QuenchML: A semantics-preserving markup language for knowledge representation in quenching , 2013, Artificial Intelligence for Engineering Design, Analysis and Manufacturing.

[63]  S. Boag,et al.  XQuery 1.0 : An XML query language, W3C Working Draft 12 November 2003 , 2003 .

[64]  Richi Nayak,et al.  Combining Structure and Content Similarities for XML Document Clustering , 2008, AusDM.

[65]  Viktor K. Prasanna,et al.  ModelML: a Markup Language for Automatic Model Synthesis , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[66]  Steven J. DeRose,et al.  XML Path Language (XPath) , 1999 .

[67]  Jérôme David,et al.  Association Rule Ontology Matching Approach , 2007, Int. J. Semantic Web Inf. Syst..

[68]  Analía Amandi,et al.  Supporting the discovery and labeling of non-taxonomic relationships in ontology learning , 2009, Expert Syst. Appl..

[69]  Richi Nayak,et al.  A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity , 2007, Int. J. Pattern Recognit. Artif. Intell..

[70]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[71]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[72]  Steffen Staab,et al.  Comparing Conceptual, Divise and Agglomerative Clustering for Learning Taxonomies from Text , 2004, ECAI.

[73]  Richi Nayak,et al.  Clustering XML Documents Using Closed Frequent Subtrees: A Structural Similarity Approach , 2007, INEX.

[74]  E F Begley MatML version 3.0 schema , 2003 .

[75]  Richi Nayak,et al.  Fast and effective clustering of XML data using structural information , 2008, Knowledge and Information Systems.

[76]  Elena Console,et al.  Data Fusion , 2009, Encyclopedia of Database Systems.

[77]  Alexander Maedche,et al.  Clustering Ontology-Based Metadata in the Semantic Web , 2002, PKDD.

[78]  Ludovic Denoyer,et al.  Report on the XML mining track at INEX 2007 categorization and clustering of XML documents , 2008, SIGF.

[79]  Kenji Araki,et al.  The Latest MML (Medical Markup Language) Version 2.3—XML-Based Standard for Medical Data Exchange/Storage , 2004, Journal of Medical Systems.

[80]  Yun Chi,et al.  Frequent Subtree Mining - An Overview , 2004, Fundam. Informaticae.

[81]  Deborah L. McGuinness,et al.  SameAs Networks and Beyond: Analyzing Deployment Status and Implications of owl: sameAs in Linked Data , 2010, International Semantic Web Conference.

[82]  Roy T. Fielding,et al.  Uniform Resource Identifier (URI): Generic Syntax , 2005, RFC.

[83]  Elke A. Rundensteiner,et al.  XML Based Markup Languages for Specific Domains , 2010 .

[84]  Deborah L. McGuinness,et al.  An Environment for Merging and Testing Large Ontologies , 2000, KR.

[85]  Luc Dehaspe,et al.  Discovery of relational association rules , 2001 .

[86]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[87]  Nicola Fanizzi,et al.  Inductive learning for the Semantic Web: What does it buy? , 2010, Semantic Web.

[88]  Hugh Glaser,et al.  Managing Co-reference on the Semantic Web , 2009, LDOW.

[89]  Rafael Berlanga Llavori,et al.  Mining Association Rules from Semantic Web Data , 2010, IEA/AIE.

[90]  Stephen Muggleton,et al.  Inverse entailment and progol , 1995, New Generation Computing.

[91]  Steffen Staab,et al.  Discovering Conceptual Relations from Text , 2000, ECAI.

[92]  Jayant Madhavan,et al.  Structured Data Meets the Web: A Few Observations , 2006, IEEE Data Eng. Bull..

[93]  Amy Isard,et al.  SSML: A speech synthesis markup language , 1997, Speech Commun..

[94]  Wei Wang,et al.  Efficient mining of frequent subgraphs in the presence of isomorphism , 2003, Third IEEE International Conference on Data Mining.

[95]  Pierre Senellart,et al.  Automatic wrapper induction from hidden-web sources with domain knowledge , 2008, WIDM '08.

[96]  Peter F. Patel-Schneider,et al.  Reducing OWL entailment to description logic satisfiability , 2004, Journal of Web Semantics.

[97]  Takeo Kunishima,et al.  Semantic extensions of XML for advanced applications , 2001 .

[98]  Martin Gaedke,et al.  Discovering and Maintaining Links on the Web of Data , 2009, SEMWEB.

[99]  Richi Nayak,et al.  Knowledge Discovery from XML Documents , 2006, Lecture Notes in Computer Science.

[100]  Gwenn Englebienne,et al.  Learning Concept Mappings from Instance Similarity , 2008, SEMWEB.

[101]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[102]  Charles L. A. Clarke,et al.  Overview of the TREC 2011 Web Track , 2011, TREC.

[103]  Nathalie Pernelle,et al.  L2R: A Logical Method for Reference Reconciliation , 2007, AAAI.

[104]  Clement T. Yu,et al.  Bootstrapping Domain Ontology for Semantic Web Services from Source Web Sites , 2005, TES.

[105]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[106]  Johanna Völker,et al.  Statistical Schema Induction , 2011, ESWC.

[107]  Tim Berners-Lee,et al.  Linked data on the web (LDOW2008) , 2008, WWW.

[108]  Erik Wilde,et al.  XML fever , 2008, CACM.

[109]  Charu C. Aggarwal,et al.  Xproj: a framework for projected structural clustering of xml documents , 2007, KDD '07.

[110]  Jens Lehmann,et al.  Learning of OWL Class Descriptions on Very Large Knowledge Bases , 2008, SEMWEB.

[111]  Ludovic Denoyer,et al.  Report on the XML Mining Track at INEX 2005 and INEX 2006 , 2006, INEX.

[112]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[113]  Shui-Lung Chuang,et al.  Context-Aware Wrapping: Synchronized Data Extraction , 2007, VLDB.