Supporting link analysis using advanced querying methods on semantic web databases

There is an increasing demand for technologies that can help organizations unearth actionable knowledge from their data assets. This demand continues to drive the flurry of activities in data mining research where the emphasis is on technologies that can identify patterns in data. However, in addition to the " patterns " view of data, other data and knowledge perspectives are required to support the broad range of complex analytical tasks found in contemporary applications. For example, in some applications in homeland security, bioinformatics, business and other investigative domains many tasks are focused on " connecting the dots ". For this genre of applications, support for identifying, revealing and analyzing links or relationships between groups of entities (link analysis) is crucial. Currently, mainstream database systems do not provide support for such analyses and current solutions rely on exporting their data from their databases into custom applications to be analyzed. This has the disadvantage of additional overhead and precludes the ability to exploit other mature technologies offered by today's database systems. This thesis argues for database support for link analysis by providing an appropriate interpretation for such information requests in a graph database model. It addresses several key database issues with respect to supporting such queries. First, it identifies a number of querying constructs that are crucial to supporting linking analysis applications and proposes a formal query language called SPARQ2L that allows their expression. A formal semantics and characterization of the computational complexity of SPARQ2L's query constructs is also presented. Second, it proposes a database storage model that supports efficient processing of queries while being tolerant of data persistence. The storage model combines a graph linearization strategy rooted in algebraic techniques for solving path problems with a set of heuristics for node and edge clustering that aims to minimize external path lengths. Third, it proposes a novel relevance model SemRank which exploits the " machine processible semantics " of data in ascribing relative importance to query results and offers a flexible or " modulative ranking " model enabling serendipitous knowledge discovery.

[1]  Vassilis Christophides,et al.  RQL: a declarative query language for RDF , 2002, WWW.

[2]  Yehoshua Sagiv,et al.  XSEarch: A Semantic Search Engine for XML , 2003, VLDB.

[3]  Robert E. Tarjan,et al.  Fast Algorithms for Solving Path Problems , 1981, JACM.

[4]  J. Carroll,et al.  Jena: implementing the semantic web recommendations , 2004, WWW Alt. '04.

[5]  Roy Goldman,et al.  DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases , 1997, VLDB.

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[8]  Henry Lieberman,et al.  Sesame: An Architecture for Storing and Querying RDF Data and Schema Information , 2005 .

[9]  Max J. Egenhofer,et al.  Determining Semantic Similarity among Entity Classes from Different Ontologies , 2003, IEEE Trans. Knowl. Data Eng..

[10]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[11]  Krishna Prasad Chitrapura,et al.  Node ranking in labeled directed graphs , 2004, CIKM '04.

[12]  Raphael Volz,et al.  A Comparison of RDF Query Languages , 2004, SEMWEB.

[13]  Sougata Mukherjea,et al.  BioPatentMiner: An Information Retrieval System for BioMedical Patents , 2004, VLDB.

[14]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[15]  John R. Smith,et al.  Supporting Incremental Join Queries on Ranked Inputs , 2001, VLDB.

[16]  Ian Horrocks,et al.  OWL Web Ontology Language Reference-W3C Recommen-dation , 2004 .

[17]  Dave J. Beckett,et al.  The design and implementation of the redland RDF application framework , 2001, WWW '01.

[18]  Andreas Harth,et al.  Optimized index structures for querying RDF from the Web , 2005, Third Latin American Web Congress (LA-WEB'2005).

[19]  Joanna H Shih,et al.  Whole genome expression profiling of advance stage papillary serous ovarian cancer reveals activated pathways , 2004, Oncogene.

[20]  Tim Furche,et al.  Web and Semantic Web Query Languages: A Survey , 2005, Reasoning Web.

[21]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[22]  Dan Suciu,et al.  Index Structures for Path Expressions , 1999, ICDT.

[23]  Amit P. Sheth,et al.  Ρ-Queries: enabling querying for semantic associations on the semantic web , 2003, WWW '03.

[24]  David R. Karger,et al.  Haystack: A General-Purpose Information Management Tool for End Users Based on Semistructured Data , 2005, CIDR.

[25]  Dan Brickley,et al.  Rdf vocabulary description language 1.0 : Rdf schema , 2004 .

[26]  Jeff Z. Pan,et al.  Querying the Semantic Web with Preferences , 2006, SEMWEB.

[27]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[28]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[29]  Gerhard Weikum,et al.  Integrating DB and IR Technologies: What is the Sound of One Hand Clapping? , 2005, CIDR.

[30]  Timothy W. Finin,et al.  Swoogle: a search and metadata engine for the semantic web , 2004, CIKM '04.

[31]  Daniel Schwabe,et al.  A hybrid approach for searching in the semantic web , 2004, WWW '04.

[32]  Krys J. Kochut,et al.  SPARQLeR: Extended Sparql for Semantic Association Discovery , 2007, ESWC.

[33]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[34]  Catherine Faron-Zucker,et al.  Searching the semantic Web: approximate query processing based on ontologies , 2006, IEEE Intelligent Systems.

[35]  Steffen Staab,et al.  SEAL: a framework for developing SEmantic PortALs , 2001, K-CAP '01.

[36]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[37]  Gerhard Weikum,et al.  Probabilistic Ranking of Database Query Results , 2004, VLDB.

[38]  Vagelis Hristidis,et al.  Keyword proximity search on XML graphs , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[39]  Krys J. Kochut,et al.  BRAHMS: A WorkBench RDF Store and High Performance Memory System for Semantic Association Discovery , 2005, SEMWEB.

[40]  Flavius Frasincar,et al.  RAL: An Algebra for Querying RDF , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[41]  Catherine Faron-Zucker,et al.  Querying the Semantic Web with Corese Search Engine , 2004, ECAI.

[42]  Annika Hinze,et al.  Storing RDF as a graph , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[43]  Ronald Fagin,et al.  A formula for incorporating weights into scoring rules , 2000, Theor. Comput. Sci..

[44]  S. Sudarshan,et al.  Clustering Techniques for Minimizing External Path Length , 1996, VLDB.

[45]  Hai Zhuge,et al.  Ranking Semantic-linked Network , 2003, WWW.

[46]  Stanislav Barton,et al.  Designing Indexing Structure for Discovering Relationships in RDF Graphs , 2004, DATESO.

[47]  Stefan Decker,et al.  TRIPLE - An RDF Query, Inference, and Transformation Language , 2001, INAP.

[48]  Amit P. Sheth,et al.  SPARQ2L: towards support for subgraph extraction queries in rdf databases , 2007, WWW '07.

[49]  Claudio Gutiérrez,et al.  Querying RDF Data from a Graph Database Perspective , 2005, ESWC.

[50]  Robert E. Tarjan,et al.  A Unified Approach to Path Problems , 1981, JACM.

[51]  Marcelo Arenas,et al.  Semantics and complexity of SPARQL , 2006, TODS.

[52]  Amit P. Sheth,et al.  Context-Aware Semantic Association Ranking , 2003, SWDB.

[53]  Vagelis Hristidis,et al.  PREFER: a system for the efficient execution of multi-parametric ranked queries , 2001, SIGMOD '01.

[54]  Gerhard Weikum,et al.  Towards a Statistically Semantic Web , 2004, ER.

[55]  Amit P. Sheth,et al.  Semantic Association Identification and Knowledge Discovery for National Security Applications , 2005, J. Database Manag..

[56]  Allan Borodin,et al.  Link analysis ranking: algorithms, theory, and experiments , 2005, TOIT.

[57]  Amit P. Sheth,et al.  SemRank: ranking complex relationship search results on the semantic web , 2005, WWW '05.

[58]  Amit P. Sheth,et al.  Discovering and Ranking Semantic Associations over a Large RDF Metabase , 2004, VLDB.

[59]  Vassilis Christophides,et al.  The ICS-FORTH RDFSuite: Managing Voluminous RDF Description Bases , 2001, SemWeb.