Declarative analysis of noisy information networks

There is a growing interest in methods for analyzing data describing networks of all types, including information, biological, physical, and social networks. Typically the data describing these networks is observational, and thus noisy and incomplete; it is often at the wrong level of fidelity and abstraction for meaningful data analysis. This has resulted in a growing body of work on extracting, cleaning, and annotating network data. Unfortunately, much of this work is ad hoc and domain-specific. In this paper, we present the architecture of a data management system that enables efficient, declarative analysis of large-scale information networks. We identify a set of primitives to support the extraction and inference of a network from observational data, and describe a framework that enables a network analyst to easily implement and combine new extraction and analysis techniques, and efficiently apply them to large observation networks. The key insight behind our approach is to decouple, to the extent possible, (a) the operations that require traversing the graph structure (typically the computationally expensive step), from (b) the operations that do the modification and update of the extracted network. We present an analysis language based on Datalog, and show how to use it to cleanly achieve such decoupling. We briefly describe our prototype system that supports these abstractions. We include a preliminary performance evaluation of the system and show that our approach scales well and can efficiently handle a wide spectrum of data cleaning operations on network data.

[1]  Kotagiri Ramamohanarao,et al.  A Generalization of the Differential Approach to Recursive Query Evaluation , 1987, J. Log. Program..

[2]  Catriel Beeri,et al.  Sets and negation in a logic data base language (LDL1) , 1987, PODS.

[3]  Jeffrey D. Ullman,et al.  Principles of database and knowledge-base systems, Vol. I , 1988 .

[4]  Michel Scholl,et al.  Gram: a graph data model and query languages , 1992, ECHT '92.

[5]  Ralf Hartmut Güting,et al.  GraphDB: Modeling and Querying Graphs in Databases , 1994, VLDB.

[6]  Jan Van den Bussche,et al.  A Graph-Oriented Object Database Model , 1994, IEEE Trans. Knowl. Data Eng..

[7]  Kenneth A. Ross,et al.  Efficient Incremental Evaluation of Queries with Aggregation , 1994, ILPS.

[8]  Alberto O. Mendelzon,et al.  Finding Regular Simple Paths in Graph Databases , 1989, SIAM J. Comput..

[9]  Jeffrey D. Ullman,et al.  A survey of deductive database systems , 1995, J. Log. Program..

[10]  Leonid Libkin,et al.  Incremental maintenance of views with duplicates , 1995, SIGMOD '95.

[11]  Gultekin Özsoyoglu,et al.  A graph query language and its query processing , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[12]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[13]  Jennifer Neville,et al.  Iterative Classification in Relational Data , 2000 .

[14]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[15]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[16]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[17]  Jennifer Neville,et al.  Collective Classification with Relational Dependency Networks , 2003 .

[18]  Lada A. Adamic,et al.  Friends and neighbors on the Web , 2003, Soc. Networks.

[19]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[20]  Dmitri V. Kalashnikov,et al.  Exploiting Relationships for Domain-Independent Data Cleaning , 2005, SDM.

[21]  David D. Jensen,et al.  The case for anomalous link discovery , 2005, SKDD.

[22]  Inderpal Singh Mumick,et al.  Incremental maintenance of aggregate and outerjoin expressions , 2006, Inf. Syst..

[23]  Ion Stoica,et al.  Declarative networking: language, execution and optimization , 2006, SIGMOD Conference.

[24]  Ulf Leser,et al.  Fast and practical indexing and querying of very large graphs , 2007, SIGMOD '07.

[25]  Christos Faloutsos,et al.  Graph evolution: Densification and shrinking diameters , 2006, TKDD.

[26]  Philip Levis,et al.  The design and implementation of a declarative sensor network system , 2007, SenSys '07.

[27]  Jignesh M. Patel,et al.  SAGA: a subgraph matching tool for biological graphs , 2007, Bioinform..

[28]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[29]  Jon M. Kleinberg,et al.  The link-prediction problem for social networks , 2007, J. Assoc. Inf. Sci. Technol..

[30]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[31]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[32]  Lise Getoor,et al.  Collective Classification in Network Data , 2008, AI Mag..

[33]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[34]  Oded Shmueli,et al.  Evaluating very large datalog queries on social networks , 2009, EDBT '09.

[35]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[36]  Claudio Gutiérrez,et al.  Representing, Querying and Transforming Social Networks with RDF/SPARQL , 2009, ESWC.