Systems for Graph Extraction from Tabular Data

Connections amongst real-world entities provide significant insights for numerous reallife applications in social networks, semantic web, road maps, finance, among others. Graphs are perhaps the most natural way to model such connections in application data. However, in many enterprises, an application data is still primarily stored in an RDBMS in a tabular format and users extract graphs out of an RDBMS and store them in specialized graph processing systems. As a result, many users face two major challenges before conducting any graph analysis. First, extracting graphs from an RDBMS requires building an ETL pipeline, which can require a significant amount of time. Second, keeping the extracted graph in the graph processing system, such as a graph database management system (GDBMS), in sync with the original data in the RDBMS requires developing additional non-trivial synchronization code. In this thesis, we study and address these two challenges and present two software systems, GraphWrangler and R2GSync, that we have developed to solve these challenges. GraphWrangler is an interactive system that streamlines the ETL pipeline. Users connect to an RDBMS using GraphWrangler and with several simple interactions, such as dragging and dropping of rows and columns and drawing edges on the screen, they describe table-to-graph mappings. This way, users can describe the graphs they would like to extract without writing any custom scripts. In addition, GraphWrangler allows user to immediately visualize their tables in the form of a graph. Our second system, R2GSync, uses the mappings of an extracted graph and maintains a consistent, i.e., in sync, copy of this graph in a GDBMS as updates happen to the original RDBMS from which the graph was extracted. Querying the extracted graph inside the GDBMS requires a new querying functionality inside the GDBMS that we call edge views. We describe our implementation of edge views and several optimizations to make queries that contain edge views more efficient.

[1]  Sangkeun Lee,et al.  Table2Graph: A Scalable Graph Construction from Relational Tables Using Map-Reduce , 2015, 2015 IEEE First International Conference on Big Data Computing Service and Applications.

[2]  Semih Salihoglu,et al.  GraphWrangler: An Interactive Graph View on Relational Data , 2019, SIGMOD Conference.

[3]  Gábor Szárnyas,et al.  Incremental View Maintenance for Property Graph Queries , 2017, SIGMOD Conference.

[4]  Carlo Curino,et al.  Kaskade: Graph Views for Efficient Graph Analytics , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[5]  Jeffrey Heer,et al.  Predictive Interaction for Data Transformation , 2015, CIDR.

[6]  Amine Mhedhbi,et al.  The ubiquity of large graphs and surprising challenges of graph processing: extended survey , 2017, The VLDB Journal.

[7]  Roberto De Virgilio,et al.  Converting relational to graph databases , 2013, GRADES.

[8]  Bruce G. Lindsay,et al.  How to roll a join: asynchronous incremental view maintenance , 2000, SIGMOD '00.

[9]  Amine Mhedhbi,et al.  The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing , 2017 .

[10]  Michael Stonebraker,et al.  VERTEXICA: Your Relational Friend for Graph Analytics! , 2014, Proc. VLDB Endow..

[11]  Xi Chen,et al.  How LinkedIn Economic Graph Bonds Information and Product: Applications in LinkedIn Salary , 2018, KDD.

[12]  Jorge Bernardino,et al.  Graph Databases: Neo4j Analysis , 2017, ICEIS.

[13]  Justin J. Miller,et al.  Graph Database Applications and Concepts with Neo4j , 2013 .

[14]  Jeffrey F. Naughton,et al.  A comparison of three methods for join view maintenance in parallel RDBMS , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[15]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[16]  Theodore L. Willke,et al.  GraphBuilder – A Scalable Graph Construction Library for Apache TM Hadoop TM , 2012 .

[17]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[18]  Harsh Thakkar,et al.  Mapping RDF Databases to Property Graph Databases , 2020, IEEE Access.

[19]  Yue Zhuge,et al.  Graph structured views and their incremental maintenance , 1998, Proceedings 14th International Conference on Data Engineering.

[20]  Jignesh M. Patel,et al.  The Case Against Specialized Graph Analytics Engines , 2015, CIDR.

[21]  Daniel P. Miranker,et al.  On directly mapping relational databases to RDF and OWL , 2012, WWW.

[22]  Jimmy J. Lin,et al.  Real-Time Twitter Recommendation: Online Motif Detection in Large Dynamic Graphs , 2014, Proc. VLDB Endow..

[23]  Frank Wm. Tompa,et al.  Efficiently updating materialized views , 1986, SIGMOD '86.

[24]  Udayan Khurana,et al.  GraphGen: Exploring Interesting Graphs in Relational Data , 2015, Proc. VLDB Endow..

[25]  Jeremy Chen,et al.  Graphflow: An Active Graph Database , 2017, SIGMOD Conference.

[26]  Stefan Plantikow,et al.  Cypher: An Evolving Query Language for Property Graphs , 2018, SIGMOD Conference.

[27]  Zhengping Qian,et al.  Real-time Constrained Cycle Detection in Large Dynamic Graphs , 2018, Proc. VLDB Endow..

[28]  HeerJeffrey,et al.  D3 Data-Driven Documents , 2011 .

[29]  Hui Ding,et al.  TAO: Facebook's Distributed Data Store for the Social Graph , 2013, USENIX Annual Technical Conference.

[30]  Abdelkamel Tari,et al.  Materialized View Maintenance: Issues, Classification, and Open Challenges , 2019, Int. J. Cooperative Inf. Syst..

[31]  John T. Stasko,et al.  Network-based visual analysis of tabular data , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).