Empowering Investigative Journalism with Graph-based Heterogeneous Data Management

Investigative Journalism (IJ, in short) is staple of modern, democratic societies. IJ often necessitates working with large, dynamic sets of heterogeneous, schema-less data sources, which can be structured, semi-structured, or textual, limiting the applicability of classical data integration approaches. In prior work, we have developed ConnectionLens, a system capable of integrating such sources into a single heterogeneous graph, leveraging Information Extraction (IE) techniques; users can then query the graph by means of keywords, and explore query results and their neighborhood using an interactive GUI. Our keyword search problem is complicated by the graph heterogeneity, and by the lack of a result score function that would allow to prune some of the search space. In this work, we describe an actual IJ application studying conflicts of interest in the biomedical domain, and we show how ConnectionLens supports it. Then, we present novel techniques addressing the scalability challenges raised by this application: one allows to reduce the significant IE costs while building the graph, while the other is a novel, parallel, in-memory keyword search engine, which achieves orders of magnitude speed-up over our previous engine. Our experimental study on the real-world IJ application data confirms the benefits of our contributions.

[1]  Guy E. Blelloch,et al.  Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing , 2017, SPAA.

[2]  Kiyoung Choi,et al.  A scalable processing-in-memory accelerator for parallel graph processing , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[3]  Anthony K. H. Tung,et al.  An Efficient Parallel Keyword Search Engine on Knowledge Graphs , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[4]  Michael Stonebraker,et al.  Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[5]  Angelos-Christos G. Anadiotis,et al.  Graph-based keyword search in heterogeneous data sources , 2020, ArXiv.

[6]  Ioana Manolescu,et al.  Towards Scalable Hybrid Stores: Constraint-Based Rewriting to the Rescue , 2019, SIGMOD Conference.

[7]  Diego Calvanese,et al.  MASTRO-I: Efficient Integration of Relational Data through DL Ontologies , 2007, Description Logics.

[8]  Patrick Valduriez,et al.  CloudMdsQL: querying heterogeneous cloud data stores with a common language , 2016, Distributed and Parallel Databases.

[9]  Willy Zwaenepoel,et al.  Everything you always wanted to know about multicore graph processing but were afraid to ask , 2017, USENIX Annual Technical Conference.

[10]  Alon Y. Halevy,et al.  Principles of Data Integration , 2012 .

[11]  Anand Sivasubramaniam,et al.  Large-Scale Graph Processing on Emerging Storage Devices , 2019, FAST.

[12]  Willy Zwaenepoel,et al.  X-Stream: edge-centric graph processing using streaming partitions , 2013, SOSP.

[13]  Philip S. Yu,et al.  BLINKS: ranked keyword searches on graphs , 2007, SIGMOD '07.

[14]  Divesh Srivastava,et al.  Data-driven domain discovery for structured datasets , 2020, Proc. VLDB Endow..

[15]  François Goasdoué,et al.  Mixed-instance querying: a lightweight integration architecture for data journalism , 2016, Proc. VLDB Endow..

[16]  Yi Chen,et al.  Identifying meaningful return information for XML keyword search , 2007, SIGMOD '07.

[17]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[18]  Yuanyuan Tian,et al.  Enabling Rich Queries Over Heterogeneous Data From Diverse Sources In HealthCare , 2020, CIDR.

[19]  Feifei Li,et al.  Scalable Keyword Search on Large RDF Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[20]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[21]  Renée J. Miller,et al.  Organizing Data Lakes for Navigation , 2020, SIGMOD Conference.

[22]  François Goasdoué,et al.  Obi-Wan , 2020, Proc. VLDB Endow..

[23]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[24]  Feng Shao,et al.  XRANK: ranked keyword search over XML documents , 2003, SIGMOD '03.

[25]  Renée J. Miller,et al.  Pytheas: Pattern-based Table Discovery in CSV Files , 2020, Proc. VLDB Endow..

[26]  Ioana Manolescu,et al.  Graph integration of structured, semistructured and unstructured data for data journalism , 2020, Inf. Syst..

[27]  Shan Wang,et al.  Finding Top-k Min-Cost Connected Trees in Databases , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[28]  Ioana Manolescu,et al.  ConnectionLens: Finding Connections Across Heterogeneous Data Sources , 2018, Proc. VLDB Endow..

[29]  Roi Blanco,et al.  Keyword search over RDF graphs , 2011, CIKM '11.

[30]  Charles E. Leiserson,et al.  A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers) , 2010, SPAA '10.

[31]  Sungpack Hong,et al.  PGX.D: a fast distributed graph processing engine , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[32]  Reynold Xin,et al.  Finding related tables , 2012, SIGMOD Conference.

[33]  Carsten Binnig,et al.  Dictionary-based order-preserving string compression for main memory column stores , 2009, SIGMOD Conference.

[34]  Michael Stonebraker,et al.  Aurum: A Data Discovery System , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[35]  日経BP社,et al.  Amazon Web Services完全ソリューションガイド , 2016 .