PandaDB: Understanding Unstructured Data in Graph Database

—Unstructured data (e.g., images, videos, PDF files, etc.) contain semantic information, for example, the facial feature of a person and the plate number of a vehicle. There could be semantic relationships among data items. For example, a person’s face may appear in two irrelevant photos. Also, part of data is in structured format (e.g., person’s name and age). Naturally, end-users prefer to query unstructured data and structured data together based on the potential relationships among them. In this work, we build an open-source graph database named PandaDB to manage and query structured and unstructured data in graph. We first introduce graph as the data model to manage structured and unstructured data in one framework, then propose a query language extension to understand the semantic information of the unstructured data in the graph. Next, we develop a new cost model and related query optimization techniques to speed up the unstructured data processing in graph. Finally, we optimize the unstructured data storage and provide the index to speed up the query processing for unstructured data. PandaDB is widely used in industrial applications like FinTech, Knowledge Graph, and Recommendation System. The results show PandaDB can support a large scale of unstructured data query processing in a graph.

[1]  Hai Jin,et al.  Milvus: A Purpose-Built Vector Data Management System , 2021, SIGMOD Conference.

[2]  Yuanchun Zhou,et al.  Unsupervised Author Disambiguation using Heterogeneous Graph Convolutional Network Embedding , 2019, 2019 IEEE International Conference on Big Data (Big Data).

[3]  Minjia Zhang,et al.  GRIP: Multi-Store Capacity-Optimized High-Performance Nearest Neighbor Search for Vector Search Engine , 2019, CIKM.

[4]  Yannis Velegrakis,et al.  Beyond Macrobenchmarks: Microbenchmark-based Graph Database Evaluation , 2018, Proc. VLDB Endow..

[5]  Le Song,et al.  Heterogeneous Graph Neural Networks for Malicious Account Detection , 2018, CIKM.

[6]  Stefan Plantikow,et al.  Cypher: An Evolving Query Language for Property Graphs , 2018, SIGMOD Conference.

[7]  Walid G. Aref,et al.  GRFusion: Graphs as First-Class Citizens in Main-Memory Relational Database Systems , 2018, SIGMOD Conference.

[8]  Matthijs Douze,et al.  Link and Code: Fast Indexing with Graphs and Compact Regression Codes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Peter Boncz,et al.  G-CORE: A Core for Future Graph Query Languages , 2017, SIGMOD Conference.

[10]  Ulf Leser,et al.  Optimization of Complex Dataflows with User-Defined Functions , 2017, ACM Comput. Surv..

[11]  Bhiksha Raj,et al.  SphereFace: Deep Hypersphere Embedding for Face Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[13]  Marcelo Arenas,et al.  Foundations of Modern Query Languages for Graph Databases , 2016, ACM Comput. Surv..

[14]  Jeffrey Xu Yu,et al.  Scalable supergraph search in large graph databases , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[15]  Yury A. Malkov,et al.  Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  D. Vrgoc,et al.  Querying Graphs with Data , 2016, J. ACM.

[17]  Xiaolong Wang,et al.  Modeling Mention, Context and Entity with Neural Networks for Entity Disambiguation , 2015, IJCAI.

[18]  Hassan Chafi,et al.  The LDBC Social Network Benchmark: Interactive Workload , 2015, SIGMOD Conference.

[19]  Vito Giovanni Castellana,et al.  In-Memory Graph Databases for Web-Scale Data , 2015, Computer.

[20]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Walaa Medhat,et al.  Sentiment analysis algorithms and applications: A survey , 2014 .

[22]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[23]  Michael W. Godfrey,et al.  Mining modern repositories with elasticsearch , 2014, MSR 2014.

[24]  Andreas Rauber,et al.  Bridging structured and unstructured data via hybrid semantic search and interactive ontology-enhanced query formulation , 2013, Knowledge and Information Systems.

[25]  Urmila Shrawankar,et al.  Techniques for Feature Extraction In Speech Recognition System : A Comparative Study , 2013, ArXiv.

[26]  Ronen Feldman,et al.  Techniques and applications for sentiment analysis , 2013, CACM.

[27]  Renzo Angles,et al.  A Comparison of Current Graph Database Models , 2012, 2012 IEEE 28th International Conference on Data Engineering Workshops.

[28]  Hossein Mobahi,et al.  Toward a Practical Face Recognition System: Robust Alignment and Illumination by Sparse Representation , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Norbert Martínez-Bazan,et al.  DEX: A high-performance graph database management system , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[30]  Harumi A. Kuno,et al.  Modern B-tree techniques , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  H. Jégou,et al.  Searching in one billion vectors: Re-rank with source coding , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jitendra Malik,et al.  Shape matching and object recognition using shape contexts , 2010, 2010 3rd International Conference on Computer Science and Information Technology.

[33]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[34]  Samy Bengio,et al.  Large-scale content-based audio retrieval from text queries , 2008, MIR '08.

[35]  Marwan Mattar,et al.  Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments , 2008 .

[36]  Guido Moerkotte,et al.  Dynamic programming strikes back , 2008, SIGMOD Conference.

[37]  James Ze Wang,et al.  Image retrieval: Ideas, influences, and trends of the new age , 2008, CSUR.

[38]  Claudio Gutierrez,et al.  Survey of graph database models , 2008, CSUR.

[39]  Marcel Worring,et al.  Adding Semantics to Detectors for Video Retrieval , 2007, IEEE Transactions on Multimedia.

[40]  Silviu Cucerzan,et al.  Large-Scale Named Entity Disambiguation Based on Wikipedia Data , 2007, EMNLP.

[41]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[42]  Wilson C. Hsieh,et al.  Bigtable: a distributed storage system for structured data , 2006, OSDI '06.

[43]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[44]  Nicu Sebe,et al.  Content-based multimedia information retrieval: State of the art and challenges , 2006, TOMCCAP.

[45]  David A. Forsyth,et al.  Matching Words and Pictures , 2003, J. Mach. Learn. Res..

[46]  Serge J. Belongie,et al.  Shape Matching and Object Recognition Using Shape Contexts , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[48]  Shih-Fu Chang,et al.  Image Retrieval: Current Techniques, Promising Directions, and Open Issues , 1999, J. Vis. Commun. Image Represent..

[49]  Ja-Chen Lin,et al.  A new LDA-based face recognition system which can solve the small sample size problem , 1998, Pattern Recognit..

[50]  Dragutin Petkovic,et al.  Content-based representation and retrieval of visual media: A state-of-the-art review , 1996, Multimedia Tools and Applications.

[51]  Tat-Seng Chua,et al.  A video retrieval and sequencing system , 1995, TOIS.

[52]  Alon Y. Halevy,et al.  Query Optimization by Predicate Move-Around , 1994, VLDB.

[53]  Goetz Graefe,et al.  The Volcano optimizer generator: extensibility and efficient search , 1993, Proceedings of IEEE 9th International Conference on Data Engineering.

[54]  Seymour Furmand Databases , 1993, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[55]  Chris Buckley,et al.  Optimization of inverted vector searches , 1985, SIGIR '85.

[56]  E. Codd,et al.  Relational database: a practical foundation for productivity , 1982, CACM.

[57]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[58]  T. G. Price,et al.  Access path selection in a relational database management system , 1979, SIGMOD '79.

[59]  MingJie Tang,et al.  A Distributed Graph Database System to Query Unstructured Data in Big Graph , 2021 .

[60]  Andrey Gubichev,et al.  Query Processing and Optimization in Graph Databases , 2015 .

[61]  Cordelia Schmid,et al.  Product Quantization for Nearest Neighbor Search , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  C. A. R. Hoare,et al.  The 1981 ACM Turing Award Lecture , 2001 .

[63]  Peter Stanchev,et al.  Content-Based Image Retrieval Systems , 2001 .

[64]  M. Ogiela Multimedia tools and applications , 1995 .

[65]  Abraham Silberschatz,et al.  Database System Concepts , 1980 .

[66]  Weimin Zheng,et al.  Gemini: A Computation-Centric Distributed Graph Processing System , 2016, OSDI.