Learned Data-aware Image Representations of Line Charts for Similarity Search

Finding line-chart images similar to a given line-chart image query is a common task in data exploration and image query systems, e.g. finding similar trends in stock markets or medical Electroencephalography images. The state-of-the-art approaches consider either data-level similarity (when the underlying data is present) or image-level similarity (when the underlying data is absent). In this paper, we study the scenario that during query time, only line-chart images are available. Our goal is to train a neural network that can turn these line-chart images into representations that are aware of the data used to generate these line charts, so as to learn better representations. Our key idea is that we can collect both data and line-chart images to learn such a neural network (at training step), while during query (or inference) time, we support the case that only line-chart images are provided. To this end, we present LineNet, a Vision Transformer-based Triplet Autoencoder model to learn data-aware image representations of line charts for similarity search. We design a novel pseudo labels selection mechanism to guide LineNet to capture both data-aware and image-level similarity of line charts. We further propose a diversified training samples selection strategy to optimize the learning process and improve the performance. We conduct both quantitative evaluation and case studies, showing that LineNet significantly outperforms the state-of-the-art methods for searching similar line-chart images.

[1]  Chengliang Chai,et al.  Data Management for Machine Learning: A Survey , 2023, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yun Wang,et al.  GALVIS: Visualization Construction through Example-Powered Declarative Programming , 2022, CIKM.

[3]  M. Ouzzani,et al.  Sevi: Speech-to-Visualization through Neural Machine Translation , 2022, SIGMOD Conference.

[4]  J. Zhao,et al.  Dynamic Index Construction with Deep Reinforcement Learning , 2022, Data Science and Engineering.

[5]  Jianhua Feng,et al.  Feature Augmentation with Reinforcement Learning , 2022, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[6]  Chengliang Chai,et al.  Synthesizing Privacy Preserving Entity Resolution Datasets , 2022, 2022 IEEE 38th International Conference on Data Engineering (ICDE).

[7]  R. Maciejewski,et al.  Annotating Line Charts for Addressing Deception , 2022, CHI.

[8]  Feng Zhang,et al.  DREW: Efficient Winograd CNN Inference with Deep Reuse , 2022, WWW.

[9]  Dominik Moritz,et al.  ComputableViz: Mathematical Operators as a Formalism for Visualisation Processing and Analysis , 2022, CHI.

[10]  N. Tang,et al.  Selective Data Acquisition in the Wild for Model Charging , 2022, Proc. VLDB Endow..

[11]  Xuedi Qin,et al.  Steerable Self-Driving Data Visualization , 2022, IEEE Transactions on Knowledge and Data Engineering.

[12]  Yuyu Luo,et al.  nvBench: A Large-Scale Synthesized Dataset for Cross-Domain Natural Language to Visualization Task , 2021, ArXiv.

[13]  Chengliang Chai,et al.  Natural Language to Visualization by Neural Machine Translation , 2021, IEEE Transactions on Visualization and Computer Graphics.

[14]  Leixian Shen,et al.  Towards Natural Language Interfaces for Data Visualization: A Survey , 2021, IEEE Transactions on Visualization and Computer Graphics.

[15]  Themis Palpanas,et al.  Deep Learning Embeddings for Data Series Similarity Search , 2021, KDD.

[16]  Chengliang Chai,et al.  Automatic Data Acquisition for Deep Learning , 2021, Proc. VLDB Endow..

[17]  Xuedi Qin,et al.  Synthesizing Natural Language to Visualization (NL2VIS) Benchmarks from NL2SQL Benchmarks , 2021, SIGMOD Conference.

[18]  Aditya G. Parameswaran,et al.  Lux: Always-on Visualization Recommendations for Exploratory Dataframe Workflows , 2021, Proc. VLDB Endow..

[19]  Xuemin Lin,et al.  T3S: Effective Representation Learning for Trajectory Similarity Computation , 2021, 2021 IEEE 37th International Conference on Data Engineering (ICDE).

[20]  Dominik Moritz,et al.  AI4VIS: Survey on Artificial Intelligence Approaches for Data Visualization , 2021, IEEE Transactions on Visualization and Computer Graphics.

[21]  Zhifeng Bao,et al.  A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration , 2021, Data Science and Engineering.

[22]  Fahad Shahbaz Khan,et al.  Transformers in Vision: A Survey , 2021, ACM Comput. Surv..

[23]  Yong Wang,et al.  A Survey on ML4VIS: Applying Machine Learning Advances to Data Visualization , 2020, IEEE Transactions on Visualization and Computer Graphics.

[24]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[25]  Tamara Munzner,et al.  VizCommender: Computing Text-Based Similarity in Visualization Repositories for Content-Based Recommendations , 2020, IEEE Transactions on Visualization and Computer Graphics.

[26]  N. Tang,et al.  VisClean: Interactive Cleaning for Progressive Visualization , 2020, Proc. VLDB Endow..

[27]  Weiwei Cui,et al.  Retrieve-Then-Adapt: Example-based Automatic Generation for Proportion-related Infographics , 2020, IEEE Transactions on Visualization and Computer Graphics.

[28]  Heng Tao Shen,et al.  Correlated Features Synthesis and Alignment for Zero-shot Cross-modal Retrieval , 2020, SIGIR.

[29]  Lei Cao,et al.  Human-in-the-loop Outlier Detection , 2020, SIGMOD Conference.

[30]  Venu Govindaraju,et al.  Chart Mining: A Survey of Methods for Automated Chart Analysis , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Ce Liu,et al.  Supervised Contrastive Learning , 2020, NeurIPS.

[32]  Heng Tao Shen,et al.  Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval , 2020, AAAI.

[33]  Guoliang Li,et al.  Crowdsourcing-based Data Extraction from Visualization Charts , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[34]  Guoliang Li,et al.  Interactive Cleaning for Progressive Visualization through Composite Questions , 2020, 2020 IEEE 36th International Conference on Data Engineering (ICDE).

[35]  E. Dong,et al.  An interactive web-based dashboard to track COVID-19 in real time , 2020, The Lancet Infectious Diseases.

[36]  N. Tang,et al.  Making data visualization more efficient and effective: a survey , 2019, The VLDB Journal.

[37]  Yang Wang,et al.  VISPubComPAS: a comparative analytical system for visualization publication data , 2019, Journal of Visualization.

[38]  Maneesh Agrawala,et al.  Searching the Visual Style and Structure of D3 Visualizations , 2019, IEEE Transactions on Visualization and Computer Graphics.

[39]  Tim Kraska,et al.  VizNet: Towards A Large-Scale Visualization Learning and Benchmarking Repository , 2019, CHI.

[40]  Eric J. Ma,et al.  Peax: Interactive Visual Pattern Search in Sequential Data Using Unsupervised Deep Representation Learning , 2019, bioRxiv.

[41]  Karrie Karahalios,et al.  ShapeSearch: A Flexible and Efficient System for Shape-based Exploration of Trendlines , 2018, SIGMOD Conference.

[42]  Jeffrey Heer,et al.  Formalizing Visualization Design Knowledge as Constraints: Actionable and Extensible Models in Draco , 2018, IEEE Transactions on Visualization and Computer Graphics.

[43]  Younghoon Kim,et al.  Assessing Effects of Task and Data Distribution on the Effectiveness of Visual Encodings , 2018, Comput. Graph. Forum.

[44]  Azza Abouzeid,et al.  Qetch: Time Series Querying with Expressive Sketches , 2018, SIGMOD Conference.

[45]  Guoliang Li,et al.  DeepEye: Creating Good Data Visualizations by Keyword Search , 2018, SIGMOD Conference.

[46]  Azza Abouzeid,et al.  Expressive Time Series Querying with Hand-Drawn Scale-Free Sketches , 2018, CHI.

[47]  Guoliang Li,et al.  DeepEye: Towards Automatic Data Visualization , 2018, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[48]  M. Stonebraker,et al.  Beagle: Automated Extraction and Interpretation of Visualizations from the Web , 2017, CHI.

[49]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[50]  Karrie Karahalios,et al.  You can't always sketch what you want: Understanding Sensemaking in Visual Query Systems , 2017, IEEE Transactions on Visualization and Computer Graphics.

[51]  Tobias Isenberg,et al.  Vispubdata.org: A Metadata Collection About IEEE Visualization (VIS) Publications , 2017, IEEE Transactions on Visualization and Computer Graphics.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Jian Li,et al.  Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach , 2016, SIGMOD Conference.

[54]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[55]  Babak Saleh,et al.  Learning style similarity for searching infographics , 2015, Graphics Interface.

[56]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[58]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Cong Yu,et al.  Computational Journalism: A Call to Arms to Database Researchers , 2011, CIDR.

[61]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[62]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[63]  Shigeru Makino,et al.  QueryLines: approximate query for visual browsing , 2005, CHI 2005.

[64]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[65]  Eamonn J. Keogh,et al.  Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases , 2001, Knowledge and Information Systems.

[66]  Refael Hassin,et al.  Approximation algorithms for maximum dispersion , 1997, Oper. Res. Lett..

[67]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[68]  Xiang Yu Cost-based or Learning-based? A Hybrid Query Optimizer for Query Plan Selection , 2022, Proc. VLDB Endow..

[69]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[70]  N. Tang,et al.  DeepTrack: Monitoring and Exploring Spatio-Temporal Data – A Case of Tracking COVID-19 – , 2020 .

[71]  Guoliang Li,et al.  DeepEye: A Data Science System for Monitoring and Exploring COVID-19 Data , 2020, IEEE Data Eng. Bull..

[72]  Fabian Beck,et al.  VIS Author Profiles: Interactive Descriptions of Publication Records Combining Text and Visualization , 2019, IEEE Transactions on Visualization and Computer Graphics.

[73]  John Lee,et al.  Fast-Forwarding to Desired Visualizations with Zenvisage , 2017, CIDR.

[74]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[75]  Ben Shneiderman,et al.  Visual Queries for Finding Patterns in Time Series Data (2002) , 2005 .

[76]  Clu-istos Foutsos,et al.  Fast subsequence matching in time-series databases , 1994, SIGMOD '94.