Interactive Repair of Tables Extracted from PDF Documents on Mobile Devices

PDF documents often contain rich data tables that offer opportunities for dynamic reuse in new interactive applications. We describe a pipeline for extracting, analyzing, and parsing PDF tables based on existing machine learning and rule-based techniques. Implementing and deploying this pipeline on a corpus of 447 documents with 1,171 tables results in only 11 tables that are correctly extracted and parsed. To improve the results of automatic table analysis, we first present a taxonomy of errors that arise in the analysis pipeline and discuss the implications of cascading errors on the user experience. We then contribute a system with two sets of lightweight interaction techniques (gesture and toolbar), for viewing and repairing extraction errors in PDF tables on mobile devices. In an evaluation with 17 users involving both a phone and a tablet, participants effectively repaired common errors in 10 tables, with an average time of about 2 minutes per table.

[1]  Bongshin Lee,et al.  Facilitating Spreadsheet Manipulation on Mobile Devices Leveraging Speech , 2018 .

[2]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[3]  Brad A. Myers,et al.  Maximizing the guessability of symbolic input , 2005, CHI Extended Abstracts.

[4]  Stratos Idreos,et al.  dbTouch in action database kernels for touch-based data exploration , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  Jock D. Mackinlay,et al.  The impact of fluid documents on reading and browsing: an observational study , 2000, CHI.

[6]  Bongshin Lee,et al.  TouchPivot: Blending WIMP & Post-WIMP Interfaces for Data Exploration on Tablet Devices , 2017, CHI.

[7]  Christopher Ré,et al.  DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference , 2012, VLDS.

[8]  Niklas Elmqvist,et al.  Elastic Documents: Coupling Text and Tables through Contextual Visualizations for Enhanced Document Reading , 2019, IEEE Transactions on Visualization and Computer Graphics.

[9]  Nicholas Chen,et al.  TextTearing: opening white space for digital ink annotation , 2013, UIST.

[10]  Daniel Kifer,et al.  Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[11]  Karrie Karahalios,et al.  DataTone: Managing Ambiguity in Natural Language Interfaces for Data Visualization , 2015, UIST.

[12]  Jeffrey Heer,et al.  Wrangler: interactive visual specification of data transformation scripts , 2011, CHI.

[13]  Gregory D. Abowd,et al.  Interaction techniques for ambiguity resolution in recognition-based interfaces , 2007, SIGGRAPH '07.

[14]  Clare-Marie Karat,et al.  The Beauty of Errors: Patterns of Error Correction in Desktop Speech Systems , 1999, INTERACT.

[15]  Zhe Chen,et al.  Integrating spreadsheet data via accurate and low-effort extraction , 2014, KDD.

[16]  Eric Crestan,et al.  Web-scale table census and classification , 2011, WSDM '11.

[17]  Dominique Ritze,et al.  A Large Public Corpus of Web Tables containing Time and Context Metadata , 2016, WWW.

[18]  A. Azzouz 2011 , 2020, City.

[19]  Emanuel Zgraggen,et al.  Tableur: Handwritten Spreadsheets , 2016, CHI Extended Abstracts.

[20]  Meredith Ringel Morris,et al.  User-defined gestures for surface computing , 2009, CHI.

[21]  Vittorio Fuccella,et al.  Gestures and widgets: performance in text editing on multi-touch capable mobile devices , 2013, CHI.

[22]  Gennaro Costagliola,et al.  A technique for improving text editing on touchscreen devices , 2018, J. Vis. Lang. Comput..

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  Zhe Chen,et al.  Senbazuru: A Prototype Spreadsheet Database Management System , 2013, Proc. VLDB Endow..

[25]  Desney S. Tan,et al.  CueTIP: a mixed-initiative interface for correcting handwriting errors , 2006, UIST.

[26]  Wolfgang Lehner,et al.  Building the Dresden Web Table Corpus: A Classification Approach , 2015, 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC).

[27]  Maneesh Agrawala,et al.  Facilitating Document Reading by Linking Text and Tables , 2018, UIST.

[28]  Fabian Beck,et al.  Exploring Interactive Linking Between Text and Visualization , 2018, EuroVis.

[29]  Arnab Nandi,et al.  GestureQuery: A Multitouch Database Query Interface , 2013, Proc. VLDB Endow..