Multi-Hypothesis Parsing of Tabular Data in Comma-Separated Values (CSV) Files

htmlabstractTabular data on the web comes in various formats and shapes. Preparing data for data analysis and integration requires manual steps which go beyond simple parsing of the data. The preparation includes steps like correct configuration of the parser, removing of meaningless rows, casting of data types and reshaping of the table structure. The goal of this thesis is the development of a robust and modular system which is able to automatically transform messy CSV data sources into a tidy tabular data structure. The highly diverse corpus of CSV files from the UK open data hub will serve as a basis for the evaluation of the system.

[1]  York Sure-Vetter,et al.  Transforming arbitrary tables into logical form with TARTAR , 2007, Data Knowl. Eng..

[2]  E. F. Codd,et al.  Further Normalization of the Data Base Relational Model , 1971, Research Report / RJ / IBM / San Jose, California.

[3]  Luís Torgo,et al.  Design of an end-to-end method to extract information from tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[4]  David W. Embley,et al.  Table-processing paradigms: a research survey , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[5]  David W. Embley,et al.  Data Extraction from Web Tables: The Devil is in the Details , 2011, 2011 International Conference on Document Analysis and Recognition.

[6]  Jayant Madhavan,et al.  Harvesting relational tables from lists on the web , 2009, The VLDB Journal.

[7]  Matthew Hurst Towards a theory of tables , 2006, International Journal of Document Analysis and Recognition (IJDAR).

[8]  David W. Embley,et al.  Clustering header categories extracted from web tables , 2015, Electronic Imaging.

[9]  W. Bruce Croft,et al.  Table extraction using conditional random fields , 2003, DG.O.

[10]  Ethem Alpaydin,et al.  Introduction to Machine Learning (Adaptive Computation and Machine Learning) , 2004 .

[11]  David W. Embley,et al.  Factoring web tables , 2011, IEA/AIE'11.

[12]  Sunita Sarawagi,et al.  Annotating and searching web tables using entities, types and relationships , 2010, Proc. VLDB Endow..

[13]  Zhang Li,et al.  Information Quality Evaluation Framework: Extending ISO 25012 Data Quality Model , 2012 .

[14]  Felix Naumann,et al.  Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms , 2015, Proc. VLDB Endow..

[15]  Zhe Chen,et al.  Automatic web spreadsheet data extraction , 2013, SS@ '13.

[16]  Hongbo Du,et al.  Micro: A normalization tool for relational database designers , 1999, J. Netw. Comput. Appl..

[17]  Kawaljeet Singh,et al.  A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing , 2010 .

[18]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[19]  Hanan Samet,et al.  Schema Extraction for Tabular Data on the Web , 2013, Proc. VLDB Endow..

[20]  Yeye He,et al.  TEGRA: Table Extraction by Global Record Alignment , 2015, SIGMOD Conference.

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  Yannis Charalabidis,et al.  Benefits, Adoption Barriers and Myths of Open Data and Open Government , 2012, Inf. Syst. Manag..

[23]  Yalin Wang,et al.  Table structure understanding and its performance evaluation , 2004, Pattern Recognit..

[24]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[25]  Sören Auer,et al.  User-driven semantic mapping of tabular data , 2013, I-SEMANTICS '13.

[26]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[27]  Yi-Zeng Liang,et al.  Monte Carlo cross validation , 2001 .

[28]  Dan Brickley,et al.  Resource description framework (RDF) schema specification , 1998 .

[29]  Ali Yazici,et al.  JMathNorm: A Database Normalization Tool Using Mathematica , 2007, International Conference on Computational Science.

[30]  George Nagy,et al.  Segmenting Tables via Indexing of Value Cells by Table Headers , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[31]  Zhe Chen,et al.  Senbazuru: A Prototype Spreadsheet Database Management System , 2013, Proc. VLDB Endow..

[32]  Carlo Batini,et al.  Methodologies for data quality assessment and improvement , 2009, CSUR.

[33]  Yakov Shafranovich,et al.  Common Format and MIME Type for Comma-Separated Values (CSV) Files , 2005, RFC.

[34]  Wolfgang Lehner,et al.  DeExcelerator: a framework for extracting relational data from partially structured documents , 2013, CIKM.

[35]  Xinxin Wang,et al.  Tabular Abstraction, Editing, and Formatting , 1996 .

[36]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[37]  Jácome Cunha,et al.  From spreadsheets to relational databases and back , 2009, PEPM '09.

[38]  Hadley Wickham,et al.  Reshaping Data with the reshape Package , 2007 .

[39]  Matthew Francis Hurst,et al.  The interpretation of tables in texts , 2000 .

[40]  Edleno Silva de Moura,et al.  Joint unsupervised structure discovery and information extraction , 2011, SIGMOD '11.

[41]  Rinke Hoekstra,et al.  Linked Humanities Data: The Next Frontier? , 2012, ISWC 2012.

[42]  Daniel P. Lopresti,et al.  Evaluating the performance of table processing algorithms , 2002, International Journal on Document Analysis and Recognition.

[43]  Tim Finin,et al.  Automatically Generating Government Linked Data from Tables , 2011, AAAI 2011.

[44]  A. Karr Exploratory Data Mining and Data Cleaning , 2006 .