DeExcelerator: a framework for extracting relational data from partially structured documents

Of the structured data published on the web, for instance as datasets on Open Data Platforms such as data.gov, but also in the form of HTML tables on the general web, only a small part is in a relational form. Instead the data is intermingled with formatting, layout and textual metadata, i.e., it is contained in partially structured documents. This makes transformation into a true relational form necessary, which is a precondition for most forms of data analysis and data integration. Studying data.gov as an example source for partially structured documents, we present a classification of typical normalization problems. We then present the DeExcelerator, which is a framework for extracting relations from partially structured documents such as spreadsheets and HTML tables.