Using apps and rules in contextual workflows to semantically extract data from documents

If smartly utilized, Big Data locked in unstructured sources, such as PDF documents, can yield unprecedented insights in solving tough business issues, optimizing business processes and improving customer relations. The challenge addressed in this paper is to unlock the value held in data plunged in unstructured document. We describe how a contextual workflow based approach is used to address, in a semantic and flexible way, various problems arising in processing data contained into documents. We present the MANTRA Smart Data Platform, which enables to turn Big Data into Smart Data by means of contextual workflows composed by smart-cloud applications (APPs for short). Among the others, the MANTRA Language APP executes MANTRA rules that are able to extract and annotate information contained in heterogeneous sources (raw text, PDF, HTML or other presentation-oriented document format). Such rules exploit syntactic and semantic expressions, visual and spatial features, and natural language capabilities. Real cases of applications are showing that the proposed approach is able to process a large amount of heterogeneous input documents, as well as extract and consolidate the information of interest.

[1]  George R. Thoma Automating the production of bibliographic records for MEDLINE , 2001 .

[2]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[3]  Kurt Maly,et al.  Automated Template-Based Metadata Extraction Architecture , 2007, ICADL.

[4]  Taghi M. Khoshgoftaar,et al.  A Multi-dimensional Comparison of Toolkits for Machine Learning with Big Data , 2015, 2015 IEEE International Conference on Information Reuse and Integration.

[5]  Raymond A. Lorie,et al.  A system for automated data entry from forms , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[6]  Gaetano Borriello,et al.  Open data kit: tools to build information services for developing regions , 2010, ICTD.

[7]  Remco M. Dijkman,et al.  Semantics and analysis of business process models in BPMN , 2008, Inf. Softw. Technol..

[8]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.

[9]  Nathanael Chambers,et al.  Template-Based Information Extraction without the Templates , 2011, ACL.

[10]  Gaetano Borriello,et al.  Integrating ODK Scan into the community health worker supply chain in Mozambique , 2013, ICTD.

[11]  Joseph M. Hellerstein,et al.  Shreddr: pipelined paper digitization for low-resource organizations , 2012, ACM DEV '12.

[12]  Gaetano Borriello,et al.  Digitizing paper forms with mobile imaging technologies , 2012, ACM DEV '12.

[13]  Kentaro Toyama,et al.  Mobile phones and paper documents: evaluating a new approach for capturing microfinance data in rural India , 2006, CHI.

[14]  Kentaro Toyama,et al.  Managing microfinance with paper, pen and digital slate , 2010, ICTD 2010.

[15]  Zhi Tang,et al.  Logical Labeling of Fixed Layout PDF Documents Using Multiple Contexts , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.