Abstract argumentation for reading order detection

Detecting the reading order among the layout components of a document's page is fundamental to ensure effectiveness or even applicability of subsequent content extraction steps. While in single-column documents the reading flow can be straightforwardly determined, in more complex documents the task may become very hard. This paper proposes an automatic strategy for identifying the correct reading order of a document page's components based on abstract argumentation. The technique is unsupervised, and works on any kind of document based only on general assumptions about how humans behave when reading documents. Experimental results show that it is effective in more complex cases, and requires less background knowledge, than previous solutions that have been proposed in the literature.

[1]  Michelangelo Ceci,et al.  Machine Learning for Reading Order Detection in Document Image Understanding , 2008, Machine Learning in Document Analysis and Recognition.

[2]  Alvaro Barreiro,et al.  Improving the Extraction of Text in PDFs by Simulating the Human Reading Order , 2012, J. Univers. Comput. Sci..

[3]  Thomas M. Breuel,et al.  High Performance Document Layout Analysis , 2003 .

[4]  Paul A. Viola,et al.  Efficient geometric algorithms for parsing in two dimensions , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[5]  Phan Minh Dung,et al.  On the Acceptability of Arguments and its Fundamental Role in Nonmonotonic Reasoning, Logic Programming and n-Person Games , 1995, Artif. Intell..

[6]  Raymond W. Smith Hybrid Page Layout Analysis via Tab-Stop Detection , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[7]  Yasuto Ishitani,et al.  Document transformation system from papers to XML data based on pivot XML document method , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[8]  Roman Kern,et al.  An Unsupervised Machine Learning Approach to Body Text and Table of Contents Extraction from Digital Scientific Articles , 2013, TPDL.

[9]  Marco Aiello,et al.  Document understanding for a broad class of documents , 2002, Int. J. Document Anal. Recognit..

[10]  Jean-Luc Meunier,et al.  Optimized XY-cut for determining a page reading order , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[11]  Zhi Tang,et al.  A graph-based method of newspaper article reconstruction , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[12]  George Nagy,et al.  HIERARCHICAL REPRESENTATION OF OPTICALLY SCANNED DOCUMENTS , 1984 .

[13]  Stefano Ferilli,et al.  Machine Learning for Digital Document Processing: from Layout Analysis to Metadata Extraction , 2008, Machine Learning in Document Analysis and Recognition.