Article Segmentation in Digitised Newspapers with a 2D Markov Model

Document analysis and recognition is increasingly used to digitise collections of historical books, newspapers and other periodicals. In the digital humanities, it is often the goal to apply information retrieval (IR) and natural language processing (NLP) techniques to help researchers analyse and navigate these digitised archives. The lack of article segmentation is impairing many IR and NLP systems, which assume text is split into ordered, error-free documents. We define a document analysis and image processing task for segmenting digitised newspapers into articles and other content, e.g. adverts, and we automatically create a dataset of 11602 articles. Using this dataset, we develop and evaluate an innovative 2D Markov model that encodes reading order and substantially outperforms the current state-of-the-art, reaching similar accuracy to human annotators.

[1]  C. Clausner,et al.  ICDAR2015 competition on recognition of documents with complex layouts - RDCL2015 , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[2]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[3]  Zhuowen Tu,et al.  Fixed-Point Model For Structured Labeling , 2013, ICML.

[4]  Wei-Ying Ma,et al.  VIPS: a Vision-based Page Segmentation Algorithm , 2003 .

[5]  Kenning Arlitsch,et al.  Microfilm, Paper, and OCR: Issues in Newspaper Digitization. The Utah Digital Newspapers Program , 2004 .

[6]  Razvan C. Bunescu,et al.  Using Encyclopedic Knowledge for Named entity Disambiguation , 2006, EACL.

[7]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[8]  R. Furmaniak Unsupervised Newspaper Segmentation Using Language Context , 2007, Ninth International Conference on Document Analysis and Recognition (ICDAR 2007).

[9]  Dan S. Bloomberg,et al.  Multiresolution Morphological Approach to Document Image Analysis , 1991 .

[10]  Basilios Gatos,et al.  ICDAR 2003 page segmentation competition , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[11]  Thomas M. Breuel,et al.  Performance Evaluation and Benchmarking of Six-Page Segmentation Algorithms , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Luigi Laura,et al.  Performance Evaluation of Algorithms for Newspaper Article Identification , 2011, 2011 International Conference on Document Analysis and Recognition.

[13]  Pasi Fränti,et al.  Using linguistic features to automatically extract web page title , 2017, Expert Syst. Appl..

[14]  Yeliz Yesilada,et al.  Web Page Segmentation: A Review , 2014 .

[15]  Rose Holley Trove: Innovation in Access to Information in Australia , 2010 .

[16]  Stavros J. Perantonis,et al.  Integrated algorithms for newspaper page decomposition and article tracking , 1999, Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR '99 (Cat. No.PR00318).

[17]  Zhi Tang,et al.  Newspaper article reconstruction using ant colony optimization and bipartite graph , 2013, Appl. Soft Comput..

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  Marco Aiello,et al.  TEXTUAL ARTICLE CLUSTERING IN NEWSPAPER PAGES , 2006, Appl. Artif. Intell..

[20]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[21]  Apostolos Antonacopoulos,et al.  ICDAR 2013 Competition on Historical Newspaper Layout Analysis (HNLA 2013) , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[22]  Ankur Jain,et al.  Google Newspaper Search – Image Processing and Analysis Pipeline , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[23]  C. Clausner,et al.  Historical Document Layout Analysis Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[24]  Gerhard Paass,et al.  Machine Learning for Document Structure Recognition , 2012, Modeling, Learning, and Processing of Text Technological Data Structures.

[25]  Esslli Site,et al.  Natural Language Processing for Historical Texts , 2012 .

[26]  Santanu Chaudhury,et al.  Newspaper Article Extraction Using Hierarchical Fixed Point Model , 2014, 2014 11th IAPR International Workshop on Document Analysis Systems.

[27]  Robert M. Gray,et al.  Image classification by a two-dimensional hidden Markov model , 2000, IEEE Trans. Signal Process..

[28]  Joel Nothman,et al.  Evaluating Entity Linking with Wikipedia , 2013, Artif. Intell..

[29]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2009, Information Retrieval.