Transforming Japanese archives into accessible digital books

Digitized physical books offer access to tremendous amounts of knowledge, even for people with print-related disabilities. Various projects and standard activities are underway to make all of our past and present books accessible. However digitizing books requires extensive human efforts such as correcting the results of OCR (optical character recognition) and adding structural information such as headings. Some Asian languages need extra efforts for the OCR errors because of their many and varied character sets. Japanese has used more than 10,000 characters compared with a few hundred in English. This heavy workload is inhibiting the creation of accessible digital books. To facilitate digitization, we are developing a new system for processing physical books. We reduce and disperse the human efforts and accelerate conversions by combining automatic inference and human capabilities. Our system preserves the original page images for the entire digitization process to support gradual refinement and distributes the work as micro-tasks. We conducted trials with the Japanese National Diet Library (NDL) to evaluate the required effort for digitizing books with a variety of layouts and years of publication. The results showed old Japanese books had specific problems when correcting the OCR errors and adding structures. Drawing on our results, we discuss further workload reductions and future directions for international digitization systems.

[1]  Hironobu Takagi,et al.  Social accessibility: achieving accessibility through collaborative metadata authoring , 2008, Assets '08.

[2]  Georgios Kouroupetroglou,et al.  Auditory Accessibility of Metadata in Books: A Design for All Approach , 2007, HCI.

[3]  Asaf Tzadok,et al.  User Collaboration for Improving Access to Historical Texts , 2010 .

[4]  Ahmad Abdulkader,et al.  Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[5]  Neil Soiffer,et al.  Daisy 3: A Standard for Accessible Multimedia Books , 2008, IEEE MultiMedia.

[6]  Hironobu Takagi,et al.  What's Next? A Visual Editor for Correcting Reading Order , 2009, INTERACT.

[7]  Rose Holley Many Hands Make Light Work : Public Collaborative OCR Text Correction in Australian Historic Newspapers , 2009 .

[8]  Ying Liu,et al.  Structure extraction from PDF-based book documents , 2011, JCDL '11.

[9]  Hildelies Balk,et al.  IMPACT: centre of competence in text digitisation , 2011, HIP '11.

[10]  Xuelong Li,et al.  Image Quality Assessment Based on Multiscale Geometric Analysis , 2009, IEEE Transactions on Image Processing.

[11]  Ankur Jain,et al.  Google Newspaper Search – Image Processing and Analysis Pipeline , 2009, 2009 10th International Conference on Document Analysis and Recognition.

[12]  Otto Chrons,et al.  Digitalkoot: Making Old Archives Accessible Using Crowdsourcing , 2011, Human Computation.

[13]  George Nagy,et al.  Style context with second-order statistics , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Gregory B. Newby,et al.  Distributed proofreading , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..

[15]  Gabriella Kazai,et al.  ICDAR 2011 Book Structure Extraction Competition , 2011, 2011 International Conference on Document Analysis and Recognition.

[16]  Southern District of New York 14 November 2013 – Case No. 1 Decision of the U.S. District Court “Google Books” , 2014 .

[17]  Hironobu Takagi,et al.  Accessibility commons: a metadata infrastructure for web accessibility , 2008, Assets '08.

[18]  Adrien Treuille,et al.  Predicting protein structures with a multiplayer online game , 2010, Nature.

[19]  Chrysanthos Dellarocas,et al.  Harnessing Crowds: Mapping the Genome of Collective Intelligence , 2009 .

[20]  Manuel Blum,et al.  reCAPTCHA: Human-Based Character Recognition via Web Security Measures , 2008, Science.