A Novel Procedure to Speed up the Transcription of Historical Handwritten Documents by Interleaving Keyword Spotting and user Validation

We propose a novel procedure to speed-up the content transcription of handwritten documents in digital historical archives when a keyword spotting system is used for the purpose. Instead of performing the validation of the system outputs in a single step, as it is customary, the proposed methodology envisaged a multi-step validation process to be embedded into a human-in-the-loop approach. At each step, the system outputs are validated and, whenever an image word that does not correspond to any entry of the keyword list is mistakenly returned by the system, its correct transcription is entered and used to query the system in the next step. The performance of our approach has been experimentally evaluated in terms of the total time to achieve the complete transcription of a subset of documents from the Bentham dataset. The results confirm that interleaving keyword spotting by the system and validation by the user leads to a significant reduction of the time required to transcribe the document content with respect to both the manual transcription and the traditional end-of-the-loop validation process.

[1]  Alejandro Héctor Toselli,et al.  ICDAR2015 Competition on Keyword Spotting for Handwritten Documents , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[2]  Konstantinos Zagoris,et al.  ICFHR2016 Handwritten Keyword Spotting Competition (H-KWS 2016) , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[3]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Claudio De Stefano,et al.  Assisted Transcription of Historical Documents by Keyword Spotting: A Performance Model , 2017, 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR).

[5]  Alejandro Héctor Toselli,et al.  Querying out-of-vocabulary words in lexicon-based keyword spotting , 2017, Neural Computing and Applications.

[6]  Vassilis Katsouros,et al.  Handwritten document image segmentation into text lines and words , 2010, Pattern Recognit..

[7]  Volkmar Frinken,et al.  A Novel Word Spotting Method Based on Recurrent Neural Networks , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Josep Lladós,et al.  Efficient segmentation-free keyword spotting in historical document collections , 2015, Pattern Recognit..

[9]  Angelo Marcelli,et al.  A Human in the Loop Approach to Historical Handwritten Documents Transcription , 2016, 2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR).

[10]  Konstantinos Zagoris,et al.  A Framework for Efficient Transcription of Historical Documents Using Keyword Spotting , 2015, HIP@ICDAR.