Books of Hours. the First Liturgical Data Set for Text Segmentation.

The Book of Hours was the bestseller of the late Middle Ages and Renaissance. It is a historical invaluable treasure, documenting the devotional practices of Christians in the late Middle Ages. Up to now, its textual content has been scarcely studied because of its manuscript nature, its length and its complex content. At first glance, it looks too standardized. However, the study of book of hours raises important challenges: (i) in image analysis, its often lavish ornamentation (illegible painted initials, line-fillers, etc.), abbreviated words, multilingualism are difficult to address in Handwritten Text Recognition (HTR); (ii) its hierarchical entangled structure offers a new field of investigation for text segmentation; (iii) in digital humanities, its textual content gives opportunities for historical analysis. In this paper, we provide the first corpus of books of hours, which consists of Latin transcriptions of 300 books of hours generated by Handwritten Text Recognition (HTR) - that is like Optical Character Recognition (OCR) but for handwritten and not printed texts. We designed a structural scheme of the book of hours and annotated manually two books of hours according to this scheme. Lastly, we performed a systematic evaluation of the main state of the art text segmentation approaches.

[1]  Igor Malioutov,et al.  Minimum Cut Model for Spoken Lecture Segmentation , 2006, ACL.

[2]  Walter W. S. Cook,et al.  Les Livres d'Heures manuscrits de la Bibliotheque Nationale , 1928 .

[3]  Jing Li,et al.  SegBot: A Generic Neural Text Segmentation Model with Pointer Network , 2018, IJCAI.

[4]  Yaakov Yaari,et al.  Segmentation of Expository Texts by Hierarchical Agglomerative Clustering , 1997, ArXiv.

[5]  Marti A. Hearst,et al.  A Critique and Improvement of an Evaluation Metric for Text Segmentation , 2002, CL.

[6]  Minlie Huang,et al.  A Deep Sequential Model for Discourse Parsing on Multi-Party Dialogues , 2018, AAAI.

[7]  Li Yi,et al.  Dialogue Session Segmentation by Embedding-Enhanced TextTiling , 2016, INTERSPEECH.

[8]  Haiqing Chen,et al.  A Weakly Supervised Method for Topic Segmentation and Labeling in Goal-oriented Dialogues via Reinforcement Learning , 2018, IJCAI.

[9]  Nathalie Aussenac-Gilles,et al.  Détection automatique de la structure organisationnelle de documents à partir de marqueurs visuels et lexicaux , 2014 .

[10]  John D. Lafferty,et al.  Statistical Models for Text Segmentation , 1999, Machine Learning.

[11]  Christopher De Hamel,et al.  A History of Illuminated Manuscripts , 1986 .

[12]  Marti A. Hearst Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[13]  Hideki Kozima,et al.  Text Segmentation Based on Similarity between Words , 1993, ACL.

[14]  Alexander Löser,et al.  SECTOR: A Neural Model for Coherent Topic Segmentation and Classification , 2019, TACL.

[15]  Hitoshi Isahara,et al.  A Statistical Model for Domain-Independent Text Segmentation , 2001, ACL.

[16]  Tadashi Nomoto,et al.  A Grammatico-Statistical Approach to Discourse Partitioning , 1994, COLING.

[17]  Christopher Kermorvant,et al.  Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks , 2020, ArXiv.

[18]  Brigitte Grau,et al.  Thematic segmentation of texts: two methods for two kinds of texts , 1998, COLING.

[19]  Jacob Eisenstein,et al.  Hierarchical Text Segmentation from Multi-Scale Lexical Cohesion , 2009, NAACL.

[20]  Freddy Y. Y. Choi Advances in domain independent linear text segmentation , 2000, ANLP.

[21]  Danushka Bollegala,et al.  A Sequential Model for Discourse Segmentation , 2010, CICLing.

[22]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[23]  Eric Fosler-Lussier,et al.  Discourse Segmentation of Multi-Party Conversation , 2003, ACL.

[24]  L. Poos,et al.  Time Sanctified: The Book of Hours in Medieval Art and Life , 1988 .

[25]  G. Clark The Spitz Master: A Parisian Book of Hours , 2003 .

[26]  Shinji Watanabe,et al.  Using ASR Methods for OCR , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[27]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[28]  Chris Biemann,et al.  TopicTiling: A Text Segmentation Algorithm based on LDA , 2012, ACL 2012.

[29]  Johanna D. Moore,et al.  Automatic Segmentation of Multiparty Dialogue , 2006, EACL.

[30]  Tuomas Heikkilä,et al.  Quantitative methods for the analysis of medieval calendars , 2018, Digit. Scholarsh. Humanit..

[31]  Shafiq R. Joty,et al.  CODRA: A Novel Discriminative Framework for Rhetorical Analysis , 2015, CL.

[32]  Graeme Hirst,et al.  Lexical Cohesion Computed by Thesaural relations as an indicator of the structure of text , 1991, CL.