论文信息 - Preparing children's writing database for automated processing

Preparing children's writing database for automated processing

This paper describes the process of anonymizing a German, publicly available children’s corpus of digitized and scanned in spontaneously written texts from Grades 1-8. After reviewing the data collection process published previously, the method for anonymization of texts and meta data are described. A revised annotation set that was added to the existing transcription is defined. This annotation supports the spelling error analysis process while adding further annotation at the syntax level to allow for separate processing of these issues. Updates to statistics for the new version of the data are reported to give the reader an idea about research potential this version of the data may provide.

Sebastian Stüker | Rémi Lavalley | Kay M. Berkling

[1] Andrea Bertschi-Kaufmann,et al. Entwicklung von Lesefähigkeit: Massnahmen -Messungen - Effekte Ergebnisse und Konsequenzen aus dem Forschungsprojekt «Lese- und Schreibkompetenzen fördern» , 2006 .

[2] Sebastian Stüker,et al. Speech technology-based framework for quantitative analysis of German spelling errors in freely composed children's texts , 2011, SLaTE.

[3] Katrin Hein,et al. A Database of Freely Written Texts of German School Students for the Purpose of Automatic Spelling Error Classification , 2014, LREC.

[4] Egon Stemle,et al. KoKo: an L1 Learner Corpus for German , 2014, LREC.

[5] Knut Schwippert,et al. Orthographische Lernprozesse im Grundschulbereich. Ergebnisse aus Mehrebenenanalysen , 2005 .