Commas Recovery with Syntactic Features in French and in Czech

Automatic speech transcripts can be made more readable and useful for further processing by enriching them with punctuation marks and other meta-linguistic information. We study in this work how to improve automatic recovery of one of the most difficult punctuation marks, commas, in French and in Czech. We show that commas detection performances are largely improved in both languages by integrating into our baseline Conditional Random Field model syntactic features derived from dependency structures. We further study the relative impact of language-independent vs. specific features, and show that a combination of both of them gives the largest improvement. Robustness of these features to speech recognition errors is finally discussed.

[1]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[2]  Pascal Denis,et al.  Statistical French Dependency Parsing: Treebank Conversion and First Results , 2010, LREC.

[3]  Jan Hajic,et al.  The Prague Dependency Treebank , 2003 .

[4]  Marthe Simard Étude de la distribution de la virgule dans les phrases de textes argumentatifs d'expression française , 1993 .

[5]  Josef Psutka,et al.  Automatic punctuation annotation in czech broadcast news speech , 2004 .

[6]  Heidi Christensen,et al.  Punctuation annotation using statistical prosody models. , 2001 .

[7]  Fernando Batista,et al.  Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news , 2008, Speech Commun..

[8]  Dilek Z. Hakkani-Tür,et al.  Syntactically-informed models for comma prediction , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Stuart M. Shieber,et al.  Comma Restoration Using Constituency Information , 2003, HLT-NAACL.

[10]  Claire Gardent,et al.  Analyse syntaxique du français parlé , 2009 .

[11]  Geoffrey Zweig,et al.  Maximum entropy model for punctuation annotation from speech , 2002, INTERSPEECH.

[12]  Josef van Genabith,et al.  A Linguistically Inspired Statistical Model for Chinese Punctuation Generation , 2010, TALIP.

[13]  Ralph Weischedel,et al.  PERFORMANCE MEASURES FOR INFORMATION EXTRACTION , 2007 .