Adding sentence boundaries to conversational speech transcriptions using noisily labelled examples

This paper presents a technique for adding sentence boundaries to text obtained by Automatic Speech Recognition (ASR) of conversational speech audio. We show that starting with imprecise boundary information added by using only silence information from an ASR system, we can improve boundary detection using head and tail phrases. The main purpose for the insertion of sentence boundaries to ASR conversational text is to improve linguistic analysis, namely Information Extraction, for text mining systems that handle huge volumes of textual data and analyze trends and specific features of concepts described in document sets. Hence, we also show how the addition of boundaries improves two basic natural language processing tasks viz. POS label assignment and NP extraction.