A Novel Feature Combination Approach for Spoken Document Classification with Support Vector Machines

In most approaches to text classification, the basic units (terms) used to represent a document are: words (with or without stemming), n-gram characters, phonemes, syllables, multi-words, etc. However, these units are always used exclusively. In this paper, a novel approach is presented that combines two types of such units to represent a document for text classification. Our experiments show that, if appropriately chosen, the combined terms will result in better recognition rates than using only one type of those terms. The new approach is tightly related to the high level of redundancy property, which is a desirable property for text classification tasks and which is directly connected to the margin theory of Support Vector Machines [14]. The topic classification approach presented here is designed to be part of the ALERT system that automatically scans multimedia data like TV or radio broadcasts for the presence of pre-specified topics. Therefore, tests are mainly conducted on the (German) transcription output of an automatic speech recognizer (ASR). Since for German document classification, n-gram character features are more robust against errors (e.g. from ASR) in text and yield better results than word level features [7], we combined these two types of terms together for representing a document. In most cases, this approach gives better results than n-gram character features or word features alone. We also applied our feature combination approach to text classification of the wellknown English Reuters corpus (which is based on plain text, not ASR output). It shows that appropriate combination of different type of features also gives a slightly better result. Additionally, we will introduce a soundex text representation scheme which, when used in combination with other feature types, can help the text classification task.