In most approaches to text classification, the basic units (terms) used to represent a document are: words (with or without stemming), n-gram characters, phonemes, syllables, multi-words, etc. However, these units are always used exclusively. In this paper, a novel approach is presented that combines two types of such units to represent a document for text classification. Our experiments show that, if appropriately chosen, the combined terms will result in better recognition rates than using only one type of those terms. The new approach is tightly related to the high level of redundancy property, which is a desirable property for text classification tasks and which is directly connected to the margin theory of Support Vector Machines [14]. The topic classification approach presented here is designed to be part of the ALERT system that automatically scans multimedia data like TV or radio broadcasts for the presence of pre-specified topics. Therefore, tests are mainly conducted on the (German) transcription output of an automatic speech recognizer (ASR). Since for German document classification, n-gram character features are more robust against errors (e.g. from ASR) in text and yield better results than word level features [7], we combined these two types of terms together for representing a document. In most cases, this approach gives better results than n-gram character features or word features alone. We also applied our feature combination approach to text classification of the wellknown English Reuters corpus (which is based on plain text, not ASR output). It shows that appropriate combination of different type of features also gives a slightly better result. Additionally, we will introduce a soundex text representation scheme which, when used in combination with other feature types, can help the text classification task.
[1]
James Allan,et al.
Automatic Query Expansion Using SMART: TREC 3
,
1994,
TREC.
[2]
Jörg Kindermann,et al.
Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?
,
2002,
Machine Learning.
[3]
Thorsten Joachims,et al.
Learning to classify text using support vector machines - methods, theory and algorithms
,
2002,
The Kluwer international series in engineering and computer science.
[4]
Vladimir N. Vapnik,et al.
The Nature of Statistical Learning Theory
,
2000,
Statistics for Engineering and Information Science.
[5]
Vladimir Vapnik,et al.
The Nature of Statistical Learning
,
1995
.
[6]
Gerhard Rigoll,et al.
Automatic topic identification in multimedia broadcast data
,
2002,
Proceedings. IEEE International Conference on Multimedia and Expo.
[7]
Justin Zobel,et al.
Phonetic string matching: lessons from information retrieval
,
1996,
SIGIR '96.
[8]
A. Kosmala,et al.
Audio-Visual Analysis of Multimedia Documents for Automatic Topic Identification
,
2002
.
[9]
Nello Cristianini,et al.
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods
,
2000
.
[10]
Thorsten Joachims,et al.
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization
,
1997,
ICML.
[11]
Thorsten Joachims,et al.
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
,
1998,
ECML.