This paper describes the introduction of Arabic speech and text into the TIDES OnTAP system. This includes the development of the BBN Audio Indexing System for broadcast news in Arabic, development and the introduction of an Arabic event tracker and Arabic querying into the TIDES OnTAP system. Key issues addressed in this work revolve around the three major components of the audio indexing system: automatic speech recognition, speaker identification, named entity identification and Arabic document tracking. The system deals with several challenges introduced by the Arabic language, including the absence of short vowels in written text and the presence of compound words that are formed by the concatenation of certain conjunctions, prepositions, articles, and pronouns, as prefixes and suffixes to the word stem. The absence of short vowels in the transcripts was addressed with a novel solution that leverages the strengths of Hidden Markov models. Another challenge was the acquisition of appropriate language modeling data, given the absence of broadcast news data for that purpose. We present performance results for all three components of the Audio Indexing System.
[1]
Richard M. Schwartz,et al.
Probabilistic models for topic detection and tracking
,
1999,
1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).
[2]
Daben Liu,et al.
Speech and language technologies for audio indexing and retrieval
,
2000,
Proceedings of the IEEE.
[3]
Richard M. Schwartz,et al.
An Algorithm that Learns What's in a Name
,
1999,
Machine Learning.
[4]
H. Gish,et al.
Text-independent speaker identification
,
1994,
IEEE Signal Processing Magazine.
[5]
Daben Liu,et al.
Fast speaker change detection for broadcast news transcription and indexing
,
1999,
EUROSPEECH.
[6]
Mark J. F. Gales,et al.
Maximum likelihood linear transformations for HMM-based speech recognition
,
1998,
Comput. Speech Lang..
[7]
Amit Srivastava,et al.
Integrated technologies for indexing spoken language
,
2000,
CACM.