Turkish LVCSR: Database Preparation and Language Modeling for an Agglutinative Language

Turkish language is an agglutinative language. It is possible to produce a very high number of words from the same root with suffixes [1]. Language modeling for agglutinative languages needs to be different than modeling of languages like English. Such languages also have inflections but not as many as an agglutinative language. Techniques which can be used for modeling agglutinative languages are presented in this work. Turkish is one of the least studied language for speech recognition. For this reason the first step for Turkish speech recognition is preparing a database. The texts to record the database were selected from television programs and newspaper articles. Selection criterion was to cover various subject and to create a phonetically balanced corpus. Additionally it is important to include as many different word as possible. The Speech Training and Recognition Unified Tool (STRUT)1 has been used for training and testing systems for preliminary recognition experiments.