Developing an automatic transcription and retrieval system for spoken lectures in Turkish

With the increase of online video lectures, using speech and language processing technologies for education has become quite important. This paper presents an automatic transcription and retrieval system developed for processing spoken lectures in Turkish. The main steps in the system are automatic transcription of Turkish video lectures using a large vocabulary continuous speech recognition (LVCSR) system and finding keywords on the lattices obtained from the LVCSR system using a speech retrieval system based on keyword search. While developing this system, first a state-of-the-art LVCSR system was developed for Turkish using advance acoustic modeling methods, then keywords were extracted automatically from word sequences in the reference transcriptions of video lectures, and a speech retrieval system was developed for searching these keywords in the lattice output of the LVCSR system. The spoken lecture processing system yields 14.2% word error rate and 0.86 maximum term weighted value on the test data.

[1]  Siddika Parlak,et al.  Performance Analysis and Improvement of Turkish Broadcast News Retrieval , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL 2006.

[3]  Ebru Arisoy,et al.  Turkish Broadcast News Transcription and Retrieval , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Jean-Luc Gauvain,et al.  Transcribing lectures and seminars , 2005, INTERSPEECH.

[6]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Helena Moniz,et al.  The LECTRA Corpus - Classroom Lecture Transcriptions in European Portuguese , 2008, LREC.

[8]  Kornel Laskowski,et al.  Advances in lecture recognition: the ISL RT-06s evaluation system , 2006, INTERSPEECH.

[9]  Sadaoki Furui,et al.  Ubiquitous speech processing , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Helena Moniz,et al.  Recognition of classroom lectures in european portuguese , 2006, INTERSPEECH.

[11]  James R. Glass,et al.  Spoken Content Retrieval—Beyond Cascading Speech Recognition with Text Retrieval , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[12]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[13]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[14]  James R. Glass,et al.  Analysis and Processing of Lecture Audio Data: Preliminary Investigations , 2004, Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004 - SpeechIR '04.

[15]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[16]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[17]  S. Furui Recent Advances in Spontaneous Speech Recognition and Understanding , 2003 .

[18]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[19]  Murat Saraclar,et al.  Lattice Indexing for Spoken Term Detection , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.