Assisted keyword indexing for lecture videos using unsupervised keyword spotting

Created a completely unsupervised within-speaker keyword spotting system to create accessible index.Average Precision at 10 of 71.5% and 79.5% for laptop recorded and in-lecture queries for RIT lectures.Whitening is used to reduce variance in MFCC feature vectors (performance increase of 58%).In Table?1 and the accompanying text we explicitly define our criteria for defining 'valid' search hits.MIT lectures recorded on lapel microphone have the average Precision at 10 of 89.5%. Many students use videos to supplement learning outside the classroom. This is particularly important for students with challenged visual capacities, for whom seeing the board during lecture is difficult. For these students, we believe that recording the lectures they attend and providing effective video indexing and search tools will make it easier for them to learn course subject matter at their own pace. As a first step in this direction, we seek to help instructors create an index for their lecture videos using audio keyword search, with queries recorded by the instructor on their laptop and/or created from video excerpts. For this we have created an unsupervised within-speaker keyword spotting system. We represent audio data using de-noised, whitened and scale-normalized Mel Frequency Cepstral Coefficient (MFCC) features, and locate queries using Segmental Dynamic Time Warping (SDTW) of feature sequences. Our system is evaluated using introductory Linear Algebra lectures from instructors with different accents at two U.S. universities. For lectures produced using a video camera at RIT, laptop-recorded queries obtain an average Precision at 10 of 71.5%, while 79.5% is obtained for within-lecture queries. For lectures recorded using a lapel microphone at MIT, using a similar keyword set we obtain a much higher average Precision at 10 of 89.5%. Our results suggest that our system is robust to changes in environment, speaker and recording setup.

[1]  Aren Jansen,et al.  Segmental acoustic indexing for zero resource keyword search , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[3]  Jan Cernocký,et al.  Speech@FIT lecture browser , 2010, 2010 IEEE Spoken Language Technology Workshop.

[4]  Paul Lamere,et al.  Sphinx-4: a flexible open source framework for speech recognition , 2004 .

[5]  Yu Huang,et al.  Spoken Knowledge Organization by Semantic Structuring and a Prototype Course Lecture System for Personalized Learning , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[7]  Atsunori Ogawa,et al.  Zero-resource spoken term detection using hierarchical graph-based similarity search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Hervé Bourlard,et al.  Posterior-Based Features and Distances in Template Matching for Speech Recognition , 2007, MLMI.

[9]  S. Levinson,et al.  Considerations in dynamic time warping algorithms for discrete word recognition , 1978 .

[10]  James R. Glass,et al.  Towards unsupervised pattern discovery in speech , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[11]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[12]  Xavier Anguera Miró Information retrieval-based dynamic time warping , 2013, INTERSPEECH.

[13]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[14]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[15]  Douglas D. O'Shaughnessy,et al.  Comparative Evaluation of Feature Normalization Techniques for Speaker Verification , 2011, NOLISP.

[16]  Gerhard Doblinger,et al.  Computationally efficient speech enhancement by spectral minima tracking in subbands , 1995, EUROSPEECH.

[17]  Lin-Shan Lee,et al.  Unsupervised spoken term detection with spoken queries by multi-level acoustic patterns with varying model granularity , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  James R. Glass,et al.  Analysis and Processing of Lecture Audio Data: Preliminary Investigations , 2004, Proceedings of the Workshop on Interdisciplinary Approaches to Speech Indexing and Retrieval at HLT-NAACL 2004 - SpeechIR '04.

[19]  John R. Kender,et al.  VAST MM: multimedia browser for presentation video , 2007, CIVR '07.