The NNI Query-by-Example System for MediaEval 2015

In this paper we describe the system proposed by NNI (NWPUNTU-I2R) team for the QUESST task within the Mediaeval 2014 evaluation. To solve the problem, we used both dynamic time warping (DTW) and symbolic search (SS) based approaches. The DTW system performs template matching using subsequence DTW algorithm and posterior representations. The symbolic search is performed on phone sequences generated by phone recognizers. For both symbolic and DTW search, partial sequence matching is performed to reduce missing rate, especially for query type 2 and 3. After fusing 9 DTW systems, 7 symbolic systems, and query length side information, we obtained 0.6023 actual normalized cross entropy (actCnxe) for all queries combined. For type 3 complex queries, we achieved 0.7252 actCnxe.

[1]  Hamid Sheikhzadeh,et al.  ETSI AMR-2 VAD: evaluation and ultra low-resource implementation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[2]  Jacob Benesty,et al.  On single-channel noise reduction in the time domain , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Kenneth Ward Church,et al.  A summary of the 2012 JHU CLSP workshop on zero resource speech technologies and models of early language acquisition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Frédéric Bimbot,et al.  Audio keyword extraction by unsupervised word discovery , 2009, INTERSPEECH.

[5]  Mireia Díez,et al.  High-performance Query-by-Example Spoken Term Detection on the SWS 2013 evaluation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Bin Ma,et al.  Acoustic TextTiling for story segmentation of spoken documents , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[9]  Jacob Benesty,et al.  Filtering Techniques for Noise Reduction and Speech Enhancement , 2003 .

[10]  Anja Walter,et al.  Audio Signal Processing For Next Generation Multimedia Communication Systems , 2016 .

[11]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Jacob Benesty,et al.  New insights into the noise reduction Wiener filter , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Haizhou Li,et al.  SEAME: a Mandarin-English code-switching speech corpus in south-east asia , 2010, INTERSPEECH.

[14]  Lukás Burget,et al.  Calibration and fusion of query-by-example systems — But SWS 2013 , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  A. Oppenheim,et al.  Signal reconstruction from phase or magnitude , 1980 .

[16]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[17]  Marvin H.J. Gruber Statistical Digital Signal Processing and Modeling , 1997 .

[18]  Bin Ma,et al.  An acoustic segment modeling approach to query-by-example spoken term detection , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Monson H. Hayes,et al.  Statistical Digital Signal Processing and Modeling , 1996 .

[20]  Bin Ma,et al.  Intrinsic spectral analysis based on temporal context features for query-by-example spoken term detection , 2014, INTERSPEECH.

[21]  Bin Ma,et al.  Language independent query-by-example spoken term detection using N-best phone sequences and partial matching , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Eric John Diethorn Subband noise reduction methods for speech enhancement , 2000 .

[23]  Florian Metze,et al.  Query by Example Search on Speech at Mediaeval 2015 , 2014, MediaEval.

[24]  Haizhou Li,et al.  MASS: A Malay language LVCSR corpus resource , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[25]  Franciska de Jong,et al.  Robust speech/non-speech classification in heterogeneous multimedia content , 2011, Speech Commun..