Effective combination of heterogeneous subword-based spoken term detection systems

Combining heterogeneous systems has been shown to provide significant improvement in the spoken term detection (STD) task. However, there has been little research into why the system combination improves STD performance. In this paper, we analyze the heterogeneousness of the systems by calculating the correlation between their scores and evaluating the effectiveness of the combined subword-based systems. Here, we investigate both heterogeneous detection schemes and heterogeneous subword units, using a test-bed of NTCIR-10 task. Experimental analysis shows that the higher improvement rates can be achieved by combining the more heterogeneous systems which are with lower correlation each other, that is, with lager amount of complementary information. Compared with the highest performance among each individual system to be combined, a parallel combination of heterogeneous subword units improves the STD performance by 13.59%, and the system with an efficient cascaded combination of heterogeneous subword units and heterogeneous detection schemes improves by 12.79%. Finally, the state-of-the-art performance of 74.07 average maximum F-measure on the NTCIR-10 task can be achieved by the combination of heterogeneous subword units and heterogeneous detection schemes.

[1]  LeeJoon Ho Analyses of multiple evidence combination , 1997 .

[2]  Shi-wook Lee,et al.  Combining multiple subword representations for open-vocabulary spoken document retrieval , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[3]  Mitchel Weintraub,et al.  LVCSR log-likelihood ratio scoring for keyword spotting , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Tatsuya Kawahara,et al.  Overview of the NTCIR-10 SpokenDoc-2 Task , 2013, NTCIR.

[5]  Jonathan G. Fiscus,et al.  A post-processing system to yield reduced word error rates: Recognizer Output Voting Error Reduction (ROVER) , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[6]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[7]  Carmen García-Mateo,et al.  Multi-site heterogeneous system fusions for the Albayzin 2010 Language Recognition Evaluation , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Jean-Luc Gauvain,et al.  Combining multiple speech recognizers using voting and language model information , 2000, INTERSPEECH.

[9]  Bin Ma,et al.  Using parallel tokenizers with DTW matrix combination for low-resource spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Nicholas J. Belkin,et al.  The effect multiple query representations on information retrieval system performance , 1993, SIGIR.

[11]  Shi-wook Lee,et al.  Experimental Evaluation of Probabilistic Similarity for Spoken Term Detection , 2013, ICPRAM.

[12]  Giuseppe Riccardi,et al.  Computing consensus translation from multiple machine translation systems , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[13]  Jong-Hak Lee,et al.  Analyses of multiple evidence combination , 1997, SIGIR '97.

[14]  Luis Javier Rodríguez-Fuentes,et al.  On the calibration and fusion of heterogeneous spoken term detection systems , 2013, INTERSPEECH.

[15]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[16]  O. H. Lowry Academic press. , 1972, Analytical chemistry.

[17]  Richard Sproat,et al.  Lattice-Based Search for Spoken Utterance Retrieval , 2004, NAACL.