Combination of diverse subword units in spoken term detection

This paper focuses on the following two points: First, we try to clarify the effect of combination systems from two aspects, accuracy and heterogeneity. And then we evaluate our unique subword unit, called Sub-Phonetic Segment (SPS) to maximize performance improvement by combination. Combination systems usually yield higher performance than any individual system. When the systems being combined are individually accurate but also mutually heterogeneous, the improvement by combination can be maximized. From this consideration, we estimate heterogeneity by correlation of false alarm errors of combined systems and confirm that lower correlation of two systems yields the better performance improvement by combination. Comparative tests of several combination approaches are carried out on subword-based spoken term detection. Since subword-based systems use constrained linguistic knowledge, it is fairly straightforward to verify the heterogeneity of combined systems. Experimental results show that the most significant improvements can be achieved by combination of two different subword units, triphone and SPS, which are highly heterogeneous subword units with low correlation of false alarm detections.

[1]  Shi-wook Lee,et al.  Experimental Evaluation of Probabilistic Similarity for Spoken Term Detection , 2013, ICPRAM.

[2]  Brian Kingsbury,et al.  Efficient spoken term detection using confusion networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Timothy J. Hazen,et al.  Retrieval and browsing of spoken content , 2008, IEEE Signal Processing Magazine.

[4]  Xiaodong Cui,et al.  A high-performance Cantonese keyword search system , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Giuseppe Riccardi,et al.  Computing consensus translation from multiple machine translation systems , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[6]  Mark J. F. Gales,et al.  A confidence-based approach for improving keyword hypothesis scores , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Jonathan G. Fiscus,et al.  Results of the 2006 Spoken Term Detection Evaluation , 2006 .

[8]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[9]  Tatsuya Kawahara,et al.  Overview of the NTCIR-10 SpokenDoc-2 Task , 2013, NTCIR.

[10]  Carmen García-Mateo,et al.  Multi-site heterogeneous system fusions for the Albayzin 2010 Language Recognition Evaluation , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[11]  Shi-wook Lee,et al.  Combining multiple subword representations for open-vocabulary spoken document retrieval , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[12]  Olivier Siohan,et al.  Multiple classifiers by constrained minimization , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).