Improved structural similarity measurement for vocal signals

In recent years, the SSIM was proposed for image and vocal signal assessments to match human perception. The existing SSIMs for vocal signals are similar to those for images. However, the human perceptions for voices and images are different. If two vocal signals differ only by phase, delay, or logistic frequency shift, they are heard similarly. In this paper, we propose the non-uniform sampling frequency mean SSIM (NUS-FMSSIM) to highly match the human perception for voices. Simulations show that it is more robust to phase change, time shift, and logistic frequency shift than the existing SSIMs for vocal signals.

[1]  John G. Beerends,et al.  A Perceptual Audio Quality Measure , 1992 .

[2]  B. Paillard,et al.  PERCEVAL: Perceptual Evaluation of the Quality of Audio Signals , 1992 .

[3]  Xiangyang Wang,et al.  Audio Quality evaluation using frequency structural similarity measure , 2011 .

[4]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[5]  C.D. Creusere,et al.  Objective analysis of temporally varying audio quality metrics , 2008, 2008 42nd Asilomar Conference on Signals, Systems and Computers.

[6]  Charles D. Creusere,et al.  Audio quality assessment using the mean structural similarity measure , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  A. Bovik,et al.  A universal image quality index , 2002, IEEE Signal Processing Letters.