Digital Architecture for Instantaneous V/UV/S Classification of Noise Free Speech Segments

Speech segments are broadly categorized as voiced (V), unvoiced (UV), and silence (S). The V/UV/S classification of speech segments plays an important role in many speech-based applications. In this paper, we propose a digital architecture for instantaneous V/UV/S classification of noise free speech segments. The proposed architecture uses the incoming samples of the speech segments to compute two popularly used time-domain-based speech parameters namely short-time energy (STE) and short-time average zero-crossing rate (STAZCR). These computed parameters are used along with pre-determined STE and STAZCR thresholds by the decision logic to classify the speech segments. The necessary hardware to perform on-the fly computations of the said parameters is realized using an algorithmic state-machine with datapath (ASMD). The decision logic is realized as a standalone unit, integrated with the ASMD. Further, the proposed architecture can be reconfigured to work with speech segments having variable lengths in powers of 2, upto 1024. The proposed architecture is prototyped on field-programmable gate array (FPGA) using Xilinx Zedboard Zynq Evaluation and Development Kit XC7Z020CLG484-1. The implementation results show that the proposed architecture utilizes minimal resources on FPGA fabric, and achieves maximum operating clock frequencies up to 185 MHz.

[1]  Andreas Spanias,et al.  Cepstrum-based pitch detection using a new statistical V/UV classification algorithm , 1999, IEEE Trans. Speech Audio Process..

[2]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[3]  K. Sreenivasa Rao,et al.  Voice/non-voice detection using phase of zero frequency filtered speech signal , 2016, Speech Commun..

[4]  Lawrence R. Rabiner,et al.  A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition , 1976 .

[5]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[6]  Vikas Kumar,et al.  A VLSI architecture of CORDIC-based popular windows and its FPGA prototype , 2017, Int. J. High Perform. Syst. Archit..

[7]  Aniruddha Kanhe,et al.  Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales , 2020, Digit. Signal Process..

[8]  Aniruddha Kanhe,et al.  Recognition of Spoken Languages from Acoustic Speech Signals Using Fourier Parameters , 2019, Circuits, Systems, and Signal Processing.

[9]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[10]  Pong P. Chu FPGA Prototyping by Verilog Examples: Xilinx Spartan-3 Version , 2008 .

[11]  F. Ykhlef,et al.  Evaluation of time domain features for voiced/non-voiced classification of speech , 2012, 2012 International Conference on Signals and Electronic Systems (ICSES).

[12]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[13]  James R. Glass,et al.  A 6 mW, 5,000-Word Real-Time Speech Recognizer Using WFST Models , 2015, IEEE Journal of Solid-State Circuits.

[14]  Alan V. Oppenheim,et al.  Discrete-time signal processing (2nd ed.) , 1999 .

[15]  Buket D. Barkana,et al.  Voiced/Unvoiced Decision for Speech Signals Based on Zero-Crossing Rate and Energy , 2008, SCSS.