Machine learning strategies for recovering speech articulatory trajectories and gestures from speech.

Articulatory information can improve the performance of automatic speech recognition systems. Unfortunately, since such information is not directly observable, it must be estimated from the acoustic signal using speech‐inversion techniques. Here, we first compare five different machine learning techniques for inverting the speech acoustics generated using the Haskins Laboratories speech production model in combination with HLsyn. In particular, we compare the accuracies of estimating two forms of articulatory information (a) vocal tract constriction trajectories and (b) articulatory flesh‐point pellet trajectories. We show that tract variable estimation can be performed more accurately than pellet estimation. Second, we also show that estimated tract variables can improve the performance of an autoregressive neural network model for recognizing speech gestures. We compare gesture recognition accuracy for three different input conditions: (1) generated acoustic signal and estimated tract variables, (2) aco...