Observing Pianist Accuracy and Form with Computer Vision

We present a first step towards developing an interactive piano tutoring system that can observe a student playing the piano and give feedback about hand movements and musical accuracy. In particular, we have two primary aims: 1) to determine which notes on a piano are being played at any moment in time, 2) to identify which finger is pressing each note. We introduce a novel two-stream convolutional neural network that takes video and audio inputs together for detecting pressed notes and finger presses. We formulate our two problems in terms of multi-task learning and extend a state-of-the-art object detection model to incorporate both audio and visual features. In addition, we introduce a novel finger identification solution based on pressed piano note information. We experimentally confirm that our approach is able to detect pressed piano keys and the piano player's fingers with a high accuracy.

[1]  Colin Raffel,et al.  Onsets and Frames: Dual-Objective Piano Transcription , 2017, ISMIR.

[2]  George Tzanetakis,et al.  Effective use of multimedia for computer-assisted musical instrument tutoring , 2007, Emme '07.

[3]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[4]  M. Hunt,et al.  Distance measures for speech recognition , 1989 .

[5]  George Tzanetakis,et al.  Detecting Pianist Hand Posture Mistakes for Virtual Piano Tutoring , 2016, International Conference on Mathematics and Computing.

[6]  Simon Holland,et al.  Artificial Intelligence in Music Education: A Critical Review , 2000, Readings in Music and Artificial Intelligence.

[7]  Tsutomu Terada,et al.  Design and Implementation of a Real-Time Fingering Detection System for Piano Performance , 2006, ICMC.

[8]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[9]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10]  György Fazekas,et al.  Music recommendation for music learning: Hotttabs, a multimedia guitar tutor , 2011 .

[11]  Marcelo M. Wanderley,et al.  Estimation of Guitar Fingering and Plucking Controls Based on Multimodal Analysis of Motion, Audio and Musical Score , 2015, CMMR.

[12]  Robert Joseph,et al.  A computer‐based multi‐media tutor for beginning piano students , 1990 .

[13]  Guillaume Lemaitre,et al.  Real-time Polyphonic Music Transcription with Non-negative Matrix Factorization and Beta-divergence , 2010, ISMIR.

[14]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Stefan Lee,et al.  Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[17]  Xavier Serra,et al.  Multi-Label Music Genre Classification from Audio, Text and Images Using Deep Features , 2017, ISMIR.

[18]  Peter Knees,et al.  Drum Transcription via Joint Beat and Drum Modeling Using Convolutional Recurrent Neural Networks , 2017, ISMIR.

[19]  Charles Louis Hanon,et al.  The virtuoso pianist : in sixty excercises for the piano : for the acquirement of agility, independence, strength, and perfect evenness in the fingers, as well as suppleness of the wrist , .

[20]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[21]  Jakob Abeßer,et al.  Music Information Retrieval Meets Music Education , 2012, Multimodal Music Processing.

[22]  Alexander Lerch,et al.  Chord Detection Using Deep Learning , 2015, ISMIR.

[23]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Bochen Li,et al.  Skeleton Plays Piano: Online Generation of Pianist Body Movements from MIDI Performance , 2018, ISMIR.

[25]  Howard Cheng,et al.  Real-Time Piano Music Transcription Based on Computer Vision , 2015, IEEE Transactions on Multimedia.

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[28]  Ali Farhadi,et al.  YOLOv3: An Incremental Improvement , 2018, ArXiv.

[29]  Bingjun Zhang,et al.  Automatic Music Transcription using Audio-Visual Fusion for Violin Practice in Home Environment , 2009 .

[30]  David Hsu,et al.  Digital violin tutor: an integrated system for beginning violin learners , 2005, ACM Multimedia.

[31]  Daniel Gärtner,et al.  Real-Time Transcription and Separation of Drum Recordings Based on NMF Decomposition , 2014, DAFx.

[32]  Daniel P. W. Ellis,et al.  Content-Aware Collaborative Music Recommendation Using Pre-trained Neural Networks , 2015, ISMIR.

[33]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.