Analyzing Liquid Pouring Sequences via Audio-Visual Neural Networks

Existing work to estimate the weight of a liquid poured into a target container often require predefined source weights or visual data. We present novel audio-based and audio-augmented techniques, in the form of multimodal convolutional neural networks (CNNs), to estimate poured weight, perform overflow detection, and classify liquid and target container. Our audio-based neural network uses the sound from a pouring sequence–a liquid being poured into a target container. Audio inputs consist of converting raw audio into mel-scaled spectrograms. Our audio-augmented network fuses this audio with its corresponding visual data based on video images. Only a microphone and camera are required, which can be found in any modern smartphone or Microsoft Kinect. Our approach improves classification accuracy for different environments, containers, and contents of the robot pouring task. Our Pouring Sequence Neural Networks (PSNN) are trained and tested using the Rethink Robotics Baxter Research Robot. To the best of our knowledge, this is the first use of audio-visual neural networks to analyze liquid pouring sequences by classifying their weight, liquid, and receiving container.

[1]  Erkki Oja,et al.  Artificial neural networks. Formal models and their applications , 2005 .

[2]  Tsukasa Ogasawara,et al.  Learning pouring skills from demonstration and practice , 2014, 2014 IEEE-RAS International Conference on Humanoid Robots.

[3]  Emile S. Webster,et al.  The Use of Helmholtz Resonance for Measuring the Volume of Liquids and Solids , 2010, Sensors.

[4]  Gary R. Bradski,et al.  Recognition and Pose Estimation of Rigid Transparent Objects with a Kinect Sensor , 2012, Robotics: Science and Systems.

[5]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[6]  Doug L. James,et al.  Rigid-body fracture sound with precomputed soundbanks , 2010, ACM Trans. Graph..

[7]  A. Stoytchev,et al.  Object Categorization in the Sink : Learning Behavior – Grounded Object Categories with Water , 2012 .

[8]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[11]  Dinesh K. Pai,et al.  The Sounds of Physical Shapes , 1998, Presence.

[12]  Carme Torras,et al.  Force-based robot learning of pouring skills using parametric hidden Markov models , 2013, 9th International Workshop on Robot Motion and Control.

[13]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[14]  Doug L. James,et al.  Animating fire with sound , 2011, ACM Trans. Graph..

[15]  Oliver Brock,et al.  Analysis and Observations From the First Amazon Picking Challenge , 2016, IEEE Transactions on Automation Science and Engineering.

[16]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[17]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Muhammad Huzaifah,et al.  Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks , 2017, ArXiv.

[20]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[21]  Connor Schenck,et al.  Visual closed-loop control for pouring liquids , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Iván V. Meza,et al.  Localization of sound sources in robotics: A review , 2017, Robotics Auton. Syst..

[23]  Afsar Saranli,et al.  Acoustic Surface Perception for Improved Mobility of Legged Robots , 2012 .

[24]  Dinesh Manocha,et al.  Sounding liquids: Automatic sound synthesis from fluid simulation , 2010, TOGS.

[25]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[26]  Tsukasa Ogasawara,et al.  Pouring Skills with Planning and Learning Modeled from Human Demonstrations , 2015, Int. J. Humanoid Robotics.

[27]  Lakhmi C. Jain,et al.  Recurrent Neural Networks: Design and Applications , 1999 .

[28]  Doug L. James Physically based sound for computer animation and virtual environments , 2016, SIGGRAPH Courses.

[29]  Navdeep Jaitly,et al.  Vocal Tract Length Perturbation (VTLP) improves speech recognition , 2013 .

[30]  Chonhyon Park,et al.  Robot Motion Planning for Pouring Liquids , 2016, ICAPS.

[31]  Eric O. Boyer,et al.  Continuous auditory feedback for sensorimotor learning , 2015 .

[32]  Juan Carlos Niebles,et al.  Liquid Pouring Monitoring via Rich Sensory Inputs , 2018, ECCV.

[33]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[34]  Vladlen Koltun,et al.  An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling , 2018, ArXiv.

[35]  Eric Martinson,et al.  Discovery of sound sources by an autonomous mobile robot , 2009, Auton. Robots.

[36]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[37]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[38]  Gregory D. Hager,et al.  Temporal Convolutional Networks for Action Segmentation and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Wolfram Burgard,et al.  A probabilistic approach to liquid level detection in cups using an RGB-D camera , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[40]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[41]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Ming C. Lin,et al.  ISNN: Impact Sound Neural Network for Audio-Visual Object Classification , 2018, ECCV.

[43]  Jean-Marie Adrien,et al.  The missing link: modal synthesis , 1991 .

[44]  Oliver Kroemer,et al.  Learning Audio Feedback for Estimating Amount and Flow of Granular Material , 2018, CoRL.

[45]  Jie Huang,et al.  A model-based sound localization system and its application to robot navigation , 1999, Robotics Auton. Syst..