Listen to the Image

Visual-to-auditory sensory substitution devices can assist the blind in sensing the visual environment by translating the visual information into a sound pattern. To improve the translation quality, the task performances of the blind are usually employed to evaluate different encoding schemes. In contrast to the toilsome human-based assessment, we argue that machine model can be also developed for evaluation, and more efficient. To this end, we firstly propose two distinct cross-modal perception model w.r.t. the late-blind and congenitally-blind cases, which aim to generate concrete visual contents based on the translated sound. To validate the functionality of proposed models, two novel optimization strategies w.r.t. the primary encoding scheme are presented. Further, we conduct sets of human-based experiments to evaluate and compare them with the conducted machine-based assessments in the cross-modal generation task. Their highly consistent results w.r.t. different encoding schemes indicate that using machine model to accelerate optimization evaluation and reduce experimental cost is feasible to some extent, which could dramatically promote the upgrading of encoding scheme then help the blind to improve their visual perception ability.

[1]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[2]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[3]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Ómar I. Jóhannesson,et al.  Designing sensory-substitution devices: Principles, pitfalls and potential , 2016, Restorative neurology and neuroscience.

[5]  T. Ifukube,et al.  A blind mobility aid modeled after echolocation of bats , 1991, IEEE Transactions on Biomedical Engineering.

[6]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[7]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Andrew J. Kolarik,et al.  Sensory substitution information informs locomotor adjustments when walking through apertures , 2014, Experimental Brain Research.

[9]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  M. Ptito,et al.  Cross-modal plasticity revealed by electrotactile stimulation of the tongue in the congenitally blind. , 2005, Brain : a journal of neurology.

[11]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[12]  P. Bach-y-Rita,et al.  Sensory substitution and the human–machine interface , 2003, Trends in Cognitive Sciences.

[13]  Peter B. L. Meijer,et al.  Functional recruitment of visual cortex for sound encoded object identification in the blind , 2009, Neuroreport.

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  D. Bavelier,et al.  Cross-modal plasticity: where and how? , 2002, Nature Reviews Neuroscience.

[16]  Sandra M. Sanabria-Bohórquez,et al.  Occipital Activation by Pattern Recognition in the Early Blind Using Auditory Substitution for Vision , 2001, NeuroImage.

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  U. Noppeney The effects of visual deprivation on functional and structural organization of the human brain , 2007, Neuroscience & Biobehavioral Reviews.

[19]  Thomas D. Wright,et al.  Neuroscience and Biobehavioral Reviews Sensory Substitution as an Artificially Acquired Synaesthesia , 2022 .

[20]  C. Spence,et al.  Multisensory Integration: Space, Time and Superadditivity , 2005, Current Biology.

[21]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[22]  Jamie Ward,et al.  Seeing with Sound? Exploring Different Characteristics of a Visual-to-Auditory Sensory Substitution Device , 2011, Perception.

[23]  Shinsuke Shimojo,et al.  Auditory Sensory Substitution is Intuitive and Automatic with Texture Stimuli , 2015, Scientific Reports.

[24]  M. Hallett,et al.  Functional relevance of cross-modal plasticity in blind humans , 1997, Nature.

[25]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[26]  A D Négrel,et al.  Global data on blindness. , 1995, Bulletin of the World Health Organization.

[27]  Christian Scheiber,et al.  What neuroimaging tells us about sensory substitution , 2007, Neuroscience & Biobehavioral Reviews.

[28]  William M. Stern,et al.  Shape conveyed by visual-to-auditory sensory substitution activates the lateral occipital complex , 2007, Nature Neuroscience.

[29]  Amir Amedi,et al.  Reading with Sounds: Sensory Substitution Selectively Activates the Visual Word Form Area in the Blind , 2012, Neuron.

[30]  S. Gelfand Essentials of Audiology , 1997 .

[31]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[32]  Sameer A. Nene,et al.  Columbia Object Image Library (COIL100) , 1996 .

[33]  Xuelong Li,et al.  Deep Binary Reconstruction for Cross-Modal Hashing , 2017, IEEE Transactions on Multimedia.

[34]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[35]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Joon Son Chung,et al.  You said that? , 2017, BMVC.

[37]  Gregory Cohen,et al.  EMNIST: an extension of MNIST to handwritten letters , 2017, CVPR 2017.

[38]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.