Towards human-like and transhuman perception in AI 2.0: a review

Perception is the interaction interface between an intelligent system and the real world. Without sophisticated and flexible perceptual capabilities, it is impossible to create advanced artificial intelligence (AI) systems. For the next-generation AI, called ‘AI 2.0’, one of the most significant features will be that AI is empowered with intelligent perceptual capabilities, which can simulate human brain’s mechanisms and are likely to surpass human brain in terms of performance. In this paper, we briefly review the state-of-the-art advances across different areas of perception, including visual perception, auditory perception, speech perception, and perceptual information processing and learning engines. On this basis, we envision several R&D trends in intelligent perception for the forthcoming era of AI 2.0, including: (1) human-like and transhuman active vision; (2) auditory perception and computation in an actual auditory setting; (3) speech perception and computation in a natural interaction setting; (4) autonomous learning of perceptual information; (5) large-scale perceptual information processing and learning platforms; and (6) urban omnidirectional intelligent perception and reasoning engines. We believe these research directions should be highlighted in the future plans for AI 2.0.

[1]  Paul A. Wilford,et al.  Multi-view in lensless compressive imaging , 2013, 2013 Picture Coding Symposium (PCS).

[2]  Heiga Zen,et al.  Speech Synthesis Based on Hidden Markov Models , 2013, Proceedings of the IEEE.

[3]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[4]  E. Robinson,et al.  PRINCIPLES OF DIGITAL WIENER FILTERING , 1967 .

[5]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Félix Herrera Priano,et al.  A Model for the Smart Development of Island Territories , 2016, DG.O.

[8]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[9]  Geoffrey Zweig,et al.  Achieving Human Parity in Conversational Speech Recognition , 2016, ArXiv.

[10]  Tara N. Sainath,et al.  FUNDAMENTAL TECHNOLOGIES IN MODERN SPEECH RECOGNITION Digital Object Identifier 10.1109/MSP.2012.2205597 , 2012 .

[11]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[12]  Ramesh Raskar,et al.  Coded time of flight cameras , 2013, ACM Trans. Graph..

[13]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[14]  Yunhe Pan,et al.  Heading toward Artificial Intelligence 2.0 , 2016 .

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  George Saon,et al.  The IBM 2015 English conversational telephone speech recognition system , 2015, INTERSPEECH.

[17]  George Saon,et al.  The IBM 2016 English Conversational Telephone Speech Recognition System , 2016, INTERSPEECH.

[18]  Matthew H. Davis,et al.  Speech recognition in adverse conditions: A review , 2012 .

[19]  Yi Yang,et al.  Person Re-identification: Past, Present and Future , 2016, ArXiv.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Larissa Romualdo-Suzuki,et al.  Data as infrastructure for smart cities , 2016 .

[22]  Hans-Peter Seidel,et al.  An efficient construction of reduced deformable objects , 2013, ACM Trans. Graph..

[23]  Matt Weldon,et al.  A high-resolution SWIR camera via compressed sensing , 2012, Defense + Commercial Sensing.

[24]  Andrea Vedaldi,et al.  Understanding deep image representations by inverting them , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Thomas Kailath,et al.  ESPRIT-estimation of signal parameters via rotational invariance techniques , 1989, IEEE Trans. Acoust. Speech Signal Process..

[26]  廣瀬雄一,et al.  Neuroscience , 2019, Workplace Attachments.

[27]  Daniel G. Aliaga,et al.  A Survey of Urban Reconstruction , 2013, Comput. Graph. Forum.

[28]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.

[29]  Emmanuel J. Candès,et al.  Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information , 2004, IEEE Transactions on Information Theory.

[30]  Tara N. Sainath,et al.  Joint training of convolutional and non-convolutional neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Fei-Yue Wang,et al.  Data-Driven Intelligent Transportation Systems: A Survey , 2011, IEEE Transactions on Intelligent Transportation Systems.

[32]  Yusuke Hioka,et al.  Pinpoint extraction of distant sound source based on DNN mapping from multiple beamforming outputs to prior SNR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Jing Chen,et al.  Acoustic Array Systems : Paper ICA 2016-352 A sound source localization algorithm using microphone array with rigid body , 2016 .

[34]  Keith M. Kendrick,et al.  Intelligent perception , 1998 .

[35]  V. Mountcastle,et al.  An organizing principle for cerebral function : the unit module and the distributed system , 1978 .

[36]  H. Gaskell The precedence effect , 1983, Hearing Research.

[37]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[38]  Shawki Areibi,et al.  Deep Learning on FPGAs: Past, Present, and Future , 2016, ArXiv.

[39]  Bingbing Ni,et al.  Crowded Scene Analysis: A Survey , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[40]  Ashok Veeraraghavan,et al.  Depth Selective Camera: A Direct, On-Chip, Programmable Technique for Depth Selectivity in Photography , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[42]  Justin Manzo,et al.  The DARPA Robotics Challenge [Competitions] , 2013, IEEE Robotics Autom. Mag..

[43]  Ting Sun,et al.  Single-pixel imaging via compressive sampling , 2008, IEEE Signal Process. Mag..

[44]  Samidha Dwivedi Sharma,et al.  A Review of Securing Home Using Video Surveillance , 2014 .

[45]  John Makhoul A 50-Year Retrospective on Speech and Language Processing , 2016, INTERSPEECH.

[46]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[47]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[48]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[49]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[50]  Shuang Xu,et al.  First Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention , 2016, INTERSPEECH.

[51]  Simon King,et al.  Measuring a decade of progress in Text-to-Speech , 2014 .

[52]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.