Statistical Language and Speech Processing

Deep learning research aims at discovering learning algorithms that discover multiple levels of distributed representations, with higher levels representing more abstract concepts. Although the study of deep learning has already led to impressive theoretical results, learning algorithms and breakthrough experiments, several challenges lie ahead. This paper proposes to examine some of these challenges, centering on the questions of scaling deep learning algorithms to much larger models and datasets, reducing optimization difficulties due to ill-conditioning or local minima, designing more efficient and powerful inference and sampling procedures, and learning to disentangle the factors of variation underlying the observed data. It also proposes a few forward-looking research directions aimed at overcoming these challenges. 1 Background on Deep Learning Deep learning is an emerging approach within the machine learning research community. Deep learning algorithms have been proposed in recent years to move machine learning systems towards the discovery of multiple levels of representation. They have had important empirical successes in a number of traditional AI applications such as computer vision and natural language processing. See [10,17] for reviews and [14] and the other chapters of the book [95] for practical guidelines. Deep learning is attracting much attention both from the academic and industrial communities. Companies like Google, Microsoft, Apple, IBM and Baidu are investing in deep learning, with the first widely distributed products being used by consumers aimed at speech recognition. Deep learning is also used for object recognition (Google Goggles), image and music information retrieval (Google Image Search, Google Music), as well as computational advertising [36]. A deep learning building block (the restricted Boltzmann machine, or RBM) was used as a crucial part of the winning entry of a million-dollar machine learning competition (the Netflix competition) [115,134]. The New York Times covered the subject twice in 2012, with front-page articles. Another series of articles (including a third New York Times article) covered a more recent event showing off the application of deep learning in a major Kaggle competition for 1 http://www.nytimes.com/2012/11/24/science/scientists-see-advancesin-deep-learning-a-part-of-artificial-intelligence.html A.-H. Dediu et al. (Eds.): SLSP 2013, LNAI 7978, pp. 1–37, 2013. c © Springer-Verlag Berlin Heidelberg 2013

[1]  Ben Taskar,et al.  Alignment by Agreement , 2006, NAACL.

[2]  Joan-Andreu Sánchez,et al.  Part-of-Speech Tagging Based on Machine Translation Techniques , 2007, IbPRIA.

[3]  Fabrice Lefèvre,et al.  Investigating multiple approaches for SLU portability to a new language , 2010, INTERSPEECH.

[4]  Gökhan Tür,et al.  Beyond ASR 1-best: Using word confusion networks in spoken language understanding , 2006, Comput. Speech Lang..

[5]  Fabrice Lefèvre,et al.  Combination of stochastic understanding and machine translation systems for language portability of dialogue systems , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Hermann Ney,et al.  Applications of Statistical Machine Translation Approaches to Spoken Language Understanding , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[7]  Julien Mairal,et al.  Structured sparsity through convex optimization , 2011, ArXiv.

[8]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Hermann Ney,et al.  Comparing Stochastic Approaches to Spoken Language Understanding in Multiple Languages , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Sophie Rosset,et al.  Semantic annotation of the French media dialog corpus , 2005, INTERSPEECH.

[13]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[14]  Johan Schalkwyk,et al.  OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[15]  Anil Kumar Singh,et al.  Modeling Letter-to-Phoneme Conversion as a Phrase Based Statistical Machine Translation Problem with Minimum Error Rate Training , 2009, HLT-NAACL.

[16]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[17]  José B. Mariño,et al.  Ncode: an Open Source Bilingual N-gram SMT Toolkit , 2011, Prague Bull. Math. Linguistics.

[18]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[19]  Alexandre Allauzen,et al.  From n-gram-based to CRF-based Translation Models , 2011, WMT@EMNLP.

[20]  Gökhan Tür,et al.  Improving spoken language understanding using word confusion networks , 2002, INTERSPEECH.

[21]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[22]  Gökhan Tür,et al.  Joint Decoding for Speech Recognition and Semantic Tagging , 2012, INTERSPEECH.

[23]  Geoffrey E. Hinton,et al.  Self-organizing neural network that discovers surfaces in random-dot stereograms , 1992, Nature.

[24]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[25]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[26]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[27]  Frédéric Béchet,et al.  Conceptual decoding from word lattices: application to the spoken dialogue corpus MEDIA , 2006, INTERSPEECH.

[28]  David M. Bradley,et al.  Differentiable Sparse Coding , 2008, NIPS.

[29]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[30]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[31]  Yoshua Bengio,et al.  What regularized auto-encoders learn from the data-generating distribution , 2012, J. Mach. Learn. Res..

[32]  José B. Mariño,et al.  Improving statistical MT by coupling reordering and decoding , 2006, Machine Translation.

[33]  François Yvon,et al.  Practical Very Large Scale CRFs , 2010, ACL.

[34]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[35]  Regina Barzilay,et al.  Unsupervised Multilingual Learning for Morphological Segmentation , 2008, ACL.

[36]  Hermann Ney,et al.  Natural language understanding using statistical machine translation , 2001, INTERSPEECH.