Machine Learning for Audio, Image and Video Analysis

The past decade has witnessed an explosion of digital information and a phenomenal growth in the popularity of audio, image, and video multimedia. Due to the advancements in processing power and the proliferation of the Internet, people can easily capture, store, transmit, and share audio, image, and video content. However, efficient and effective indexing and retrieval of such data accumulated over time is still a formidable challenge. Researchers in industry and academia have spent a tremendous effort on developing sophisticated systems for processing and understanding this new information, at the core of which always lie machine-learning techniques. As a very broad research field, machine learning covers a broad set of areas ranging from uncertainty analysis to kernel methods. Although it closely interacts with other fields such as statistics, signal processing, and pattern recognition, it is almost impossible for the researchers in those areas to understand every detail of the state-of-the-art machine-learning techniques. While most of the books on machine learning cover classic techniques such as neural networks or support vector machines, few of them emphasize the recent advancements and their applications to audio, image, and video analysis. The book Machine Learning for Audio, Image and Video Analysis intends to fill this gap by bringing to its readers the latest developments in this fast-growing field. The book consists of an introduction, three main parts total of 13 chapters , and four appendices. The introduction explains how readers with various backgrounds can benefit from studying this book and provides the ABCs about acquisition and processing of audio and visual information for beginners. The appendices also prepare supplemental material for readers who lack the related statistical and signal-processing background. For experienced researchers, the recent advancements can be found in Part II, where classic machine-learning techniques are discussed along with their state-of-the-art developments. For practitioners, the authors analyze three typical applications to provide a sense of how machine-learning techniques can be applied to understand audio/video data. Part I contains a brief description of how the human biological system perceives audio and video signals and how these signals are captured and digitized in a format that is amenable to computer processing. Chapters 2 and 3 further introduce the audio/video representation and coding standards without getting into too much detail. Besides color, texture, and shape, it is my opinion that local descriptors such as scale-invariant feature transform SIFT should be introduced at this point because they have demonstrated the strength of these “bag of words” representations in many applications. Widely used machine-learning techniques are the focus of Chapters 4 to 11 in Part II. Chapters 4, 5, and 7 introduce the general objectives and approaches of machine learning and the means to evaluate its performance from a statistical point of view. Chapter 5 introduces the Bayesian decision theory, which leads to further investigation of Markovian models in Chapter 10. Kernel methods are discussed in detail in Chapter 9. It is worth mentioning that this book, unlike most other books in this field, not only introduces a few widely used techniques in audio and image analysis, but also discusses the latest advancements in the field. For example, most books would touch the surface of support vector machines SVM by introducing the original two-class SVM, yet Chapter 9 of this book goes one step further to discuss sequential minimal optimization, a powerful multiclass extension of SVM, which is more appealing in practice. Chapters 6 and 11 are concerned with clustering and dimension reduction via unsupervised learning. Specifically, Chapter 11 introduces several manifold learning techniques, such as locally linear embedding LLE and ISOMAP, which are particularly useful in handling nonlinear data in audio and video processing. Chapter 8 combines the discussion of classic neural networks and ensemble methods. My personal view is that ensemble methods, such as AdaBoost and random forest, deserve an independent chapter because of their outstanding reported performance on many applications and benchmark data sets. Similarly, the “topic model” methods, such as probabilistic latent semantic indexing pLSI and latent dirichlet analysis LDA , should be added to this book to reflect the current research trend in analyzing text and visual data. Part III showcases three applications of machine-learning techniques, namely speech and handwriting recognition, automatic face recognition, and video segmentation and keyframe extraction. In Chapter 13, the authors discuss the automatic facerecognition system. However, face image localization—one of the core problems in this application—doesn’t seem to get enough attention. It would be desirable to add a section about the boosted cascade method proposed by Viola and Jones, which is one of the most successful machine-learning applications in image analysis. In addition, it is better to point out that the eigenface and its variants can suppress a lot of the luminance variation of the face images by removing the eigenvectors corresponding to the three largest eigenvalues. There are several things that are unique in this book. In some chapters, the problem sections are included to challenge the readers to understand the discussed methods or apply them to solve some sample problems. Distinct from other books, it also points out several public software packages and benchmark data sets that encourage the reader to have a hands-on experience on how machine-learning techniques work to analyze audio and visual content. Its comprehensive coverage on recent development in this research area makes it easy for experienced researchers to further explore the latest techniques.

[1]  Jorma Laaksonen,et al.  LVQ_PAK: The Learning Vector Quantization Program Package , 1996 .

[2]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[3]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[4]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[5]  Koby Crammer,et al.  Margin Analysis of the LVQ Algorithm , 2002, NIPS.

[6]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Paul W. Munro,et al.  Improving Committee Diagnosis with Resampling Techniques , 1995, NIPS.

[9]  L. Breiman Arcing Classifiers , 1998 .

[10]  Per Christian Hansen,et al.  Solution of Ill-Posed Problems by Means of Truncated SVD , 1988 .

[11]  Geoffrey E. Hinton,et al.  The appeal of parallel distributed processing , 1986 .

[12]  G. Lewicki,et al.  Approximation by Superpositions of a Sigmoidal Function , 2003 .

[13]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[14]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[15]  M. Crucianu,et al.  Bayesian learning in neural networks for sequence processing , 2007 .

[16]  James A. Casbon,et al.  Spectral clustering of protein sequences , 2006, Nucleic acids research.

[17]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[18]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[19]  H. Bourlard,et al.  Auto-association by multilayer perceptrons and singular value decomposition , 1988, Biological Cybernetics.

[20]  Thomas G. Dietterich,et al.  Solving Multiclass Learning Problems via Error-Correcting Output Codes , 1994, J. Artif. Intell. Res..

[21]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[22]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[23]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[24]  Massimiliano Pontil,et al.  Support Vector Machines for 3D Object Recognition , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[26]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Atsushi Sato,et al.  Generalized Learning Vector Quantization , 1995, NIPS.

[28]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[29]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[30]  David Barber,et al.  Bayesian Classification With Gaussian Processes , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  T. Kohonen,et al.  Bibliography of Self-Organizing Map SOM) Papers: 1998-2001 Addendum , 2003 .

[33]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[34]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[35]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[36]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[37]  Roman Rosipal,et al.  An Expectation-Maximization Approach to Nonlinear Component Analysis , 2001, Neural Computation.

[38]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[39]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[40]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[41]  Christopher J. Taylor,et al.  The use of kernel principal component analysis to model data distributions , 2003, Pattern Recognit..

[42]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[43]  Anil K. Jain,et al.  Artificial Neural Networks: A Tutorial , 1996, Computer.

[44]  David A. Medler A Brief History of Connectionism , 1998 .

[45]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[46]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[47]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[48]  L. Cooper,et al.  When Networks Disagree: Ensemble Methods for Hybrid Neural Networks , 1992 .

[49]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[50]  S. Griffis EDITOR , 1997, Journal of Navigation.

[51]  Kagan Tumer,et al.  Error Correlation and Error Reduction in Ensemble Classifiers , 1996, Connect. Sci..

[52]  Mikhail Belkin,et al.  Consistency of spectral clustering , 2008, 0804.0678.

[53]  Thomas Villmann,et al.  Generalized relevance learning vector quantization , 2002, Neural Networks.

[54]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[55]  Shigeru Katagiri,et al.  Prototype-based minimum classification error/generalized probabilistic descent training for various speech units , 1994, Comput. Speech Lang..

[56]  Federico Girosi,et al.  Reducing the run-time complexity of Support Vector Machines , 1999 .

[57]  A. A. Mullin,et al.  Principles of neurodynamics , 1962 .

[58]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[59]  Daewon Lee,et al.  An improved cluster labeling method for support vector clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[61]  J. Weston,et al.  Support vector density estimation , 1999 .

[62]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[63]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[64]  Geoffrey E. Hinton,et al.  A general framework for parallel distributed processing , 1986 .

[65]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[66]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[67]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[68]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[69]  Ulrike von Luxburg,et al.  Limits of Spectral Clustering , 2004, NIPS.

[70]  Francesco Camastra,et al.  Cursive character recognition by learning vector quantization , 2001, Pattern Recognit. Lett..

[71]  Samy Bengio,et al.  Torch: a modular machine learning software library , 2002 .

[72]  Kevin J. Cherkauer Human Expert-level Performance on a Scientiic Image Analysis Task by a System Using Combined Artiicial Neural Networks , 1996 .

[73]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[74]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[75]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[76]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[77]  S. Balasundaram,et al.  On Lagrangian support vector regression , 2010, Expert Syst. Appl..

[78]  G. Matheron Principles of geostatistics , 1963 .

[79]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[80]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .