Unsupervised pattern discovery for multimedia sequences

This thesis investigates the problem of discovering patterns from multimedia sequences. The problem is of interest as capturing and storing large amounts of multimedia data has become commonplace, yet our capability to process, interpret, and use these rich corpora has notably lagged behind. Patterns refer to the recurrent and statistically consistent units in a data collection, their recurrence and consistency provide useful bases for organizing large corpra. Unsupervised pattern discovery is important, as it is desirable to adapt to diverse media collections without extensive annotation. Moreover, the patterns should be meaningful, since meanings are what we humans perceive from multimedia. The goal of this thesis is to devise a general framework for finding multi-modal temporal patterns from a collection of multimedia sequences, using the self-similarity in both the appearance and the temporal progression of the content. There, we have addressed three sub-problems: learning temporal pattern models, associating meanings with patterns, and finding patterns in multimodality. We propose novel models for the discovery of multimedia temporal patterns. We construct dynamic graphical models for capturing the multi-level dependency between the audio-visual observations and the events. We propose a stochastic search scheme for finding the optimal model size and topology, as well as unsupervised feature grouping for selecting relevant descriptors for temporal streams. We present novel approaches towards automatically explaining and evaluating the patterns in multimedia streams. Such approaches link the computational representations of the patterns with words in the video stream. The linking between the representation of audio-visual patterns, such as those acquired by a dynamic graphical model and the metadata, is achieved by statistical association. We develop solutions for finding patterns that reside across multiple modalities. This is realized with layered dynamic mixture model, and we address the modeling problems of intea-modality temporal dependency and inter-modality asynchrony in different parts of the model structure. With unsupervised pattern discovery, we are able to discover from baseball and soccer programs the common semantic states, play and break, with accuracies comparable to their supervised counterparts. On large broadcast news corpus we find that multimedia patterns have good correspondence with news topics that have salient audio-visual cues. These findings demonstrate the potential of our framework of mining multi-level temporal patterns from multimodal streams, and it has broad outlook in adapting to new content domains and extending to other applications such as event detection and information retrieval.

[1]  Malcolm Slaney,et al.  A critique of pure audition , 1998 .

[2]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[3]  Ron Kohavi,et al.  The Utility of Feature Weighting in Nearest-Neighbor Algorithms , 1997 .

[4]  Regunathan Radhakrishnan,et al.  A content-adaptive analysis and representation framework for summarization using audio cues , 2005 .

[5]  Michael I. Jordan,et al.  An Introduction to Graphical Models , 2001 .

[6]  Matthew Brand,et al.  Structure Learning in Conditional Probability Models via an Entropic Prior and Parameter Extinction , 1999, Neural Computation.

[7]  Yoshua Bengio,et al.  An Input Output HMM Architecture , 1994, NIPS.

[8]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[9]  Mark S. Squillante,et al.  Analysis and characterization of large‐scale Web server access patterns and performance , 1999, World Wide Web.

[10]  Albert S. Bregman,et al.  The Auditory Scene. (Book Reviews: Auditory Scene Analysis. The Perceptual Organization of Sound.) , 1990 .

[11]  Shih-Fu Chang,et al.  General and domain-specific techniques for detecting and recognizing superimposed text in video , 2002, Proceedings. International Conference on Image Processing.

[12]  Yehoshua Y. Zeevi,et al.  The Generalized Gabor Scheme of Image Representation in Biological and Machine Vision , 1988, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[14]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[15]  R. Manmatha,et al.  Automatic image annotation and retrieval using cross-media relevance models , 2003, SIGIR.

[16]  M. Giard,et al.  Auditory-Visual Integration during Multimodal Object Recognition in Humans: A Behavioral and Electrophysiological Study , 1999, Journal of Cognitive Neuroscience.

[17]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[18]  Christophe Andrieu,et al.  Iterative algorithms for optimal state estimation of jump Markov linear systems , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[19]  Patrick Pérez,et al.  Data fusion for visual tracking with particles , 2004, Proceedings of the IEEE.

[20]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[21]  Anne Sullivan,et al.  Auditory perception. , 1973, British medical journal.

[22]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[23]  Shih-Fu Chang,et al.  Structure analysis of soccer video with domain knowledge and hidden Markov models , 2004, Pattern Recognit. Lett..

[24]  Carla E. Brodley,et al.  Feature Subset Selection and Order Identification for Unsupervised Learning , 2000, ICML.

[25]  G. Calvert,et al.  Multisensory integration: methodological approaches and emerging principles in the human brain , 2004, Journal of Physiology-Paris.

[26]  Vibhu O. Mittal,et al.  Applying Machine Learning for High‐Performance Named‐Entity Extraction , 2000, Comput. Intell..

[27]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[28]  Yanjun Qi,et al.  Supervised classification for video shot segmentation , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[29]  Tony Jebara,et al.  Probability Product Kernels , 2004, J. Mach. Learn. Res..

[30]  Ran El-Yaniv,et al.  Agnostic Classification of Markovian Sequences , 1997, NIPS.

[31]  Jing Huang,et al.  Image indexing using color correlograms , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[32]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[33]  Shih-Fu Chang,et al.  An integrated approach for content-based video object segmentation and retrieval , 1999, IEEE Trans. Circuits Syst. Video Technol..

[34]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[35]  Atreyi Kankanhalli,et al.  Automatic partitioning of full-motion video , 1993, Multimedia Systems.

[36]  Rong Yan,et al.  Learning query-class dependent weights in automatic video retrieval , 2004, MULTIMEDIA '04.

[37]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[38]  Paul D. Over,et al.  TREC Video Retrieval Evaluation Website | NIST , 2000 .

[39]  Richard M. Karp,et al.  CLIFF: clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts , 2001, ISMB.

[40]  Shih-Fu Chang,et al.  Automatic discovery of query-class-dependent models for multimodal search , 2005, MULTIMEDIA '05.

[41]  Malcolm Slaney,et al.  Construction and evaluation of a robust multifeature speech/music discriminator , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Jianbo Shi,et al.  Learning Segmentation by Random Walks , 2000, NIPS.

[43]  Michael Hu,et al.  A Hierarchical HMM Implementation for Vertebrate Gene Splice Site Prediction , 2000 .

[44]  A. King,et al.  Multisensory integration: perceptual grouping by eye and ear , 2001, Current Biology.

[45]  Matthew Cooper Video segmentation combining similarity analysis and classification , 2004, MULTIMEDIA '04.

[46]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[47]  Douglas A. Reynolds,et al.  Robust text-independent speaker identification using Gaussian mixture speaker models , 1995, IEEE Trans. Speech Audio Process..

[48]  Shih-Fu Chang,et al.  Determining computable scenes in films and their structures using audio-visual memory models , 2000, ACM Multimedia.

[49]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[50]  Yang Dong MINING SEQUENTIAL PATTERNS IN WEB LOGS , 2000 .

[51]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[52]  John S. Boreczky,et al.  Comparison of video shot boundary detection techniques , 1996, Electronic Imaging.

[53]  Ren C. Luo,et al.  Multisensor fusion and integration: approaches, applications, and future research directions , 2002 .

[54]  Aaron F. Bobick,et al.  Recognition of Visual Activities and Interactions by Stochastic Parsing , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[56]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[57]  Shih-Fu Chang,et al.  Segmentation, structure detection and summarization of multimedia sequences , 2002 .

[58]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[59]  Alex Pentland,et al.  Unsupervised clustering of ambulatory audio and video , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[60]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[61]  Milind R. Naphade,et al.  Discovering recurrent events in video using unsupervised methods , 2002, Proceedings. International Conference on Image Processing.

[62]  J. Simpson,et al.  The Oxford English Dictionary , 1884 .

[63]  Padhraic Smyth,et al.  Pattern discovery in sequences under a Markov assumption , 2002, KDD.

[64]  Qi Tian,et al.  A mid-level representation framework for semantic sports video analysis , 2003, ACM Multimedia.

[65]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[66]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[67]  Yang Song,et al.  Unsupervised Learning of Human Motion , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[68]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[69]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[70]  Shih-Fu Chang,et al.  Semantic visual templates: linking visual features to semantics , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[71]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[72]  Edward Y. Chang,et al.  Support vector machine active learning for image retrieval , 2001, MULTIMEDIA '01.

[73]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[74]  Zhu Liu,et al.  Multimedia content analysis-using both audio and visual clues , 2000, IEEE Signal Process. Mag..

[75]  W. Bruce Croft,et al.  Cluster-based retrieval using language models , 2004, SIGIR '04.

[76]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[77]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[78]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[79]  Gang Wang,et al.  TRECVID 2004 Search and Feature Extraction Task by NUS PRIS , 2004, TRECVID.

[80]  Stanley Boykin,et al.  Machine learning of event segmentation for news on demand , 2000, CACM.

[81]  Pietro Perona,et al.  Unsupervised Learning of Models for Recognition , 2000, ECCV.

[82]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[83]  Shun-ichi Amari,et al.  Information geometry of the EM and em algorithms for neural networks , 1995, Neural Networks.

[84]  David Marr,et al.  VISION A Computational Investigation into the Human Representation and Processing of Visual Information , 2009 .

[85]  E. Bullmore,et al.  Activation of auditory cortex during silent lipreading. , 1997, Science.

[86]  Shih-Fu Chang,et al.  Structure analysis of soccer video with hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[87]  I. J. Myung,et al.  Counting probability distributions: Differential geometry and model selection , 2000, Proc. Natl. Acad. Sci. USA.

[88]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[89]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[90]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[91]  Richard H. Lansing,et al.  The Oxford English-Italian Italian-English Dictionary , 1984 .

[92]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[93]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[94]  Alexander G. Hauptmann,et al.  Towards a Large Scale Concept Ontology for Broadcast Video , 2004, CIVR.

[95]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[96]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[97]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[98]  Ziyou Xiong,et al.  Generation of sports highlights using a combination of supervised & unsupervised learning in audio domain , 2003, Fourth International Conference on Information, Communications and Signal Processing, 2003 and the Fourth Pacific Rim Conference on Multimedia. Proceedings of the 2003 Joint.

[99]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[100]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[101]  Marcel Worring,et al.  Content-Based Image Retrieval at the End of the Early Years , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[102]  Padhraic Smyth,et al.  Clustering Sequences with Hidden Markov Models , 1996, NIPS.

[103]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[104]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[105]  T. Sejnowski,et al.  A critique of pure vision , 1993 .

[106]  Shih-Fu Chang,et al.  Experiments in constructing belief networks for image classification systems , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[107]  Anil K. Jain,et al.  Automatic classification of tennis video for high-level content-based retrieval , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[108]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[109]  HongJiang Zhang,et al.  Automatic parsing of TV soccer programs , 1995, Proceedings of the International Conference on Multimedia Computing and Systems.

[110]  Pietro Perona,et al.  Object class recognition by unsupervised scale-invariant learning , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[111]  Eric Horvitz,et al.  Layered representations for learning and inferring office activity from multiple sensory channels , 2004, Comput. Vis. Image Underst..

[112]  Nando de Freitas,et al.  Robust Full Bayesian Learning for Radial Basis Networks , 2001, Neural Computation.

[113]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[114]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[115]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[116]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[117]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..

[118]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[119]  James Lee Hafner,et al.  Efficient Color Histogram Indexing for Quadratic Form Distance Functions , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[120]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[121]  Kevin P. Murphy,et al.  Linear-time inference in Hierarchical HMMs , 2001, NIPS.

[122]  Henning Schulzrinne,et al.  Proceedings of the 12th annual ACM international conference on Multimedia , 2004, MM 2004.

[123]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[124]  Daniel P. W. Ellis,et al.  Speech and Audio Signal Processing - Processing and Perception of Speech and Music, Second Edition , 1999 .

[125]  Tony Jebara,et al.  Dynamical Systems Trees , 2004, UAI.

[126]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[127]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[128]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[129]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[130]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[131]  B. S. Manjunath,et al.  An efficient color representation for image retrieval , 2001, IEEE Trans. Image Process..

[132]  Sanjeev R. Kulkarni,et al.  Rapid estimation of camera motion from compressed video with application to video annotation , 2000, IEEE Trans. Circuits Syst. Video Technol..

[133]  Ajay Divakaran,et al.  MPEG-7 visual motion descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[134]  John R. Kender,et al.  Visual concepts for news story tracking: analyzing and exploiting the NIST TRESVID video annotation experiment , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).