Audio-visual football video analysis, from structure detection to attention analysis

Sport video is an important video genre. Content-based sports video analysis attracts great interest from both industry and academic fields. A sports video is characterised by repetitive temporal structures, relatively plain contents, and strong spatio-temporal variations, such as quick camera switches and swift local motions. It is necessary to develop specific techniques for content-based sports video analysis to utilise these characteristics. For an efficient and effective sports video analysis system, there are three fundamental questions: (1) what are key stories for sports videos; (2) what incurs viewer’s interest; and (3) how to identify game highlights. This thesis is developed around these questions. We approached these questions from two different perspectives and in turn three research contributions are presented, namely, replay detection, attack temporal structure decomposition, and attention-based highlight identification. Replay segments convey the most important contents in sports videos. It is an efficient approach to collect game highlights by detecting replay segments. However, replay is an artefact of editing, which improves with advances in video editing tools. The composition of replay is complex, which includes logo transitions, slow motions, viewpoint switches and normal speed video clips. Since logo transition clips are pervasive in game collections of FIFA World Cup 2002, FIFA World Cup 2006 and UEFA Championship 2006, we take logo transition detection as an effective replacement of replay detection. A two-pass system was developed, including a five-layer adaboost classifier and a logo template matching throughout an entire video. The five-layer adaboost utilises shot duration, average game pitch ratio, average motion, sequential colour histogram and shot frequency between two neighbouring logo transitions, to filter out logo transition candidates. Subsequently, a logo template is constructed and employed to find all transition logo sequences. The precision and recall of this system in replay detection is 100% in a five-game evaluation collection. An attack structure is a team competition for a score. Hence, this structure is a conceptually fundamental unit of a football video as well as other sports videos. We review the literature of content-based temporal structures, such as play-break structure, and develop a three-step system for automatic attack structure decomposition. Four content-based shot classes, namely, play, focus, replay and break were identified by low level visual features. A four-state hidden Markov model was trained to simulate transition processes among these shot classes. Since attack structures are the longest repetitive temporal unit in a sports video, a suffix tree is proposed to find the longest repetitive substring in the label sequence of shot class transitions. These occurrences of this substring are regarded as a kernel of an attack hidden Markov process. Therefore, the decomposition of attack structure becomes a boundary likelihood comparison between two Markov chains. Highlights are what attract notice. Attention is a psychological measurement of “notice ”. A brief survey of attention psychological background, attention estimation from vision and auditory, and multiple modality attention fusion is presented. We propose two attention models for sports video analysis, namely, the role-based attention model and the multiresolution autoregressive framework. The role-based attention model is based on the perception structure during watching video. This model removes reflection bias among modality salient signals and combines these signals by reflectors. The multiresolution autoregressive framework (MAR) treats salient signals as a group of smooth random processes, which follow a similar trend but are filled with noise. This framework tries to estimate a noise-less signal from these coarse noisy observations by a multiple resolution analysis. Related algorithms are developed, such as event segmentation on a MAR tree and real time event detection. The experiment shows that these attention-based approach can find goal events at a high precision. Moreover, results of MAR-based highlight detection on the final game of FIFA 2002 and 2006 are highly similar to professionally labelled highlights by BBC and FIFA.

[1]  Alexander G. Hauptmann,et al.  LSCOM Lexicon Definitions and Annotations (Version 1.0) , 2006 .

[2]  B. Julesz Early vision and focal attention , 1991 .

[3]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[4]  Bärbel Mertsching,et al.  Integration of Static and Dynamic Scene Features Guiding Visual Attention , 1997, DAGM-Symposium.

[5]  Alan Hanjalic,et al.  Adaptive extraction of highlights from a sport video based on excitement modeling , 2005, IEEE Transactions on Multimedia.

[6]  Michael J. Apter,et al.  Color preference, arousal, and the theory of psychological reversals , 1982 .

[7]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[8]  Antonella Carbonaro,et al.  Ontology-Based Video Annotation in Multimedia Entertainment , 2007, 2007 4th IEEE Consumer Communications and Networking Conference.

[9]  Jonathan D. Courtney Automatic video indexing via object motion analysis , 1997, Pattern Recognit..

[10]  G. Calvert,et al.  Multisensory integration: methodological approaches and emerging principles in the human brain , 2004, Journal of Physiology-Paris.

[11]  Gerhard Rigoll,et al.  A Multi-Modal Mixed-State Dynamic Bayesian Network for Robust Meeting Event Recognition from Disturbed Data , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[12]  Michael J. Swain,et al.  WebSeer: An Image Search Engine for the World Wide Web , 1996 .

[13]  Alberto Del Bimbo,et al.  Taking into Consideration Sports Semantic Annotation of Sports Videos Content-based Multimedia Indexing and Retrieval , 2002 .

[14]  Lifang Gu,et al.  Replay Detection in Sports Video Sequences , 1999, Eurographics Multimedia Workshop.

[15]  C. Krishna Mohan,et al.  Content-Based Video Classification Using Support Vector Machines , 2004, ICONIP.

[16]  Mubarak Shah,et al.  A Graph Theoretic Approach for Scene Detection in Produced Videos , 2003 .

[17]  Patrick Gros,et al.  HMM based structuring of tennis videos using visual and audio cues , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[18]  Mark S. Squillante,et al.  Analysis and characterization of large‐scale Web server access patterns and performance , 1999, World Wide Web.

[19]  Lai-Man Po,et al.  A novel cross-diamond search algorithm for fast block motion estimation , 2002, IEEE Trans. Circuits Syst. Video Technol..

[20]  Wolfgang Effelsberg,et al.  Automatic audio content analysis , 1997, MULTIMEDIA '96.

[21]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[22]  Thierry Pun,et al.  Attentive mechanisms for dynamic and static scene analysis , 1995 .

[23]  Hatice Gunes,et al.  Affect recognition from face and body: early fusion vs. late fusion , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[24]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[25]  P. Ekman,et al.  DIFFERENCES Universals and Cultural Differences in the Judgments of Facial Expressions of Emotion , 2004 .

[26]  Tanja Schultz,et al.  Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers , 2007, HLT-NAACL 2007.

[27]  Fabio Crestani,et al.  The Troubles with Using a Logical Model of IR on a Large Collection of Documents , 1995, TREC.

[28]  Milind R. Naphade,et al.  A probabilistic framework for semantic video indexing, filtering, and retrieval , 2001, IEEE Trans. Multim..

[29]  Lihao Xu,et al.  Affective video content repression and model , 2005 .

[30]  A. Leventhal The neural basis of visual function , 1991 .

[31]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[32]  Yongduek Seo,et al.  Where Are the Ball and Players? Soccer Game Analysis with Color Based Tracking and Image Mosaick , 1997, ICIAP.

[33]  Mohan S. Kankanhalli,et al.  Goal detection in soccer video using audio/visual keywords , 2004, 2004 International Conference on Image Processing, 2004. ICIP '04..

[34]  Svetha Venkatesh,et al.  Novel approach to determining tempo and dramatic story sections in motion pictures , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[35]  Dragutin Petkovic,et al.  Automatic and semiautomatic methods for image annotation and retrieval in query by image content (QBIC) , 1995, Electronic Imaging.

[36]  Rainer Lienhart,et al.  Comparison of automatic shot boundary detection algorithms , 1998, Electronic Imaging.

[37]  Yi-Ping Phoebe Chen,et al.  The power of play-break for automatic detection and browsing of self-consumable sport video highlights , 2004, MIR '04.

[38]  Sw. Banerjee,et al.  Hierarchical Modeling and Analysis for Spatial Data , 2003 .

[39]  Joemon M. Jose,et al.  Audio-Based Event Detection for Sports Video , 2003, CIVR.

[40]  A. Murat Tekalp,et al.  Automatic soccer video analysis and summarization , 2003, IEEE Trans. Image Process..

[41]  Baoxin Li,et al.  Automatic detection of replay segments in broadcast sports programs by detection of logos in scene transitions , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[42]  Marcel Worring,et al.  MediaMill: exploring news video archives based on learned semantics , 2005, MULTIMEDIA '05.

[43]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[44]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[45]  Chng Eng Siong,et al.  Automatic generation of personalized music sports video , 2005, MULTIMEDIA '05.

[46]  Wen-Huang Cheng,et al.  Baseball event detection using game-specific feature sets and rules , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[47]  Peter J. L. van Beek,et al.  Detection of slow-motion replay segments in sports video for highlights generation , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[48]  Lie Lu,et al.  Content analysis for audio classification and segmentation , 2002, IEEE Trans. Speech Audio Process..

[49]  Noboru Babaguchi,et al.  Event based indexing of broadcasted sports video by intermodal collaboration , 2002, IEEE Trans. Multim..

[50]  Joemon M. Jose,et al.  Attention-based video summarisation in rushes collection , 2007, TVS '07.

[51]  Nando de Freitas,et al.  Robust Full Bayesian Learning for Radial Basis Networks , 2001, Neural Computation.

[52]  Wenjun Zeng,et al.  Integrated image and speech analysis for content-based video indexing , 1996, Proceedings of the Third IEEE International Conference on Multimedia Computing and Systems.

[53]  P. Lang,et al.  Affective judgment and psychophysiological response: Dimensional covariation in the evaluation of pictorial stimuli. , 1989 .

[54]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[55]  Rong Yan,et al.  Probabilistic models for combining diverse knowledge sources in multimedia retrieval , 2006 .

[56]  Gerhard Rigoll,et al.  A multi-modal graphical model for robust recognition of group actions in meetings from disturbed videos , 2005, IEEE International Conference on Image Processing 2005.

[57]  Peter Norvig,et al.  Artificial intelligence - a modern approach, 2nd Edition , 2003, Prentice Hall series in artificial intelligence.

[58]  Boon-Lock Yeo,et al.  Analysis And Presentation Of Soccer Highlights From Digital Video , 1995 .

[59]  Thomas D. C. Little,et al.  A Survey of Technologies for Parsing and Indexing Digital Video1 , 1996, J. Vis. Commun. Image Represent..

[60]  K. C. Chou,et al.  Multiscale recursive estimation, data fusion, and regularization , 1994, IEEE Trans. Autom. Control..

[61]  Shih-Fu Chang,et al.  Algorithms and system for segmentation and structure analysis in soccer video , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[62]  Ioannis Pitas,et al.  Shot detection in video sequences using entropy based metrics , 2002, Proceedings. International Conference on Image Processing.

[63]  Thomas Stockhammer,et al.  A weighted layered broadcasting scheme for scalable video transmission with multiple site reception , 2006, MobiMedia '06.

[64]  Shih-Fu Chang,et al.  Structure analysis of soccer video with domain knowledge and hidden Markov models , 2004, Pattern Recognit. Lett..

[65]  Changsheng Xu,et al.  Live sports event detection based on broadcast video and web-casting text , 2006, MM '06.

[66]  Joemon M. Jose,et al.  Football Video Segmentation Based on Video Production Strategy , 2005, ECIR.

[67]  Mei Han,et al.  An integrated baseball digest system using maximum entropy method , 2002, MULTIMEDIA '02.

[68]  Ba Tu Truong,et al.  Automatic genre identification for content-based video categorization , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[69]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[70]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[71]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[72]  P. Valdez,et al.  Effects of color on emotions. , 1994, Journal of experimental psychology. General.

[73]  Jing-Yu Yang,et al.  A generalized Foley-Sammon transform based on generalized fisher discriminant criterion and its application to face recognition , 2003, Pattern Recognit. Lett..

[74]  Aaron F. Bobick,et al.  A Framework for Recognizing Multi-Agent Action from Visual Evidence , 1999, AAAI/IAAI.

[75]  Michael S. Lew,et al.  Principles of Visual Information Retrieval , 2001, Advances in Pattern Recognition.

[76]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[77]  Takeo Kanade,et al.  Name-It: Naming and Detecting Faces in News Videos , 1999, IEEE Multim..

[78]  Riccardo Leonardi,et al.  Semantic indexing of soccer audio-visual sequences: a multimodal approach based on controlled Markov chains , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[79]  R. Cole,et al.  Survey of the State of the Art in Human Language Technology , 2010 .

[80]  Rainer Lienhart,et al.  Reliable dissolve detection , 2001, IS&T/SPIE Electronic Imaging.

[81]  Joel L. Davis,et al.  Large-Scale Neuronal Theories of the Brain , 1994 .

[82]  Lucas Chi Kwong Hui,et al.  Color Set Size Problem with Application to String Matching , 1992, CPM.

[83]  Loong Fah Cheong,et al.  Affective understanding in film , 2006, IEEE Trans. Circuits Syst. Video Technol..

[84]  Shih-Fu Chang,et al.  Structure analysis of soccer video with hidden Markov models , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[85]  Lie Lu,et al.  Optimization-based automated home video editing system , 2004, IEEE Transactions on Circuits and Systems for Video Technology.

[86]  J. Crary,et al.  Suspensions of Perception , 1999 .

[87]  Jonathan Crary Suspensions of Perception , 1999 .

[88]  Noel E. O'Connor,et al.  Event detection in field sports video using audio-visual features and a support vector Machine , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[89]  Pietro Perona,et al.  Overcomplete steerable pyramid filters and rotation invariance , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[90]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[91]  H. Zettl Sight, Sound, Motion: Applied Media Aesthetics , 1973 .

[92]  E. Bullmore,et al.  Activation of auditory cortex during silent lipreading. , 1997, Science.

[93]  Yi-Ping Phoebe Chen,et al.  Classification of self-consumable highlights for soccer video summaries , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[94]  Gu Xu,et al.  An HMM-based framework for video semantic analysis , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[95]  Tat-Seng Chua,et al.  The fusion of audio-visual features and external knowledge for event detection in team sports video , 2004, MIR '04.

[96]  HongJiang Zhang,et al.  Automatic parsing of TV soccer programs , 1995, Proceedings of the International Conference on Multimedia Computing and Systems.

[97]  Paul Buitelaar,et al.  Unsupervised Ontology-based Semantic Tagging for Knowledge Markup , 2005 .

[98]  He Yin,et al.  Affective sports highlight detection , 2007, 2007 15th European Signal Processing Conference.

[99]  David Marr,et al.  VISION A Computational Investigation into the Human Representation and Processing of Visual Information , 2009 .

[100]  Nuno Vasconcelos,et al.  Bayesian Video Shot Segmentation , 2000, NIPS.

[101]  Zhu Liu,et al.  Integration of audio and visual information for content-based video segmentation , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[102]  Yongmin Li,et al.  Video classification using spatial-temporal features and PCA , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[103]  Qi Tian,et al.  A mid-level representation framework for semantic sports video analysis , 2003, ACM Multimedia.

[104]  Mohan S. Kankanhalli,et al.  Creating audio keywords for event detection in soccer video , 2003, 2003 International Conference on Multimedia and Expo. ICME '03. Proceedings (Cat. No.03TH8698).

[105]  R. Simons,et al.  Roll ‘em!: The effects of picture motion on emotional responses , 1998 .

[106]  T. Sejnowski,et al.  A critique of pure vision , 1993 .

[107]  A. Treisman,et al.  Perceiving visually presented objets: recognition, awareness, and modularity , 1998, Current Opinion in Neurobiology.

[108]  R. Simons,et al.  Emotion processing in three systems: the medium and the message. , 1999, Psychophysiology.

[109]  A. Baddeley,et al.  Prose recall and amnesia: implications for the structure of working memory , 2002, Neuropsychologia.

[110]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[111]  John S. Boreczky,et al.  A hidden Markov model framework for video segmentation using audio and image features , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[112]  Ioannis Pitas,et al.  Content-based video parsing and indexing based on audio-visual interaction , 2001, IEEE Trans. Circuits Syst. Video Technol..

[113]  David S. Doermann,et al.  Identifying sports videos using replay, text, and camera motion features , 1999, Electronic Imaging.

[114]  S. Engel,et al.  Colour tuning in human visual cortex measured with functional magnetic resonance imaging , 1997, Nature.

[115]  Jen-Tzung Chien,et al.  Online hierarchical transformation of hidden Markov models for speech recognition , 1999, IEEE Trans. Speech Audio Process..

[116]  Chng Eng Siong,et al.  Automatic replay generation for soccer video broadcasting , 2004, MULTIMEDIA '04.

[117]  J. M. Kittross The measurement of meaning , 1959 .

[118]  B. Cuthbert,et al.  Attention to Television: Alpha Power and Its Relationship to Image Motion and Emotional Content , 2003 .

[119]  Brendan J. Frey,et al.  Probabilistic multimedia objects (multijects): a novel approach to video indexing and retrieval in multimedia systems , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[120]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[121]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[122]  Yoram Singer,et al.  The Hierarchical Hidden Markov Model: Analysis and Applications , 1998, Machine Learning.

[123]  Christof Koch,et al.  Comparison of feature combination strategies for saliency-based visual attention systems , 1999, Electronic Imaging.

[124]  Deb Roy,et al.  Situated Models of Meaning for Sports Video Retrieval , 2007, NAACL.

[125]  A. Willsky Multiresolution Markov models for signal and image processing , 2002, Proc. IEEE.