Understanding the Semantics of Media

It is difficult to understand a multimedia signal without being able to say something about its semantic content or its meaning. This chapter describes two algorithms that help bridge the semantic understanding gap that we have with multimedia. In both cases we represent the semantic content of a multimedia signal as a point in a high-dimensional space. In the first case, we represent the sentences of a video as a timevarying semantic signal. We look for discontinuities in this signal, of different sizes in a one-dimensional scale space, as an indication of a topic change. By sorting these changes, we can create a hierarchical segmentation of the video based on its semantic content. The same formalism can be used to think about color information and we consider the different media’s temporal correlation properties. In the second half of this chapter we describe an approach that connects sounds to semantics. We call this semantic-audio retrieval; the goal is to find a (non-speech) audio signal that fits a query, or to describe a (non-speech) audio signal using the appropriate words. We make this connection by building and clustering high-dimensional vector descriptions of the audio signal and its corresponding semantic description. We then build models that link the two spaces, so that a query in one space can be mapped into a model that describes the probability of correspondence for points in the opposing space.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[3]  John R. Kender,et al.  Video Summaries through Mosaic-Based Shot and Scene Clustering , 2002, ECCV.

[4]  Yee Leung,et al.  Clustering by Scale-Space Filtering , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Malcolm Slaney,et al.  Mixtures of probability experts for audio retrieval and indexing , 2002, Proceedings. IEEE International Conference on Multimedia and Expo.

[6]  R. Lyon Speech recognition in scale space , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[8]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[9]  Pak Chung Wong,et al.  TOPIC ISLANDS/sup TM/-a wavelet-based text visualization system , 1998 .

[10]  Jeffrey C. Reynar Statistical Models for Topic Segmentation , 1999, ACL.

[11]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[12]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[13]  Andrew P. Witkin,et al.  Uniqueness of the Gaussian Kernel for Scale-Space Filtering , 1986, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  Steven R. Waterhouse,et al.  Classification and Regression using Mixtures of Experts , 1997 .

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Malcolm Slaney,et al.  Multimedia edges: finding hierarchy in all dimensions , 2001, MULTIMEDIA '01.

[18]  Thomas Quatieri,et al.  Discrete-Time Speech Signal Processing: Principles and Practice , 2001 .

[19]  Andrew P. Witkin,et al.  Scale-space filtering: A new approach to multi-scale description , 1984, ICASSP.

[20]  Dragutin Petkovic,et al.  "What is in that Video Anyway?" In Search of Better Browsing , 1999, ICMCS, Vol. 1.

[21]  David A. Forsyth,et al.  Learning the semantics of words and pictures , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[22]  Christos Faloutsos,et al.  QBIC project: querying images by content, using color, texture, and shape , 1993, Electronic Imaging.

[23]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[24]  Jonathan Foote,et al.  Visualizing music and audio using self-similarity , 1999, MULTIMEDIA '99.