We present a three-step post-processing method for increasing the precision of video shot labels in the domain of television news. First, we demonstrate that news shot sequences can be characterized by rhythms of alternation (due to dialogue), repetition (due to persistent background settings), or both. Thus a temporal model is necessarily third-order Markov. Second, we demonstrate that the output of feature detectors derived from machine learning methods (in particular, from SVMs) can be converted into probabilities in a more effective way than two suggested existing methods. This is particularly true when detectors are errorful due to sparse training sets, as is common in this domain. Third, we demonstrate that a straightforward application of the Viterbi algorithm on a third-order FSM, constructed from observed transition probabilities and converted feature detector outputs, can refine feature label precision at little cost. We show that on a test corpus of TRECVID 2005 news videos annotated with 39 LSCOM-lite features, the mean increase in the measure of average precision (AP) was 4%, with some of the rarer and more difficult features having relative increases in AP of as much as 67%
[1]
Bianca Zadrozny,et al.
Transforming classifier scores into accurate multiclass probability estimates
,
2002,
KDD.
[2]
Svetha Venkatesh,et al.
Study of shot length and motion as contributing factors to movie tempo (poster session)
,
2000,
ACM Multimedia.
[3]
Alexander G. Hauptmann,et al.
Towards a Large Scale Concept Ontology for Broadcast Video
,
2004,
CIVR.
[4]
John R. Smith,et al.
A web-based system for collaborative annotation of large image and video collections: an evaluation and user study
,
2005,
MULTIMEDIA '05.
[5]
John Platt,et al.
Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods
,
1999
.