Confounded Expectations: Informedia at TRECVID 2004

For TRECVID 2004, CMU participated in the semantic feature extraction task, and the manual, interactive and automatic search tasks. For the semantic features classifiers, we tried unimodal, multi-modal and multi-concept classifiers. In interactive search, we compared a visual-only vs a complete video retrieval system using visual AND text data, and also contrasted expert vs novice users; The manual runs were similar to 2003, but they did not work comparably to last year, Additionally, we shared our low-level features with the TRECVID community. Overview We first describe the low-level features, which formed the input for all our analysis and which were distributed to other participants. Then we sketch out the experiments done for the semantic features, followed by the search task experiments (manual, automatic and interactive). Low-level ‘raw’ features Low-level features are extracted for each shot. “Low-level” means the features which are directly extracted from the source, videos. We use the term ‘low-level’ to distinguish them from the TRECVID high-level semantic feature extraction task. The low-level features are derived from several different sources: visual, audio, text and ‘semantic’ detectors, such as face detection and Video OCR detection. In TRECVID 2004, we extract 16 low-level raw features for the whole data set. This data was provided to all participating groups to encourage other researchers to use or compare their approaches with a standardized feature set. Image features A shot is the basic unit in our system; therefore, we extract one key-frame within each shot as a representative image. Image features are then based on the features extracted from that representative image. There are 3 different types of image features: color histograms, textures and edges. For all image features, we split the image into a 5 by 5 grid that tries to capture some spatial locality of information. The distributed data lists the features for each grid cell by rows, starting at the top. HSV, RGB, HVC and HCSqr Color Histograms Three different color spaces are used to construct color histogram features: HSV, HVC and RGB. Each grid presents its color histogram in 125 dimensions. Each channel is represented by 5 dimensions and plotted in a 3D histogram. Therefore, for each image, the dimension of the color histogram is 3125 (5*5*125). Due to this high dimensionality, we also provide the mean and variance for each grid and reduce the dimension to 50 (5*5*2). We also add an alternative feature called hcsqr, which is derived from the HVC color histogram, but removes variance and linearizes Hue and Chroma into a 2D histogram. Texture Images are first gray-scaled. Each image is convolved with six orientated Gabor filters. For each filter, the image is divided by 5 by 5 grids. The resulting filtered grids are then threshold and reduced to 16 bins histogram. The dimension is 2400 (6*16*5*5). Edge The edge detection is done using Canny edge detection. The result of Canny edge detection is convolved with 8 orientations. For each grid, there are 8 dimensions which show the mean magnitude for the 8 orientations. The dimensionality is thus 200 (5*5*8). Audio features We extract audio signal every 20 msecs (512 windows at 44100 HZ sampling rate). However, the basic unit of analysis is a shot which has variable length. We therefore calculate the mean and variance for each shot. FFT FFT is based on the Short Time Fourier Transform (STFT). The features are the means and variances of the spectral centroid, rolloff, flux and zerocrossings. Another feature called low energy is added. Therefore, they are 9 (4*2+1) dimensions. MFCC MFCC features are based on 10 Mel-Frequency cepstral coefficients. SFFT SFFT is a simplified FFT. It only lists the mean for the spectral centroid, rolloff, flux and zerocrossings. Motion features Motion features try to capture the movement within the shot. Although they very noisy, motion features potentially allow us to move from still image analysis to analysis of the moving video. Since the video was encoded with different MPEG encoders using different motion block numbers, we did not use the MPEG P-frame motion blocks. Kinetic Energy Kinetic energy measures the pixel variation within the shot. We convert the image to gray level and calculate the frame by frame differences for every shot. The value is the mean of the difference within the shot. Although we could still split it into 5 by 5 grids, we utilized this feature as a measurement of the stability of a shot and did not split the image. Optical Flow The optical flow motion is calculated for every 5 frames. It’s 5 by 5 and each grid contains 3 dimensions. The 3 dimensions are the mean x direction, mean y direction and variance of the magnitude. Text features The text feature is derived from the audio transcript. Semantic Detector features Two of our features are not quite “low”-level features. However, they are very basic measurements that discover peculiar characteristics in the video. Since people play important roles in news video, face detection gives us useful information. VOCR (video optical character reader) often shows people names and locations. Faces The face result collects the information of the most confident face detection result within the shot. The 5 dimensions are confidence, face size, face pose (1 is front, 2 is left, 3 is right), x coordinate of center point, and y coordinate of center point. Center point is the center location of the face detection box. VOCR VOCR result generates the information about VOCR detection boxes. The VOCR detection box is the detection result which shows the possible places to contain VOCR. The four dimensions are the number of boxes, average size of boxes, mean of x and mean of y coordinate. The timestamped contents of the recognized text are listed in a separate file. High-level semantic features To classify the 10 high-level semantic features (Boat/Ship, Madeleine Albright, Bill Clinton, Train, Beach, Basketball scored, Airplane takeoff, People walking/running, physical violence and Road), our baseline was a single modality classification approach. For each, we chose one single feature set of our standard low-level feature classes and built a classifier for that feature. Separate runs used a multi-modality classification strategy. Here, we built classifiers for each low-level feature and then combined them with a meta-classifier (stacking). The main challenge for semantic feature extraction from video is the large diversity. Low level features tend not to capture a complete semantic class well. For example, ‘outdoors’ contains many different concepts, as well as colors, textures and shapes. It may be an urban scene, a rural scene, a beach scene or another natural scene, each with different colors, textures, etc. Therefore, we attempted to utilize other semantic features to boost the performance of a specific classifier (or detector). For example, if we want to build a classifier for outdoor scenes, we can build (hopefully easier) classifiers, like sky, ocean, tree, grassland, road, building and other outdoor-related concepts. Ideally, the outdoor classifier can gain power from other easy and strong detectors, like sky, ocean and grassland, and thus the outdoor detector will be able to correctly classify many different scenes. Our approach in TRECVID 2004 was to find other concepts related to the classification task. Then, using the principles of causation and inference, we developed two classification strategies. ID Target Concept Causally related concepts 28 Boat/Ship Boat, Water_Body, Sky, Cloud 31 Train Car_Crash, Man_Made_scene, Smoke, Road 32 Beach Sky, Water_Body, Nature_Non-Vegetation, Cloud 33 Basket Scored Crowd, People, Running, Non-Studio_Setting 34 Airplane Takeoff Airplane, Sky, Smoke, Space_Vehicle_Launch 35 People Walking/running Walking, Running, People, Person Table 1. Causal relationships for the TRECVID 2004 concepts. 36 Physical violence Gun_Shot, Building, Gun, Explosion 37 Road Car, Road_Traffic, Truck, Vehicle_Noise Causation The common annotation set, distributed in TRECVID 2003, labeled several hundred semantic concepts in TRECVID 2003 development set of 47322 shots. Among those concepts, there are 190 concepts which have a frequency higher than 10. We analyzed the causal relationship of these 190 concepts to 8 high-level semantic features concepts, listed above We excluded Madeleine Albright and Bill Clinton, because the latter were more suitable for specific person x search strategies describe below. We selected the top 4 causal origins for each concept and grouped them together. Table 1 shows the respective 4 concepts which were determined to cause the evaluated target concepts. Inference Using each group of 5 concepts (4 causal ones and the target concept), we built a multi-modality classifier for each concept. To train the combination parameters, we split our training data into two sets. The first set is used to build the classifiers for each individual concept. The second set is used to validate the combination. The next step is to infer the causational concept classifier results into target concept. We then experimented with two approaches (A and B) to combine the classifier results. AWe use the confidence of causal relationship (form 0 to 1) and the error rate obtained on the training set to combine the results.