We participated in the high-level feature extraction and search task for TRECVID 2005. For the high-level feature extraction task, we make use of the available collaborative annotation results for training, and develop 2 methods to perform automated concept annotation: (a) a ranked-Maximal Figure -of-Merit (MFoM) method; and (b) a multimodal rankBoost fusion method. We submitted a total of 7 runs based on these two methods. For the search task, we focus on improving our previous retrieval system by utilizing an event entity model derived from relevant external resources. In addition, we also make use of the various high-level feature extraction results contributed by various participating groups to help in the re-ranking step. We submitted a total 6 runs in the automated search category. The e valuation results show that our event-based approach is effective in human/event queries and that the high-level features is useful for general queries. 1. HIGH LEVEL FEATURE EXTRACTION TASK We explore two methods to perform high-level feature extraction. The first is based on a ranked-Maximal Figure-ofMerit (MFoM) method that has been successfully employed in text categorization. The second employs HMM for high-level features extraction, follow by rankBoost fusion to fuse with other modality features. 1.1 Ranked Maximal Figure -of-merit (MFoM) Conventional approaches for semantic concept detection is to train a binary classifier (e.g. SVM, Boosting) from the training set by optimizing generalized classification error or maximizing the likelihood. In the high-level extraction task, our concern is ranking the relevant shots as high as possible. Therefore, the basis of this approach lies in learning an optimal ranking function in terms of the mean average precision (MAP) from the development dataset, given any type of multimedia content representation (text from ASR, visual feature, etc). Here we develop an algorithm for training the ranking function with the goal of optimizing the MAP. This algorithm is similar to our work in (Gao et al., 2003 & 2004) and the ROC optimization for classifier (Cortes & Mohri, 2003; Yan, et al, 2003), where the objective function to be optimized is derived using an approximation of the interested metric for evaluation. A good measure for ranking is the Wilcoxon-Mann-Whitney statistic, which is equal to the area under the ROC (AUC), defined over the training set as: ( ) mn y x I U m i n j j i ∑ ∑ = = = 1 , (1) where, ( ) 1 : , 0: Otherwises i j i j x y I x y > = , with i x and j y being the scores from the classifier for the i-th out of the M positive samples, and the j-th out of the N negative examples. A classifier is then trained by maximizing Eq. (1) using a gradient algorithm. A sigmoid function (see Eq. (2)) is used to approximate the correct ranking count, ( ) j i y x I , , with ( ) ( ) j i y x j i e y x S − − + = β 1 1 , (2) where β is a constant function. After smoothing, Eq. (1) becomes a differentiable function. The smoothed Eq. (1) is a function of the parameters of the classifier (embedded by i x and j y ) and will be the objective for optimisation. Because it is highly non-linear, the gradient descent algorithm is applied to find its solution as in (Gao et al, 2003 & 2004). The ranking optimisation algorithm, n amed Rank-MFoM, is derived from MFoM learning in (Gao et al, 2003 & 2004). We submitted 4 runs using the Ranked-MFoM algorithm. They are based on: (a) only text feature; (b) only texture feature; and (c-d) fusions of both features using two different settings (e.g. β , the learning rate, and the iteration cycles) for the Rank-MFoM algorithm. The results are presented below. Run A (TRECVID Run 6): The text only run. The linear classifier is trained by Rank-MFoM on the shot-level text documents comprising the ASR outputs within 3-window shots. A lexicon with 3,464 terms is extracted from the ASR outputs in the development set after the removal of stop words and both very rare and frequent words. Each text document is represented using the tf-idf feature. In this run, shots without the ASR outputs are ignored. Run B (TRECVID Run 7): The texture only run. The image is uniformly segmented into 77 grids with a size 32x32. We extract a 12-dimensional texture feature (energy of log Gabor filter) from each grid. Subsequently, we apply Gaussian Mixture Model (GMM) to model the texture distributions for the positive and negative shots. The ranking function is the log-likelihood ratio between the positive and the negative class, defined for an image as, ( ) ( ) ( ) ( ) ∑ ∑ = = − = 771 1 771 0 log log 77 1 t t t t C x P C x P X L (3) with ( ) ( ) 1 , M i i i i j j j j p x C w μ = = ⋅ Ν Σ ∑ , 0,1 i = (4) where M is the number of Gaussian mixture (here we use M=4); ( ) , j j i i N μ Σ is the j-th Gaussian distribution with the mean, j i μ , and covariance, j i Σ , for the positive i-th class; and j i w the corresponding weight for each Gaussian component. Here C0 is the positive class and C1 is the negative class . For each concept, the parameters, j i w , j i μ and j i Σ , are estimated using the Rank-MFoM from images available in the development set. Run C (TRECVID Run 5): The fusion run 1. Since the output of the texture-based classifier is at the key-frame level, we simply choose the maximal likelihood ratio score among all key-frames for a shot as its shot-level output. We then combine the output of the text -based classifier to form a 2-dimensional vector. For shots without the ASR outputs, we treat it as the missing feature rather than removing them from the training set for fusion. The 2dimension vector is fused by training a single mixture GMM using the Rank-MFoM algorithm. Run D (TRECVID Run 3): The fusion run 2. This run is different from Run C as the shot-level score is taken to be the average of all key-frames for the shot. To illustrates the effectiveness of Rank-MFoM as compared to standard SVM using text feature, we perform a preliminary run on a subset of videos in the development set. Table 1 tabulates the MAP scores for the 10 concepts for the top 2000 shots. The results indicate that Rank-MFoM out-performs SVM on auto concept annotation task. Table 1: Preliminary results for Rank-MFoM vs. SVM on a subset of TRECVID 2005 Development Set 38 39 40 41 42 43 44 45 46 47 All Rank-MFoM 0.0091 0.0158 0.0759 0.0291 0.0076 0.0083 0.0175 0.0004 0.0997 0.0399 0.0303 SVM 0.0060 0.0102 0.0045 0.0003 0.0037 0.0091 0.0055 0.0000 0.0405 0.0133 0.0093 Table 2: Performance of Runs in terms of MAP from TRECVID evaluation 38 39 40 41 42 43 44 45 46 47 All Run A 0.0449 0.0169 0.0907 0.0498 0.0703 0.0231 0.0287 0.0054 0.0735 0.0638 0.0467 Run B 0.0727 0.0198 0.0527 0.0155 0.0990 0.1143 0.0693 0.0033 0.0769 0.0412 0.0565 Run C 0.0999 0.0426 0.1540 0.0809 0.1326 0.0989 0.0962 0.0102 0.1906 0.1189 0.1025 Run D 0.0872 0.0323 0.1575 0.0795 0.1332 0.0887 0.0939 0.0072 0.2249 0.1070 0.1011 Table 2 lists the results of the 4 runs on the TRECVID 2005 test set. The results clearly show that the fusion runs (Run C and Run D) perform significantly better than those runs employing only individual feature (Run A and Run B). The overall MAP of the texture-feature-only run is also slightly better than that of the text -feature-only run. We observe from the results that fusion has a positive effect on the overall performance, except for concept 43, i.e Waterscape/waterfront. The results show that the Rank-MFoM is effective in fusing different modality features for the detection problem. For these submissions, we only focus on using the extracted text and texture features. In our future work, more features will be introduced and fused. In addition, we will look at the association between different modalities and features to further improve the detection performance. 1.2 Multimodal RankBoost Fusion This approach aims to fuse a combination of both low and high level features using the rankBoost algorithm. The list of features to be fused include: text features from ASR, audio genre, face information, shot genre, image matching, and visual concepts. The detection of the first five concepts follows standard techniques as describe din Section 2.1. We will elaborate on the extraction of visual concepts using HMM here. To detect visual concepts within each keyframe image, we employ HMM to model the association between concepts and their positive images. This is done by segmenting each key-frame into fixed 4x4 blocks and followed by clustering them. The visual features used to represent each image block include color histogram, edge histogram, and the adaptive matching pursuit feature for texture. More specifically, they are: Luv color histogram, adaptive Matching Pursuit texture features and edge histogram (Shi et al, 2004). The dimension of the resulting visual feature vector is 90. Our method treats each block in a key-frame separately and a data point in low-level visual space is just a 90-dim block vector. We then perform k-means clustering (with k=500) to the block vectors of the training images. Each training image is tokenized by dividing the image into a set of regular-sized blocks, each represented by a feature vector, and quantized into a visual cluster. As shown in Figure 1 , the content representation of each image is modeled as having been stochastically generated by a HMM, where each state will generate some blocks with similar features. In addition, the spatial transactions and co-occurrences between fixed-size blocks within an image can be utilized to help in detecting the feature concepts. Figure 1: HMM for Concept i. Figure 2: Training and Testing Process for Visual feature. Q2 Q1