Early versus late fusion in semantic video analysis