Semantic-level integration of video and speech data streams