论文信息 - Grounding spatial language for video search

Grounding spatial language for video search

The ability to find a video clip that matches a natural language description of an event would enable intuitive search of large databases of surveillance video. We present a mechanism for connecting a spatial language query to a video clip corresponding to the query. The system can retrieve video clips matching millions of potential queries that describe complex events in video such as "people walking from the hallway door, around the island, to the kitchen sink." By breaking down the query into a sequence of independent structured clauses and modeling the meaning of each component of the structure separately, we are able to improve on previous approaches to video retrieval by finding clips that match much longer and more complex queries using a rich set of spatial relations such as "down" and "past." We present a rigorous analysis of the system's performance, based on a large corpus of task-constrained language collected from fourteen subjects. Using this corpus, we show that the system effectively retrieves clips that match natural language descriptions: 58.3% were ranked in the top two of ten in a retrieval task. Furthermore, we show that spatial relations play an important role in the system's performance.

[1] Christopher R. Wren,et al. Toward Spatial Queries for Spatial Surveillance Tasks , 2006 .

[2] Hiromasa Nakatani,et al. Interactive image retrieval by natural language , 1997 .

[3] Stefanie Tellex,et al. The Human Speechome Project , 2006, EELC.

[4] Ray Jackendoff. Semantics and Cognition , 1983 .

[5] Nicholas Roy,et al. Utilizing object-object and object-scene context when planning to find things , 2009, 2009 IEEE International Conference on Robotics and Automation.

[6] W. Eric L. Grimson,et al. Answering Questions about Moving Objects in Surveillance Videos , 2003, New Directions in Question Answering.

[7] Maneesh Kumar Singh,et al. State-of-the-art on spatio-temporal information-based video retrieval , 2009, Pattern Recognit..

[8] Stefanie Tellex,et al. Towards surveillance video search by natural language query , 2009, CIVR '09.

[9] Masahito Hirakawa,et al. VIOLONE: Video Retrieval by Motion Example , 1996, J. Vis. Lang. Comput..

[10] John R. Smith,et al. Large-scale concept ontology for multimedia , 2006, IEEE MultiMedia.

[11] Dong Wang,et al. Video search in concept subspace: a text-like paradigm , 2007, CIVR '07.

[12] Stefanie Tellex,et al. Toward understanding natural language directions , 2010, HRI 2010.

[13] Deb Roy,et al. Mining temporal patterns of movement for video content classification , 2006, MIR '06.

[14] Francesco Orilia,et al. Semantics and Cognition , 1991 .