VIsual TRAnslator: Linking perceptions and natural language descriptions

Despite the fact that image understanding and natural language processing constitute two major areas of AI, there have only been a few attempts toward the integration of computer vision and the generation of natural language expressions for the description of image sequences. In this contribution we will report on practical experience gained in the projectVitra (VIsual TRAnslator) concerning the design and construction of integrated knowledge-based systems capable of translating visual information into natural language descriptions. InVitra different domains, like traffic scenes and short sequences from soccer matches, have been investigated.Our approach towardssimultaneous scene description emphasizes concurrent image sequence evaluation and natural language processing, carried out on anincremental basis, an important prerequisite for real-time performance. One major achievement of our cooperation with the vision group at the Fraunhofer Institute (IITB, Karlsruhe) is the automatic generation of natural language descriptions for recognized trajectories of objects in real world image sequences. In this survey, the different processes pertaining to high-level scene analysis and natural language generation will be discussed.

[1]  Klaus-Peter Gapp Basic Meanings of Spatial Relations: Computation and Evaluation in 3D Space , 1994, AAAI.

[2]  Heinrich Niemann,et al.  A Knowledge Based System for Analysis of Gated Blood Pool Studies , 1985, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Wolfgang Wahlster,et al.  One word says more than a thousand pictures , 1989 .

[4]  Reind P. van de Riet,et al.  Artificial intelligence II. Methodology, systems, applications : by Ph. Jorrand and V. Sgurev, Eds. (North-Holland, Amsterdam, 1987) 403 pp , 1987, Future Gener. Comput. Syst..

[5]  Wolfgang Finkler,et al.  Effects of Incremental Output on Incremental Natural Language Generation , 1992, ECAI.

[6]  K. Rohr Towards model-based recognition of human movements in image sequences , 1994 .

[7]  Jörg R. J. Schirra,et al.  ANTLIMA - A Listener Model with Mental Images , 1993, IJCAI.

[8]  Karl Rohr,et al.  Incremental recognition of pedestrians from image sequences , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Gudula Retz-Schmidt,et al.  Various Views on Spatial Prepositions , 1988, AI Mag..

[10]  John K. Tsotsos Knowledge organization and its role in representation and interpretation for time‐varying data: the ALVEN system , 1985, Comput. Intell..

[11]  Thomas Rist,et al.  Coping with the Intrinsic and Deictic Uses of Spatial Prepositions , 1986, AIMSA.

[12]  U. Rembold,et al.  KANTRA-human-machine interaction for intelligent robots using natural language , 1994, Proceedings of 1994 3rd IEEE International Workshop on Robot and Human Communication.

[13]  Jörg R. J. Schirra,et al.  From image sequences to natural language: a First step toward automatic perception and description of motions , 1987, Appl. Artif. Intell..

[14]  Dieter Koller Detektion, Verfolgung und Klassifikation bewegter Objekte in monokularen Bildfolgen am Beispiel von Straßenverkehrsszenen , 1992, DISKI.

[15]  Gudula Retz-Schmidt Die Interpretation des Verhaltens mehrerer Akteure in Szenenfolgen , 1992, Informatik-Fachberichte.

[16]  Wolfgang Wahlster,et al.  Over-Answering Yes-No Questions: Extended Responses in a NL Interface to a Vision System , 1983, IJCAI.

[17]  Bernd Neumann,et al.  NOAS: Ein System zur natürlichsprachlichen Beschreibung zeitveränderlicher Szenen , 1986, Inform. Forsch. Entwickl..

[18]  Hans-Hellmut Nagel,et al.  Model-Based Object Tracking in Traffic Scenes , 1992, ECCV.

[19]  Hans-Hellmut Nagel,et al.  Algorithmic characterization of vehicle trajectories from image sequences by motion verbs , 1991, Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Ruzena Bajcsy,et al.  LandScan: A Natural Language and Computer Vision System for Analyzing Aerial Images , 1985, IJCAI.

[21]  Gerd Herzog Visualization Methods for the VITRA Workbench , 2003 .

[22]  Wolfgang Wahlster,et al.  Incremental Natural Language Description of Dynamic Imagery , 1989, Wissensbasierte Systeme.

[23]  Siobhan Chapman Logic and Conversation , 2005 .

[24]  Georg Zimmermann,et al.  Detektion und Verfolgung mehrerer Objekte in Bildfolgen , 1986, DAGM-Symposium.

[25]  Wolfgang Wahlster,et al.  User Modelling in Anaphora Generation: Ellipsis and Definite Description , 1982, ECAI.

[26]  W. Wahister One word says more than a thousand pictures: on the automatic verbalization of the results of image sequence analysis system , 1987 .

[27]  Karin Harbusch,et al.  Incremental Syntax Generation with Tree Adjoining Grammars , 1991, Wissensbasierte Systeme.

[28]  Thomas Rist,et al.  Generierung natürlichsprachlicher Äußerungen zur simultanen Beschreibung von zeitveränderlichen Szenen , 1987, GWAI.

[29]  Thomas Rist,et al.  Natural Language Access to Visual Data: Dealing with Space and Movement , 1989 .

[30]  C.-K. Sung,et al.  Extraktion von typischen und komplexen Vorgängen aus einer langen Bildfolge einer Verkehrsszene , 1988, DAGM-Symposium.

[31]  Daniel Hernández Hybride und integrierte Ansätze zur Raumrepräsentation und ihre Anwendung , 1993, KI.

[32]  Hans-Hellmut Nagel,et al.  Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unscharfer Mengen , 1993, Informatik - Forschung und Entwicklung.

[33]  Gerd Herzog,et al.  VITRA GUIDE : Utilisation du Langage Naturel et de Représentations Graphiques pour la Description d'Itinéraires , 1993 .

[34]  Norbert Reithinger,et al.  The Performance of an Incremental Generation Component for Multi-Modal Dialog Contributions , 1992, NLG.

[35]  Peter C. Lockemann,et al.  Database Support for Knowledge-Based Image Evaluation , 1987, VLDB.

[36]  W. Maab,et al.  Vitra guide: multimodal route descriptions for computer assisted vehicle navigation , 1993 .

[37]  Thomas Rist,et al.  On the Simultaneous Interpretation of Real World Image Sequences and their Natural Language Description: The System Soccer , 1988, ECAI.

[38]  Retz-Schmidt Gudula Recognizing intentions, interactions, and causes of plan failures , 1991 .