A text-based fully automated architecture for the semantic annotation and retrieval of Turkish news videos

Video texts are known to constitute an important source of information for semantic summaries of video archives. In this study, we propose a fully automated architecture for semantic annotation and later retrieval of Turkish news videos based on the corresponding video texts. At the core of the architecture is a named entity recognizer, the output of which on video texts is used as semantic annotations for the corresponding videos. The architecture also comprises components for news story segmentation, sliding text recognition, and video retrieval in addition to a news video database. The news story segmentation module makes use of the audio waveforms of the raw video files to detect the boundaries of individual news stories. The sliding text recognizer is then executed on the video segments corresponding to these news stories to extract their texts. The texts are then fed into the named entity recognizer for Turkish news texts to extract the named entities which are to be used as semantic annotations or index terms for the retrieval of these news videos. Finally, the retrieval interface of the overall architecture enables access to the annotated videos and video segments through boolean queries formed by using the previously extracted named entities. This study is significant for its proposing the first fully automated architecture for the semantic annotation and retrieval of Turkish news video archives.

[1]  Alberto Messina,et al.  A generalised cross-modal clustering method applied to multimedia news semantic indexing and retrieval , 2009, WWW '09.

[2]  Charles L. Wayne Multilingual Topic Detection and Tracking: Successful Research Enabled by Corpora and Evaluation , 2000, LREC.

[3]  T.T. Temizel,et al.  Person name extraction from Turkish financial news text using local grammar-based approach , 2008, 2008 23rd International Symposium on Computer and Information Sciences.

[4]  Lynette Hirschman,et al.  Overview: Information Extraction From Broadcast News , 1999 .

[5]  Valentin Tablan,et al.  Web-assisted annotation, semantic indexing and search of television and radio news , 2005, WWW '05.

[6]  Dayne Freitag,et al.  Machine Learning for Information Extraction in Informal Domains , 2000, Machine Learning.

[7]  E. Dikici,et al.  Sliding text recognition in broadcast news , 2008, 2008 IEEE 16th Signal Processing, Communication and Applications Conference.

[8]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[9]  A. Yazici,et al.  Identification of coreferential chains in video texts for semantic annotation of news videos , 2008, 2008 23rd International Symposium on Computer and Information Sciences.

[10]  Changsheng Xu,et al.  Using Webcast Text for Semantic Event Detection in Broadcast Sports Video , 2008, IEEE Transactions on Multimedia.

[11]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[12]  Marcel Worring,et al.  Multimodal Video Indexing : A Review of the State-ofthe-art , 2001 .

[13]  Yorick Wilks,et al.  Named Entity Recognition from Diverse Text Types , 2001 .

[14]  Paul Over,et al.  Boundary Error Analysis and Categorization in the TRECVID News Story Segmentation Task , 2005, CIVR.

[15]  Adnan Yazici,et al.  Named Entity Recognition Experiments on Turkish Texts , 2009, FQAS.

[16]  Kalina Bontcheva,et al.  Multimedia indexing through multi-source and multi-language information extraction: the MUMIS project , 2004, Data Knowl. Eng..

[17]  Gökhan Tür,et al.  A statistical information extraction system for Turkish , 2003, Natural Language Engineering.

[18]  Roberto Basili,et al.  RitroveRAI: A Web Application for Semantic Indexing and Hyperlinking of Multimedia News , 2005, SEMWEB.

[19]  Adnan Yazici,et al.  A Fuzzy Conceptual Model for Multimedia Data with a Text-Based Automatic Annotation Scheme , 2009, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[20]  Mehmet Gokturk,et al.  An Integrated Architecture for Processing Business Documents in Turkish , 2009, CICLing.