论文信息 - Words and Pictures in the News

Words and Pictures in the News

We discuss the properties of a collection of news photos and captions, collected from the Associated Press and Reuters. Captions have a vocabulary dominated by proper names. We have implemented various text clustering algorithms to organize these items by topic, as well as an iconic matcher that identifies articles that share a picture. We have found that the special structure of captions allows us to extract some names of people actually portrayed in the image quite reliably, using a simple syntactic analysis. We have been able to build a directory of face images of individuals from this collection.

[1] David A. Forsyth,et al. Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[2] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3] David Salesin,et al. Fast multiresolution image querying , 1995, SIGGRAPH.

[4] Yoshua Bengio,et al. Pattern Recognition and Neural Networks , 1995 .

[5] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[6] George A. Miller,et al. WordNet: A Lexical Database for the English Language , 2002 .

[7] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..