Measuring historical word sense variation

We describe here a method for automatically identifying word sense variation in a dated collection of historical books in a large digital library. By leveraging a small set of known translation book pairs to induce a bilingual sense inventory and labeled training data for a WSD classifier, we are able to automatically classify the Latin word senses in a 389 million word corpus and track the rise and fall of those senses over a span of two thousand years. We evaluate the performance of seven different classifiers both in a tenfold test on 83,892 words from the aligned parallel corpus and on a smaller, manually annotated sample of 525 words, measuring both the overall accuracy of each system and how well that accuracy correlates (via mean square error) to the observed historical variation.

[1]  Jan Frederik Niermeyer,et al.  Mediae Latinitatis lexicon minus , 2002 .

[2]  Hwee Tou Ng,et al.  Exploiting Parallel Texts for Word Sense Disambiguation: An Empirical Study , 2003, ACL.

[3]  William John Teahan,et al.  Text classification and segmentation using minimum cross-entropy , 2000, RIAO.

[4]  Kevin Knight,et al.  Using Syntax to Improve Word Alignment Precision for Syntax-Based Machine Translation , 2008, WMT@ACL.

[5]  Hwee Tou Ng,et al.  Scaling Up Word Sense Disambiguation via Parallel Texts , 2005, AAAI.

[6]  John Tait,et al.  Word sense disambiguation in information retrieval revisited , 2003, SIGIR.

[7]  Gregory R. Crane,et al.  What Do You Do with a Million Books? , 2006, D Lib Mag..

[8]  S. Hamburger The Idea of Order: Transforming Research Collections for 21st Century Scholarship , 2011 .

[9]  Nancy Ide,et al.  Automatic Sense Tagging Using Parallel Corpora , 2001, NLPRS.

[10]  Hae-Chang Rim,et al.  Information retrieval using word senses: root sense tagging approach , 2004, SIGIR '04.

[11]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[12]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[13]  Philip Resnik,et al.  An Unsupervised Method for Word Sense Tagging using Parallel Corpora , 2002, ACL.

[14]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[15]  Michel Mollat du Jourdin Mediae Latinitatis Lexicon Minus , 1955 .

[16]  Bob Carpenter Character Language Models for Chinese Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[17]  Jan O. Pedersen Information Retrieval Based on Word Senses , 1995 .

[18]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[19]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[20]  Jan Frederik Niermeyer,et al.  Mediae Latinitatis lexicon minus : abbreviationes et index fontium , 1976 .

[21]  Patrick Juola,et al.  Authorship Attribution , 2008, Found. Trends Inf. Retr..

[22]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[23]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[24]  Dien Dinh,et al.  Building a Training Corpus for Word Sense Disambiguation in English-to-Vietnamese Machine Translation , 2002, COLING 2002.

[25]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[26]  Karel Vaculík,et al.  Perseus Digital Library , 2008 .

[27]  Daniel J. Cohen From Babel to Knowledge: Data Mining Large Digital Collections , 2006, D Lib Mag..

[28]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[29]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[30]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[31]  David Bamman,et al.  Transferring structural markup across translations using multilingual alignment and projection , 2010, JCDL '10.

[32]  Nancy Ide,et al.  Sense Discrimination with Parallel Corpora , 2002, SENSEVAL.

[33]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[34]  Lucia Specia,et al.  Exploiting parallel texts to produce a multilingual sense tagged corpus for word sense disambiguation , 2007 .

[35]  C. Henry,et al.  Council on Library and Information Resources (CLIR) , 2010 .

[36]  F. Mosteller,et al.  Inference and Disputed Authorship: The Federalist , 1966 .

[37]  A. Baron,et al.  Word frequency and key word statistics in historical corpus linguistics , 2009 .

[38]  Benedikt Szmrecsanyi,et al.  Corpus-based Dialectometry: Aggregate Morphosyntactic Variability in British English Dialects , 2008, Int. J. Humanit. Arts Comput..

[39]  Andrew McCallum,et al.  Organizing the OCA: learning faceted subjects from a library of digital books , 2007, JCDL '07.