ODIN: A Model for Adapting and Enriching Legacy Infrastructure

The Online Database of Interlinear Text (ODIN)1 is a database of interlinear text "snippets", harvested mostly from scholarly documents posted to theWeb. Although large amounts of language data are posted to the Web as part of scholarly discourse, making the existing "e-Linguistic infrastructure" surprisingly rich, most linguistic data available on the Web exists in legacy formats, is highly displaycentric, and is often difficult to locate or interoperate over. ODIN seeks to leverage this existing infrastructure into a rich, searchable, and interoperable resource by converting readily available semi-structured data to content-centric, searchable formats. To do this, ODIN mines scholarly papers and webpages for instances of linguistic data, focusing mostly on interlinear texts, extracts them, identifies source languages, and makes the instances available to search. Through ODIN's standard search feature, users can locate data by language name or Ethnologue code, and display lists of data by document for languages of interest. The newer Advanced Search feature allows users to locate instances by grammatical markup that is used (e.g., NOM, ACC, ERG, PST, 3SG), and by linguistic constructions (e.g., passives, conditionals, possessives, raising constructions, etc.). The latter are made possible through additional enrichment of discovered data using automated statistical taggers and parsers.

[1]  William Lewis Mining and Migrating Interlinear Glossed Text , 2003 .

[2]  David Yarowsky,et al.  Inducing Multilingual POS Taggers and NP Bracketers via Robust Projection Across Aligned Corpora , 2001, NAACL.

[3]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[4]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[5]  Marc Najork,et al.  High-performance Web Crawling High-performance Web Crawling Publication History , 2001 .

[6]  Scott Farrar,et al.  A linguistic ontology for the semantic web , 2003 .

[7]  Gary Simons,et al.  Extending Dublin Core Metadata to Support the Description and Discovery of Language Resources , 2003, Comput. Humanit..

[8]  Joan L. Bybee,et al.  The Creation of Tense and Aspect Systems in the Languages of the World , 1989 .

[9]  Michael Collins,et al.  Head-Driven Statistical Models for Natural Language Parsing , 2003, CL.

[10]  Gary Simons,et al.  The Open Language Archives Community: An Infrastructure for Distributed Archiving of Language Resources , 2003, Lit. Linguistic Comput..

[11]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[12]  William D. Lewis,et al.  The GOLD Community of Practice: an infrastructure for linguistic data on the Web , 2007, Lang. Resour. Evaluation.

[13]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[14]  William Lewis,et al.  The Semantics of Markup: Mapping Legacy Markup Schemas to a Common Semantics , 2004, NLPXML@ACL.

[15]  Jay Earley,et al.  An efficient context-free parsing algorithm , 1970, Commun. ACM.

[16]  D. Terence Langendoen,et al.  LINGUISTICS IN THE INTERNET AGE: TOOLS AND FAIR USE , 2006 .

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .