Facilitating the compilation and dissemination of ad-hoc web corpora

Since the World Wide Web gained prominence in the mid-1990s it has tantalized language investigators and ins ructors as a virtually unlimited source of machine-readable texts for compiling corpora and developing teaching materials. The broad range of languages and content domains found online also offers translators enormous promise both for translation by-example and as a comprehensive supplement to published reference works This paper surveys the impediments which s ill prevent the Web from realizing its full potential as a linguistic resource and discusses tools to overcome the remaining hurdles. Identifying online documents which are both relevant and reliable presents a major challenge. As a partial solution the author's Web concordancer KWiCFinder au omates the process of seeking and retrieving webpages Enhancements which permit more focused queries than existing search engines and provide search results in an interactive explora ory environmen are described in detail. Despite the efficiency of automated downloading and excerpting, selecting Web documents still entails significant time and effort. To multiply the benefits of a search, an online forum for sharing annotated search reports and linguistically interesting texts with other users is outlined. Furthermore, the orien ation of commercial sea ch engines toward the general public makes them less beneficial for linguistic research. The author sketches plans for a specialized Search Engine for Applied Linguis s and a selective Web Corpus Archive which build on his experience with KWiCFinder. He compares his available and proposed solutions to existing resou ces, and su veys ways to exploi them in language teaching. Together these proposed services will enable language learners and professionals to tap into the Web effectively and efficiently for instruction research and translation. t

[1]  R. Ghani Using the Web to Create Minority Language Corpora , 2001 .

[2]  Gilles-Maurice de Schryver Web for/as corpus: a perspective for the African languages , 2002 .

[3]  Martin Volk,et al.  Using the web as corpus for linguistic research , 2002 .

[4]  William H. Fletcher,et al.  Concordancing the Web with KWiCFinder , 2001 .

[5]  James Hilton Copyright Assumptions and Challenges , 2001 .

[6]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.

[7]  Roxane Spitzer Modifying the paradigm , 2006 .

[8]  Alan Cooper,et al.  About Face: The Essentials of User Interface Design , 1995 .

[9]  Timo Burkard,et al.  Herodotus: A Peer-to-Peer Web Archival System , 2002 .

[10]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[11]  Anthony McEnery,et al.  Rethinking language pedagogy from a corpus perspective. , 2000 .

[12]  Anthony McEnery,et al.  Rethinking Language Pedagogy from a Corpus Perspective: Papers from the Third International Conference on Teaching and Language Corpora , 2000 .

[13]  Antoinette Renouf,et al.  Linguistic Research with XML/RDF-aware WebCorp Tool , 2003, WWW.

[14]  Bernhard Kettemann,et al.  Teaching and learning by doing corpus analysis : proceedings of the Fourth International Conference on Teaching and Language Corpora, Graz 19-24 July, 2000 , 2002 .

[15]  Jason L. Frand The Information-Age Mindset: Changes in Students and Implications for Higher Education , 2000 .

[16]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[17]  Sriram Raghavan,et al.  Building a distributed full-text index for the Web , 2001, WWW '01.

[18]  Gregory Grefenstette,et al.  The World Wide Web as a Resource for Example-Based Machine Translation Tasks , 1999, TC.

[19]  William H. Fletcher Making the Web More Useful as a Source for Linguistic Corpora , 2004 .

[20]  Joseph Smarr GoogleLing : The Web as a Linguistic Corpus , 2002 .

[21]  Philip Resnik,et al.  THE LINGUIST'S SEARCH ENGINE: GETTING STARTED GUIDE , 2003 .

[22]  Guy Aston,et al.  The Learner as Corpus Designer , 2002 .

[23]  Rayid Ghani,et al.  Mining the web to create minority language corpora , 2001, CIKM '01.

[24]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[25]  Bernard J. Jansen,et al.  The effect of query complexity on Web searching results , 2000, Inf. Res..

[26]  Amanda Spink,et al.  Real life, real users, and real needs: a study and analysis of user queries on the web , 2000, Inf. Process. Manag..