A survey of methods to ease the development of highly multilingual text mining applications

Multilingual text processing is useful because the information content found in different languages is complementary, both regarding facts and opinions. While Information Extraction and other text mining software can, in principle, be developed for many languages, most text analysis tools have only been applied to small sets of languages because the development effort per language is large. Self-training tools obviously alleviate the problem, but even the effort of providing training data and of manually tuning the results is usually considerable. In this paper, we gather insights by various multilingual system developers on how to minimise the effort of developing natural language processing applications for many languages. We also explain the main guidelines underlying our own effort to develop complex text mining software for tens of languages. While these guidelines—most of all: extreme simplicity—can be very restrictive and limiting, we believe to have shown the feasibility of the approach through the development of the Europe Media Monitor (EMM) family of applications (http://emm.newsbrief.eu/overview.html). EMM is a set of complex media monitoring tools that process and analyse up to 100,000 online news articles per day in between twenty and fifty languages. We will also touch upon the kind of language resources that would make it easier for all to develop highly multilingual text mining applications. We will argue that—to achieve this—the most needed resources would be freely available, simple, parallel and uniform multilingual dictionaries, corpora and software tools.

[1]  Angus Whyte,et al.  Improving Communication in E-democracy Using Natural Language Processing , 2007, IEEE Intelligent Systems.

[2]  J. Linge,et al.  Internet surveillance systems for early alerting of health threats. , 2009, Euro surveillance : bulletin Europeen sur les maladies transmissibles = European communicable disease bulletin.

[3]  Yorick Wilks,et al.  How feasible is the reuse of grammars for Named Entity Recognition? , 2002, LREC.

[4]  Emily M. Bender,et al.  Rapid Prototyping of Scalable Grammars: Towards Modularity in Extensions to a Language-Independent Core , 2005, IJCNLP.

[5]  Victor Lavrenko,et al.  Language-specific models in multilingual topic tracking , 2004, SIGIR '04.

[6]  Bruno Pouliquen,et al.  Extending an Information Extraction tool set to Central and Eastern European languages , 2006, ArXiv.

[7]  Bruno Pouliquen,et al.  Sentiment Analysis in the News , 2010, LREC.

[8]  Steinberger Ralf,et al.  Automatic Detection of Quotations in Multilingual News , 2007 .

[9]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[10]  Aarne Ranta,et al.  The GF Resource Grammar Library , 2009 .

[11]  Sebastian Riedel,et al.  The CoNLL 2007 Shared Task on Dependency Parsing , 2007, EMNLP.

[12]  Diana Maynard,et al.  NE Recognition Without Training Data on a Language You Don't Speak , 2003, NER@ACL.

[13]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[14]  A. Shimada,et al.  1-Corpora and evaluation tools for multilingual named entity grammar development , 2003 .

[15]  Manny Rayner,et al.  Adapting the Core Language Engine to French and Spanish , 1996, ArXiv.

[16]  Steinberger Ralf,et al.  Multilingual Multi-document Continuously-updated Social Networks , 2007 .

[17]  Ralf Steinberger,et al.  Exploiting Machine Learning Techniques to Build an Event Extraction System for Portuguese and Spanish , 2009, Linguamática.

[18]  Heng Ji,et al.  Can One Language Bootstrap the Other: A Case Study on Event Extraction , 2009, HLT-NAACL 2009.

[19]  Bruno Pouliquen,et al.  Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic , 2010, LREC.

[20]  Alexandr Rosen Mediating between Incompatible Tagsets , 2010 .

[21]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[22]  Vincenzo Pallotta,et al.  Special Issue on Robust Methods in Analysis of Natural Language Data , 2001, Nat. Lang. Eng..

[23]  Jacques Vergne The chunk as the period of the functions length and frequency of words on the syntagmatic axis , 2009, IWPT.

[24]  Bruno Pouliquen,et al.  Cross-lingual Named Entity Recognition , 2007 .

[25]  Josef Steinberger,et al.  NewsGist: A Multilingual Statistical News Summarizer , 2010, ECML/PKDD.

[26]  Kathrin Spreyer,et al.  Projection-based Acquisition of a Temporal Labeller , 2008, IJCNLP.

[27]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[28]  Bruno Pouliquen,et al.  Geocoding Multilingual Texts: Recognition, Disambiguation and Visualisation , 2006, LREC.

[29]  Bruno Pouliquen,et al.  Using language-independent rules to achieve high multilinguality in Text Mining , 2007, NATO ASI Mining Massive Data Sets for Security.

[30]  Josef Steinberger,et al.  Creating Sentiment Dictionaries via Triangulation , 2011, Decis. Support Syst..

[31]  Michael Gamon,et al.  Practical Experience with Grammar Sharing in Multilingual NLP , 1997 .

[32]  Josef Steinberger,et al.  WB-JRC-UT's Participation in TAC 2009: Update Summarization and AESOP Tasks , 2009, TAC.

[33]  Emanuele Pianta,et al.  Evaluating Cross-Language Annotation Transfer in the MultiSemCor Corpus , 2004, COLING.

[34]  Josef Steinberger,et al.  Using Parallel Corpora for Multilingual (Multi-document) Summarisation Evaluation , 2010, CLEF.

[35]  Bruno Pouliquen,et al.  Expanding a multilingual media monitoring and information extraction tool to a new language: Swahili , 2011, Lang. Resour. Evaluation.

[36]  Jacques Vergne Une méthode pour l’analyse descendante et calculatoire de corpus multilingues : application au calcul des relations sujet-verbe , 2002, JEPTALNRECITAL.

[37]  Marco Turchi,et al.  Building Multilingual Named Entity Annotated Corpora Exploiting Parallel Corpora , 2010 .

[38]  Marc Dymetman,et al.  Learning Machine Translation , 2010 .

[39]  Satoshi Sekine,et al.  Named Entity Discovery Using Comparable News Articles , 2004, COLING.

[40]  Bruno Pouliquen,et al.  An introduction to the Europe Media Monitor family of applications , 2013, ArXiv.

[41]  Jyh-Shing Roger Jang,et al.  Extraction of transliteration pairs from parallel corpora using a statistical transliteration model , 2006, Inf. Sci..

[42]  Eric Wehrli,et al.  Fips, A “Deep” Linguistic Multilingual Parser , 2007, ACL 2007.

[43]  Kalina Bontcheva,et al.  Architectural elements of language engineering robustness , 2002, Natural Language Engineering.

[44]  Satoshi Sekine,et al.  Named entities : recognition, classification and use , 2009 .