论文信息 - Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian

Introducing the La Repubblica Corpus: A Large, Annotated, TEI(XML)-compliant Corpus of Newspaper Italian

This paper describes the La Repubblica corpus, currently being developed at the SSLMIT of the University of Bologna. The corpus is a very large collection of newspaper text, currently amounting to 175 million words, but expected to grow to 400 million before the end of 2004. When completed, it will contain all the articles published between 1985 and 2000 by the national daily La Repubblica. The paper discusses the techniques used to extract the text, tokenize it and annotate it (basic TEI annotation, POS tagging, genre/topic categorization), it presents examples of how it can be used, and gives details of the ways in which interested users can access it. The paper concludes with a discussion of current and future developments, and of weak and strong points of this resource.

[1] Bo Pang,et al. Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[2] Helmut Schmidt,et al. Probabilistic part-of-speech tagging using decision trees , 1994 .

[3] Ingo Schröder. A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkit , 2002 .

[4] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[5] R. R. Favretti,et al. CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model , 2002 .

[6] Oliver Christ,et al. A Modular and Flexible Architecture for an Integrated Corpus Query System , 1994, ArXiv.

[7] F. Tamburini,et al. ANNOTAZIONE GRAMMATICALE E LEMMATIZZAZIONE DI CORPORA IN ITALIANO , 2001 .

[8] Thorsten Joachims,et al. Making large scale SVM learning practical , 1998 .

[9] Walter Daelemans,et al. Improving Accuracy in word class tagging through the Combination of Machine Learning Systems , 2001, CL.

[11] Monica Monachini. ELM-IT: EAGLES Specifications for Italian morphosyntax Lexicon Specification and Classification Guidelines , 1996 .

[12] J. Sinclair. The lexical item , 1998 .

[13] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14] Aidan Finn,et al. Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..