Text Categorization for Internet Content Filtering

Text Filtering is one of the most challenging and useful tasks in the Multilingual Information Access eld. In a number of ltering applications, Automated Text Categorization of documents plays a key role. In this paper, we present two of that applications (Hermes and POESIA), focused on personalized news delivery and Internet inappropriate content blocking, respectively. We are specically concerned with the role of Automated Text Categorization in these applications, and how the task is approached in a multilingual environment. Apart from the details of the methods employed in our work, we envisage new solutions for a more complex task we have called Cross-Lingual Text Categorization.

[1]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  David D. Lewis,et al.  Representation and Learning in Information Retrieval , 1991 .

[4]  Miguel E. Ruiz,et al.  Concept Indexing for Automated Text Categorization , 2004, NLDB.

[5]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[6]  Douglas W. Oard,et al.  The State of the Art in Text Filtering , 1997, User Modeling and User-Adapted Interaction.

[7]  José María Gómez Hidalgo,et al.  Evaluating cost-sensitive Unsolicited Bulk Email categorization , 2002, SAC '02.

[8]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[9]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[10]  Douglas B. Terry,et al.  Using collaborative filtering to weave an information tapestry , 1992, CACM.

[11]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[12]  Pablo Gervás Gómez-Navarro,et al.  Proyecto Mercurio: un servicio personalizado de noticias basado en técnicas de clasificación de texto y modelado de usuario , 2000 .

[13]  Pablo Gervás,et al.  Evaluating a User-Model Based Personalisation Architecture for Digital News Services , 2000, ECDL.

[14]  Manuel de Buenaga Rodríguez,et al.  Text filtering at POESIA: a new Internet content filtering tool dor educational environments , 2002, Proces. del Leng. Natural.

[15]  Stephen E. Robertson,et al.  The TREC-9 filtering track , 1999, SIGF.

[16]  William W. Cohen,et al.  Joins that Generalize: Text Classification Using WHIRL , 1998, KDD.

[17]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Manuel J. Maña López,et al.  Generación automática de resümenes personalizados , 2001, Proces. del Leng. Natural.

[20]  Manuel de Buenaga Rodríguez,et al.  Using WordNet to Complement Training Information in Text Categorization , 1997, ArXiv.

[21]  Nicholas J. Belkin,et al.  Information filtering and information retrieval: two sides of the same coin? , 1992, CACM.

[22]  V Loukachevitch Natalia,et al.  Knowledge Representation for Multilingual Text Categorization , 1997 .

[23]  Aitao Chen,et al.  Cross-language Retrieval Experiments at CLEF 2002 , 2002, CLEF.

[24]  Manuel de Buenaga Rodr ´ iguez,et al.  Integrating Lexical Knowledge in Learning-Based Text Categorization , 2002 .

[25]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .