Classifying Web corpora into domain and genre using automatic feature identification

Texts in representative corpora are typically classified into their domain and genre. However, it is not clear if existing domain and genre typologies can be applied at all to unlabeled data collected from the Web, for instance, to results of crawling. This study attempts to establish the most suitable categories for describing domains and genres of arbitrary web texts and to estimate the accuracy of their automatic classification using machine learning methods, such as Support Vector Machine (SVM) and clustering (repeated bisections and graph clustering). We also discuss methods for inducing the most discriminative features to perform this classification. The method has been designed to work with few or no linguistic resources and has been validated on a variety of languages: English, German, Chinese and Russian.

[1]  András Kornai,et al.  Classifying the Hungarian Web , 2003, EACL.

[2]  Mark Sanderson,et al.  The SPIRIT collection: an overview of a large web collection , 2004, SIGF.

[3]  Qiang Shen,et al.  Rough set-aided keyword reduction for text categorization , 2001, Appl. Artif. Intell..

[4]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[5]  Silvia Bernardini,et al.  A New Approach to the Study of Translationese: Machine-learning the Difference between Original and Translated Text , 2005, Lit. Linguistic Comput..

[6]  Dieter Merkl,et al.  Text classification with self-organizing maps: Some lessons learned , 1998, Neurocomputing.

[7]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[8]  Julie Weeds,et al.  Finding Predominant Word Senses in Untagged Text , 2004, ACL.

[9]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[10]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[11]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[12]  Elizabeth R. Jessup,et al.  Matrices, Vector Spaces, and Information Retrieval , 1999, SIAM Rev..

[13]  Pavel Braslavski,et al.  Document Style Recognition Using Shallow Statistical Analysis , 2004 .

[14]  Hinrich Schütze,et al.  Automatic Detection of Text Genre , 1997, ACL.

[15]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[16]  Jussi Karlgren,et al.  The Wheres and Whyfores for Studying Textual Genre Computationally , 2004, AAAI Technical Report.

[17]  Georg Rehm,et al.  Towards automatic Web genre identification: a corpus-based approach in the domain of academia by example of the Academic's Personal Homepage , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[19]  Marina Santini,et al.  Automatic identification of genre in Web pages , 2011 .

[20]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[21]  Dunja Mladenic,et al.  Turning Yahoo to Automatic Web-Page Classifier , 1998, European Conference on Artificial Intelligence.

[22]  Aidan Finn,et al.  Learning to classify documents according to genre , 2006, J. Assoc. Inf. Sci. Technol..

[23]  Adam Kilgarriff,et al.  Introduction to the Special Issue on the Web as Corpus , 2003, CL.