Genre as noise: noise in genre

Given a specific information need, documents of the wrong genre can be considered as noise. From this perspective, genre classification helps to separate relevant documents from noise. Orthographic errors represent a second, finer notion of noise. Since specific genres often include documents with many errors, an interesting question is whether this “micro-noise” can help to classify genre. In this paper we consider both problems. After introducing a comprehensive hierarchy of genres, we present an intuitive method to build specialized and distinctive classifiers that also work for very small training corpora. Special emphasis is given to the selection of intelligent high-level features. We then investigate the correlation between genre and micro noise. Using special error dictionaries, we estimate the typical error rates for each genre. Finally, we test if the error rate of a document represents a useful feature for genre classification.

[1]  Jussi Karlgren,et al.  Assembling a Balanced Corpus from the Internet , 1998, NODALIDA.

[2]  Wolfgang Wahlster,et al.  Verbmobil: Foundations of Speech-to-Speech Translation , 2000, Artificial Intelligence.

[3]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[4]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[5]  Jussi Karlgren,et al.  Recognizing Text Genres With Simple Metrics Using Discriminant Analysis , 1994, COLING.

[6]  Andreas Arning Fehlersuche in großen Datenmengen unter Verwendung der in den Daten vorhandenen Redundanz , 1997, DISKI.

[7]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[8]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[9]  Kevin Crowston,et al.  Reproduced and emergent genres of communication on the World-Wide Web , 1997, Proceedings of the Thirtieth Hawaii International Conference on System Sciences.

[10]  共立出版株式会社 コンピュータ・サイエンス : ACM computing surveys , 1978 .

[11]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[12]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[13]  Carol Van Ess-Dykema,et al.  The Form is the Substance: Classification of Genres in Text , 2001, HTLKM@ACL.

[14]  Marina Santini Common Criteria for Genre Classification : Annotation and Granularity , 2006 .

[15]  Ian Witten,et al.  Data Mining , 2000 .

[16]  Randy Goebel,et al.  Elements of a Learning Interface for Genre Qualified Search , 2007, Australian Conference on Artificial Intelligence.

[17]  Thorsten Joachims,et al.  A Statistical Learning Model of Text Classification for Support Vector Machines. , 2001, SIGIR 2002.

[18]  Klaus U. Schulz,et al.  Orthographic Errors in Web Pages: Toward Cleaner Web Corpora , 2006, Computational Linguistics.

[19]  Ching Y. Suen,et al.  The behavior-knowledge space method for combination of multiple classifiers , 1993, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[20]  R. Tibshirani,et al.  Classi cation by Pairwise Coupling , 1998 .

[21]  R. Rosenfeld,et al.  Two decades of statistical language modeling: where do we go from here? , 2000, Proceedings of the IEEE.

[22]  Robert Tibshirani,et al.  Classification by Pairwise Coupling , 1997, NIPS.

[23]  Beata Megyesi,et al.  Using Linguistic Data for Genre Classification , 2005 .