Automatic Extraction of Domain-Specific Stopwords from Labeled Documents

Automatic extraction of domain-specific stopword list from a large labeled corpus is discussed. Most researches remove the stopwords using a standard stopword list, and high and low document frequencies. In this paper, a new approach for stopword extraction based on the notion of backward filter level performance and sparsity measure of training data, is proposed. First, we discuss the motivation for updating existing lists or building new ones. Second, based on the proposed backward filter-level performance, we examine the effectiveness of high document frequency filtering for stopword reduction. Finally, a new method for building general and domain-specific stopwords is proposed. The method assumes that a set of candidate stopwords must have minimum information content and prediction capacity, which can be estimated by a classifier performance. The proposed approach is extensively compared with other methods including inverse document frequency and information gain. According to the comparative study, the proposed approach offers more promising results, which guarantee minimum information loss by filtering out most stopwords.

[1]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[2]  Fredric C. Gey,et al.  UC Berkeley at CLEF 2003 - Russian Language Experiments and Domain-Specific Cross-Language Retrieval , 2003, CLEF.

[3]  David W. Corne,et al.  Evolving Better Stoplists for Document Clustering and Web Intelligence , 2003, HIS.

[4]  Fredric C. Gey,et al.  UC Berkeley at CLEF-2003 - Russian Language Experiments and Domain-Specific Retrieval , 2003, CLEF.

[5]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[6]  Hiroyuki Kawano,et al.  Mining association algorithm with threshold based on ROC analysis , 2001, Proceedings of the 34th Annual Hawaii International Conference on System Sciences.

[7]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[8]  Sang-Jo Lee,et al.  Building an ontology based on hub words for information retrieval , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[9]  C. J. van Rijsbergen,et al.  The selection of good search terms , 1981, Inf. Process. Manag..

[10]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[11]  Dan Crow,et al.  A hybrid approach to concept extraction and recognition-based matching in the domain of human resources , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[12]  Iadh Ounis,et al.  Automatically Building a Stopword List for an Information Retrieval System , 2005, J. Digit. Inf. Manag..

[13]  Jonathan I. Maletic,et al.  Automatic software clustering via Latent Semantic Analysis , 1999, 14th IEEE International Conference on Automated Software Engineering.

[14]  Jane Huffman Hayes,et al.  Text mining for software engineering: how analyst feedback impacts final results , 2005, MSR '05.

[15]  Wei-Ying Ma,et al.  An Evaluation on Feature Selection for Text Clustering , 2003, ICML.

[16]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[17]  Jacques Savoy A stemming procedure and stopword list for general French corpora , 1999 .

[18]  Javed Mostafa,et al.  An application of text categorization methods to gene ontology annotation , 2005, SIGIR '05.

[19]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[20]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[21]  David W. Corne,et al.  Towards modernised and Web-specific stoplists for Web document analysis , 2003, Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003).

[22]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[23]  Fredric C. Gey,et al.  Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[24]  Carol Peters,et al.  Comparative Evaluation of Multilingual Information Access Systems , 2003, Lecture Notes in Computer Science.