Mining Specific Features for Acquiring User Information Needs

Term-based approaches can extract many features in text documents, but most include noise. Many popular text-mining strategies have been adapted to reduce noisy information from extracted features; however, text-mining techniques suffer from low frequency. The key issue is how to discover relevance features in text documents to fulfil user information needs. To address this issue, we propose a new method to extract specific features from user relevance feedback. The proposed approach includes two stages. The first stage extracts topics (or patterns) from text documents to focus on interesting topics. In the second stage, topics are deployed to lower level terms to address the low-frequency problem and find specific terms. The specific terms are determined based on their appearances in relevance feedback and their distribution in topics or high-level patterns. We test our proposed method with extensive experiments in the Reuters Corpus Volume 1 dataset and TREC topics. Results show that our proposed approach significantly outperforms the state-of-the-art models.

[1]  David Buttler,et al.  Tracking multiple topics for finding interesting articles , 2007, KDD '07.

[2]  Ram Akella,et al.  Active relevance feedback for difficult queries , 2008, CIKM '08.

[3]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[4]  Stephen E. Robertson,et al.  The TREC 2002 Filtering Track Report , 2002, TREC.

[5]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[6]  Ellen M. Voorhees,et al.  Evaluating Evaluation Measure Stability , 2000, SIGIR 2000.

[7]  Yuefeng Li,et al.  Mining positive and negative patterns for relevance feature discovery , 2010, KDD.

[8]  Bing Liu,et al.  Identifying comparative sentences in text documents , 2006, SIGIR.

[9]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[10]  Yue Xu,et al.  Deploying Approaches for Pattern Refinement in Text Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[11]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[12]  Yue Xu,et al.  Automatic Pattern-Taxonomy Extraction for Web Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[13]  Raymond Y. K. Lau,et al.  A two-stage text mining model for information filtering , 2008, CIKM '08.

[14]  Mika Klemettinen,et al.  Applying data mining techniques for descriptive phrase extraction in digital document collections , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[15]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[16]  Stephen E. Robertson,et al.  Building a filtering test collection for TREC 2002 , 2003, SIGIR.

[17]  Hui Zhao,et al.  Text Classification Improved through Automatically Extracted Sequences , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Jiawei Han,et al.  Summarizing itemset patterns: a profile-based approach , 2005, KDD '05.

[19]  W. Bruce Croft,et al.  Latent concept expansion using markov random fields , 2007, SIGIR.

[20]  Yuefeng Li,et al.  Mining ontology for automatically acquiring Web user information needs , 2006, IEEE Transactions on Knowledge and Data Engineering.

[21]  ChengXiang Zhai,et al.  A study of methods for negative relevance feedback , 2008, SIGIR '08.

[22]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[23]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[24]  Xu Ling,et al.  Mining multi-faceted overviews of arbitrary topics in a text collection , 2008, KDD.

[25]  Yue Xu,et al.  Generating concise association rules , 2007, CIKM '07.