Pattern document weight discovery for text classification mining

The quality of discovered related features in text documents are describing based on user preferences. For the reason that of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term-based approaches. Most of the problems are occurred in polysemy and synonmy. Over the years, there has been repeatedly held the hypothesis that pattern-based methods should achieve better than term-based ones. Big challenge is how to effectively use large scale patterns vestiges a hard problem in text mining. In this paper, the robustness is used to discuss the characteristics of a model for describing its training sets is distorted or the application environment is altered. A new model robust if it still provides satisfactory performance regardless of having its training sets are altered or changed. To make a breakthrough in this challenging issue, this paper presents a pioneering model for weight feature discovery. It discovers both positive and negative patterns in text documents as at a higher level features and deploy them over low-level features. The terms also classify into categories and updates term weights depends on their specificity and their distributions in patterns. Significant experiments using this model on RCV1, TREC topics and Reuters-21578 significant experiments using this model on RCV1, TREC topics and Reuters-21578 demonstrate that the proposed model significantly outperforms both the state of the term-based methods and the pattern based methods.

[1]  Kavitha Murugeshan,et al.  Discovering Patterns to Produce Effective Output through Text Mining Using Naïve Bayesian Algorithm , 2013 .

[2]  Gerhard Weikum,et al.  Fast logistic regression for text categorization with variable-length n-grams , 2008, KDD.

[3]  Christopher C. Yang Search Engines Information Retrieval in Practice , 2010, J. Assoc. Inf. Sci. Technol..

[4]  Yin-Fu Huang,et al.  Mining sequential patterns using graph search techniques , 2003, Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003.

[5]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[6]  Eric P. Xing,et al.  Sparse Additive Generative Models of Text , 2011, ICML.

[7]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[8]  Yuefeng Li,et al.  Mining Specific Features for Acquiring User Information Needs , 2013, PAKDD.

[9]  Yuefeng Li,et al.  Mining ontology for automatically acquiring Web user information needs , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[11]  Yue Xu,et al.  Topical Pattern Based Document Modelling and Relevance Ranking , 2014, WISE.

[12]  Heui-Seok Lim Improving kNN Based Text Classification with Well Estimated Parameters , 2004, ICONIP.

[13]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[14]  S. Raman,et al.  Phrase-based text representation for managing the Web documents , 2003, Proceedings ITCC 2003. International Conference on Information Technology: Coding and Computing.

[15]  Pavel Pudil,et al.  Feature Selection Using Improved Mutual Information for Text Classification , 2004, SSPR/SPR.

[16]  Ron Bekkerman,et al.  High-precision phrase-based document classification on a modern scale , 2011, KDD.

[17]  Harry Shum,et al.  Query Dependent Ranking Using K-nearest Neighbor * , 2022 .

[18]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[19]  Yue Xu,et al.  Selected new training documents to update user profile , 2010, CIKM.

[20]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .