FPST: a new term weighting algorithm for long running and short lived events

Term weighting is a useful technique that extracts important features from textual documents, thereby providing a basis for different text mining approaches. While several term weighting algorithms based on their frequency and some other statistical measures have been proposed in the past, they are inaccurate in extracting hot terms from internet-based digitised news documents. To overcome that problem, this paper presents an innovative and effective term weighting algorithm by considering position, scattering and topicality along with frequency. Frequency considers the number of occurrences of a term; position focuses on where the term appears; scattering focuses on the distribution of a term in the entire document. Here topicality is calculated for both short lived events and long running events. Experimental evaluation shows that the proposed term weighting algorithm outperforms the existing term weighting algorithms.

[1]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[2]  Mitsuru Ishizuka,et al.  Topic extraction from news archive using TF*PDF algorithm , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[3]  Helena Ahonen-Myka,et al.  Simple Semantics in Topic Detection and Tracking , 2004, Information Retrieval.

[4]  Yaquan Xu,et al.  A new feature selection method based on support vector machines for text categorisation , 2011, Int. J. Data Anal. Tech. Strateg..

[5]  Chien Chin Chen,et al.  Life Cycle Modeling of News Events Using Aging Theory , 2003, ECML.

[6]  James Allan,et al.  Extracting significant time varying features from text , 1999, CIKM '99.

[7]  Gan Jie,et al.  Research of improved IF-IDF Weighting algorithm , 2010, The 2nd International Conference on Information Science and Engineering.

[8]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[9]  Seda Ozmutlu Automatic new topic identification using multiple linear regression , 2006 .

[10]  Gary Geunbae Lee,et al.  Dependency structure language model for topic detection and tracking , 2007, Inf. Process. Manag..

[11]  Sung-Hyon Myaeng,et al.  A novel term weighting scheme based on discrimination power obtained from past retrieval results , 2012, Inf. Process. Manag..

[12]  Ran Li,et al.  An Improved Algorithm to Term Weighting in Text Classification , 2010, ICMT 2010.

[13]  Mark T. Maybury,et al.  Information Storage and Retrieval Systems: Theory and Implementation , 2000 .

[14]  Krzysztof Michalak,et al.  Correlation based feature selection method , 2010, Int. J. Bio Inspired Comput..

[15]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[16]  Yi-Ning Tu,et al.  Indices of novelty for emerging topic detection , 2012, Inf. Process. Manag..

[17]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[18]  Yuen-Hsien Tseng,et al.  A comparison of methods for detecting hot topics , 2009, Scientometrics.

[19]  Jian Su,et al.  Supervised and Traditional Term Weighting Methods for Automatic Text Categorization , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jin Liu,et al.  The Hot Keyphrase Extraction Based on TF*PDF , 2011, 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications.

[21]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[22]  Taghi M. Khoshgoftaar,et al.  Evaluation of the importance of data pre-processing order when combining feature selection and data sampling , 2012, Int. J. Bus. Intell. Data Min..

[23]  Yunming Ye,et al.  A comparative study of feature weighting methods for document co-clustering , 2011, Int. J. Inf. Technol. Commun. Convergence.

[24]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[25]  Dai Guan-zhong,et al.  Design and implementation of on-line hot topic discovery model , 2008, Wuhan University Journal of Natural Sciences.

[26]  Kansheng Shi,et al.  Efficient text classification method based on improved term reduction and term weighting , 2011 .

[27]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[28]  Y. Radhika,et al.  Hot topic extraction based on frequency, position, scattering and topical weight for time sliced news documents , 2013, 2013 15th International Conference on Advanced Computing Technologies (ICACT).

[29]  Chris Clifton,et al.  TopCat: Data Mining for Topic Identification in a Text Corpus , 2004, IEEE Trans. Knowl. Data Eng..

[30]  Ali R. Hurson,et al.  TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[31]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[32]  Witold Pedrycz,et al.  Data Mining Methods for Knowledge Discovery , 1998, IEEE Trans. Neural Networks.

[33]  Dolf Trieschnigg,et al.  TNO Hierarchical topic detection report at TDT 2004 , 2004 .

[34]  Joel D. Martin Fast and Furious Text Mining , 2005, IEEE Data Eng. Bull..

[35]  Qiudan Li,et al.  QuestionHolic: Hot topic discovery and trend analysis in community question answering systems , 2011, Expert Syst. Appl..

[36]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[37]  Kuan-Yu Chen,et al.  Hot Topic Extraction Based on Timeline Analysis and Multidimensional Sentence Modeling , 2007, IEEE Transactions on Knowledge and Data Engineering.

[38]  Ali Selamat,et al.  Enhance Term Weighting Algorithm as Feature Selection Technique for Illicit Web Content Classification , 2008, 2008 Eighth International Conference on Intelligent Systems Design and Applications.

[39]  Ronen Feldman,et al.  Book Reviews: The Text Mining Handbook: Advanced Approaches to Analyzing Unstructured Data by Ronen Feldman and James Sanger , 2008, CL.

[40]  Roger Bilisoly Practical text mining with Perl , 2008 .

[41]  Chen Liang,et al.  Improved Terms Weighting Algorithm of Text , 2011, 2011 International Conference on Network Computing and Information Security.

[42]  Chris Buckley,et al.  New Retrieval Approaches Using SMART: TREC 4 , 1995, TREC.