Multi-Step Iterative Algorithm for Feature Selection on Dynamic Documents

The authors propose clustering based multistep iterative algorithm. The important step is where terms are grouped by synonyms. It takes advantage of semantic relativity measure between the terms. Term frequency is computed of the group of synonyms by considering the relativity measure of the terms appearing in the document from the parent term in the group. This increases the importance of terms which though individually appear less frequently but together show their strong presence. The authors tried experiments on different real and artificial datasets such as NEWS 20, Reuters, emails, research papers on different topics. Resulted entropy shows that their algorithm gives improved result on certain set of documents which are well-articulated, such as research papers. The results are marginal on documents where the message is emphasized by repetitions of terms specifically the documents that are rapidly generated such as emails. The authors also observed that newly arrived documents get appropriately mapped based on proximity to the semantic group.

[1]  Qinbao Song,et al.  A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data , 2013, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yang Song,et al.  Boosting the Feature Space: Text Classification for Unstructured Data on the Web , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Hujun Bao,et al.  A Variance Minimization Criterion to Feature Selection Using Laplacian Regularization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jayaraj Jayabharathy,et al.  Correlated concept based dynamic document clustering algorithms for newsgroups and scientific literature , 2014, Decis. Anal..

[5]  L. Muflikhah,et al.  Document Clustering Using Concept Space and Cosine Similarity Measurement , 2009, 2009 International Conference on Computer Technology and Development.

[6]  Charles Elkan,et al.  Deriving TF-IDF as a Fisher Kernel , 2005, SPIRE.

[7]  Muhammad Zubair Asghar,et al.  A Review of Feature Extraction in Sentiment Analysis , 2014 .

[8]  Imambi S.Sagar,et al.  A Novel Feature Selection Method for Classification of Medical Documents from Pubmed , 2011 .

[9]  Huan Liu,et al.  Toward integrating feature selection algorithms for classification and clustering , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Julita Vassileva,et al.  Push-Poll Recommender System: Supporting Word of Mouth , 2007, User Modeling.

[11]  Boris Vrdoljak,et al.  Ontology Matching Using TF/IDF Measure with Synonym Recognition , 2013, ICIST.

[12]  I. Halcu,et al.  Converting unstructured and semi-structured data into knowledge , 2013, 2013 11th RoEduNet International Conference.

[13]  Sinisa Todorovic,et al.  Local-Learning-Based Feature Selection for High-Dimensional Data Analysis , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Yongtao Wang,et al.  A feature selection method for document clustering based on part-of-speech and word co-occurrence , 2010, 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery.

[15]  Dimitar Kazakov,et al.  WordNet-based text document clustering , 2004 .

[16]  C. A. Murthy,et al.  Unsupervised Feature Selection Using Feature Similarity , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Muhammad Rafi,et al.  An improved semantic similarity measure for document clustering based on topic maps , 2013, ArXiv.

[18]  Goutam Chakraborty,et al.  Analysis of Unstructured Data: Applications of Text Analytics and Sentiment Mining , 2014 .

[19]  Damien Hanyurwimfura,et al.  A Centroid and Relationship based Clustering for Organizing Research Papers , 2014 .

[20]  Beatriz de la Iglesia,et al.  Survey on Feature Selection , 2015, ArXiv.

[21]  Ali A. Ghorbani,et al.  An Iterative Hybrid Filter-Wrapper Approach to Feature Selection for Document Clustering , 2009, Canadian Conference on AI.

[22]  S. Kanmani,et al.  Correlated Concept based Topic Updation Model for Dynamic Corpora , 2014 .

[23]  Greg Linden,et al.  Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[24]  Rong Jin,et al.  Online Feature Selection and Its Applications , 2014, IEEE Transactions on Knowledge and Data Engineering.

[25]  G. Bharathi,et al.  IMPROVING INFORMATION RETRIEVAL USING DOCUMENT CLUSTERS AND SEMANTIC SYNONYM EXTRACTION , 2012 .

[26]  Chris H. Q. Ding,et al.  Evolving Feature Selection , 2005, IEEE Intell. Syst..

[27]  Anne Boyer,et al.  Modeling Preferences in a Distributed Recommender System , 2007, User Modeling.

[28]  Kalpna Sagar,et al.  Impact of Agile and TDD Implementation in Database , 2011 .

[29]  Ali R. Hurson,et al.  TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[30]  Tech Cse,et al.  A Survey on Document Clustering with Similarity Measures , 2013 .

[31]  Sébastien Fournier,et al.  An Effective TF/IDF-Based Text-to-Text Semantic Similarity Measure for Text Classification , 2014, WISE.