A Similarity Function for Feature Pattern Clustering and High Dimensional Text Document Classification

Text document classification and clustering is an important learning task which fits to both data mining and machine learning areas. The learning task throws several challenges when it is required to process high dimensional text documents. Word distribution in text documents plays a very key role in learning process. Research related to high dimensional text document classification and clustering is usually limited to application of traditional distance functions and most of the research contributions in the existing literature did not consider the word distribution in documents. In this research, we propose a novel similarity function for feature pattern clustering and high dimensional text classification. The similarity function proposed is used to carry supervised learning based dimensionality reduction. The important feature of this work is that the word distribution before and after dimensionality reduction is the same. Experiment results prove the proposed approach achieves dimensionality reduction, retains the word distribution and obtained better classification accuracies compared to other measures.

[1]  Kumarappan Kathiresan,et al.  The Superior Complement in Graphs , 2011 .

[2]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[3]  José Hernández-Orallo,et al.  Knowledge Discovery from Databases , 2005, Encyclopedia of Database Technologies and Applications.

[4]  C. V. Guru Rao,et al.  Selection and Peer-review under Responsibility of the Organizing Committee of Itqm 2014. Clustering Text Data Streams – a Tree Based Approach with Ternary Function and Ternary Feature Vector , 2022 .

[5]  Shadi A. Aljawarneh,et al.  GARUDA: Gaussian dissimilarity measure for feature representation and anomaly detection in Internet of things , 2018, The Journal of Supercomputing.

[6]  Shadi Aljawarneh,et al.  A novel fuzzy similarity measure and prevalence estimation approach for similarity profiled temporal association pattern mining , 2017, Future Gener. Comput. Syst..

[7]  Shadi Aljawarneh,et al.  A computationally efficient approach for temporal pattern mining in IoT , 2016, 2016 International Conference on Engineering & MIS (ICEMIS).

[8]  Vangipuram Radhakrishna,et al.  A Computationally Efficient Approach for Mining Similar Temporal Patterns , 2016 .

[9]  R. Srinivasan,et al.  A Feature Clustering Approach for Dimensionality Reduction and Classification , 2015, MENDEL.

[10]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[11]  Vangipuram Radhakrishna,et al.  An Approach for Mining Similarity Profiled Temporal Association Patterns Using Gaussian Based Dissimilarity Measure , 2015 .

[12]  Yelipe UshaRani,et al.  An efficient approach for imputation and classification of medical data values using class-based clustering of medical records , 2017, Comput. Electr. Eng..

[13]  Mohammed Azmi Al-Betar,et al.  Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering , 2017, Expert Syst. Appl..

[14]  Stanley B. Zdonik,et al.  Window-aware load shedding for aggregation queries over data streams , 2006, VLDB.

[15]  Vangipuram Radhakrishna,et al.  Krishna Sudarsana: A Z-Space Similarity Measure , 2018 .

[16]  Suh-Yin Lee,et al.  Mining Temporal Patterns in Time Interval-Based Data , 2015, IEEE Transactions on Knowledge and Data Engineering.

[17]  Shadi Aljawarneh,et al.  G-SPAMINE: An approach to discover temporal association patterns and trends in internet of things , 2017, Future Gener. Comput. Syst..

[18]  Steve Hanneke,et al.  The Optimal Sample Complexity of PAC Learning , 2015, J. Mach. Learn. Res..

[19]  Garrison W. Cottrell,et al.  Latent semantic indexing is an optimal special case of multidimensional scaling , 1992, SIGIR '92.

[20]  Gareth J. F. Jones,et al.  Context-driven Dimensionality Reduction for Clustering Text Documents , 2015, FIRE.

[21]  Shie-Jue Lee,et al.  A Similarity Measure for Text Classification and Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[22]  S. Bhattacharyya,et al.  Uncorrelated Local Maximum Margin Criterion: An Efficient Dimensionality Reduction Method for Text Classification , 2012 .

[23]  Kim-Kwang Raymond Choo,et al.  A novel fuzzy gaussian-based dissimilarity measure for discovering similarity temporal association patterns , 2018, Soft Comput..

[24]  Vangipuram RADHAKRISHNA,et al.  Normal Distribution Based Similarity Profiled Temporal Association Pattern Mining (N-SPAMINE) , 2017 .

[25]  Huidong Jin,et al.  An effective class-centroid-based dimension reduction method for text classification , 2013, WWW '13 Companion.

[26]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[27]  Shie-Jue Lee,et al.  A similarity measure for text processing , 2011, 2011 International Conference on Machine Learning and Cybernetics.

[28]  Mohamed Medhat Gaber,et al.  Towards an Adaptive Approach for Mining Data Streams in Resource Constrained Environments , 2004, DaWaK.

[29]  J. Pei,et al.  Advanced Cluster Analysis , 2012 .

[30]  Ratna Babu Chinnam,et al.  mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification , 2011, Inf. Sci..

[31]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[32]  Vangipuram Radhakrishna,et al.  Looking into the possibility of novel dissimilarity measure to discover similarity profiled temporal association patterns in IoT , 2016, 2016 International Conference on Engineering & MIS (ICEMIS).

[33]  Shadi Aljawarneh,et al.  ASTRA - A Novel interest measure for unearthing latent temporal associations and trends through extending basic gaussian membership function , 2017, Multimedia Tools and Applications.

[34]  Tongfeng Sun,et al.  Review of classical dimensionality reduction and sample selection methods for large-scale data processing , 2019, Neurocomputing.

[35]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[36]  A. Ananda Rao,et al.  Design and analysis of novel similarity measure for clustering and classification of high dimensional text documents , 2014, CompSysTech.

[37]  Shadi A. Aljawarneh,et al.  Extending the Gaussian membership function for finding similarity between temporal patterns , 2017, 2017 International Conference on Engineering & MIS (ICEMIS).

[38]  Shie-Jue Lee,et al.  A Fuzzy Similarity-Based Approach for Multi-label Document Classification , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[39]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[40]  Aurélien Garivier,et al.  On the Complexity of Best-Arm Identification in Multi-Armed Bandit Models , 2014, J. Mach. Learn. Res..

[41]  E. Oja,et al.  Independent Component Analysis , 2013 .

[42]  Victor-Emil Neagoe,et al.  Feature selection with Ant Colony Optimization and its applications for pattern recognition in space imagery , 2016, 2016 International Conference on Communications (COMM).

[43]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[44]  Vangipuram Radhakrishna,et al.  A computationally optimal approach for extracting similar temporal patterns , 2016, 2016 International Conference on Engineering & MIS (ICEMIS).

[45]  Marián Vajtersic,et al.  Parallel rare term vector replacement: Fast and effective dimensionality reduction for text , 2013, J. Parallel Distributed Comput..

[46]  Shadi Aljawarneh,et al.  A similarity measure for outlier detection in timestamped temporal databases , 2016, 2016 International Conference on Engineering & MIS (ICEMIS).

[47]  Ehsan Adeli,et al.  Joint feature-sample selection and robust diagnosis of Parkinson's disease from MRI data , 2016, NeuroImage.

[48]  Vangipuram Radhakrishna,et al.  Looking into the possibility for designing normal distribution based dissimilarity measure to discover time profiled association patterns , 2017, 2017 International Conference on Engineering & MIS (ICEMIS).

[49]  Yixin Zhong,et al.  Dimensionality reduction for text using LLE , 2008, 2008 International Conference on Natural Language Processing and Knowledge Engineering.

[50]  Shadi A. Aljawarneh,et al.  A similarity measure for temporal pattern discovery in time series data generated by IoT , 2016, 2016 International Conference on Engineering & MIS (ICEMIS).

[51]  Jian Pei,et al.  Classification: Basic Concepts , 2012 .

[52]  Tobias Berka,et al.  Dimensionality reduction for information retrieval using vector replacement of rare terms , 2014 .

[53]  Vangipuram Radhakrishna,et al.  SRIHASS - a similarity measure for discovery of hidden time profiled temporal associations , 2017, Multimedia Tools and Applications.

[54]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[55]  Shie-Jue Lee,et al.  A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification , 2011, IEEE Transactions on Knowledge and Data Engineering.

[56]  Vangipuram Radhakrishna,et al.  Design and analysis of a novel temporal dissimilarity measure using Gaussian membership function , 2017, 2017 International Conference on Engineering & MIS (ICEMIS).

[57]  Pramod Kumar Singh,et al.  A Two-Stage Unsupervised Dimension Reduction Method for Text Clustering , 2012, BIC-TA.

[58]  Richard A. Johnson,et al.  Applied Multivariate Statistical Analysis , 1983 .

[59]  Won Suk Lee,et al.  estWin: Online data stream mining of recent frequent itemsets by sliding window method , 2005, J. Inf. Sci..