Feature Extraction and Duplicate Detection for Text Mining : A Survey

Text mining, also known as Intelligent Text Analysis is an important research area. It is very difficult to focus on the most appropriate information due to the high dimensionality of data. Feature Extraction is one of the important techniques in data reduction to discover the most important features. Processing massive amount of data stored in a unstructured form is a challenging task. Several pre-processing methods and algorithms are needed to extract useful features from huge amount of data. The survey covers different text summarization, classification, clustering methods to discover useful features and also discovering query facets which are multiple groups of words or phrases that explain and summarize the content covered by a query thereby reducing time taken by

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Mika Klemettinen,et al.  Applying data mining techniques for descriptive phrase extraction in digital document collections , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[3]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[4]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[5]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[8]  Dmitriy Fradkin,et al.  Experiments with random projections for machine learning , 2003, KDD '03.

[9]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10]  K.R. Venugopal,et al.  Generic Feature Extraction for Classification using Fuzzy C - Means Clustering , 2005, 2005 3rd International Conference on Intelligent Sensing and Information Processing.

[11]  Nikhil R. Pal,et al.  Genetic programming for simultaneous feature selection and classifier design , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[12]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[13]  Eduardo Gasca,et al.  Eliminating redundancy and irrelevance using a new MLP-based feature selection method , 2006, Pattern Recognit..

[14]  Sreeram Ramakrishnan,et al.  A hybrid approach for feature subset selection using neural networks and ant colony optimization , 2007, Expert Syst. Appl..

[15]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[16]  Christopher C. Yang,et al.  Mining related queries from Web search engine query logs using an improved association rule mining model , 2007, J. Assoc. Inf. Sci. Technol..

[17]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Ram Akella,et al.  Active relevance feedback for difficult queries , 2008, CIKM '08.

[19]  Daisy Zhe Wang,et al.  WebTables: exploring the power of tables on the web , 2008, Proc. VLDB Endow..

[20]  Daniel Sánchez,et al.  Text Knowledge Mining: An Alternative to Text Data Mining , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[21]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[22]  Renée J. Miller,et al.  Framework for Evaluating Clustering Algorithms in Duplicate Detection , 2009, Proc. VLDB Endow..

[23]  K. G. Srinivasa,et al.  Soft Computing for Data Mining Applications , 2009, Studies in Computational Intelligence.

[24]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[25]  Renée J. Miller,et al.  Creating probabilistic databases from duplicated data , 2009, The VLDB Journal.

[26]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[27]  Hiroshi Ogura,et al.  Feature selection with a measure of deviations from Poisson in text categorization , 2009, Expert Syst. Appl..

[28]  S. Handschuh,et al.  Visual abstraction and ordering in faceted browsing of text collections , 2010 .

[29]  Antoon Bronselaer,et al.  Aspects of object merging , 2010, 2010 Annual Meeting of the North American Fuzzy Information Processing Society.

[30]  Gurpreet Singh Lehal,et al.  A Survey of Text Summarization Extractive Techniques , 2010 .

[31]  Deng Cai,et al.  Unsupervised feature selection for multi-cluster data , 2010, KDD.

[32]  Xindong Wu,et al.  Keyphrase extraction based on semantic relatedness , 2010, 9th IEEE International Conference on Cognitive Informatics (ICCI'10).

[33]  Sandhya Joshi,et al.  Classification of Alzheimer's Disease and Parkinson's Disease by Using Machine Learning and Neural Network Methods , 2010, 2010 Second International Conference on Machine Learning and Computing.

[34]  Yue Xu,et al.  Selected new training documents to update user profile , 2010, CIKM.

[35]  Sougata Mukherjea,et al.  Faceted search and browsing of audio content on spoken web , 2010, CIKM.

[36]  H. S. Dhami,et al.  Text Summarization for Information Retrieval using Pattern Recognition Techniques , 2011 .

[37]  Zi Huang,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence ℓ2,1-Norm Regularized Discriminative Feature Selection for Unsupervised Learning , 2022 .

[38]  Sanjay Chawla,et al.  Robust Record Linkage Blocking Using Suffix Arrays and Bloom Filters , 2011, TKDD.

[39]  Xiaojun Wu,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  L M Patnaik,et al.  Classification of email using BeaKS: Behavior and keyword stemming , 2011, TENCON 2011 - 2011 IEEE Region 10 Conference.

[41]  Marianne Winslett,et al.  Using structural information in XML keyword search effectively , 2011, TODS.

[42]  J. Jebamalar Tamilselvi,et al.  Handling Duplicate Data in Data Warehouse for Data Mining , 2011 .

[43]  Ryen W. White,et al.  Modeling and analysis of cross-session search tasks , 2011, SIGIR.

[44]  Panayiotis Tsaparas,et al.  Facet discovery for structured web search: a query-log mining approach , 2011, SIGMOD '11.

[45]  Yuefeng Li,et al.  Effective Pattern Discovery for Text Mining , 2012, IEEE Transactions on Knowledge and Data Engineering.

[46]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[47]  Felix Naumann,et al.  Adaptive Windows for Duplicate Detection , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[48]  Özgür Ulusoy,et al.  Static index pruning in web search engines: Combining term and document popularities with query views , 2012, TOIS.

[49]  Yi Chen,et al.  Differentiating search results on structured data , 2012, TODS.

[50]  Huan Liu,et al.  Unsupervised feature selection for linked social media data , 2012, KDD.

[51]  Monika Henzinger,et al.  On Multiple Keyword Sponsored Search Auctions with Budgets , 2012, ICALP.

[52]  Axel Schulz,et al.  I See a Car Crash: Real-Time Detection of Small Scale Incidents in Microblogs , 2013, ESWC.

[53]  Davide Martinenghi,et al.  Top-k diversity queries over bounded regions , 2013, TODS.

[54]  A. Salinger,et al.  Efficient Fuzzy Search in Large Text Collections , 2013 .

[55]  Fabrizio Silvestri,et al.  Discovering tasks from search engine query logs , 2013, TOIS.

[56]  Lei Wang,et al.  On Similarity Preserving Feature Selection , 2013, IEEE Transactions on Knowledge and Data Engineering.

[57]  Gonzalo Navarro,et al.  Spaces, Trees, and Colors , 2013, ACM Comput. Surv..

[58]  Wei Chu,et al.  Learning to extract cross-session search tasks , 2013, WWW.

[59]  Sherif Sakr,et al.  The family of mapreduce and large-scale data processing systems , 2013, CSUR.

[60]  Xiao Qin,et al.  Interrelation analysis of celestial spectra data using constrained frequent pattern trees , 2013, Knowl. Based Syst..

[61]  Roque Marín,et al.  ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences , 2013, PAKDD.

[62]  Hector Garcia-Molina,et al.  Pay-As-You-Go Entity Resolution , 2013, IEEE Transactions on Knowledge and Data Engineering.

[63]  Beng Chin Ooi,et al.  Distributed data management using MapReduce , 2014, CSUR.

[64]  James Allan,et al.  Extending Faceted Search to the General Web , 2014, CIKM.

[65]  Carlo Zaniolo,et al.  Harvesting Domain Specific Ontologies from Text , 2014, 2014 IEEE International Conference on Semantic Computing.

[66]  Swapan K. Parui,et al.  Incremental blind feedback , 2014, ACM Trans. Asian Lang. Inf. Process..

[67]  Jing Zhang,et al.  An Efficient Algorithm of Frequent Itemsets Mining Based on MapReduce , 2014 .

[68]  Yogesh R. Shepal A Fast Clustering-Based Feature Subset Selection Algorithm for High Dimensional Data , 2014 .

[69]  Luis Miguel Bergasa,et al.  Text Detection and Recognition on Traffic Panels From Street-Level Imagery Using Visual Appearance , 2014, IEEE Transactions on Intelligent Transportation Systems.

[70]  Dong Xu,et al.  Semi-Supervised Heterogeneous Fusion for Multimedia Data Co-Clustering , 2014, IEEE Transactions on Knowledge and Data Engineering.

[71]  Dan Suciu,et al.  Query-Based Data Pricing , 2015, J. ACM.

[72]  Soo-Hyung Kim,et al.  Automatic extraction of text regions from document images by multilevel thresholding and k-means clustering , 2015, 2015 IEEE/ACIS 14th International Conference on Computer and Information Science (ICIS).

[73]  NIDHI TIWARI,et al.  Classification Framework of MapReduce Scheduling Algorithms , 2015, ACM Comput. Surv..

[74]  Mudhakar Srivatsa,et al.  Fine-Grained Knowledge Sharing in Collaborative Environments , 2015, IEEE Transactions on Knowledge and Data Engineering.

[75]  Felix Naumann,et al.  Progressive Duplicate Detection , 2015, IEEE Transactions on Knowledge and Data Engineering.

[76]  R. B. V. Subramanyam,et al.  Mining Interesting Infrequent Itemsets from Very Large Data based on MapReduce Framework , 2015 .

[77]  Farooque Azam,et al.  Innovative Windows for Duplicate Detection , 2015 .

[78]  Vijay Kumar Verma,et al.  Text mining and information professionals: Role, issues and challenges , 2015, 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services.

[79]  Xiaojun Wan,et al.  PPSGen: Learning-Based Presentation Slides Generation for Academic Papers , 2015, IEEE Transactions on Knowledge and Data Engineering.

[80]  Panayiotis Tsaparas,et al.  Review Selection Using Micro-Reviews , 2015, IEEE Transactions on Knowledge and Data Engineering.

[81]  S. Sitharama Iyengar,et al.  Query Click and Text Similarity Graph for Query Suggestions , 2015, MLDM.

[82]  Kun Zhou,et al.  Exploring Topical Lead-Lag across Corpora , 2015, IEEE Transactions on Knowledge and Data Engineering.

[83]  Feiping Nie,et al.  Feature Selection via Global Redundancy Minimization , 2015, IEEE Transactions on Knowledge and Data Engineering.

[84]  Suh-Yin Lee,et al.  Mining Temporal Patterns in Time Interval-Based Data , 2015, IEEE Transactions on Knowledge and Data Engineering.

[85]  Beng Chin Ooi,et al.  Efficient Processing of Spatial Group Keyword Queries , 2015, TODS.

[86]  Antoon Bronselaer,et al.  Propagation of Data Fusion , 2015, IEEE Transactions on Knowledge and Data Engineering.

[87]  A. Akilan,et al.  Text mining: Challenges and future directions , 2015, 2015 2nd International Conference on Electronics and Communication Systems (ICECS).

[88]  Eleonora D'Andrea,et al.  Real-Time Detection of Traffic From Twitter Stream Analysis , 2015, IEEE Transactions on Intelligent Transportation Systems.

[89]  Purushothama Raju,et al.  Mining Closed Sequential Patterns in Large Sequence Databases , 2015 .

[90]  Di Jiang,et al.  Cross-Lingual Topic Discovery From Multilingual Search Engine Query Log , 2016, ACM Trans. Inf. Syst..

[91]  S. S. Iyengar,et al.  USER FEEDBACK SESSION WITH CLICKED AND UNCLICKED DOCUMENTS FOR RELATED SEARCH RECOMMENDATION , 2016 .

[92]  Chunxiao Jiang,et al.  Microblog Dimensionality Reduction—A Deep Learning Approach , 2016, IEEE Transactions on Knowledge and Data Engineering.

[93]  E. Medvet,et al.  Inference of Regular Expressions for Text Extraction from Examples , 2016, IEEE Transactions on Knowledge and Data Engineering.

[94]  Donald E. Brown,et al.  Text Mining the Contributors to Rail Accidents , 2016, IEEE Transactions on Intelligent Transportation Systems.

[95]  Yueting Zhuang,et al.  Graph Regularized Feature Selection with Data Reconstruction , 2016, IEEE Transactions on Knowledge and Data Engineering.

[96]  Laliteshwari,et al.  Relevance Feature Discovery for Text Mining , 2016 .

[97]  Dieter Pfoser,et al.  Efficient Processing of Relevant Nearest-Neighbor Queries , 2016, TSAS.

[98]  Robert G. Capra,et al.  The Effects of Aggregated Search Coherence on Search Behavior , 2016, ACM Trans. Inf. Syst..

[99]  Xuemin Lin,et al.  Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search , 2016, IEEE Transactions on Knowledge and Data Engineering.

[100]  Jifu Zhang,et al.  FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce , 2016, IEEE Transactions on Systems, Man, and Cybernetics: Systems.