Effective Pattern Discovery for Text Mining

Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Substantial experiments on RCV1 data collection and TREC topics demonstrate that the proposed solution achieves encouraging performance.

[1]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[2]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[3]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Yue Xu,et al.  Automatic Pattern-Taxonomy Extraction for Web Mining , 2004, IEEE/WIC/ACM International Conference on Web Intelligence (WI'04).

[6]  Manabu Sassano,et al.  Virtual Examples for Text Classification with Support Vector Machines , 2003, EMNLP.

[7]  S. Raman,et al.  Phrase-based text representation for managing the Web documents , 2003, Proceedings ITCC 2003. International Conference on Information Technology: Coding and Computing.

[8]  Raymond Y. K. Lau,et al.  A two-stage text mining model for information filtering , 2008, CIKM '08.

[9]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[10]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[11]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[12]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[13]  Balachander Krishnamurthy,et al.  Class-based graph anonymization for social network data , 2009, Proc. VLDB Endow..

[14]  Jian Pei,et al.  A brief survey on anonymization techniques for privacy preserving publishing of social network data , 2008, SKDD.

[15]  Fakhri Karray,et al.  A concept-based model for enhancing text categorization , 2007, KDD '07.

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Claudio Gentile,et al.  Kernel Methods for Document Filtering , 2002, TREC.

[18]  JonesK. Sparck,et al.  A probabilistic model of information retrieval , 2000 .

[19]  Cynthia Dwork,et al.  Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography , 2007, WWW '07.

[20]  Yue Xu,et al.  Multi-Tier Granule Mining for Representations of Multidimensional Association Rules , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  George Karypis,et al.  SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[22]  Jia-Dong Ren,et al.  Mining Weighted Closed Sequential Patterns in Large Databases , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[23]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[24]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[25]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[26]  Joydeep Ghosh,et al.  Evaluating the novelty of text-mined rules using lexical knowledge , 2001, KDD '01.

[27]  Yuefeng Li,et al.  Mining ontology for automatically acquiring Web user information needs , 2006, IEEE Transactions on Knowledge and Data Engineering.

[28]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[29]  William W. Cohen Improving a Page Classifier with Anchor Extraction and Link Analysis , 2002, NIPS.

[30]  Mika Klemettinen,et al.  Applying data mining techniques for descriptive phrase extraction in digital document collections , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[31]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[32]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[33]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[34]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[35]  Stan Matwin,et al.  Statistical Phrases in Automated Text Categorization , 2000 .

[36]  SaltonGerard,et al.  Term-weighting approaches in automatic text retrieval , 1988 .

[37]  Alina Campan,et al.  A Clustering Approach for Data and Structural Anonymity in Social Networks , 2008 .

[38]  Philip S. Yu,et al.  Protecting Sensitive Labels in Social Network Data Anonymization , 2013, IEEE Transactions on Knowledge and Data Engineering.

[39]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[40]  Siddharth Srivastava,et al.  Anonymizing Social Networks , 2007 .

[41]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[42]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[43]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[44]  Diane J. Cook,et al.  Graph-based anomaly detection , 2003, KDD '03.

[45]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[46]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[47]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[48]  Stephen E. Robertson,et al.  The TREC 2002 Filtering Track Report , 2002, TREC.

[49]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[50]  Bing Liu,et al.  Identifying comparative sentences in text documents , 2006, SIGIR.

[51]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[52]  Lawrence B. Holder,et al.  Discovering Structural Anomalies in Graph-Based Data , 2007 .

[53]  Yue Xu,et al.  Deploying Approaches for Pattern Refinement in Text Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[54]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[55]  Yuefeng Li,et al.  Interpretations of association rules by granular computing , 2003, Third IEEE International Conference on Data Mining.

[56]  Fakhri Karray,et al.  Enhancing Text Clustering Using Concept-based Mining Model , 2006, Sixth International Conference on Data Mining (ICDM'06).

[57]  Donald F. Towsley,et al.  Resisting structural re-identification in anonymized social networks , 2008, The VLDB Journal.

[58]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[59]  Seán Slattery,et al.  Data Mining on Symbolic Knowledge Extracted from the Web , 2000 .

[60]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[61]  Chengqi Zhang,et al.  An information filtering model on the Web and its application in JobAgent , 2000, Knowl. Based Syst..

[62]  Jiawei Han,et al.  Data Mining for Web Intelligence , 2002, Computer.

[63]  Stephen E. Robertson,et al.  Experimentation as a way of life: Okapi at TREC , 2000, Inf. Process. Manag..

[64]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[65]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[66]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[67]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 2 , 2000, Inf. Process. Manag..

[68]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[69]  Yonatan Aumann,et al.  Knowledge Management: A Text Mining Approach , 1998, PAKM.

[70]  Ting Yu,et al.  Anonymizing bipartite graph data using safe groupings , 2008, Proc. VLDB Endow..

[71]  Philippe Golle,et al.  Private social network analysis: how to assemble pieces of a graph privately , 2006, WPES '06.

[72]  Yin-Fu Huang,et al.  Mining sequential patterns using graph search techniques , 2003, Proceedings 27th Annual International Computer Software and Applications Conference. COMPAC 2003.

[73]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[74]  Claire Cardie,et al.  Empirical Methods in Information Extraction , 1997, AI Mag..

[75]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[76]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[77]  Yue Xu,et al.  Generating concise association rules , 2007, CIKM '07.

[78]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[79]  David D. Lewis,et al.  Evaluating and optimizing autonomous text classification systems , 1995, SIGIR '95.

[80]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..