Text data mining: a proposed framework and future perspectives

With the increased advancements in technology and the emergence of different kinds of applications, the amount of available data becomes enormous, and the large proliferation of such data becomes evident. Therefore, there is an essential need for some techniques or methods to interact with data and extract useful information and patterns from them. Text data mining (TDM) is the process of extracting desired information out of mountains of textual data that are inherently unstructured, without the need to read them all. In this paper, we shed the light on the-state-of-the-art in text mining as an interdisciplinary field of several related areas. To facilitate the understanding of text data mining, this paper proposes a framework that visualises this field in a step-wise manner, taking into consideration the semantic of the extracted text. In addition, this paper surveys a number of useful applications and proposes a new approach for spam detection based on the proposed TDM framework.

[1]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[2]  Vipin Kumar,et al.  The Challenges of Clustering High Dimensional Data , 2004 .

[3]  Aixin Sun,et al.  Short text classification using very few words , 2012, SIGIR '12.

[4]  Catherine Blake,et al.  Text mining , 2011, Annu. Rev. Inf. Sci. Technol..

[5]  Gurpreet Singh Lehal,et al.  A Survey of Text Mining Techniques and Applications , 2009 .

[6]  Georgios Paliouras,et al.  Filtron: A Learning-Based Anti-Spam Filter , 2004, CEAS.

[7]  Kenneth A. Perrine,et al.  Interactive visualization of multiple query results , 2001, IEEE Symposium on Information Visualization, 2001. INFOVIS 2001..

[8]  Michael W. Berry,et al.  Survey of Text Mining , 2003, Springer New York.

[9]  David A. Smith,et al.  Detecting and Browsing Events in Unstructured text , 2002, SIGIR '02.

[10]  Philip S. Yu,et al.  On the Use of Side Information for Mining Text Data , 2014, IEEE Transactions on Knowledge and Data Engineering.

[11]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[12]  Xiaoping Chen,et al.  Multi-mode Natural Language Processing for Extracting Open Knowledge , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[13]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[14]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Gerhard Paass,et al.  Evaluating the Performance of Text Mining Systems on Real-world Press Archives , 2005, GfKl.

[16]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[17]  Pablo Castells,et al.  An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval , 2007, IEEE Transactions on Knowledge and Data Engineering.

[18]  Jayant Madhavan,et al.  Socialising Data with Google Fusion Tables , 2010, IEEE Data Eng. Bull..

[19]  Xiang Li,et al.  Joint inference for cross-document information extraction , 2011, CIKM '11.

[20]  Shaidah Jusoh,et al.  Techniques , Applications and Challenging Issue in Text Mining , 2012 .

[21]  Ahmed Khorsi,et al.  An Overview of Content-Based Spam Filtering Techniques , 2007, Informatica.

[22]  Shaidah Jusoh,et al.  Agent-based Knowledge Mining Architec ture , 2011 .

[23]  Andreas Hotho,et al.  A Brief Survey of Text Mining , 2005, LDV Forum.

[24]  Roberto Souto Maior de Barros,et al.  On the Use of Data Mining Tools for Data Preparation in Classification Problems , 2012, 2012 IEEE/ACIS 11th International Conference on Computer and Information Science.

[25]  Weiguo Fan,et al.  Tapping the power of text mining , 2006, CACM.

[26]  Raymond J. Mooney,et al.  Text mining with information extraction , 2004 .

[27]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[28]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[29]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[30]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[31]  Padmini Srinivasan,et al.  MeSHmap: a text mining tool for MEDLINE , 2001, AMIA.

[32]  Ian Witten,et al.  Data Mining , 2000 .

[33]  Hong Yu,et al.  Extracting synonymous gene and protein terms from biological literature , 2003, ISMB.

[34]  S. Logeswari,et al.  A Survey on Text Mining in Clustering , 2011 .

[35]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[36]  Divesh Srivastava,et al.  Weighted Set-Based String Similarity , 2010, IEEE Data Eng. Bull..

[37]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.