Text Data Mining: Theory and Methods

This paper provides the reader with a very brief introduction to some of the theory and methods of text data mining. The intent of this article is to introduce the reader to some of the current methodologies that are employed within this discipline area while at the same time making the reader aware of some of the interesting challenges that remain to be solved within the area. Finally, the articles serves as a very rudimentary tutorial on some of techniques while also providing the reader with a list of references for additional study.

[1]  Michael W. Berry,et al.  Document clustering using nonnegative matrix factorization , 2006, Inf. Process. Manag..

[2]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[3]  Qiang Shen,et al.  Rough set-aided keyword reduction for text categorization , 2001, Appl. Artif. Intell..

[4]  D. Cook,et al.  Interactive visualization of hierarchical clusters using MDS and MST , 2000 .

[5]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[6]  Haesun Park,et al.  Generalizing discriminant analysis using the generalized singular value decomposition , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Inderjit S. Dhillon,et al.  Co-clustering documents and words using bipartite spectral graph partitioning , 2001, KDD '01.

[8]  Min Song Can Visualizing Document Space Improve Users' Information Foraging?. , 1998 .

[9]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[10]  Samuel Kaski,et al.  Mining massive document collections by the WEBSOM method , 2004, Inf. Sci..

[11]  Kevin W. Boyack,et al.  Mapping the backbone of science , 2004, Scientometrics.

[12]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[13]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[14]  Catherine Plaisant,et al.  SpaceTree: supporting exploration in large node link tree, design evolution and empirical evaluation , 2002, IEEE Symposium on Information Visualization, 2002. INFOVIS 2002..

[15]  K. Börner,et al.  Mapping topics and topic bursts in PNAS , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Liu Fei,et al.  A peer-to-peer hypertext categorization using directed acyclic graph support vector machines , 2004 .

[17]  Robert L. Goldstone,et al.  The simultaneous evolution of author and paper networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[19]  Chaomei Chen,et al.  Visualizing knowledge domains , 2005, Annu. Rev. Inf. Sci. Technol..

[20]  Zhao Yang Dong,et al.  Effectiveness of Document Representation for Classification , 2005, DaWaK.

[21]  Laura A. Mather,et al.  A linear algebra measure of cluster quality , 2000, J. Am. Soc. Inf. Sci..

[22]  Gary G Yen,et al.  Crossmaps: Visualization of overlapping relationships in collections of journal papers , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[24]  Joydeep Ghosh,et al.  Under Consideration for Publication in Knowledge and Information Systems Generative Model-based Document Clustering: a Comparative Study , 2003 .

[25]  Haesun Park,et al.  Structure Preserving Dimension Reduction for Clustered Text Data Based on the Generalized Singular Value Decomposition , 2003, SIAM J. Matrix Anal. Appl..

[26]  Ying Xu,et al.  Clustering gene expression data using a graph-theoretic approach: an application of minimum spanning trees , 2002, Bioinform..

[27]  Algorithm of documents clustering based on minimum spanning tree , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[28]  E. Wegman,et al.  5 - Text Data Mining with Minimal Spanning Trees , 2005 .

[29]  Carla E. Brodley,et al.  Feature Selection for Unsupervised Learning , 2004, J. Mach. Learn. Res..

[30]  Kenneth Ward Church,et al.  Iterative Denoising for Cross-Corpus Discovery , 2004 .

[31]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[32]  Enrique Romero,et al.  Margin maximization with feed-forward neural networks: a comparative study with SVM and AdaBoost , 2004, Neurocomputing.

[33]  Vasant Honavar,et al.  Multinomial Event Model Based Abstraction for Sequence and Text Classification , 2005, SARA.

[34]  Ronald N. Kostoff,et al.  Factor matrix text filtering and clustering: Research Articles , 2005 .

[35]  Baowen Xu,et al.  Generating Different Semantic Spaces for Document Classification , 2004, AWCC.

[36]  Dimitris A. Karras,et al.  A Robust Meaning Extraction Methodology Using Supervised Neural Networks , 2002, Australian Joint Conference on Artificial Intelligence.

[37]  Herbert Schildt Natural-language processing in C , 1987 .

[38]  Alan Benson,et al.  Improving Customer Experience via Text Mining , 2005, DNIS.

[39]  Guillaume Cleuziou,et al.  DDOC: Overlapping Clustering of Words for Document Classification , 2004, SPIRE.

[40]  Weimao Ke,et al.  Studying the emerging global brain: Analyzing and visualizing the impact of co-authorship teams , 2005, Complex..

[41]  Naohiro Ishii,et al.  Combining Multiple K-Nearest Neighbor Classifiers for Text Classification by Reducts , 2002, Discovery Science.

[42]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[43]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[44]  J. B. Rosen,et al.  Lower Dimensional Representation of Text Data Based on Centroids and Least Squares , 2003 .

[45]  Bong Chih How,et al.  An Examination of Feature Selection Frameworks in Text Categorization , 2005, AIRS.

[46]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 1999 .

[47]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[48]  Michael W. Berry,et al.  Understanding search engines: mathematical modeling and text retrieval (software , 2005 .