A software infrastructure for research in textual data mining

Few tools exist that address the challenges facing researchers in the textual data mining (TDM) field. Some are too specific to their application, or are prototypes not suitable for general use. More general tools often are not capable of processing large volumes of data. We have created a textual data mining infrastructure (TMI) that incorporates both existing and new capabilities in a reusable framework conductive to developing new tools and components. TMI adheres to strict guidelines that allow it to run in a wide range of processing environments - as a result, it accommodates the volume of computing and diversity of research occurring in TDM. A unique capability of TMI is support for optimization. This facilitates text mining research by automating the search for optimal parameters in text mining algorithms. In this article we describe a number of applications that use the TMI. We present several novel results that have not been published elsewhere. We also discuss how the TMI utilizes existing machine-learning libraries, thereby enabling researchers to continue and extend their endeavors with minimal effort. Towards that end, TMI is available on the web at hddi.cse.lehigh.edu.

[1]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[2]  William M. Pottenger,et al.  Detecting emerging concepts in textual data mining , 2001 .

[3]  William M. Pottenger,et al.  The Role of the HDDI Collection Builder in Hierarchical Distributed Dynamic Indexing , 2004 .

[4]  Yong-Bin Kim,et al.  HDDI™: Hierarchical Distributed Dynamic Indexing , 2001 .

[5]  William M. Pottenger A Framework for Understanding LSI Performance , 2004 .

[6]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[7]  William M. Pottenger,et al.  Massively parallel distributed feature extraction in textual data mining using HDDI/sup TM/ , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[8]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[9]  F. D. Bouskila The Role of Semantic Locality in Hierarchical Distributed Dynamic Indexing and Information Retrieval , 1999 .

[10]  Vivek D. Pinto A Survey Of Optimization Techniques Being Used In The Field , 2000 .

[11]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[12]  William M. Pottenger,et al.  A term co-occurrence based framework for understanding lsi: theory and practice , 2003 .

[13]  William M. Pottenger,et al.  Classification of Emotions in Internet Chat: An Application of Machine Learning Using Speech Phonemes , 2003 .

[14]  W. Pottenger,et al.  Improving Retrieval Performance with Positive and Negative Equivalence Classes of Terms , 2002 .

[15]  William M. Pottenger,et al.  A Survey of Emerging Trend Detection in Textual Data Mining , 2004 .