Integrating data and text mining processes for digital library applications

This paper explores the integration of text mining and data mining techniques, digital library systems, and computational and data grid technologies with the objective of developing an online classification service exemplar. We discuss the current research issues relating to the use of data mining algorithms and toolkits for textual data; the necessary changes within the Cheshire3 Information Framework to accommodate analysis workflows; the outcomes of a demonstrator based on the National Library of Medicine's Medline dataset; and the provision of comparable metrics for evaluation purposes. The prototype has resulted in extremely accurate online classification services and offers a novel method of supporting text mining and data mining within a highly scaled computational environment, integrated seamlessly into the digital library architecture.

[1]  Marti A. Hearst Untangling Text Data Mining , 1999, ACL.

[2]  Robert Sanderson,et al.  Grid-based digital libraries: cheshire3 and distributed retrieval , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[3]  Robert Sanderson,et al.  Indexing and searching tera-scale Grid-Based Digital Libraries , 2006, InfoScale '06.

[4]  Robert Sanderson,et al.  Cheshire3: retrieving from tera-scale grid-based digital libraries , 2006, SIGIR.

[5]  Ian H. Witten,et al.  Text mining in a digital library , 2004, International Journal on Digital Libraries.

[6]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[7]  Frans Coenen,et al.  Tree Structures for Mining Association Rules , 2004, Data Mining and Knowledge Discovery.

[8]  Edward A. Fox,et al.  Automatic document metadata extraction using support vector machines , 2003, 2003 Joint Conference on Digital Libraries, 2003. Proceedings..