Efficient access to information and integration of information from various sources and leveraging this information to knowledge are currently major challenges in life science research. However, a large fraction of this information is only available from scientific articles that are stored in huge document databases in free text format or from the Web, where it is available in semi-structured format.
Text mining provides some methods (e.g., classification, clustering, etc.) able to automatically extract relevant knowledge patterns contained in the free text data. The inclusion of the Grid text-mining services into a Grid-based knowledge discovery system can significantly support problem solving processes based on such a system.
Motivation for the research effort presented in this paper is to use the Grid computational, storage, and data access capabilities for text mining tasks and text classification in particular. Text classification mining methods are time-consuming and utilizing the Grid infrastructure can bring significant benefits. Implementation of text mining techniques in distributed environment allows us to access different geographically distributed data collections and perform text mining tasks in parallel/distributed fashion.
[1]
Gerard Salton,et al.
Term-Weighting Approaches in Automatic Text Retrieval
,
1988,
Inf. Process. Manag..
[2]
J. Ross Quinlan.
Learning First-Order Definitions of Functions
,
1996,
J. Artif. Intell. Res..
[3]
David D. Lewis,et al.
Reuters-21578 Text Categorization Test Collection, Distribution 1.0
,
1997
.
[4]
Pedro M. Domingos,et al.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss
,
1997,
Machine Learning.
[5]
Hans Peter Luhn,et al.
A Statistical Approach to Mechanized Encoding and Searching of Literary Information
,
1957,
IBM J. Res. Dev..
[6]
Sholom M. Weiss,et al.
Towards language independent automated learning of text categorization models
,
1994,
SIGIR '94.