This paper discusses the knowledge discovery in text (KDT) system for the 'Request for Comments (RFC) Document Series'. The paper proposes a versatile system architecture for text mining in RFC that maintains structured and unstructured data components of the document. The documents are represented by keywords and knowledge discovery is performed by analysing the co-occurrence frequencies of the various keywords representing the document. The clustering of documents is done by extracted knowledge, which can reduce the search space. The relevant documents retrieved during the search process for a query are ranked based on relevance of the topic in it. This paper describes RFC Viewer, our tool for viewing the RFC document in rich text format rather than text format, which also provides knowledge extracted from the RFC document and supports various KDD operations on the document.
[1]
G. Karypis,et al.
Criterion Functions for Document Clustering ∗ Experiments and Analysis
,
2001
.
[2]
A. Moffat,et al.
Offline dictionary-based compression
,
2000,
Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).
[3]
David R. Karger,et al.
Scatter/Gather: a cluster-based approach to browsing large document collections
,
1992,
SIGIR '92.
[4]
Dan Smith,et al.
Information extraction for semi-structured documents
,
1997
.
[5]
Yonatan Aumann,et al.
Knowledge Management: A Text Mining Approach
,
1998,
PAKM.
[6]
Bin Chen,et al.
A Robust System Architecture for Mining Semi-Structured Data
,
1998,
KDD.
[7]
Ido Dagan,et al.
Mining Text Using Keyword Distributions
,
1998,
Journal of Intelligent Information Systems.