论文信息 - Knowledge Management: A Text Mining Approach

Knowledge Management: A Text Mining Approach

Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document. At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest. This paper presents an intermediate approach, one that we call text mining at the term level, in which knowledge discovery takes place on a more focused collection of words and phrases that are extracted from and label each document. These terms plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process. This paper describes Document Explorer, our tool that implements text mining at the term level. It consists of a document retrieval module, which converts retrieved documents from their native formats into documents represented using the SGML mark-up language used by Document Explorer; a two-stage term-extraction approach, in which terms are first proposed in a termgeneration stage, and from which a smaller set are then selected in a term-filtering stage in light of their frequencies of occurrence elsewhere in the collection; our taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and our knowledge-discovery tools for the resulting term-labeled documents. Finally, we evaluate our approach on a collection of patent records as well as Reuters newswire stories. Our results confirm that Text Mining serves as a powerful technique to manage knowledge encapsulated in large document collections.