Knowledge Management: A Text Mining Approach

Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document. At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest. This paper presents an intermediate approach, one that we call text mining at the term level, in which knowledge discovery takes place on a more focused collection of words and phrases that are extracted from and label each document. These terms plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process. This paper describes Document Explorer, our tool that implements text mining at the term level. It consists of a document retrieval module, which converts retrieved documents from their native formats into documents represented using the SGML mark-up language used by Document Explorer; a two-stage term-extraction approach, in which terms are first proposed in a termgeneration stage, and from which a smaller set are then selected in a term-filtering stage in light of their frequencies of occurrence elsewhere in the collection; our taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and our knowledge-discovery tools for the resulting term-labeled documents. Finally, we evaluate our approach on a collection of patent records as well as Reuters newswire stories. Our results confirm that Text Mining serves as a powerful technique to manage knowledge encapsulated in large document collections.

[1]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[2]  BrillEric,et al.  Transformation-based error-driven learning and natural language processing , 1995 .

[3]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[4]  J. Jenkins,et al.  Word association norms , 1964 .

[5]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[6]  Usama M. Fayyad,et al.  Knowledge Discovery in Databases: An Overview , 1997, ILP.

[7]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[8]  Udo Hahn,et al.  Deep Knowledge Discovery from Natural Language Texts , 1997, KDD.

[9]  Katerina T. Frantzi,et al.  Incorporating Context Information for the Extraction of Terms , 1997, ACL.

[10]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[11]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[12]  Deborah L. McGuinness,et al.  Integrated Support for Data Archeology , 1993, Int. J. Cooperative Inf. Syst..

[13]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[14]  Yonatan Aumann,et al.  Maximal Association Rules: A New Tool for Mining for Keyword Co-Occurrences in Document Collections , 1997, KDD.

[15]  Martin Rajman,et al.  Text Mining: Natural Language techniques and Text Mining applications , 1998 .

[16]  Shmuel Tomi Klein,et al.  Clumping Properties of Content-Bearing Words , 1998, J. Am. Soc. Inf. Sci..

[17]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[18]  Willi Klösgen,et al.  Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora , 1992, Int. J. Intell. Syst..

[19]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[20]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[21]  David A. Hull Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..