论文信息 - Knowledge Discovery in Textual Databases (KDT)

Knowledge Discovery in Textual Databases (KDT)

The information age is characterized by a rapid growth in the amount of information available in electronic media. Traditional data handling methods are not adequate to cope with this information flood. Knowledge Discovery in Databases (KDD) is a new paradigm that focuses on computerized exploration of large amounts of data and on discovery of relevant and interesting patterns within them. While most work on KDD is concerned with structured databases, it is clear that this paradigm is required for handling the huge amount of information that is available only in unstructured textual form. To apply traditional KDD on texts it is necessary to impose some structure on the data that would be rich enough to allow for interesting KDD operations. On the other hand, we have to consider the severe limitations of current text processing technology and define rather simple structures that can be extracted from texts fairly automatically and in a reasonable cost. We propose using a text categorization paradigm to annotate text articles with meaningful concepts that are organized in hierarchical structure. We suggest that this relatively simple annotation is rich enough to provide the basis for a KDD framework, enabling data summarization, exploration of interesting patterns, and trend analysis. This research combines the KDD and text categorization paradigms and suggests advances to the state of the art in both areas.

Ido Dagan | Ronen Feldman | Ido Dagan | Ronen Feldman

[1] Willi Klösgen,et al. Problems for knowledge discovery in databases and their treatment in the statistics interpreter explora , 1992, Int. J. Intell. Syst..

[2] David D. Lewis,et al. An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[3] Gregory Piatetsky-Shapiro,et al. Knowledge Discovery in Databases: An Overview , 1992, AI Mag..

[4] Paul S. Jacobs,et al. Joining Statistics with NLP for Text Categorization , 1992, ANLP.

[5] David R. Karger,et al. Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[6] Deborah L. McGuinness,et al. Integrated Support for Data Archeology , 1993, Int. J. Cooperative Inf. Syst..

[7] Wesley W. Chu,et al. Pattern-based clustering for database attribute values , 1993 .

[8] Sholom M. Weiss,et al. Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[9] G. Hébrail,et al. Experiments of Textual Data Analysis at Electricité de France , 1994 .