论文信息 - Mining Association Rules from Unstructured Documents

Mining Association Rules from Unstructured Documents

association rules from collections of unstructured documents called EART (Extract Association Rules from Text). The EART system treats texts only not images or figures. EART discovers association rules amongst keywords labeling the collection of textual documents. The main characteristic of EART is that the system integrates XML technology (to transform unstructured documents into structured documents) with Information Retrieval scheme (TF-IDF) and Data Mining technique for association rules extraction. EART depends on word feature to extract association rules. It consists of four phases: structure phase, index phase, text mining phase and visualization phase. Our work depends on the analysis of the keywords in the extracted association rules through the co-occurrence of the keywords in one sentence in the original text and the existing of the keywords in one sentence without co-occurrence. Experiments applied on a collection of scientific documents selected from MEDLINE that are related to the outbreak of H5N1 avian influenza virus. I. INTRODUCTION HE information age is characterized by a rapid growth for information available in electronic media such as databases, data warehouses, intranet documents, business emails and www. This growth has created a demanding task called Knowledge Discovery in Databases (KDD) and in Texts (KDT). Therefore, researchers and companies in recent years [7, 13] focused on this task and significant progress has been made. Text Mining (TM) and Knowledge Discovery in Text (KDT) are new research areas that try to solve the problem of information overload by using techniques from The main goal of text mining is to enable users to extract information from large textual resources. The final output of the mining process varies and it can only be defined with respect to a specific application. Most Text Mining objectives fall under the following categories of operations: Feature

Hany Mahgoub | Hany Mahgoub

[1] Dietmar F. Rösner,et al. The XDOC Document Suite - a Workbench for Document Mining , 2003, Text Mining.

[2] Ramakrishnan Srikant,et al. Discovering Trends in Text Databases , 1997, KDD.

[3] Jan. Paralic,et al. Text Mining for Documents Annotation and Ontology Support , 2003 .

[4] Rakesh Agarwal,et al. Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[5] Ido Dagan,et al. Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[6] Martin Rajman,et al. Text Mining: Natural Language techniques and Text Mining applications , 1998 .

[7] Heikki Mannila,et al. Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[8] George Buchanan,et al. Scalable browsing for large collections: a case study , 2000, DL '00.

[9] Mika Klemettinen,et al. Mining in the Phrasal Frontier , 1997, PKDD.

[10] Mika Klemettinen,et al. Applying data mining techniques for descriptive phrase extraction in digital document collections , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[11] Haym Hirsh,et al. Mining Associations in Text in the Presence of Background Knowledge , 1996, KDD.

[12] Rajeev Motwani,et al. Beyond Market Baskets: Generalizing Association Rules to Dependence Rules , 1998, Data Mining and Knowledge Discovery.

[13] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.