Better rules, fewer features: a semantic approach to selecting features from text

The choice of features used to represent a domain has a profound effect on the quality of the model produced; yet, few researchers have investigated the relationship between the features used to represent text and the quality of the final model. We explored this relationship for medical texts by comparing association rules based on features with three different semantic levels: (1) words (2) manually assigned keywords and (3) automatically selected medical concepts. Our preliminary findings indicate that bi-directional association rules based on concepts or keywords are more plausible and more useful than those based on word features. The concept and keyword representations also required 90% fewer features than the word representation. This drastic dimensionality reduction suggests that this approach is well suited to large textual corpora of medical text, such as parts of the Web.

[1]  Ido Dagan,et al.  Knowledge Discovery in Textual Databases (KDT) , 1995, KDD.

[2]  Kevin Barraclough,et al.  I and i , 2001, BMJ : British Medical Journal.

[3]  Seán Slattery,et al.  Data Mining on Symbolic Knowledge Extracted from the Web , 2000 .

[4]  F. M.,et al.  The Concise Oxford Dictionary of Current English , 1929, Nature.

[5]  Balaji Padmanabhan,et al.  Unexpectedness as a Measure of Interestingness in Knowledge Discovery , 1999, Decis. Support Syst..

[6]  Ramakrishnan Srikant,et al.  Discovering Trends in Text Databases , 1997, KDD.

[7]  Michael J. Pazzani,et al.  Comprehensible Knowledge-Discovery in Databases , 1997 .

[8]  Mika Klemettinen,et al.  Applying data mining techniques for descriptive phrase extraction in digital document collections , 1998, Proceedings IEEE International Forum on Research and Technology Advances in Digital Libraries -ADL'98-.

[9]  Jeffrey L. Goldberg Cdm: an Approach to Learning in Text Categorization , 1996, Int. J. Artif. Intell. Tools.

[10]  Roberto J. Bayardo,et al.  Mining the most interesting rules , 1999, KDD '99.

[11]  Betsy L. Humphreys,et al.  Relationships in Medical Subject Headings (MeSH) , 2001 .

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Heikki Mannila,et al.  Finding interesting rules from large sets of discovered association rules , 1994, CIKM '94.

[14]  Wynne Hsu,et al.  Pruning and summarizing the discovered associations , 1999, KDD '99.

[15]  Ronen Feldman Practical Text Mining , 1998, PKDD.

[16]  Udo Hahn,et al.  Knowledge mining from textual sources , 1997, CIKM '97.

[17]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[18]  George Hripcsak,et al.  Using Knowledge Sources to Improve Classification of Medical Text Reports , 2000, KDD 2000.

[19]  A. Tversky,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[20]  Abraham Silberschatz,et al.  On Subjective Measures of Interestingness in Knowledge Discovery , 1995, KDD.

[21]  Ian H. Witten,et al.  Text mining: a new frontier for lossless compression , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[22]  Haym Hirsh,et al.  Mining Associations in Text in the Presence of Background Knowledge , 1996, KDD.

[23]  George Buchanan,et al.  Scalable browsing for large collections: a case study , 2000, DL '00.

[24]  土肥 一夫,et al.  The Concise Oxford Dictionary of Current Englishと英和辞典 , 2001 .

[25]  Abraham Silberschatz,et al.  What Makes Patterns Interesting in Knowledge Discovery Systems , 1996, IEEE Trans. Knowl. Data Eng..

[26]  T O Tengs,et al.  The link between smoking and impotence: two decades of evidence. , 2001, Preventive medicine.