Multidimensional text classification for drug information

This paper proposes a multidimensional model for classifying drug information text documents. The concept of multidimensional category model is introduced for representing classes. In contrast with traditional flat and hierarchical category models, the multidimensional category model classifies each document using multiple predefined sets of categories, where each set corresponds to a dimension. Since a multidimensional model can be converted to flat and hierarchical models, three classification approaches are possible, i.e., classifying directly based on the multidimensional model and classifying with the equivalent flat or hierarchical models. The efficiency of these three approaches is investigated using drug information collection with two different dimensions: 1) drug topics and 2) primary therapeutic classes. In the experiments, k-nearest neighbor, na/spl inodot//spl uml/ve Bayes, and two centroid-based methods are selected as classifiers. The comparisons among three approaches of classification are done using two-way analysis of variance, followed by the Scheffe/spl acute/'s test for post hoc comparison. The experimental results show that multidimensional-based classification performs better than the others, especially in the presence of a relatively small training set. As one application, a category-based search engine using the multidimensional category concept was developed to help users retrieve drug information.

[1]  D. Tikk,et al.  Experiment with a hierarchical text categorization method on the WIPO-alpha patent collection , 2003, Fourth International Symposium on Uncertainty Modeling and Analysis, 2003. ISUMA 2003..

[2]  Alberto H. F. Laender,et al.  An experimental study in automatically categorizing medical documents , 2001 .

[3]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[4]  Y Yang An evaluation of statistical approaches to MEDLINE indexing. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[5]  Hwee Tou Ng,et al.  Bayesian online classifiers for text classification and filtering , 2002, SIGIR '02.

[6]  Charles F. Curran,et al.  Streamlining the Information Retrieval Process in the Drug Information Department , 2001 .

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Jong-Hyeok Lee,et al.  Text categorization based on k-nearest neighbor approach for Web site classification , 2003, Inf. Process. Manag..

[9]  Padmini Srinivasan,et al.  Automatic Text Categorization Using Neural Networks , 1997 .

[10]  Susan T. Dumais,et al.  Hierarchical classification of Web content , 2000, SIGIR '00.

[11]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[12]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[13]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[14]  Steven T. Johnson,et al.  Internet Utilization among Medical Information Specialists in the Pharmaceutical Industry and Academia , 1998 .

[15]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[16]  Verayuth Lertnattee,et al.  Effect of term distributions on centroid-based text categorization , 2004, Inf. Sci..

[17]  Padmini Srinivasan,et al.  Hierarchical Text Categorization Using Neural Networks , 2004, Information Retrieval.

[18]  Scott R McCreadie,et al.  Building a better online formulary. , 2002, American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists.

[19]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.