Using Boolean Rule Extraction for Taxonomic Text Categorization for Big Data

Categorization hierarchies are ubiquitous in big data. Examples include MEDLINE’s Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a category is very small at lower levels of the hierarchy. Blindly applying a standard categorization model is likely to perform poorly if this fact is not taken into consideration. This paper introduces a novel technique for text categorization, called Boolean rule extraction, which enables you to effectively address this situation. In addition, models that are generated by this introduced rulebased technique can be easily interpreted and modified by a human expert, enabling better human-machine interaction. The Text Rule Builder node and the newly developed HPBOOLRULE procedure in SAS ® Text Miner implement this technique. The paper demonstrates how to use the HPBOOLRULE procedure to obtain effective predictive models at various hierarchy levels in a taxonomy.