An analysis into using unstructured non-expert text in the illicit drug domain

The Pillreports.com database was mined in order to determine if the free-text fields in the database could be of use in differentiating regular pills from those that have been adulterated, i.e. contains ingredients not comparable to MDMA. The data was download and extracted using RapidMiner and Xpath queries. A Naive Bayes and SVM binary classification model was created. Pre-processing techniques of tokenisation, n-gram creation, stop-word removal, stemming as well as feature selection by weights were applied to the data, resulting in a 15 point improvement in the model. In addition we are reporting on a comprehensive cluster analysis. Frequent terms and differences between clusters were visualised using word clouds. Clusters were compared with values contained in nominal fields. Model results and interpretation are provided at various preprocessing stages. Key phrase extraction is identified as an area of possible future work.