Discovering Disease Patterns Using the Supervised Topic Model

In this paper, we explore the methods of medical data mining. The medical data usually have some unique characteristics such as sparseness, highly correlated features and unbalanced sample categories. After researching the models commonly used in current medical data mining, we use the topic-based model for medical data mining. We build a supervised topic model (the SLDA model) and use Gibbs sampling to estimate parameters. From the results of the model, we can find some important relationships among features in our medical data. Finally, the SLDA model was combined with a Random Forest classifier, which gets good predictive performance in disease prediction.

[1]  Harleen Kaur,et al.  Empirical Study on Applications of Data Mining Techniques in Healthcare , 2006 .

[2]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[3]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4]  M. Lindquist,et al.  Antipsychotic drugs and heart muscle disorder in international pharmacovigilance: data mining study , 2001, BMJ : British Medical Journal.

[5]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[6]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[7]  Charalampos Bratsas,et al.  On the Classification of Emotional Biosignals Evoked While Viewing Affective Pictures: An Integrated Data-Mining-Based Approach for Healthcare Applications , 2010, IEEE Transactions on Information Technology in Biomedicine.

[8]  Bo Zhang,et al.  Improved Bayesian Logistic Supervised Topic Models with Data Augmentation , 2013, ACL.

[9]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[10]  L. D'andrea,et al.  Cluster analysis of adult children of alcoholics. , 1994, The International journal of the addictions.

[11]  Nevin L. Zhang,et al.  Latent Structure Models and Diagnosis in Traditional Chinese Medicine (I) , 2006 .

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  James G. Scott,et al.  Bayesian Inference for Logistic Models Using Pólya–Gamma Latent Variables , 2012, 1205.0310.

[14]  Roger B. Fillingim,et al.  Cluster analysis of multiple experimental pain modalities , 2005, Pain.

[15]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[16]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.