Topical Cluster Discovery in Semistructured Healthcare Data

We propose an approach to clustering XML-based corpora of healthcare documents by their latent topic similarity. Our approach is a two-step process. Initially, the latent topic distributions of the input healthcare documents are inferred, by performing collapsed Gibbs sampling and parameter estimation under an XML topic model. Subsequently, the inferred distributions are grouped through established clustering techniques.

[1]  Neel Sundaresan,et al.  A classifier for semi-structured documents , 2000, KDD '00.

[2]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[3]  Gianni Costa,et al.  A hierarchical model-based approach to co-clustering high-dimensional data , 2008, SAC '08.

[4]  Nir Friedman,et al.  Probabilistic Graphical Models , 2009, Data-Driven Computational Neuroscience.

[5]  Gianni Costa,et al.  On Effective XML Clustering by Path Commonality: An Efficient and Scalable Algorithm , 2012, 2012 IEEE 24th International Conference on Tools with Artificial Intelligence.

[6]  Gianni Costa,et al.  Mining Overlapping Communities and Inner Role Assignments through Bayesian Mixed-Membership Models of Networks with Context-Dependent Interactions , 2018, ACM Trans. Knowl. Discov. Data.

[7]  Nando de Freitas,et al.  An Introduction to MCMC for Machine Learning , 2004, Machine Learning.

[8]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[9]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[10]  Yingjie Tian,et al.  A Comprehensive Survey of Clustering Algorithms , 2015, Annals of Data Science.

[11]  Gianni Costa,et al.  Developments in Partitioning XML Documents by Content and Structure Based on Combining Multiple Clusterings , 2013, 2013 IEEE 25th International Conference on Tools with Artificial Intelligence.

[12]  Gianni Costa,et al.  Mining Clusters in XML Corpora Based on Bayesian Generative Topic Modeling , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[13]  Robert L. Winkler,et al.  An Introduction to Bayesian Inference and Decision , 1972 .

[14]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[15]  Bogdan Filipic,et al.  Exploiting structural information for semi-structured document categorization , 2006, Inf. Process. Manag..

[16]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[17]  David M. Mimno,et al.  Applications of Topic Models , 2017, Found. Trends Inf. Retr..

[18]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[19]  Éric Grégoire International Journal on Artificial Intelligence Tools (IJAIT) , 2011 .

[20]  Gianni Costa,et al.  Machine learning techniques for XML (co-)clustering by structure-constrained phrases , 2018, Information Retrieval Journal.

[21]  Dan Suciu,et al.  Data on the Web: From Relations to Semistructured Data and XML , 1999 .

[22]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[23]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[24]  Sachindra Joshi,et al.  A bag of paths model for measuring structural similarity in Web documents , 2003, KDD '03.

[25]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[26]  Dennis V. Lindley,et al.  An Introduction to Bayesian Inference and Decision , 1974 .

[27]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[28]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[29]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[30]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[31]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[32]  R. Ortale,et al.  Model-Based Collaborative Personalized Recommendation on Signed Social Rating Networks , 2016, ACM Trans. Internet Techn..

[33]  Gianni Costa,et al.  XML Document Co-clustering via Non-negative Matrix Tri-factorization , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.