Identifying Similar Cases in Document Networks Using Cross-Reference Structures

Our objective was to explore the creation of document networks based on different thresholds of shared information and different clustering algorithms on those networks to identify document clusters describing similar clinical cases. We created networks from vaccine adverse event report sets using seven approaches for linking reports. We then applied three clustering algorithms [visualization of similarities (VOS), Louvain, k-means] to these networks and evaluated their ability to identify known clusters. The report sets included one simulated set and three sets from the Vaccine Adverse Event Reporting System; each was split into training and testing subsets. Training subsets were used to estimate parameter values for the clustering algorithms and testing subsets to evaluate clusters. We created the networks by linking reports based on shared information in the form either of individual Medical Dictionary for Regulatory Activities Preferred Terms (PTs) or of dyads, triplets, quadruplets, quintuplets, and sextuplets of PTs; we created another network by weighting the single PT network connections by Lin's information theoretic approach to similarity. We then repeated this entire process using networks based on text mining output rather than structured data. We evaluated report clustering using recall, precision, and f-measure. The VOS algorithm outperformed Louvain and k-means in general. The best weighting scheme appeared to be related to the complexity of the known cluster. For example, singleton weighting performed best for an intussusception cluster driven by a single PT. We observed marginal differences between the code- and textual-based clustering. In conclusion, our approach supported identification of similar nodes in a document network.

[1]  David M. Blei,et al.  Relational Topic Models for Document Networks , 2009, AISTATS.

[2]  Ludo Waltman,et al.  Vos: A New Method for Visualizing Similarities between Objects , 2006, GfKl.

[3]  George Hripcsak,et al.  Use abstracted patient-specific features to assist an information-theoretic measurement to assess similarity between medical cases , 2008, J. Biomed. Informatics.

[4]  E. Brown,et al.  The Medical Dictionary for Regulatory Activities (MedDRA) , 1999, Drug safety.

[5]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  Robert Ball,et al.  Vaccine adverse event text mining system for extracting features from vaccine safety reports , 2012, J. Am. Medical Informatics Assoc..

[7]  Breck Baldwin,et al.  Entity-Based Cross-Document Coreferencing Using the Vector Space Model , 1998, COLING.

[8]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[9]  A. L. Rector Clinical terminology : Why is it so hard? : Challenges to Progresses , 1999 .

[10]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[11]  L. Brooke The National Library of Medicine. , 1980, Hospital libraries.

[12]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[13]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[14]  William W. Cohen,et al.  Block-LDA: Jointly Modeling Entity-Annotated Text and Entity-Entity Links , 2014, Handbook of Mixed Membership Models and Their Applications.

[15]  Jan Bonhoeffer,et al.  Anaphylaxis: case definition and guidelines for data collection, analysis, and presentation of immunization safety data. , 2007, Vaccine.

[16]  Andrew McCallum,et al.  Group and Topic Discovery from Relations and Their Attributes , 2005, NIPS.

[17]  E. Brown,et al.  Using MedDRA , 2004, Drug safety.

[18]  Taxiarchis Botsis,et al.  Application of Information Retrieval Approaches to Case Classification in the Vaccine Adverse Event Reporting System , 2013, Drug Safety.

[19]  Marianthi Markatou,et al.  Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection , 2011, J. Am. Medical Informatics Assoc..

[20]  Steffen Bickel,et al.  Unsupervised prediction of citation influences , 2007, ICML '07.

[21]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[22]  Robert Ball,et al.  Network Analysis of Possible Anaphylaxis Cases Reported to the US Vaccine Adverse Event Reporting System after H1N1 Influenza Vaccine , 2011, MIE.

[23]  Huawei Shen,et al.  Quantifying and identifying the overlapping community structure in networks , 2009, 0905.2666.

[24]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[25]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  T Botsis,et al.  The contribution of the vaccine adverse event text mining system to the classification of possible Guillain-Barré syndrome reports. , 2013, Applied clinical informatics.

[27]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  Kwangsoo Kim,et al.  A patent intelligence system for strategic technology planning , 2013, Expert Syst. Appl..

[29]  Riyaz Sikora,et al.  Assessing the relative influence of journals in a citation network , 2005, CACM.

[30]  R Ball,et al.  Simulating adverse event spontaneous reporting systems as preferential attachment networks , 2014, Applied Clinical Informatics.

[31]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[32]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[33]  A. Rector Clinical Terminology: Why Is it so Hard? , 1999, Methods of Information in Medicine.

[34]  Matthieu Latapy,et al.  Computing Communities in Large Networks Using Random Walks , 2004, J. Graph Algorithms Appl..

[35]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[36]  R Ball,et al.  Can Network Analysis Improve Pattern Recognition Among Adverse Events Following Immunization Reported to VAERS? , 2011, Clinical pharmacology and therapeutics.

[37]  Catherine Havasi,et al.  ConceptNet 3 : a Flexible , Multilingual Semantic Network for Common Sense Knowledge , 2007 .

[38]  William W. Cohen,et al.  Block-LDA: Jointly Modeling Entity-Annotated Text and Entity-Entity Links , 2014, Handbook of Mixed Membership Models and Their Applications.

[39]  Wim Peters,et al.  Cross-lingual legal information retrieval using a WordNet architecture , 2005, ICAIL '05.

[40]  G. Hartvigsen,et al.  Secondary Use of EHR: Data Quality Issues and Informatics Opportunities , 2010, Summit on translational bioinformatics.