Improving rare disease classification using imperfect knowledge graph

Accurately recognizing rare diseases based on symptom description is a critical task. The lack of historical data for rare diseases poses a great challenge to machine learning-based approaches. In this study, we develop a text classification algorithm that represents a document as a combination of a “bag of words” and a “bag of knowledge terms where a “knowledge term” is a term shared between the document and the knowledge graph relevant to the disease classification task.

[1]  Bin Liang,et al.  CN-DBpedia: A Never-Ending Chinese Knowledge Extraction System , 2017, IEA/AIE.

[2]  Hongfang Liu,et al.  Leveraging Collaborative Filtering to Accelerate Rare Disease Diagnosis , 2017, AMIA.

[3]  R. D. du Bois,et al.  Rare Diseases , 1946, Handbook Integrated Care.

[4]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[5]  Ole Winther,et al.  Rare disease diagnosis: A review of web search, social media and large-scale data-mining approaches , 2015, Rare diseases.

[6]  Hongfang Liu,et al.  Utilization of Electronic Medical Records and Biomedical Literature to Support the Diagnosis of Rare Diseases Using Data Fusion and Collaborative Filtering Approaches , 2018, JMIR medical informatics.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Hongfang Liu,et al.  Incorporating Knowledge-Driven Insights into a Collaborative Filtering Model to Facilitate the Differential Diagnosis of Rare Diseases , 2018, AMIA.

[9]  Nick Craswell Mean Reciprocal Rank , 2009, Encyclopedia of Database Systems.

[10]  Burr Settles,et al.  Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances , 2011, EMNLP.

[11]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[12]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[13]  Christopher Ré,et al.  Snorkel: Rapid Training Data Creation with Weak Supervision , 2017, Proc. VLDB Endow..

[14]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[15]  Venkatesh Balasubramanian,et al.  Slice: Scalable Linear Extreme Classifiers Trained on 100 Million Labels for Related Searches , 2019, WSDM.

[16]  Louis-Philippe Morency,et al.  Multimodal Machine Learning: A Survey and Taxonomy , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Evgeniy Gabrilovich,et al.  Feature Generation for Text Categorization Using World Knowledge , 2005, IJCAI.

[18]  David Sontag,et al.  Learning a Health Knowledge Graph from Electronic Medical Records , 2017, Scientific Reports.

[19]  Ole Winther,et al.  FindZebra: A search engine for rare diseases , 2013, Int. J. Medical Informatics.

[20]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[21]  Jiangjiang He,et al.  China has officially released its first national list of rare diseases. , 2018, Intractable & rare diseases research.

[22]  Jason Eisner,et al.  Machine Learning with Annotator Rationales to Reduce Annotation Cost , 2008 .

[23]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[24]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[25]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[26]  Viktor de Boer,et al.  The knowledge graph as the default data model for learning on heterogeneous knowledge , 2017, Data Sci..

[27]  Stefanie Putkowski The National Organization for Rare Disorders (NORD) , 2010 .

[28]  Bartosz Krawczyk,et al.  Learning from imbalanced data: open challenges and future directions , 2016, Progress in Artificial Intelligence.