Dynamic Text Categorization of Search Results for Medical Class Recognition in Real World Evidence Studies in the Chinese Language

Classifying clinical terms from electronic medical record (EMR) systems is critical for real world evidence (RWE) research. Yet the task is challenging, especially in languages other than English. Clinical research institutes require a cost-effective method to address this challenge. We proposed a software pipeline with two components: a feature generator that gathers descriptive words of the terms by text-segmenting the search results from two search engines and a learning mechanism that utilizes machine learning algorithms for classification. Models are trained with training sets of different sizes to determine effectiveness. Models were compared using 10-fold cross validation or another supplied testing set. We applied our pipeline to a Chinese medication term set extracted from a clinical system, and also to a data set of standard medications names. A term-vs.-word frequency matrix was generated based on the Google search results of the term sets. Most models tasked with classifying whether a medication belonged to Western or Chinese medicine achieved high accuracy, especially with radial basis functions (RBF) network. The performance of models trained with training sets of different sizes was not significantly different. When the same approach was applied to the information gathered from another Chinese language search engine (Baidu), better performance was achieved. The results of the other experiments conducted on the medication name set also demonstrates a significant improvement from baseline. Dynamic text categorization with machine learning can be applied to classify clinical terms based on information retrieved from search engines in RWE studies.

[1]  Leonard W D'Avolio,et al.  Comparative effectiveness research and medical informatics. , 2010, The American journal of medicine.

[2]  James D. Herbsleb,et al.  Social coding in GitHub: transparency and collaboration in an open software repository , 2012, CSCW.

[3]  Nitesh V. Chawla,et al.  Information Gain, Correlation and Support Vector Machines , 2006, Feature Extraction.

[4]  H. Altay Güvenir,et al.  Classification by Voting Feature Intervals , 1997, ECML.

[5]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[6]  Jacob Eisenstein,et al.  Visual and linguistic information in gesture classification , 2006 .

[7]  Mei-Chen Wu,et al.  Error Detection and Correction Based on Chinese Phonemic Alphabet in Chinese Text , 2007, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[9]  Wanda Pratt Dynamic organization of search results using the UMLS , 1997, AMIA.

[10]  Jun Liang,et al.  Increasing the Meaningful Use of Electronic Medical Records: A Localized Health Level 7 Clinical Document Architecture System , 2010, ADMA.

[11]  Zoran Budimac,et al.  Text Categorization and Sorting of Web Search Results , 2009, Comput. Informatics.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Lei Liu,et al.  Extracting important information from Chinese Operation Notes with natural language processing methods , 2014, J. Biomed. Informatics.

[15]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2007, ICML '07.

[18]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[19]  Susan T. Dumais,et al.  Bringing order to the Web: automatically categorizing search results , 2000, CHI.

[20]  Tsun Ku,et al.  中文混淆字集應用於別字偵錯模板自動產生 (Chinese Confusion Word Set for Automatic Generation of Spelling Error Detecting Template) [In Chinese] , 2009, ROCLING.

[21]  George Hripcsak,et al.  Caveats for the use of operational electronic health record data in comparative effectiveness research. , 2013, Medical care.

[22]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.