A Machine Learning Approach to Identify Clinical Trials Involving Nanodrugs and Nanodevices from ClinicalTrials.gov

Background Clinical Trials (CTs) are essential for bridging the gap between experimental research on new drugs and their clinical application. Just like CTs for traditional drugs and biologics have helped accelerate the translation of biomedical findings into medical practice, CTs for nanodrugs and nanodevices could advance novel nanomaterials as agents for diagnosis and therapy. Although there is publicly available information about nanomedicine-related CTs, the online archiving of this information is carried out without adhering to criteria that discriminate between studies involving nanomaterials or nanotechnology-based processes (nano), and CTs that do not involve nanotechnology (non-nano). Finding out whether nanodrugs and nanodevices were involved in a study from CT summaries alone is a challenging task. At the time of writing, CTs archived in the well-known online registry ClinicalTrials.gov are not easily told apart as to whether they are nano or non-nano CTs—even when performed by domain experts, due to the lack of both a common definition for nanotechnology and of standards for reporting nanomedical experiments and results. Methods We propose a supervised learning approach for classifying CT summaries from ClinicalTrials.gov according to whether they fall into the nano or the non-nano categories. Our method involves several stages: i) extraction and manual annotation of CTs as nano vs. non-nano, ii) pre-processing and automatic classification, and iii) performance evaluation using several state-of-the-art classifiers under different transformations of the original dataset. Results and Conclusions The performance of the best automated classifier closely matches that of experts (AUC over 0.95), suggesting that it is feasible to automatically detect the presence of nanotechnology products in CT summaries with a high degree of accuracy. This can significantly speed up the process of finding whether reports on ClinicalTrials.gov might be relevant to a particular nanoparticle or nanodevice, which is essential to discover any precedents for nanotoxicity events or advantages for targeted drug therapy.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Yiyong Huang,et al.  Boron-based pronucleophiles in catalytic (asymmetric) C(sp3)–allyl cross-couplings , 2012 .

[3]  Jean-Michel Renders,et al.  Word-Sequence Kernels , 2003, J. Mach. Learn. Res..

[4]  Martin Fritts,et al.  Nanoinformatics: a new area of research in nanomedicine , 2012, International journal of nanomedicine.

[5]  M. Bally,et al.  A Comparison of Liposomal Formulations of Doxorubicin with Drug Administered in Free Form , 2001, Drug safety.

[6]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[7]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[8]  Anne Marsden,et al.  International Organization for Standardization , 2014 .

[9]  Miguel García-Remesal,et al.  Using Nanoinformatics Methods for Automatically Identifying Relevant Nanotoxicology Entities from the Literature , 2012, BioMed research international.

[10]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[11]  Tony Tse,et al.  Moving Toward Transparency of Clinical Trials , 2008, Science.

[12]  V Maojo,et al.  International Efforts in Nanoinformatics Research Applied to Nanomedicine , 2010, Methods of Information in Medicine.

[13]  R. Bawa Patents and nanomedicine. , 2007, Nanomedicine.

[14]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[15]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[16]  Damaris Murry,et al.  Nanomaterial Registry: An authoritative resource for assessing environmental and biological interactions of nanomaterials , 2012 .

[17]  Ricardo Pietrobon,et al.  The Database for Aggregate Analysis of ClinicalTrials.gov (AACT) and Subsequent Regrouping by Clinical Specialty , 2012, PloS one.

[18]  Martin Fritts,et al.  Nanoinformatics: developing new computing applications for nanomedicine , 2012 .

[19]  O. Kirillova Results and Outcome Reporting In ClinicalTrials.gov, What Makes it Happen? , 2012, PloS one.

[20]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[21]  Harlan M. Krumholz,et al.  Trial Publication after Registration in ClinicalTrials.Gov: A Cross-Sectional Analysis , 2009, PLoS medicine.

[22]  Chen Lin,et al.  Automatic Prediction of Rheumatoid Arthritis Disease Activity from the Electronic Medical Records , 2013, AMIA.

[23]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[24]  S. Gruner,et al.  Metal Nanoparticle/Block Copolymer Composite Assembly and Disassembly. , 2009, Chemistry of materials : a publication of the American Chemical Society.

[25]  Martin Fritts,et al.  Informatics and standards for nanomedicine technology. , 2011, Wiley interdisciplinary reviews. Nanomedicine and nanobiotechnology.

[26]  Alastair J J Wood,et al.  Progress and deficiencies in the registration of clinical trials. , 2009, The New England journal of medicine.

[27]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28]  K. Krleza-Jeric Clinical Trial Registration: The Differing Views of Industry, the WHO, and the Ottawa Group , 2005, PLoS medicine.

[29]  Roberto Basili,et al.  Learning to Classify Text Using Support Vector Machines: Methods, Theory, and Algorithms by Thorsten Joachims , 2003, Comput. Linguistics.

[30]  B. Burmahl The big picture. , 2000, Health facilities management.

[31]  Nathan A. Baker,et al.  Standardizing data , 2008, Nature Cell Biology.

[32]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[33]  Cynthia Brandt,et al.  Semi-supervised clinical text classification with Laplacian SVMs: An application to cancer case management , 2013, J. Biomed. Informatics.

[34]  M. Malmsten,et al.  Nanomedicine: reshaping clinical practice , 2010, Journal of internal medicine.

[35]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[36]  R. Horton,et al.  Time to register randomised trials , 1999, The Lancet.

[37]  Ronald C. Chen,et al.  Revival of the abandoned therapeutic wortmannin by nanoparticle drug delivery , 2012, Proceedings of the National Academy of Sciences.

[38]  A. Gabizon Pegylated Liposomal Doxorubicin: Metamorphosis of an Old Drug into a New Form of Chemotherapy , 2001, Cancer investigation.

[39]  P. Lachenbruch On Expected Probabilities of Misclassification in Discriminant Analysis, Necessary Sample Size, and a Relation with the Multiple Correlation Coefficient , 1968 .

[40]  Yves Lechevallier,et al.  Proceedings of COMPSTAT'2010 , 2010 .

[41]  Francis R. Bach,et al.  Consistency of the group Lasso and multiple kernel learning , 2007, J. Mach. Learn. Res..

[42]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[43]  Shaker A Mousa,et al.  Emerging nanopharmaceuticals. , 2008, Nanomedicine : nanotechnology, biology, and medicine.

[44]  K. Horie,et al.  Terminology of polymers and polymerization processes in dispersed systems (IUPAC Recommendations 2011) , 2011 .

[45]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[46]  Cameron Rhudy,et al.  How Congress May Have Failed Consumers with the Food and Drug Administration Amendments Act of 2007 , 2008 .

[47]  Hua Xu,et al.  Portability of an algorithm to identify rheumatoid arthritis in electronic health records , 2012, J. Am. Medical Informatics Assoc..

[48]  R. Tibshirani The Lasso Problem and Uniqueness , 2012, 1206.0313.

[49]  Kenneth D. Mandl,et al.  Outcome Reporting Among Drug Trials Registered in ClinicalTrials.gov , 2010, Annals of Internal Medicine.

[50]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[51]  M. Roco National Nanotechnology Initiative , 2012 .

[52]  G. Whitesides The 'right' size in nanobiotechnology , 2003, Nature Biotechnology.

[53]  Michael Stonebraker,et al.  The Morgan Kaufmann Series in Data Management Systems , 1999 .

[54]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[55]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[56]  Martin Fritts,et al.  Nanoinformatics and DNA-Based Computing: Catalyzing Nanomedicine , 2010, Pediatric Research.

[57]  T. Pang,et al.  Registering clinical trials: an essential role for WHO , 2004, The Lancet.

[58]  R. Simes Publication bias: the case for an international registry of clinical trials. , 1986, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[59]  Jesse A Berlin,et al.  From ClinicalTrials.gov trial registry to an analysis-ready database of clinical trial results , 2013, Clinical trials.

[60]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[61]  Gibson Peter,et al.  Considerations on a Definition of Nanomaterial for Regulatory Purposes , 2010 .

[62]  Charles Tahan,et al.  Identifying Nanotechnology in Society , 2006, Adv. Comput..

[63]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[64]  Serguei V. S. Pakhomov,et al.  Automated Disambiguation of Acronyms and Abbreviations in Clinical Texts: Window and Training Size Considerations , 2012, AMIA.

[65]  Scott Gustafson,et al.  caCORE: A common infrastructure for cancer informatics , 2003, Bioinform..

[66]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[67]  W. Chan,et al.  Nanotoxicity: the growing need for in vivo study. , 2007, Current opinion in biotechnology.

[68]  Julio C. Facelli,et al.  Automatic Extraction of Nanoparticle Properties Using Natural Language Processing: NanoSifter an Application to Acquire PAMAM Dendrimer Properties , 2014, PloS one.

[69]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[70]  Nicholas C. Ide,et al.  The ClinicalTrials.gov results database--update and key issues. , 2011, The New England journal of medicine.

[71]  H. Beyenal,et al.  Correlative Microscopy and Chemical Imaging to Characterize the Structure and Biogeochemical Function of Biofilms , 2012, Microscopy and Microanalysis.

[72]  V Maojo,et al.  Nanoinformatics knowledge infrastructures: bringing efficient information management to nanomedical research. , 2013, Computational science & discovery.

[73]  Wayne M. Mullett,et al.  Nanomedicine in action: an overview of cancer nanomedicine on the market and in clinical trials , 2013 .

[74]  Arthur G Erdman,et al.  The big picture on nanomedicine: the state of investigational and approved nanomedicine products. , 2013, Nanomedicine : nanotechnology, biology, and medicine.

[75]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[76]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[77]  An-Wen Chan,et al.  Bias, Spin, and Misreporting: Time for Full Access to Trial Protocols and Results , 2008, PLoS medicine.

[78]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[79]  John Hoey,et al.  Clinical trial registration: a statement from the International Committee of Medical Journal Editors. , 2005, Circulation.

[80]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[81]  Mauro Ferrari,et al.  nan'o·tech·nol'o·gy n. , 2006, Nature nanotechnology.

[82]  Nathan A. Baker,et al.  NanoParticle Ontology for cancer nanotechnology research , 2011, J. Biomed. Informatics.

[83]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[84]  K. Bretonnel Cohen,et al.  Mining the pharmacogenomics literature - a survey of the state of the art , 2012, Briefings Bioinform..

[85]  Li Li,et al.  Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics , 2014, PloS one.

[86]  E. Candès,et al.  Near-ideal model selection by ℓ1 minimization , 2008, 0801.0345.