An automated approach to identify scientific publications reporting pharmacokinetic parameters

Pharmacokinetic (PK) predictions of new chemical entities are aided by prior knowledge from other compounds. The development of robust algorithms that improve preclinical and clinical phases of drug development remains constrained by the need to search, curate and standardise PK information across the constantly-growing scientific literature. The lack of centralised, up-to-date and comprehensive repositories of PK data represents a significant limitation in the drug development pipeline.In this work, we propose a machine learning approach to automatically identify and characterise scientific publications reporting PK parameters from in vivo data, providing a centralised repository of PK literature. A dataset of 4,792 PubMed publications was labelled by field experts depending on whether in vivo PK parameters were estimated in the study. Different classification pipelines were compared using a bootstrap approach and the best-performing architecture was used to develop a comprehensive and automatically-updated repository of PK publications. The best-performing architecture encoded documents using unigram features and mean pooling of BioBERT embeddings obtaining an F1 score of 83.8% on the test set. The pipeline retrieved over 121K PubMed publications in which in vivo PK parameters were estimated and it was scheduled to perform weekly updates on newly published articles. All the relevant documents were released through a publicly available web interface (https://app.pkpdai.com) and characterised by the drugs, species and conditions mentioned in the abstract, to facilitate the subsequent search of relevant PK data. This automated, open-access repository can be used to accelerate the search and comparison of PK results, curate ADME datasets, and facilitate subsequent text mining tasks in the PK domain.

[1]  S. Marshall,et al.  Good Practices in Model‐Informed Drug Discovery and Development: Practice, Application, and Documentation , 2016, CPT: pharmacometrics & systems pharmacology.

[2]  Jan Grzegorzewski,et al.  PK-DB: pharmacokinetics database for individualized and stratified computational modeling , 2019, bioRxiv.

[3]  J. Guillaume,et al.  [PubMed]. , 2020, Annales de dermatologie et de venereologie.

[4]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[5]  George Papadatos,et al.  Activity, assay and target data curation and quality in the ChEMBL database , 2015, Journal of Computer-Aided Molecular Design.

[6]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[7]  Marco Tulio Valente,et al.  Mining usage patterns for the Android API , 2015, PeerJ Comput. Sci..

[8]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[9]  Konrad P. Kording,et al.  Pubmed Parser: A Python Parser for PubMed Open-Access XML Subset and MEDLINE XML Dataset XML Dataset , 2020, J. Open Source Softw..

[10]  Doug Downey,et al.  SPECTER: Document-level Representation Learning using Citation-informed Transformers , 2020, ACL.

[11]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[12]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[13]  Lang Li,et al.  Literature mining on pharmacokinetics numerical data: A feasibility study , 2009, J. Biomed. Informatics.

[14]  Dimitrios Kokkinakis MeSH(c): from a Controlled Vocabulary to a Processable Resource , 2008, LREC.

[15]  Dahai Zhang,et al.  A Data-Driven Design for Fault Detection of Wind Turbines Using Random Forests and XGboost , 2018, IEEE Access.

[16]  Oscar Montiel,et al.  Estimation of Population Pharmacokinetic Model Parameters Using a Genetic Algorithm , 2017, NAFIPS.

[17]  Yonghwa Choi,et al.  A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining , 2019, IEEE Access.

[18]  Adriano D. Andricopulo,et al.  PK/DB: database for pharmacokinetic properties and predictive in silico ADME models , 2008, Bioinform..

[19]  M. Mckee,et al.  Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018. , 2020, JAMA.

[20]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[21]  Leila Etaati Azure Databricks , 2019, Machine Learning with Microsoft Technologies.

[22]  Abdullah Al Nahid,et al.  Effective Intrusion Detection System Using XGBoost , 2018, Inf..

[23]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[24]  Alois Knoll,et al.  Gradient boosting machines, a tutorial , 2013, Front. Neurorobot..

[25]  Luis Mateus Rocha,et al.  Biomedical literature mining for pharmacokinetics numerical parameter collection , 2013 .

[26]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[27]  R. M. Owen,et al.  An analysis of the attrition of drug candidates from four major pharmaceutical companies , 2015, Nature Reviews Drug Discovery.

[28]  Ann Richard,et al.  ACToR--Aggregated Computational Toxicology Resource. , 2008, Toxicology and applied pharmacology.

[29]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[30]  Fumiyoshi Yamashita,et al.  An Evolutionary Search Algorithm for Covariate Models in Population Pharmacokinetic Analysis. , 2017, Journal of pharmaceutical sciences.

[31]  Sophia Ananiadou,et al.  Negated bio-events: analysis and identification , 2013, BMC Bioinformatics.

[32]  Minzhu Xie,et al.  XGBFEMF: An XGBoost-Based Framework for Essential Protein Prediction , 2018, IEEE Transactions on NanoBioscience.

[33]  PubChem , 2020, Definitions.

[34]  Susanne Winiwarter,et al.  Improving the Accuracy of Predicted Human Pharmacokinetics: Lessons Learned from the AstraZeneca Drug Pipeline Over Two Decades. , 2020, Trends in pharmacological sciences.

[35]  Ulf Leser,et al.  Finding kinetic parameters using text mining. , 2004, Omics : a journal of integrative biology.

[36]  Franco Lombardo,et al.  Trend Analysis of a Database of Intravenous Pharmacokinetic Parameters in Humans for 1352 Drug Compounds , 2018, Drug Metabolism and Disposition.

[37]  Z R Li,et al.  Quantitative structure-pharmacokinetic relationships for drug clearance by using statistical learning methods. , 2006, Journal of molecular graphics & modelling.

[38]  Norman W. Paton,et al.  KiPar, a tool for systematic information retrieval regarding parameters for kinetic modelling of yeast metabolic pathways , 2009, Bioinform..

[39]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.