Elucidating high-dimensional cancer hallmark annotation via enriched ontology

MOTIVATION Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. RESULTS To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. AVAILABILITY https://github.com/cskyan/chmannot.

[1]  P L Schuyler,et al.  The UMLS Metathesaurus: representing different views of biomedical concepts. , 1993, Bulletin of the Medical Library Association.

[2]  C. Mathers,et al.  Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 2012 , 2015, International journal of cancer.

[3]  Raymond S H Yang,et al.  Characterization of gene expression changes associated with MNNG, arsenic, or metal mixture treatment in human keratinocytes: application of cDNA microarray technology. , 2002, Environmental health perspectives.

[4]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[5]  Nan Wang,et al.  ProtQuant: a tool for the label-free quantification of MudPIT proteomics data , 2007, BMC Bioinformatics.

[6]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[7]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[8]  Dina Demner-Fushman,et al.  Biomedical Text Mining: A Survey of Recent Progress , 2012, Mining Text Data.

[9]  M. Roizen,et al.  Hallmarks of Cancer: The Next Generation , 2012 .

[10]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[11]  Yixin Chen,et al.  A fast parallel algorithm for finding the longest common sequence of multiple biosequences , 2006, BMC Bioinformatics.

[12]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[13]  Aaron M. Cohen,et al.  An Effective General Purpose Approach for Automated Biomedical Document Classification , 2006, AMIA.

[14]  Douglas Hanahan,et al.  Accessories to the Crime: Functions of Cells Recruited to the Tumor Microenvironment Prospects and Obstacles for Therapeutic Targeting of Function-enabling Stromal Cell Types , 2022 .

[15]  Qing Zhang,et al.  Automating document classification for the Immune Epitope Database , 2007, BMC Bioinformatics.

[16]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[17]  Richard Schlegel,et al.  HPV-16 E6/7 Immortalization Sensitizes Human Keratinocytes to Ultraviolet B by Altering the Pathway from Caspase-8 to Caspase-9-dependent Apoptosis* , 2002, The Journal of Biological Chemistry.

[18]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[19]  I. Spivak,et al.  Markers of cellular senescence. Telomere shortening as a marker of cellular senescence , 2016, Aging.

[20]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[21]  S. Graham,et al.  Human papillomavirus gene expression is controlled by host cell splicing factors. , 2012, Biochemical Society transactions.

[22]  Mathias Wagner,et al.  Text mining, a race against time? An attempt to quantify possible variations in text corpora of medical publications throughout the years , 2016, Comput. Biol. Medicine.

[23]  Yuri Lazebnik,et al.  What are the hallmarks of cancer? , 2010, Nature Reviews Cancer.

[24]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[25]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[26]  Huan Liu Feature Selection , 2010, Encyclopedia of Machine Learning.

[27]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[28]  Robert A. Weinberg,et al.  Creation of human tumour cells with defined genetic elements , 1999, Nature.

[29]  Carlo Gambacorti-Passerini,et al.  Locking Src/Abl Tyrosine Kinase Activities Regulate Cell Differentiation and Invasion of Human Cervical Cancer Cells Expressing E6/E7 Oncoproteins of High-Risk HPV , 2010, Journal of oncology.

[30]  Sang Kook Lee,et al.  Anticancer Activity of Novel Daphnane Diterpenoids from Daphne genkwa through Cell-Cycle Arrest and Suppression of Akt/STAT/Src Signalings in Human Lung Cancer Cells , 2012, Biomolecules & therapeutics.

[31]  K. Bretonnel Cohen,et al.  Frontiers of biomedical text mining: current progress , 2007, Briefings Bioinform..

[32]  H. Clevers,et al.  Wnt signalling in stem cells and cancer , 2005, Nature.

[33]  C. Amura,et al.  Treatment with a non-steroidal anti-inflammatory agent delays the growth of spontaneous pulmonary metastases of a mammary adenocarcinoma of non-detected immunogenicity. , 1992, British Journal of Cancer.

[34]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[35]  C B Harley,et al.  Replicative Senescence and Cell Immortality: The Role of Telomeres and Telomerase , 1997, Proceedings of the Society for Experimental Biology and Medicine. Society for Experimental Biology and Medicine.

[36]  Ming-Jer Tang,et al.  Insulin-like growth factor 1 is a potent stimulator of cervical cancer cell invasiveness and proliferation that is modulated by alphavbeta3 integrin signaling. , 2006, Carcinogenesis.

[37]  Amal Fawzy,et al.  Importance of serum levels of angiopoietin-2 and survivin biomarkers in non-small cell lung cancer. , 2012, Journal of the Egyptian National Cancer Institute.

[38]  Shamkant B. Navathe,et al.  Investigation into biomedical literature classification using support vector machines , 2005, 2005 IEEE Computational Systems Bioinformatics Conference (CSB'05).

[39]  Hagit Shatkay,et al.  Mining the Biomedical Literature in the Genomic Era: An Overview , 2003, J. Comput. Biol..

[40]  Maria Simi,et al.  Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization , 2000, ECDL.

[41]  Tetiana Serdiuk,et al.  Impact of cell division on intracellular uptake and nuclear targeting with fluorescent SiC‐based nanoparticles , 2013, Journal of biophotonics.

[42]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[43]  Jie Huang,et al.  An Anti-noise Text Categorization Method Based on Support Vector Machines , 2005, AWIC.

[44]  Patrick Ruch,et al.  Automatic assignment of biomedical categories: toward a generic approach , 2006, Bioinform..

[45]  Li Li,et al.  Inactivation of YAP oncoprotein by the Hippo pathway is involved in cell contact inhibition and tissue growth control. , 2007, Genes & development.

[46]  Anna Korhonen,et al.  Automatic semantic classification of scientific literature according to the hallmarks of cancer , 2016, Bioinform..

[47]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[48]  Tin Kam Ho,et al.  The Random Subspace Method for Constructing Decision Forests , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[49]  M. Frattini,et al.  The role of the E1 and E2 proteins in the replication of human papillomavirus type 31b. , 1994, Virology.

[50]  Michel Sadelain,et al.  Human T-lymphocyte cytotoxicity and proliferation directed by a single chimeric TCRζ /CD28 receptor , 2002, Nature Biotechnology.

[51]  S. Graham,et al.  Human papillomavirus: gene expression, regulation and prospects for novel diagnostic methods and antiviral therapies. , 2010, Future microbiology.

[52]  Hans Clevers,et al.  Activation of β-Catenin-Tcf Signaling in Colon Cancer by Mutations in β-Catenin or APC , 1997, Science.

[53]  Ali H. Brivanlou,et al.  Signaling Pathways in Cancer and Embryonic Stem Cells , 2007, Stem Cell Reviews.

[54]  Yun-Wei Lin,et al.  Cooperation of ERK and SCFSkp2 for MKP-1 Destruction Provides a Positive Feedback Regulation of Proliferating Signaling* , 2006, Journal of Biological Chemistry.

[55]  Yasunori Yamamoto,et al.  A Sentence Classification System for Multi Biomedical Literature Summarization , 2005, 21st International Conference on Data Engineering Workshops (ICDEW'05).

[56]  Hwee Tou Ng,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997, SIGIR '97.

[57]  David J. States,et al.  Bioinformatics Applications Note Databases and Ontologies Metab2mesh: Annotating Compounds with Medical Subject Headings , 2022 .

[58]  John R Mackey,et al.  Senescence evasion by MCF-7 human breast tumor-initiating cells , 2010, Breast Cancer Research.

[59]  Ming-Jer Tang,et al.  Insulin-like growth factor 1 is a potent stimulator of cervical cancer cell invasiveness and proliferation that is modulated by αvβ3 integrin signaling , 2006 .

[60]  Raul Rodriguez-Esteban,et al.  Biomedical Text Mining and Its Applications , 2009, PLoS Comput. Biol..

[61]  J. Steitz,et al.  Telomere shortening associated with chromosome instability is arrested in immortal cells which express telomerase activity. , 1992, The EMBO journal.

[62]  Hagit Shatkay,et al.  Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users , 2008, Bioinform..

[63]  P. Nair,et al.  Corrigendum to “Coexpression of Notch1 and NF-κB signaling pathway components in human cervical cancer progression” [Gynecol. Oncol. 104 (2007) 352–361] , 2007 .

[64]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[65]  Hajo Hippner,et al.  Text Mining , 2006, Informatik-Spektrum.

[66]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.