Exploration of Known and Unknown Early Symptoms of Cervical Cancer and Development of a Symptom Spectrum - Outline of a Data and Text Mining Based Approach

This position paper delineates the structure of some experi- ments to detect early symptoms of cervical cancer. We are using a large corpora of electronic patient records texts in Swedish from Karolinska University Hospital from the years 2009-2010, where we extracted in total 1,660 patient records with the ICD-10 diagnosis code C53 for cer- vical cancer. We used a Named Entity Recogniser called Clinical Entity Finder to detect the diagnosis and symptoms expressed in these clinical texts containing in total 2,988,118 words. We found 28,218 symptoms and diagnoses on these 1,660 patients. We present some initial findings, and discuss them and propose a set of experiments to find possible early symptoms and/or a spectrum of early symptoms of cervical cancer.

[1]  J. Beilby Light Microscopic Techniques in Biology and Medicine , 1977 .

[2]  Sumithra Velupillai Temporal Expressions in Swedish Medical Text - A Pilot Study , 2014, BioNLP@ACL.

[3]  J. Peto,et al.  Human papillomavirus is a necessary cause of invasive cervical cancer worldwide , 1999, The Journal of pathology.

[4]  T. Kessler,et al.  Cervical Cancer: Prevention and Early Detection. , 2017, Seminars in oncology nursing.

[5]  Maria Kvist,et al.  HEALTH BANK - A Workbench for Data Science Applications in Healthcare , 2015, CAiSE Industry Track.

[6]  Hercules Dalianis,et al.  Clinical Text Retrieval - An Overview of Basic Building Blocks and Applications , 2014, Professional Search in the Modern World.

[7]  K. Sundström Human papillomavirus test and vaccination : impact on cervical cancer screening and prevention , 2012 .

[8]  Byoung-Tak Zhang,et al.  Mining the Risk Types of Human Papillomavirus (HPV) by AdaCost , 2003, DEXA.

[9]  Hercules Dalianis,et al.  Stockholm EPR Corpus : A Clinical Database Used to Improve Health Care , 2012 .

[10]  Sumithra Velupillai,et al.  Shades of Certainty: Annotation and Classification of Swedish Medical Records , 2012 .

[11]  A. Ramirez,et al.  Measuring the nature and duration of symptoms of cervical cancer in young women: developing an interview-based approach , 2013, BMC Women's Health.

[12]  Maria Kvist,et al.  Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study , 2014, J. Biomed. Informatics.

[13]  Hercules Dalianis,et al.  Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike , 2009, ACL.

[14]  Maria Skeppstedt,et al.  Negation detection in Swedish clinical text: An adaption of NegEx to Swedish , 2011, J. Biomed. Semant..

[15]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[16]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[17]  Danielle L. Mowery,et al.  Cue-based assertion classification for Swedish clinical text - Developing a lexicon for pyConTextSwe , 2014, Artif. Intell. Medicine.

[18]  Jantima Polpinij,et al.  Ontology-based Text Analysis Approach to Retrieve Oncology Documents from PubMed Relevant to Cervical Cancer in Clinical Trials , 2010, ICDM.

[19]  P. Sasieni,et al.  Delays in diagnosis of young females with symptomatic cervical cancer in England: an interview-based study , 2014, The British journal of general practice : the journal of the Royal College of General Practitioners.

[20]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[21]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[22]  Goran Nenadic,et al.  Text mining of cancer-related information: Review of current status and future directions , 2014, Int. J. Medical Informatics.