Modeling words for online sexual behavior surveillance and clinical text information extraction

How do we model the meaning of words? In domains like information retrieval, words have classically been modeled as discrete entities using 1-of-n encoding, a representation that elides most of a word’s syntactic and semantic structure. Recent research, however, has begun exploring more robust representations called word embeddings. Embeddings model words as a parameterized function mapping into an n-dimensional continuous space and implicitly encode a number of interesting semantic and syntactic properties. This dissertation examines two application areas where existing, state-of-the-art terminology modeling improves the task of information extraction (IE) – the process of transforming unstructured data into structured form. We show that a large amount of word meaning can be learned directly from very large document collections. First, we explore the feasibility of mining sexual health behavior data directly from the unstructured text of online “hookup” requests. The Internet has fundamentally changed how individuals locate sexual partners. The rise of dating websites, location-aware smartphone apps like Grindr and Tinder that facilitate casual sexual encounters (“hookups”), as well as changing trends in sexual health practices all speak to the shifting cultural dynamics surrounding sex in the digital age. These shifts also coincide with an increase in the incidence rate of sexually transmitted infections (STIs) in subpopulations such as young adults, racial and ethnic minorities, and men who have sex with men (MSM). The reasons for these increases and their possible connections to Internet cultural dynamics are not completely understood. What is apparent, however, is that sexual encounters negotiated online complicate

[1]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[2]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[3]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[4]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[5]  SoroaAitor,et al.  Graph-based Word Sense Disambiguation of biomedical documents , 2010 .

[6]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[7]  C. Voytek,et al.  Examining Racial Disparities in HIV: Lessons From Sexually Transmitted Infections Research , 2008, Journal of acquired immune deficiency syndromes.

[8]  C. Grov Risky Sex- and Drug-Seeking in a Probability Sample of Men-for-Men Online Bulletin Board Postings , 2010, AIDS and Behavior.

[9]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[10]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[11]  Gurmeet Singh Manku,et al.  Detecting near-duplicates for web crawling , 2007, WWW '07.

[12]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[13]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.

[16]  J. Ellen,et al.  Moving From Core Groups to Risk Spaces , 2003 .

[17]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[18]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[19]  Sergey Brin,et al.  Extracting Patterns and Relations from the World Wide Web , 1998, WebDB.

[20]  Peter D. Turney Domain and Function: A Dual-Space Model of Semantic Relations and Compositions , 2012, J. Artif. Intell. Res..

[21]  Peter J. Haug,et al.  ONYX: A System for the Semantic Analysis of Clinical Text , 2009, BioNLP@HLT-NAACL.

[22]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[23]  Wei Ma,et al.  RxNorm: prescription for electronic drug information exchange , 2005, IT Professional.

[24]  Ehud Rivlin,et al.  Placing search in context: the concept revisited , 2002, TOIS.

[25]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[26]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[27]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[28]  R. Valdiserri,et al.  The reemerging HIV/AIDS epidemic in men who have sex with men. , 2007, JAMA.

[29]  Ellen Riloff,et al.  Automatically Generating Extraction Patterns from Untagged Text , 1996, AAAI/IAAI, Vol. 2.

[30]  M. McFarlane,et al.  The Internet as a newly emerging risk environment for sexually transmitted diseases. , 2000, JAMA.

[31]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[32]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[33]  William West,et al.  Are gay communities dying or just in transition? Results from an international consultation examining possible structural change in gay communities , 2008, AIDS care.

[34]  Vaclav Snasel,et al.  Survey of Plagiarism Detection Methods , 2011, 2011 Fifth Asia Modelling Symposium.

[35]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[36]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[37]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[38]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[39]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[40]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[41]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[42]  Cynthia Brandt,et al.  Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification , 2013, J. Am. Medical Informatics Assoc..

[43]  Hongfang Liu,et al.  A study of abbreviations in the UMLS , 2001, AMIA.

[44]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[45]  Jean-Luc Gauvain,et al.  Training Neural Network Language Models on Very Large Corpora , 2005, HLT.

[46]  B. Mustanski,et al.  Methamphetamine and young men who have sex with men: understanding patterns and correlates of use and the association with HIV-related sexual risk. , 2007, Archives of pediatrics & adolescent medicine.

[47]  Rodney D. Nielsen,et al.  Towards comprehensive syntactic and semantic annotations of the clinical narrative , 2013, J. Am. Medical Informatics Assoc..

[48]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[49]  Hongfang Liu,et al.  Research and applications: MedXN: an open source medication extraction and normalization tool for clinical text , 2014, J. Am. Medical Informatics Assoc..

[50]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[51]  Alessandro Lenci,et al.  How we BLESSed distributional semantic evaluation , 2011, GEMS.

[52]  Abdulrahman Almuhareb,et al.  Attributes in lexical acquisition , 2006 .

[53]  Andrew McCallum,et al.  Relation Extraction with Matrix Factorization and Universal Schemas , 2013, NAACL.

[54]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[55]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[56]  Patrick Pantel,et al.  Entity Extraction via Ensemble Semantics , 2009, EMNLP.

[57]  Peter D. Turney Measuring Semantic Similarity by Latent Relational Analysis , 2005, IJCAI.

[58]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[59]  K. Khan,et al.  Spread of a novel influenza A (H1N1) virus via global airline transportation. , 2009, The New England journal of medicine.

[60]  J. Catania,et al.  Changes in prevalence of HIV infection and sexual risk behavior in men who have sex with men in San Francisco: 1997 2002. , 2007, American journal of public health.

[61]  Edward A. Fox,et al.  Research Contributions , 2014 .

[62]  Jeffrey Pennington,et al.  Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection , 2011, NIPS.

[63]  J Starren,et al.  Architectural requirements for a multipurpose natural language processor in the clinical environment. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[64]  K. Blankenship,et al.  Structural interventions in public health , 2000, AIDS.

[65]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[66]  Geoff Hulten,et al.  Spamming botnets: signatures and characteristics , 2008, SIGCOMM '08.

[67]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[68]  David A. Moskowitz,et al.  “GWM Looking for Sex—SERIOUS ONLY”: The Interplay of Sexual Ad Placement Frequency and Success on the Sexual Health of “Men Seeking Men” on Craigslist , 2010, Journal of gay & lesbian social services.

[69]  Curtis Dolezal,et al.  Sexual Negotiation, HIV-Status Disclosure, and Sexual Risk Behavior Among Latino Men Who Use the Internet to Seek Sex with Other Men , 2006, Archives of sexual behavior.

[70]  W. Mcfarland,et al.  Prevalence of HIV infection and predictors of high-transmission sexual risk behaviors among men who have sex with men. , 2007, American journal of public health.

[71]  Kazutoshi Sumiya,et al.  Crowd-sourced urban life monitoring: urban area characterization based crowd behavioral patterns from Twitter , 2012, ICUIMC.

[72]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[73]  S. Buchbinder,et al.  Risk factors for HIV infection among men who have sex with men , 2006, AIDS.

[74]  Louise Deléger,et al.  A sequence labeling approach to link medications and their attributes in clinical notes and clinical trial announcements for information extraction , 2012, J. Am. Medical Informatics Assoc..

[75]  G. Glass,et al.  Assessing the impact of airline travel on the geographic spread of pandemic influenza , 2003 .

[76]  Ellen Riloff,et al.  Automatically Constructing a Dictionary for Information Extraction Tasks , 1993, AAAI.

[77]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[78]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[79]  Martha Palmer,et al.  PropBank: the Next Level of TreeBank , 2003 .

[80]  Olga Patterson,et al.  Document clustering of clinical narratives: a systematic study of clinical sublanguages. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[81]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[82]  R. Rothenberg,et al.  Syphilis Control: The Historic Context and Epidemiologic Basis for Interrupting Sexual Transmission of Treponema pallidum , 1996, Sexually transmitted diseases.

[83]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[84]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[85]  Doug Downey,et al.  A Probabilistic Model of Redundancy in Information Extraction , 2005, IJCAI.

[86]  David Milward,et al.  Precise Medication Extraction using Agile Text Mining , 2014, Louhi@EACL.

[87]  J. Knox,et al.  EPIDEMIOLOGIC TREATMENT OF CONTACTS TO INFECTIOUS SYPHILIS. , 1963, Public health reports.

[88]  D. Seal,et al.  Internet Use, Recreational Travel, and HIV Risk Behaviors in Men Who Have Sex With Men , 2011, Journal of Community Health.

[89]  T. Menza,et al.  Prediction of HIV Acquisition Among Men Who Have Sex With Men , 2009, Sexually transmitted diseases.

[90]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[91]  Peter J. Haug,et al.  MPLUS: a probabilistic medical language understanding system , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[92]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[93]  Weiqiang Zhang,et al.  RNN language model with word clustering and class-based output layer , 2013, EURASIP J. Audio Speech Music. Process..

[94]  A. Solow,et al.  Measuring biological diversity , 2006, Environmental and Ecological Statistics.

[95]  F. Curriero,et al.  Defining core gonorrhea transmission utilizing spatial data. , 2004, American journal of epidemiology.

[96]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[97]  Alessandro Lenci,et al.  Distributional Memory: A General Framework for Corpus-Based Semantics , 2010, CL.

[98]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[99]  James W. Cooper,et al.  Text analytics for life science using the Unstructured Information Management Architecture , 2004, IBM Syst. J..

[100]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[101]  E. Cuthill,et al.  Reducing the bandwidth of sparse symmetric matrices , 1969, ACM '69.

[102]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[103]  P. Sullivan,et al.  A Review of the Literature on Event-Level Substance Use and Sexual Risk Behavior Among Men Who Have Sex with Men , 2012, AIDS and Behavior.

[104]  Lynne M. Emmerton,et al.  Look-alike and sound-alike medicines: risks and ‘solutions’ , 2011, International Journal of Clinical Pharmacy.

[105]  Fang Liu,et al.  Bmc Medical Informatics and Decision Making a Umls-based Spell Checker for Natural Language Processing in Vaccine Safety , 2006 .

[106]  Naomi Sager,et al.  Research Paper: Natural Language Processing and the Representation of Clinical Data , 1994, J. Am. Medical Informatics Assoc..

[107]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[108]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[109]  Olga Patterson,et al.  Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts , 2013, DTMBIO '13.

[110]  Ronen Feldman,et al.  Using Corpus Statistics on Entities to Improve Semi-supervised Relation Extraction from the Web , 2007, ACL.

[111]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[112]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[113]  Gary Marks,et al.  Meta-analytic Examination of Online Sex-Seeking and Sexual Risk Behavior Among Men Who Have Sex With Men , 2006, Sexually transmitted diseases.

[114]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[115]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[116]  S. Kalichman,et al.  Men Who Have Met Sex Partners via the Internet: Prevalence, Predictors, and Implications for HIV Prevention , 2002, Archives of sexual behavior.

[117]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[118]  William C Miller,et al.  Sexually transmitted disease core theory: roles of person, place, and time. , 2011, American journal of epidemiology.

[119]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[120]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[121]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.