Visions and open challenges for a knowledge-based culturomics

The concept of culturomics was born out of the availability of massive amounts of textual data and the interest to make sense of cultural and language phenomena over time. Thus far however, culturomics has only made use of, and shown the great potential of, statistical methods. In this paper, we present a vision for a knowledge-based culturomics that complements traditional culturomics. We discuss the possibilities and challenges of combining knowledge-based methods with statistical methods and address major challenges that arise due to the nature of the data; diversity of sources, changes in language over time as well as temporal dynamics of information in general. We address all layers needed for knowledge-based culturomics, from natural language processing and relations to summaries and opinions.

[1]  Virgílio A. F. Almeida,et al.  From bias to opinion: a transfer-learning approach to real-time sentiment analysis , 2011, KDD.

[2]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[3]  Dunja Mladenic,et al.  Semantic Graphs Derived From Triplets with Application in Document Summarization , 2009, Informatica.

[4]  Thomas Risse,et al.  On the applicability of word sense discrimination on 201 years of modern english , 2013, International Journal on Digital Libraries.

[5]  Dilek Z. Hakkani-Tür,et al.  Open-Domain Multi-Document Summarization via Information Extraction: Challenges and Prospects , 2013, Multi-source, Multilingual Information Extraction and Summarization.

[6]  Thomas Risse,et al.  NEER: An Unsupervised Method for Named Entity Evolution Recognition , 2012, COLING.

[7]  Janyce Wiebe,et al.  Development and Use of a Gold-Standard Data Set for Subjectivity Classifications , 1999, ACL.

[8]  Lise Getoor,et al.  Collective entity resolution in relational data , 2007, TKDD.

[9]  Dragomir R. Radev,et al.  Centroid-based summarization of multiple documents , 2004, Inf. Process. Manag..

[10]  Swapna Somasundaran,et al.  Finding the Sources and Targets of Subjective Expressions , 2008, LREC.

[11]  Ben Hachey Multi-Document Summarisation Using Generic Relation Extraction , 2009, EMNLP.

[12]  Devdatt P. Dubhashi,et al.  Extractive Summarization using Continuous Vector Space Models , 2014, CVSC@EACL.

[13]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[14]  Franco Moretti Graphs, Maps, Trees: Abstract Models for a Literary History , 2005 .

[15]  Andrew McCallum,et al.  Learning Extractors from Unlabeled Text using Relevant Databases , 2007 .

[16]  J. Leskovec,et al.  Learning Semantic Graph Mapping for Document Summarization , 2004 .

[17]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[18]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[19]  Hong Yu,et al.  Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying the Polarity of Opinion Sentences , 2003, EMNLP.

[20]  Pedro M. Domingos,et al.  Joint Unsupervised Coreference Resolution with Markov Logic , 2008, EMNLP.

[21]  Devdatt P. Dubhashi,et al.  Entity disambiguation in anonymized graphs using graph kernels , 2013, CIKM.

[22]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[23]  Eyal Sagi,et al.  Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space , 2009 .

[24]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[25]  Andrew McCallum,et al.  Modeling Relations and Their Mentions without Labeled Text , 2010, ECML/PKDD.

[26]  Christian Biemann,et al.  That’s sick dude!: Automatic identification of word sense change across different timescales , 2014, ACL.

[27]  Gerhard Weikum,et al.  SITAC: discovering semantically identical temporally altering concepts in text archives , 2011, EDBT/ICDT '11.

[28]  Sasha Blair-Goldensohn,et al.  The viability of web-derived polarity lexicons , 2010, NAACL.

[29]  Kathleen R. McKeown,et al.  Predicting the semantic orientation of adjectives , 1997 .

[30]  Gilad Mishne,et al.  Why Are They Excited? Identifying and Explaining Spikes in Blog Mood Levels , 2006, EACL.

[31]  Dilek Z. Hakkani-Tür,et al.  The ICSI/UTD Summarization System at TAC 2009 , 2009, TAC.

[32]  Yan Zhang,et al.  Timeline Generation through Evolutionary Trans-Temporal Summarization , 2011, EMNLP.

[33]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[34]  David Zhang,et al.  Enhanced Search with Wildcards and Morphological Inflections in the Google Books Ngram Viewer , 2014, ACL.

[35]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[36]  Wolfgang Nejdl,et al.  Analyzing Political Trends in the Blogosphere , 2011, ICWSM.

[37]  Vasileios Hatzivassiloglou,et al.  A Formal Model for Information Selection in Multi-Sentence Text Extraction , 2004, COLING.

[38]  Daisuke Kawahara,et al.  Precise Information Retrieval Exploiting Predicate-Argument Structures , 2013, IJCNLP.

[39]  Claire Cardie,et al.  Joint Extraction of Entities and Relations for Opinion Recognition , 2006, EMNLP.

[40]  Nilesh N. Dalvi,et al.  Large-Scale Collective Entity Matching , 2011, Proc. VLDB Endow..

[41]  Bing Liu,et al.  Sentiment Analysis and Opinion Mining , 2012, Synthesis Lectures on Human Language Technologies.

[42]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[43]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[44]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[45]  Nina Tahmasebi Models and Algorithms for Automatic Detection of Language Evolution , 2013 .

[46]  Svetha Venkatesh,et al.  Event extraction using behaviors of sentiment signals and burst structure in social media , 2013, Knowledge and Information Systems.

[47]  Nigel Holmes Uncharted: Big Data as a Lens on Human Culture , 2014 .

[48]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[49]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[50]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[51]  Theresa Wilson Fine-grained subjectivity and sentiment analysis: recognizing the intensity, polarity, and attitudes of private states , 2008 .

[52]  Jiawei Han,et al.  Opinosis: A Graph Based Approach to Abstractive Summarization of Highly Redundant Opinions , 2010, COLING.

[53]  Gerlof Bouma,et al.  A Best-First Anagram Hashing Filter for Approximate String Matching with Generalized Edit Distance , 2012, COLING.

[54]  Pierre Nugues,et al.  Constructing Large Proposition Databases , 2012, LREC.

[55]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[56]  Richard Johansson,et al.  Relational Features in Fine-Grained Opinion Analysis , 2013, CL.

[57]  Vincent Ng,et al.  Coreference Resolution with World Knowledge , 2011, ACL.

[58]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[59]  Wendy G. Lehnert,et al.  Using Decision Trees for Coreference Resolution , 1995, IJCAI.

[60]  Chia-Hung Lin,et al.  EVENT-BASED TEXTUAL DOCUMENT RETRIEVAL BY USING SEMANTIC ROLE LABELING AND COREFERENCE RESOLUTION , 2008 .

[61]  Lise Getoor,et al.  A Latent Dirichlet Model for Unsupervised Entity Resolution , 2005, SDM.

[62]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[63]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[64]  Douglas W. Oard,et al.  Beyond topicality: Finding opinionated Chinese documents , 2009, ASIST.

[65]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[66]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[67]  Heeyoung Lee,et al.  A Multi-Pass Sieve for Coreference Resolution , 2010, EMNLP.

[68]  Philip Resnik,et al.  More than Words: Syntactic Packaging and Implicit Sentiment , 2009, NAACL.

[69]  Ivan Titov,et al.  A Bayesian Model for Joint Unsupervised Induction of Sentiment, Aspect and Discourse Representations , 2013, ACL.

[70]  Giang Binh Tran Structured summarization for news events , 2013, WWW '13 Companion.

[71]  Gerlof Bouma,et al.  bokstaffua, bokstaffwa, bokstafwa, bokstaua, bokstawa ... Towards lexical link-up for a corpus of Old Swedish , 2012, KONVENS.

[72]  Lise Getoor,et al.  Supervised and Unsupervised Methods in Employing Discourse Relations for Improving Opinion Polarity Classification , 2009, EMNLP.

[73]  Gerhard Weikum,et al.  Bridging the Terminology Gap in Web Archive Search , 2009, WebDB.

[74]  Tobias Günther,et al.  Sentiment Analysis of Microblogs , 2013 .

[75]  Duncan J. Watts,et al.  Everyone's an influencer: quantifying influence on twitter , 2011, WSDM '11.

[76]  Themis Palpanas,et al.  Survey on mining subjective data on the web , 2011, Data Mining and Knowledge Discovery.

[77]  Dan Klein,et al.  Coreference Resolution in a Modular, Entity-Centered Model , 2010, NAACL.

[78]  C. Fillmore FRAME SEMANTICS AND THE NATURE OF LANGUAGE * , 1976 .

[79]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[80]  Razvan C. Bunescu,et al.  Learning to Extract Relations from the Web using Minimal Supervision , 2007, ACL.

[81]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[82]  Daniel S. Weld,et al.  Autonomously semantifying wikipedia , 2007, CIKM '07.

[83]  Vincent Ng,et al.  Unsupervised Models for Coreference Resolution , 2008, EMNLP.

[84]  James Allan,et al.  Temporal summaries of new topics , 2001, SIGIR '01.

[85]  Claire Cardie,et al.  University of Massachusetts: Description of the CIRCUS System as Used for MUC-3 , 1991, MUC.

[86]  Eduard Hovy,et al.  Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text , 2006 .

[87]  D. Wijaya,et al.  Understanding semantic change of words over centuries , 2011, DETECT '11.

[88]  Klaus U. Schulz,et al.  Enabling information retrieval on historical document collections: the role of matching procedures and special lexica , 2009, AND '09.

[89]  Hector Garcia-Molina,et al.  Entity resolution with evolving rules , 2010, Proc. VLDB Endow..

[90]  Timothy Baldwin,et al.  Word Sense Induction for Novel Sense Detection , 2012, EACL.

[91]  Mike Thelwall,et al.  Sentiment in Twitter events , 2011, J. Assoc. Inf. Sci. Technol..