Mining semantics for culturomics: towards a knowledge-based approach

The massive amounts of text data made available through the Google Books digitization project have inspired a new field of big-data textual research. Named culturomics, this field has attracted the attention of a growing number of scholars over recent years. However, initial studies based on these data have been criticized for not referring to relevant work in linguistics and language technology. This paper provides some ideas, thoughts and first steps towards a new culturomics initiative, based this time on Swedish data, which pursues a more knowledge-based approach than previous work in this emerging field. The amount of new Swedish text produced daily and older texts being digitized in cultural heritage projects grows at an accelerating rate. These volumes of text being available in digital form have grown far beyond the capacity of human readers, leaving automated semantic processing of the texts as the only realistic option for accessing and using the information contained in them. The aim of our recently initiated research program is to advance the state of the art in language technology resources and methods for semantic processing of Big Swedish text and focus on the theoretical and methodological advancement of the state of the art in extracting and correlating information from large volumes of Swedish text using a combination of knowledge-based and statistical methods.

[1]  Lars Borin,et al.  Naming the Past: Named Entity and Animacy Recognition in 19th Century Swedish Literature , 2007, LaTeCH@ACL 2007.

[2]  Gerhard Weikum,et al.  Bridging the Terminology Gap in Web Archive Search , 2009, WebDB.

[3]  Lars Borin,et al.  Literary Onomastics and Language Technology , 2010 .

[4]  Kjetil Nørvåg,et al.  Exploiting time-based synonyms in searching document archives , 2010, JCDL '10.

[5]  Pierre Nugues,et al.  Ontology matching: from PropBank to DBpedia , 2012 .

[6]  Katrin Erk,et al.  SALTO - A Versatile Multi-Level Annotation Tool , 2006, LREC.

[7]  Marco Bonzanini,et al.  Extractive summarisation via sentence removal: condensing relevant sentences into a short summary , 2013, SIGIR.

[8]  Vasileios Lampos,et al.  The Expression of Emotions in 20th Century Books , 2013, PloS one.

[9]  Jennifer Chu-Carroll,et al.  Building Watson: An Overview of the DeepQA Project , 2010, AI Mag..

[10]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[11]  Kalev Leetaru,et al.  Culturomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space , 2011, First Monday.

[12]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[13]  Oren Etzioni,et al.  Open Language Learning for Information Extraction , 2012, EMNLP.

[14]  Erez Lieberman Aiden,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010, Science.

[15]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[16]  Daniel Gildea,et al.  Automatic Labeling of Semantic Roles , 2000, ACL.

[17]  Franco Moretti Graphs, Maps, Trees: Abstract Models for a Literary History , 2005 .

[18]  Pierre Nugues,et al.  Using Semantic Role Labeling to Extract Events from Wikipedia , 2011, DeRiVE@ISWC.

[19]  Mirella Lapata,et al.  Semi-Supervised Semantic Role Labeling , 2009, EACL.

[20]  Oren Etzioni,et al.  Semantic Role Labeling for Open Information Extraction , 2010, HLT-NAACL 2010.

[21]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[22]  Richard Johansson,et al.  Semantic Role Labeling with the Swedish FrameNet , 2012, LREC.

[23]  Aditya Kalyanpur,et al.  Automatic knowledge extraction from documents , 2012, IBM J. Res. Dev..

[24]  Gerhard Weikum,et al.  Incorporating terminology evolution for query translation in text retrieval with association rules , 2010, CIKM '10.

[25]  Daniel Jurafsky,et al.  Studying the History of Ideas Using Topic Models , 2008, EMNLP.

[26]  Matthew L. Jockers Macroanalysis: Digital Methods and Literary History , 2013 .

[27]  Christopher R. Johnson,et al.  Background to Framenet , 2003 .

[28]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[29]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[30]  Philip Garnett,et al.  The Expression of Emotions in 20 th Century Books , 2015 .

[31]  Richard Johansson,et al.  Dependency-based Syntactic–Semantic Analysis with PropBank and NomBank , 2008, CoNLL.

[32]  Alessandro Moschitti,et al.  End-to-End Relation Extraction Using Distant Supervision from External Semantic Repositories , 2011, ACL.

[33]  David A. Ferrucci,et al.  Introduction to "This is Watson" , 2012, IBM J. Res. Dev..

[34]  Harry Eugene Stanley,et al.  Statistical Laws Governing Fluctuations in Word Use from Word Birth to Word Death , 2011, Scientific Reports.

[35]  Pierre Nugues,et al.  Entity Extraction: From Unstructured Text to DBpedia RDF triples , 2012, WoLE@ISWC.

[36]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[37]  Markus Forsberg,et al.  The Past Meets the Present in Swedish FrameNet , 2010 .

[38]  Pierre Nugues,et al.  Constructing Large Proposition Databases , 2012, LREC.

[39]  Markus Forsberg,et al.  Korp — the corpus infrastructure of Språkbanken , 2012, LREC.

[40]  Pierre Nugues An Introduction to Language Processing with Perl and Prolog: An Outline of Theories, Implementation, and Application with Special Consideration of English, French, and German , 2006, Cognitive Technologies.

[41]  Udo Hahn,et al.  SYNTACTIC SIMPLIFICATION AND SEMANTIC ENRICHMENT—TRIMMING DEPENDENCY GRAPHS FOR EVENT EXTRACTION , 2011, Comput. Intell..

[42]  Gerhard Weikum,et al.  Entity timelines: visual analytics and named entity evolution , 2011, CIKM '11.

[43]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[44]  Devdatt P. Dubhashi,et al.  The Lovász ϑ function, SVMs and finding large dense subgraphs , 2012, NIPS 2012.

[45]  Gerhard Weikum,et al.  SITAC: discovering semantically identical temporally altering concepts in text archives , 2011, EDBT/ICDT '11.

[46]  Thomas Mayer,et al.  Towards Tracking Semantic Change by Visual Analytics , 2011, ACL.

[47]  Dan Roth,et al.  The Importance of Syntactic Parsing and Inference in Semantic Role Labeling , 2008, CL.

[48]  Pierre Nugues,et al.  Multilingual Semantic Role Labeling , 2009, CoNLL Shared Task.

[49]  J. Bohannon Digital data. Google books, Wikipedia, and the future of culturomics. , 2011, Science.