EMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions

This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.

[1]  Elaine Zosa,et al.  Multilingual Dynamic Topic Model , 2019, RANLP.

[2]  Senja Pollak,et al.  Interesting cross-border news discovery using cross-lingual article linking and document similarity , 2021, HACKASHOP.

[3]  Jarkko Lagus,et al.  A COVID-19 news coverage mood map of Europe , 2021, HACKASHOP.

[4]  Senja Pollak,et al.  Investigating cross-lingual training for offensive language detection , 2021, PeerJ Comput. Sci..

[5]  Paolo Rosso,et al.  Overview of the 7th Author Profiling Task at PAN 2019: Bots and Gender Profiling in Twitter , 2019, CLEF.

[6]  Senja Pollak,et al.  Extending Neural Keyword Extraction with TF-IDF tagset matching , 2021, HACKASHOP.

[7]  Monideepa Tarafdar,et al.  From "information" to "knowing": Exploring the role of social media in contemporary news consumption , 2014, Comput. Hum. Behav..

[8]  Igor Mozetic,et al.  Multilingual Twitter Sentiment Classification: The Role of Human Annotators , 2016, PloS one.

[9]  Senja Pollak,et al.  TNT-KID: Transformer-based Neural Tagger for Keyword Identification , 2021, Natural Language Engineering.

[10]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[11]  Blaž Škrlj,et al.  Zero-Shot Learning for Cross-Lingual News Sentiment Classification , 2020, Applied Sciences.

[12]  Tomaz Erjavec,et al.  The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English , 2019, TSD.

[13]  Nada Lavrac,et al.  ClowdFlows: A Cloud Based Scientific Workflow Platform , 2012, ECML/PKDD.

[14]  Masood Masoodian,et al.  TeMoCo: A Visualization Tool for Temporal Analysis of Multi-party Dialogues in Clinical Settings , 2019, 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS).

[15]  Masood Masoodian,et al.  TeMoCo-Doc: A visualization for supporting temporal and contextual analysis of dialogues and associated documents , 2020, AVI.

[16]  Elaine Zosa,et al.  A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval , 2020, CLSSTS@LREC.

[17]  Blaz Skrlj,et al.  RaKUn: Rank-based Keyword extraction via Unsupervised learning and Meta vertex aggregation , 2019, SLSP.

[18]  Senja Pollak,et al.  EMBEDDIA hackathon report: Automatic sentiment and viewpoint analysis of Slovenian news corpus on the topic of LGBTIQ+ , 2021, HACKASHOP.

[19]  Hannu Toivonen,et al.  Computational generation of slogans , 2020, Natural Language Engineering.

[20]  Blaz Skrlj,et al.  Fake or Not: Distinguishing between Bots, Males and Females , 2019, CLEF.

[21]  Petra Kralj Novak,et al.  Leveraging Contextual Embeddings for Detecting Diachronic Semantic Shift , 2020, LREC.

[22]  Senja Pollak,et al.  Zero-shot Cross-lingual Content Filtering: Offensive Language and Hate Speech Detection , 2021, HACKASHOP.

[23]  Heng Ji,et al.  Cross-lingual Name Tagging and Linking for 282 Languages , 2017, ACL.

[24]  René Alquézar,et al.  To Block or Not to Block? , 2004, IBERAMIA.

[25]  Hannu Toivonen,et al.  Data-Driven News Generation for Automated Journalism , 2017, INLG.

[26]  Martin Znidarsic,et al.  Annotated news corpora and a lexicon for sentiment analysis in Slovene , 2018, Lang. Resour. Evaluation.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Matthew Purver,et al.  Automating News Comment Moderation with Limited Resources: Benchmarking in Croatian and Estonian , 2020 .

[29]  Michael Wiegand,et al.  Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language , 2018 .

[30]  Preslav Nakov,et al.  Predicting the Type and Target of Offensive Posts in Social Media , 2019, NAACL.

[31]  Martin Malmsten,et al.  Playing with Words at the National Library of Sweden - Making a Swedish BERT , 2020, ArXiv.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Marko Grobelnik,et al.  Event registry: learning about world events from news , 2014, WWW.

[34]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[35]  Mikhail Arkhipov,et al.  Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language , 2019, ArXiv.

[36]  Trevor Cohn,et al.  Massively Multilingual Transfer for NER , 2019, ACL.

[37]  Myrthe Reuver,et al.  Implementing Evaluation Metrics Based on Theories of Democracy in News Comment Recommendation (Hackathon Report) , 2021, HACKASHOP.

[38]  Antoine Doucet,et al.  Alleviating Digitization Errors in Named Entity Recognition for Historical Documents , 2020, CONLL.