State of the Union: A Data Consumer's Perspective on Wikidata and Its Properties for the Classification and Resolution of Entities

Wikipedia is one of the most popular sources of free data on the Internet and subject to extensive use in numerous areas of research. Wikidata on the other hand, the knowledge base behind Wikipedia, is less popular as a source of data, despite having the “data” already in its name, and despite the fact that many applications in Natural Language Processing in general and Information Extraction in particular benefit immensely from the integration of knowledge bases. In part, this imbalance is owed to the younger age of Wikidata, which launched over a decade after Wikipedia. However, this is also owed to challenges posed by the still evolving properties of Wikidata that make its content more difficult to consume for third parties than is desirable. In this article, we analzye the causes of these challenges from the viewpoint of a data consumer and discuss possible avenues of research and advancement that both the scientific and the Wikidata community can collaborate on to turn the knowledge base into the invaluable asset that it is uniquely positioned to become.

[1]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[2]  Gerhard Weikum,et al.  YAGO: A Large Ontology from Wikipedia and WordNet , 2008, J. Web Semant..

[3]  Johanna Geiß,et al.  Beyond friendships and followers: The Wikipedia social network , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[4]  Andrea Passerini,et al.  Bootstrapping Domain Ontologies from Wikipedia: A Uniform Approach , 2015, IJCAI.

[5]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[6]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[7]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 shared task , 2003 .

[8]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[9]  Benno Stein,et al.  Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis , 2015, SIGIR.

[10]  Markus Krötzsch,et al.  Reifying RDF: What Works Well With Wikidata? , 2015, SSWS@ISWC.

[11]  Simon Lindberg,et al.  Extraction of Career Profiles from Wikipedia , 2015, BD.

[12]  Mark Liberman,et al.  Corpora for topic detection and tracking , 2002 .

[13]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[14]  Claudia Müller-Birn,et al.  Peer-production system or collaborative ontology engineering effort: what is Wikidata? , 2015, OpenSym.

[15]  Thomas Steiner,et al.  Bots vs. Wikipedians, Anons vs. Logged-Ins (Redux): A Global Study of Edit Activity on Wikipedia and Wikidata , 2014, OpenSym.

[16]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[17]  Johanna Geiß,et al.  The Wikipedia location network: overcoming borders and oceans , 2015, GIR.

[18]  Michael Günther,et al.  Introducing Wikidata to the Linked Data Web , 2014, SEMWEB.

[19]  Andreas Spitz,et al.  Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events , 2016, SIGIR.

[20]  Thomas Pellissier Tanon,et al.  From Freebase to Wikidata: The Great Migration , 2016, WWW.

[21]  Jens Lehmann,et al.  Wikidata through the Eyes of DBpedia , 2015, Semantic Web.

[22]  Markus Krötzsch,et al.  Semantic Wikipedia , 2007, WWW '06.

[23]  Alexander Pfundner,et al.  Utilizing the Wikidata System to Improve the Quality of Medical Content in Wikipedia in Diverse Languages: A Pilot Study , 2015, Journal of medical Internet research.

[24]  Daniel S. Weld,et al.  Automatically refining the wikipedia infobox ontology , 2008, WWW.