论文信息 - Implicit entity networks: a versatile document model

Implicit entity networks: a versatile document model

The time in which we live is often referred to as the Information Age. However, it can also aptly be characterized as an age of constant information overload. Nowhere is this more present than on the Web, which serves as an endless source of news articles, blog posts, and social media messages. Of course, this overload is even greater in professions that handle the creation or extraction of information and knowledge, such as journalists, lawyers, researchers, clerks, or medical professionals. The volume of available documents and the interconnectedness of their contents are both a blessing and a curse for the contemporary information consumer. On the one hand, they provide near limitless information, but on the other hand, their consumption and comprehension requires an amount of time that many of us cannot spare. As a result, automated extraction, aggregation, and summarization techniques have risen in popularity, even though they are a long way from being comprehensive. When we, as humans, are faced with an overload of information, we tend to look for patterns that bring order into the chaos. In news, we might identify familiar political gures or celebrities, whereas we might look for expressive symptoms in medicine, or precedential cases in law. In other words, we look for known entities as reference points, and then explore the content along the lines of their relations to others entities. Unfortunately, this approach is not re ected in current document models, which do not provide a similar focus on entities. As a direct result, the retrieval of entity-centric knowledge and relations from a ood of textual information becomes more di cult than it has to be, and the inclusion of external knowledge sources is impeded. In this thesis, we introduce implicit entity networks as a comprehensive document model that addresses this shortcoming and provides a holistic representation of document collections and document streams. Based on the premise of modelling the cooccurrence relations between terms and entities as rst-class citizens, we investigate how the resulting network structure facilitates e cient and e ective entity-centric search, and demonstrate the extraction of complex entity relations, as well as their summarization. We show that the implicit network model is fully compatible with dynamic streams of documents. Furthermore, we introduce document aggregation methods that are sensitive to the context of entity mentions, and can be used to distinguish between di erent entity relations. Beyond the relations of individual entities, we introduce network topics as a novel and scalable method for the extraction of topics from collections and streams of documents. Finally, we combine the insights gained from these applications in a versatile hypergraph document model that bridges the gap between unstructured text and structured knowledge sources.

Andreas Spitz | Andreas Spitz

[1] Valentin I. Spitkovsky,et al. A comparison of Named-Entity Disambiguation and Word Sense Disambiguation , 2016, LREC.

[2] Hui Li. Social Network Extraction and Exploration of Historic Correspondences , 2015, Bull. IEEE Tech. Comm. Digit. Libr..

[3] James Allan,et al. On-Line New Event Detection and Tracking , 1998, SIGIR Forum.

[4] Klaus Berberich,et al. Linking Wikipedia Events to Past News , 2014, SIGIR 2014.

[5] Satoshi Sekine,et al. A survey of named entity recognition and classification , 2007 .

[6] Steven Schockaert,et al. Jointly Learning Word Embeddings and Latent Topics , 2017, SIGIR.

[7] Nattiya Kanhabua,et al. Identifying Relevant Temporal Expressions for Real-World Events , 2012 .

[8] Ujwal Gadiraju,et al. Where the Event Lies: Predicting Event Occurrence in Textual Documents , 2016, SIGIR.

[9] Hans-Peter Kriegel,et al. SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds , 2014, KDD.

[10] W. Marsden. I and J , 2012 .

[11] Benno Stein,et al. Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis , 2015, SIGIR.

[12] John Sinclair,et al. Corpus, Concordance, Collocation , 1991 .

[13] Andrey Kutuzov,et al. Cross-Lingual Trends Detection for Named Entities in News Texts with Dynamic Neural Embedding Models , 2016, NewsIR@ECIR.

[14] Marie-Francine Moens,et al. A survey on question answering technology from an information retrieval perspective , 2011, Inf. Sci..

[15] Yiming Yang,et al. Topic-conditioned novelty detection , 2002, KDD.

[16] Ramon Ferrer i Cancho,et al. The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[17] Michael Gertz,et al. Multilingual and cross-domain temporal tagging , 2012, Language Resources and Evaluation.

[18] Elena Lloret,et al. Application of Text Summarization techniques to the Geographical Information Retrieval task , 2013, Expert Syst. Appl..

[19] Rishiraj Saha Roy,et al. Discovering and understanding word level user intent in Web search queries , 2015, J. Web Semant..

[20] Ralph Grishman,et al. Message Understanding Conference- 6: A Brief History , 1996, COLING.

[21] Johanna Geiß,et al. Refining imprecise spatio-temporal events: a network-based approach , 2016, GIR.

[22] Jens Lehmann,et al. DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[23] Umeshwar Dayal,et al. Ranking explanatory sentences for opinion summarization , 2013, SIGIR.

[24] Claude Berge,et al. Hypergraphs - combinatorics of finite sets , 1989, North-Holland mathematical library.

[25] Andreas Spitz,et al. Assessing Low-Intensity Relationships in Complex Networks , 2016, PloS one.

[26] Sebastian Rudolph,et al. Foundations of Semantic Web Technologies , 2009 .

[27] Brian D. Davison,et al. Tracking trends: incorporating term volume into temporal topic models , 2011, KDD.

[28] Furu Wei,et al. HyperSum: hypergraph based semi-supervised sentence ranking for query-oriented summarization , 2009, CIKM.

[29] Julia Hirschberg,et al. An Unsupervised Approach to Biography Production Using Wikipedia , 2008, ACL.

[30] Johanna Geiß,et al. With a Little Help from my Neighbors: Person Name Linking Using the Wikipedia Social Network , 2016, WWW.

[31] Chun Chen,et al. Using rich social media information for music recommendation via hypergraph model , 2011, TOMCCAP.

[32] Nick Craswell,et al. Query Expansion with Locally-Trained Word Embeddings , 2016, ACL.

[33] Kurt Hornik,et al. topicmodels : An R Package for Fitting Topic Models , 2016 .

[34] Yoav Goldberg,et al. A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[35] Thorsten Joachims,et al. Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[36] Abdalghani Abujabal,et al. Important Events in the Past, Present, and Future , 2015, WWW.

[37] Daniel M. Dunlavy,et al. Advantages to modeling relational data using hypergraphs versus graphs , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[38] Lada A. Adamic,et al. Lost in Propagation? Unfolding News Cycles from the Source , 2021, ICWSM.

[39] Gerhard Weikum,et al. A Fresh Look on Knowledge Bases: Distilling Named Events from News , 2014, CIKM.

[40] J. Firth,et al. Papers in linguistics, 1934-1951 , 1957 .

[41] Gerhard Weikum,et al. From information to knowledge: harvesting entities and relationships from web sources , 2010, PODS '10.

[42] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[43] Dragomir R. Radev,et al. LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[44] Andreas Spitz,et al. TopExNet: Entity-Centric Network Topic Exploration in News Streams , 2019, WSDM.

[45] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[46] Gilles Falquet,et al. Semantic enrichment of places with VGI sources: a knowledge based approach , 2016, GIR.

[47] Yuan Ni,et al. Semantic Documents Relatedness using Concept Graph Representation , 2016, WSDM.

[48] Gaurav Khandelwal,et al. MESH: A Flexible Distributed Hypergraph Processing System , 2019, 2019 IEEE International Conference on Cloud Engineering (IC2E).

[49] Kavita Ganesan,et al. ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks , 2015, ArXiv.

[50] Oren Etzioni,et al. Open Information Extraction from the Web , 2007, CACM.

[51] Abdelghani Bellaachia,et al. HG-Rank: A Hypergraph-based Keyphrase Extraction for Short Documents in Dynamic Genre , 2014, #MSM.

[52] Stefan Th. Gries,et al. 50-something years of work on collocations: What is or should be next … , 2013 .

[53] Gerhard Weikum,et al. Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia , 2010, EDBT '10.

[54] W. Bruce Croft,et al. Modeling higher-order term dependencies in information retrieval using query hypergraphs , 2012, SIGIR '12.

[55] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.

[56] Yizhou Sun,et al. Mining heterogeneous information networks: a structural analysis approach , 2013, SKDD.

[57] Omar Alonso,et al. Timelines as summaries of popular scheduled events , 2013, WWW.

[58] Andreas Spitz,et al. Extracting Descriptions of Location Relations from Implicit Textual Networks , 2017, GIR.

[59] Christoph Boden,et al. Extracting a Repository of Events and Event References from News Clusters , 2014 .

[60] Andreas Spitz,et al. Exploring Significant Interactions in Live News , 2018, NewsIR@ECIR.

[61] Miles Osborne,et al. Streaming First Story Detection with application to Twitter , 2010, NAACL.

[62] Fredric C. Gey,et al. NTCIR9-GeoTime Overview - Evaluating Geographic and Temporal Search: Round 2 , 2011, NTCIR.

[63] Andreas Spitz,et al. EVELIN: Exploration of Event and Entity Links in Implicit Networks , 2017, WWW.

[64] Hannah Bast,et al. Semantic full-text search with broccoli , 2014, SIGIR.

[65] M-Dyaa Albakour,et al. On the Long-Tail Entities in News , 2017, ECIR.

[66] Paul Buitelaar,et al. Who are the American Vegans related to Brad Pitt?: Exploring Related Entities , 2015, WWW.

[67] James Allan,et al. Topic detection and tracking: event-based information organization , 2002 .

[68] Philip Resnik,et al. A Discriminative Topic Model using Document Network Structure , 2016, ACL.

[69] Achim Rettinger,et al. XKnowSearch!: Exploiting Knowledge Bases for Entity-based Cross-lingual Information Retrieval , 2016, CIKM.

[70] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[71] Heng Ji,et al. CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases , 2016, WWW.

[72] E. F. CODD,et al. A relational model of data for large shared data banks , 1970, CACM.

[73] Rada Mihalcea,et al. TextRank: Bringing Order into Text , 2004, EMNLP.

[74] Martin H. Levinson. Linked: The New Science of Networks , 2004 .

[75] Gao Cong,et al. Topic Exploration in Spatio-Temporal Document Collections , 2016, SIGMOD Conference.

[76] Markus Krötzsch,et al. Wikidata , 2014, Commun. ACM.

[77] Xianpei Han,et al. An Entity-Topic Model for Entity Linking , 2012, EMNLP.

[78] Elena Lloret,et al. Text summarisation in progress: a literature review , 2011, Artificial Intelligence Review.

[79] Mark A. Przybocki,et al. The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[80] David Robinson,et al. tidytext: Text Mining and Analysis Using Tidy Data Principles in R , 2016, J. Open Source Softw..

[81] George A. Miller,et al. Introduction to WordNet: An On-line Lexical Database , 1990 .

[82] James P. Callan,et al. Document filtering with inference networks , 1996, SIGIR '96.

[83] David M. W. Powers,et al. Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[84] Hugo Zaragoza,et al. The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[85] Fabian M. Suchanek,et al. YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[86] Deng Cai,et al. Heterogeneous hypergraph embedding for document recommendation , 2016, Neurocomputing.

[87] Animesh Mukherjee,et al. Global topology of word co-occurrence networks: Beyond the two-regime power-law , 2010, COLING.

[88] Chi K. Tse,et al. Comparison of co-occurrence networks of the Chinese and English languages , 2009 .

[89] Michael Strube,et al. Lexical Coherence Graph Modeling Using Word Embeddings , 2016, NAACL.

[90] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[91] Mark Newman,et al. Networks: An Introduction , 2010 .

[92] Tong Zhang,et al. Fundamentals of Predictive Text Mining , 2010, Texts in Computer Science.

[93] Klaus Berberich,et al. EXPOSÉ: EXploring Past news fOr Seminal Events , 2015, WWW.

[94] Gerhard Weikum,et al. Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment , 2015, TACL.

[95] Omer Levy,et al. Dependency-Based Word Embeddings , 2014, ACL.

[96] Abhishek Chandra,et al. Beyond graphs: toward scalable hypergraph analysis systems , 2014, PERV.

[97] Hanan Samet,et al. NewsStand: a new view on news , 2008, GIS '08.

[98] Iryna Gurevych,et al. Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps , 2017, EMNLP.

[99] Ricardo Campos,et al. Survey of Temporal Information Retrieval and Related Applications , 2014, ACM Comput. Surv..

[100] Graeme Hirst,et al. Anaphora in Natural Language Understanding: A Survey , 1981, Lecture Notes in Computer Science.

[101] Christina Lioma,et al. Graph-based term weighting for information retrieval , 2011, Information Retrieval.

[102] J. A. Rodríguez-Velázquez,et al. Subgraph centrality and clustering in complex hyper-networks , 2006 .

[103] Abdelghani Bellaachia,et al. Multi-document Hyperedge-based Ranking for Text Summarization , 2014, CIKM.

[104] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[105] David M. Blei,et al. Probabilistic topic models , 2012, Commun. ACM.

[106] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[107] Gerhard Weikum,et al. Extraction of temporal facts and events from Wikipedia , 2012, TempWeb '12.

[108] Mor Naaman,et al. Towards automatic extraction of event and place semantics from flickr tags , 2007, SIGIR.

[109] Danqi Chen,et al. Reasoning With Neural Tensor Networks for Knowledge Base Completion , 2013, NIPS.

[110] David Yarowsky,et al. One Sense Per Discourse , 1992, HLT.

[111] Andreas Spitz,et al. Terms in Time and Times in Context: A Graph-based Term-Time Ranking Model , 2015, WWW.

[112] Jaideep Srivastava,et al. Weighted node degree centrality for hypergraphs , 2013, 2013 IEEE 2nd Network Science Workshop (NSW).

[113] Cong Yu,et al. Dynamic relationship and event discovery , 2011, WSDM '11.

[114] Matt Gardner. The Topic Browser An Interactive Tool for Browsing Topic Models , 2010 .

[115] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[116] Ani Nenkova,et al. A Survey of Text Summarization Techniques , 2012, Mining Text Data.

[117] Johanna Geiß,et al. The Wikipedia location network: overcoming borders and oceans , 2015, GIR.

[118] Johanna Geiß,et al. Beyond friendships and followers: The Wikipedia social network , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[119] Klaus Berberich,et al. Identifying Time Intervals of Interest to Queries , 2014, CIKM.

[120] Gerhard Weikum,et al. AESTHETICS: Analytics with Strings, Things, and Cats , 2014, CIKM.

[121] Qiaozhu Mei,et al. PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks , 2015, KDD.

[122] Roberto Navigli,et al. Word sense disambiguation: A survey , 2009, CSUR.

[123] Philip S. Yu,et al. A Survey of Heterogeneous Information Network Analysis , 2015, IEEE Transactions on Knowledge and Data Engineering.

[124] Srikanta J. Bedathur,et al. Towards Generating Text Summaries for Entity Chains , 2014, ECIR.

[125] Andreas Spitz,et al. Terms over LOAD: Leveraging Named Entities for Cross-Document Extraction and Summarization of Events , 2016, SIGIR.

[126] Tony McEnery,et al. Collocations in context:a new perspective on collocation networks , 2015 .

[127] Gerhard Weikum,et al. CATE: context-aware timeline for entity illustration , 2011, WWW.

[128] Gerhard Weikum,et al. See what's enBlogue: real-time emergent topic identification in social media , 2012, EDBT '12.

[129] Wei Chen,et al. A Synergistic Framework for Geographic Question Answering , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[130] Saul A. Kripke,et al. Naming and Necessity , 1980 .