Entity Extraction and Consolidation for Social Web Content Preservation

With the rapidly increasing pace at which Web content is evolving, particularly social media, preserving the Web and its evolution over time be- comes an important challenge. Meaningful analysis of Web content lends itself to an entity-centric view to organise Web resources according to the infor- mation objects related to them. Therefore, the crucial challenge is to extract, de- tect and correlate entities from a vast number of heterogeneous Web resources where the nature and quality of the content may vary heavily. While a wealth of information extraction tools aid this process, we believe that, the consolidation of automatically extracted data has to be treated as an equally important step in order to ensure high quality and non-ambiguity of generated data. In this paper we present an approach which is based on an iterative cycle exploiting Web da- ta for (1) targeted archiving/crawling of Web objects, (2) entity extraction, and detection, and (3) entity correlation. The long-term goal is to preserve Web con- tent over time and allow its navigation and analysis based on well-formed struc- tured RDF data about entities.

[1]  Linyuan Lu,et al.  Link Prediction in Complex Networks: A Survey , 2010, ArXiv.

[2]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[3]  Kalina Bontcheva,et al.  Adapting SVM for data sparseness and imbalance: a case study in information extraction , 2009, Natural Language Engineering.

[4]  Yannis Stavrakas,et al.  Exploiting the Social and Semantic Web for Guided Web Archiving , 2012, TPDL.

[5]  Yorick Wilks,et al.  Named Entity Recognition from Diverse Text Types , 2001 .

[6]  Steffen Staab,et al.  Ontology-based text clustering , 2001, IJCAI 2001.

[7]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[8]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[9]  John Domingue,et al.  Exploiting conceptual spaces for ontology integration , 2008 .

[10]  Piek T. J. M. Vossen,et al.  Bootstrapping Language Neutral Term Extraction , 2010, LREC.

[11]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[12]  Daniel Boley,et al.  Principal Direction Divisive Partitioning , 1998, Data Mining and Knowledge Discovery.

[13]  Timothy Baldwin,et al.  Cross-domain Feature Selection for Language Identification , 2011, IJCNLP.

[14]  D. Maynard,et al.  Challenges in developing opinion mining tools for social media , 2012 .

[15]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[16]  Thomas Risse,et al.  Towards automatic language evolution tracking A study on word sense tracking , 2011 .

[17]  Mohammad Al Hasan,et al.  A Survey of Link Prediction in Social Networks , 2011, Social Network Data Analytics.

[18]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[19]  Diana Maynard,et al.  NLP Techniques for Term Extraction and Ontology Population , 2008, Ontology Learning and Population.

[20]  John Domingue,et al.  Exploiting Metrics for Similarity-Based Semantic Web Service Discovery , 2009, 2009 IEEE International Conference on Web Services.

[21]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[22]  Abraham Bernstein,et al.  Hexastore: sextuple indexing for semantic web data management , 2008, Proc. VLDB Endow..

[23]  Paul Deane,et al.  A Nonparametric Method for Extraction of Candidate Phrasal Terms , 2005, ACL.

[24]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..