Quality-aware similarity assessment for entity matching in Web data

One of the key challenges to realize automated processing of the information on the Web, which is the central goal of the Semantic Web, is related to the entity matching problem. There are a number of tools that reliably recognize named entities, such as persons, companies, geographic locations, in Web documents. The names of these extracted entities are, however, non-unique; the same name on different Web pages might or might not refer to the same entity. The entity matching problem concerns of identifying the entities, which are referring to the same real-world entity. This problem is very similar to the entity resolution problem studied in relational databases, however, there are also several differences. Most importantly Web pages often only contain partial or incomplete information about the entities. Similarity functions try to capture the degree of belief about the equivalence of two entities, thus they play a crucial role in entity matching. The accuracy of the similarity functions highly depends on the applied assessment techniques, but also on some specific features of the entities. We propose systematic design strategies for combined similarity functions in this context. Our method relies on the combination of multiple evidences, with the help of estimated quality of the individual similarity values and with particular attention to missing information that is common in Web context. We study the effectiveness of our method in two specific instances of the general entity matching problem, namely the person name disambiguation and the Twitter message classification problem. In both cases, using our techniques in a very simple algorithmic framework we obtained better results than the state-of-the-art methods.

[1]  Wolfgang Kellerer,et al.  Outtweeting the Twitterers - Predicting Information Cascades in Microblogs , 2010, WOSN.

[2]  Julio Gonzalo,et al.  WePS3 Evaluation Campaign: Overview of the On-line Reputation Management Task , 2010, CLEF.

[3]  Sudha Ram,et al.  Entity identification for heterogeneous database integration--a multiple classifier system approach and empirical evaluation , 2005, Inf. Syst..

[4]  Dmitri V. Kalashnikov,et al.  Domain-independent data cleaning via analysis of entity-relationship graph , 2006, TODS.

[5]  J. Euzenat,et al.  Ontology Matching , 2007, Springer Berlin Heidelberg.

[6]  George V. Moustakides,et al.  A Bayesian decision model for cost optimal record matching , 2003, The VLDB Journal.

[7]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[8]  Hector Garcia-Molina,et al.  Generic Entity Resolution with Data Confidences , 2006, CleanDB.

[9]  Young-In Song,et al.  Finding question-answer pairs from online forums , 2008, SIGIR '08.

[10]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[11]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[12]  Julio Gonzalo,et al.  WePS 2 Evaluation Campaign: Overview of the Web People Search Clustering Task , 2009 .

[13]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[14]  Karl Aberer,et al.  PicShark: mitigating metadata scarcity through large-scale P2P collaboration , 2008, The VLDB Journal.

[15]  Karl Aberer,et al.  Towards better entity resolution techniques for Web document collections , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[16]  Karl Aberer,et al.  What have fruits to do with technology?: the case of Orange, Blackberry and Apple , 2011, WIMS '11.

[17]  Surajit Chaudhuri,et al.  Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[18]  Hamish Cunningham,et al.  GATE-a General Architecture for Text Engineering , 1996, COLING.

[19]  Karl Aberer,et al.  It Was Easy, when Apples and Blackberries Were only Fruits , 2010, CLEF.

[20]  Dmitri V. Kalashnikov,et al.  Web People Search via Connection Analysis , 2008, IEEE Transactions on Knowledge and Data Engineering.

[21]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Panagiotis G. Ipeirotis,et al.  Duplicate Record Detection: A Survey , 2007 .

[23]  Anthony Wirth,et al.  Correlation Clustering , 2010, Encyclopedia of Machine Learning and Data Mining.

[24]  Bernard J. Jansen,et al.  Twitter power: Tweets as electronic word of mouth , 2009, J. Assoc. Inf. Sci. Technol..

[25]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[26]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[27]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[28]  Erhard Rahm,et al.  Frameworks for entity matching: A comparison , 2010, Data Knowl. Eng..

[29]  Themis Palpanas,et al.  A Conceptual Model for a Web-Scale Entity Name System , 2009, ASWC.

[30]  Dmitri V. Kalashnikov,et al.  Adaptive graphical approach to entity resolution , 2007, JCDL '07.

[31]  Susan T. Dumais,et al.  Similarity Measures for Short Segments of Text , 2007, ECIR.

[32]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[33]  Yaxin Bi,et al.  The combination of multiple classifiers using an evidential reasoning approach , 2008, Artif. Intell..

[34]  Dmitri V. Kalashnikov,et al.  Exploiting context analysis for combining multiple entity resolution systems , 2009, SIGMOD Conference.

[35]  Kevin W. Bowyer,et al.  Combination of multiple classifiers using local accuracy estimates , 1996, Proceedings CVPR IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[36]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[37]  Giovanni Seni,et al.  Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions , 2010, Ensemble Methods in Data Mining.

[38]  Hua Li,et al.  Enhancing text clustering by leveraging Wikipedia semantics , 2008, SIGIR '08.

[39]  Karl Aberer,et al.  idMesh: graph-based disambiguation of linked data , 2009, WWW '09.

[40]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[41]  Baozong Yuan,et al.  Multiple classifiers combination by clustering and selection , 2001, Inf. Fusion.

[42]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[43]  Hanan Samet,et al.  TwitterStand: news in tweets , 2009, GIS.

[44]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[45]  Peter Fankhauser,et al.  From Web Data to Entities and Back , 2010, CAiSE.