Identification of time-varying objects on the web

We have developed a method for determining whether data found on the Web are for the same or different objects that takes into account the possibility of changes in their attribute values over time. Specifically, we estimate the probability that observed data were generated for the same object that has undergone changes in its attribute values over time and the probability that the data are for different objects, and we define similarities between observed data using these probabilities. By giving a specific form to the distributions of time-varying attributes, we can calculate the similarity between given data and identify objects by using agglomerative clustering on the basis of the similarity. Experiments in which we compared identification accuracies between our proposed method and a method that regards all attribute values as constant showed that the proposed method improves the precision and recall of object identification.

[1]  William P. Birmingham,et al.  Improving category specific Web search by learning query modifications , 2001, Proceedings 2001 Symposium on Applications and the Internet.

[2]  Ling Lin,et al.  Meta-search Based Web Resource Discovery for Object-Level Vertical Search , 2006, WISE.

[3]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[4]  Wei-Ying Ma,et al.  Object-level ranking: bringing order to Web objects , 2005, WWW '05.

[5]  Satoshi Nakamura,et al.  Trustworthiness Analysis of Web Search Results , 2007, ECDL.

[6]  Michael Chau,et al.  Comparison of Three Vertical Search Spiders , 2003, Computer.

[7]  Wei-Ying Ma,et al.  Web object retrieval , 2007, WWW '07.

[8]  Xiaojun Wan,et al.  Person resolution in person search results: WebHawk , 2005, CIKM '05.

[9]  David W. Embley,et al.  Grouping search-engine returned citations for person-name queries , 2004, WIDM '04.

[10]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[11]  Christopher D. Manning,et al.  Using Feature Conjunctions across Examples for Learning Pairwise Classifiers , 2005 .

[12]  Katsumi Tanaka,et al.  Learning a Distance Metric for Object Identification Without Human Supervision , 2006, PKDD.

[13]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[14]  Andrew McCallum,et al.  Disambiguating Web appearances of people in a social network , 2005, WWW '05.

[15]  W. Winkler Overview of Record Linkage and Current Research Directions , 2006 .

[16]  Oren Etzioni,et al.  Dynamic Reference Sifting: A Case Study in the Homepage Domain , 1997, Comput. Networks.

[17]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[18]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[19]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[20]  Yi Qin,et al.  Comparison of two approaches to building a vertical search tool: a case study in the nanotechnology domain , 2002, JCDL '02.

[21]  David Yarowsky,et al.  Unsupervised Personal Name Disambiguation , 2003, CoNLL.

[22]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[23]  Wei-Ying Ma,et al.  Object-level Vertical Search , 2007, CIDR.

[24]  Andrew McCallum,et al.  A Machine Learning Approach to Building Domain-Specific Search Engines , 1999, IJCAI.

[25]  Katsumi Tanaka,et al.  Creating Personal Histories from the Web Using Namesake Disambiguation and Event Extraction , 2007, ICWE.

[26]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[27]  Jialun Qin,et al.  Building domain-specific Web collections for scientific digital libraries: a meta-search enhanced focused crawling method , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[28]  Cheng Li,et al.  Two supervised learning approaches for name disambiguation in author citations , 2004, Proceedings of the 2004 Joint ACM/IEEE Conference on Digital Libraries, 2004..

[29]  Stefano Spaccapietra,et al.  Conceptual modeling for traditional and spatio-temporal applications - the MADS approach , 2006 .

[30]  Ivan P. Fellegi,et al.  A Theory for Record Linkage , 1969 .

[31]  Toru Ishida,et al.  Domain-specific Web search with keyword spices , 2004, IEEE Transactions on Knowledge and Data Engineering.