Probabilistic correlation-based similarity measure on text records

Abstract Large scale unstructured text records are stored in text attributes in databases and information systems, such as scientific citation records or news highlights. Approximate string matching techniques for full text retrieval, e.g., edit distance and cosine similarity , can be adopted for unstructured text record similarity evaluation. However, these techniques do not show the best performance when applied directly, owing to the difference between unstructured text records and full text. In particular, the information are limited in text records of short length, and various information formats such as abbreviation and data missing greatly affect the record similarity evaluation. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the matching of tokens between two records, our similarity evaluation enriches the information of records by considering correlations of tokens. The probabilistic correlation between tokens is defined as the probability of them appearing together in the same records. Then we compute weights of tokens and discover correlations of records based on the probabilistic correlations of tokens. The extensive experimental results demonstrate the effectiveness of our proposed approach.

[1]  W. Bruce Croft,et al.  A Translation Model for Sentence Retrieval , 2005, HLT.

[2]  Lei Chen,et al.  Probabilistic correlation-based similarity measure of unstructured records , 2007, CIKM '07.

[3]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[4]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[5]  Luis Gravano,et al.  Text joins in an RDBMS for web data integration , 2003, WWW '03.

[6]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[7]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[8]  Luo Si,et al.  Learn to weight terms in information retrieval using category information , 2005, ICML.

[9]  W. Bruce Croft,et al.  Novelty detection based on sentence level patterns , 2005, CIKM '05.

[10]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[11]  Edward A. Fox,et al.  SimFusion: measuring similarity using unified relationship matrix , 2005, SIGIR '05.

[12]  Rahul Gupta,et al.  Creating probabilistic databases from information extraction models , 2006, VLDB.

[13]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[14]  W. Bruce Croft,et al.  Improving novelty detection for general topics using sentence level information patterns , 2006, CIKM '06.

[15]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[16]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[17]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[18]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[19]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[20]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[21]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[22]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[23]  Jianmin Wang,et al.  Efficient Recovery of Missing Events , 2013, IEEE Transactions on Knowledge and Data Engineering.

[24]  B. C. Brookes,et al.  Information Sciences , 2020, Cognitive Skills You Need for the 21st Century.

[25]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[26]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[27]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[28]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[29]  Sunita Sarawagi,et al.  Scalable Information Extraction and Integration. , 2006 .

[30]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[31]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.