Searching with numbers

A large fraction of the useful web comprises of specification documents that largely consist of hattribute name, numeric valuei pairs embedded in text. Examples include product information, classified advertisements, resumes, etc. The approach taken in the past to search these documents by first establishing correspondences between values and their names has achieved limited success because of the difficulty of extracting this information from free text. We propose a new approach that does not require this correspondence to be accurately established. Provided the data has "low reflectivity", we can do effective search even if the values in the data have not been assigned attribute names and the user has omitted attribute names in the query. We give algorithms and indexing structures for implementing the search. We also show how hints (i. e, imprecise, partial correspondences) from automatic data extraction techniques can be incorporated into our approach for better accuracy on high reflectivity datasets. Finally, we validate our approach by showing that we get high precision in our answers on real datasets from a variety of domains.

[1]  Christian Böhm,et al.  On Optimizing Nearest Neighbor Queries in High-Dimensional Data Spaces , 2001, ICDT.

[2]  Andrew V. Goldberg,et al.  Augment or push: a computational study of bipartite matching and unit-capacity flow algorithms , 1998, JEAL.

[3]  Andrei BroderMonika Henzinger Information retrieval on the Web Tools & algorithmic issues , 1998 .

[4]  Soumen Chakrabarti,et al.  Data mining for hypertext: a tutorial survey , 2000, SKDD.

[5]  Moni Naor,et al.  Rank aggregation methods for the Web , 2001, WWW '01.

[6]  Zdeněk Kopal,et al.  Physics and Astronomy of the Moon , 1962 .

[7]  Andrei Z. Broder,et al.  Information Retrieval on the Web , 1998, FOCS 1998.

[8]  Ion Muslea,et al.  Extraction Patterns for Information Extraction Tasks: A Survey , 1999 .

[9]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[10]  Umeshwar Dayal,et al.  View Definition and Generalization for Database Integration in a Multidatabase System , 1984, IEEE Transactions on Software Engineering.

[11]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[12]  Douglas Comer,et al.  Ubiquitous B-Tree , 1979, CSUR.

[13]  Ronald Fagin,et al.  Combining fuzzy information from multiple systems (extended abstract) , 1996, PODS.

[14]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[15]  Vipul Kashyap,et al.  Semantic and schematic similarities between database objects: a context-based approach , 1996, The VLDB Journal.

[16]  Junghoo Cho,et al.  A fast regular expression indexing engine , 2002, Proceedings 18th International Conference on Data Engineering.

[17]  Surya Nepal,et al.  Query processing issues in image (multimedia) databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[19]  Jungyun Seo,et al.  Classifying schematic and data heterogeneity in multidatabase systems , 1991, Computer.

[20]  Rajeev Motwani,et al.  Clique partitions, graph compression and speeding-up algorithms , 1991, STOC '91.

[21]  Arturo Crespo,et al.  A Survey Of Semi-Automatic Extraction And Transformation , 1994 .

[22]  Nick Roussopoulos,et al.  Nearest neighbor queries , 1995, SIGMOD '95.