Using clustering strategies for creating authority files

As more online databases are integrated into digital libraries, the issue of quality control of the data becomes increasingly important, especially as it relates to the effective retrieval of information. Authority work, the need to discover and reconcile variant forms of strings in bibliographic entries, will become more critical in the future. Spelling variants, misspellings, and transliteration differences will all increase the difficulty of retrieving information. We investigate a number of approximate string matching techniques that have traditionally been used to help with this problem. We then introduce the notion of approximate word matching and show how it can be used to improve detection and categorization of variant forms. We demonstrate the utility of these approaches using data from the Astrophysics Data System and show how we can reduce the human effort involved in the creation of authority files.

[1]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[2]  James C. French,et al.  The Sociology of Astronomical Publication Using ADS and ADAMS , 1997 .

[3]  Fred J. Damerau,et al.  A technique for computer detection and correction of spelling errors , 1964, CACM.

[4]  Nick Roussopoulos,et al.  AMASE: an astronomical search and discovery engine. , 1996 .

[5]  Stephen S. Murray,et al.  Astronomical Information Discovery and Access: Design and Implementation of the ADS Bibliographic Services , 1997 .

[6]  James C. French,et al.  TRENDS IN ASTRONOMICAL PUBLICATION BETWEEN 1975 AND 1996 , 1997 .

[7]  Susan L. Siegfried,et al.  Synoname1: The Getty's new approach to pattern matching for personal names , 1991 .

[8]  Larry Auld,et al.  Authority Control: An Eighty-Year Review. , 1982 .

[9]  James C. French,et al.  Using the ADS Database to Study Trends in Astronomical Publication , 1996 .

[10]  Arlene G. Taylor,et al.  Authority Files in Online Catalogs , 1984 .

[11]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[12]  Justin Zobel,et al.  Phonetic string matching: lessons from information retrieval , 1996, SIGIR '96.

[13]  Howard Lee Morgan,et al.  Spelling correction in systems programs , 1970, Commun. ACM.

[14]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[15]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[16]  Virginia Trimble Postwar growth in the length of astronomical and other scientific papers , 1984 .

[17]  John L. Pfaltz,et al.  Automating the Construction of Authority Files in Digital Libraries: A Case Study , 1997, ECDL.

[18]  James C. French,et al.  Applications of approximate word matching in information retrieval , 1997, CIKM '97.

[19]  E. O'Neill,et al.  Quality control in online databases , 1988 .

[20]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[21]  Martha E. Williams,et al.  Lack of standardization of the journal title data element in databases , 1981, J. Am. Soc. Inf. Sci..

[22]  Christine L. Borgman,et al.  Getty's Synoname™ and its cousins: A survey of applications of personal name‐matching algorithms , 1992 .

[23]  H. A. Abt,et al.  INSTITUTIONAL PRODUCTIVITIES 1993 , 1994 .

[24]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[25]  James C. French,et al.  Clustering large datasets in arbitrary metric spaces , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[26]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[27]  Robert A. Wagner,et al.  An Extension of the String-to-String Correction Problem , 1975, JACM.