Distributed error correction

We proposedistributed error correctionfor digital libraries, where individual users can correct information in a databas e in real-time. Distributed error correction is used in the R searchIndex(formerly CiteSeer) scientific literature digital library developed at NEC Research Institute. We discuss issues including motivation to contribute corrections, barr iers to participation, trust, recovery, detecting malicious ch anges, and the use of correction information to improve automated algorithms or predict the probability of errors. We also detail our implementation of distributed error correction in ResearchIndex . Introduction Many online databases contain errors. In many cases, it is impractical for database maintainers to correct all of the e rrors in their databases. Many of these databases are created using automated or partially automated means, for example some search engines classify pages into predefined categories, while other services maintain databases of automa tically extracted information, for example the HPSearch [7] service maintains a database of researcher homepages, and the WebKB project at CMU automatically extracts information from Web pages [3]. We propose the use of distributed error correctionto increase the accuracy of online databases, by harnessing the knowledge of all users, and allowing individual users to correct errors. Examples might include users reporting incorrectl y classified pages to search engines, or correcting responses from services such as homepage location services. In this work we focus on the correction of automatically extracted information inResearchIndex [10, 11], a scientific literature digital library developed at the NEC Research Institute. Th e next section provides brief background information on the ResearchIndex system and the Autonomous Citation Indexing (ACI) performed by ResearchIndex. ResearchIndex ResearchIndex is a scientific literature digital library pr oject at NEC Research Institute. Areas of focus include the effective use of the capabilities of the web, and the use of machine learning. The ResearchIndex project encompasses many areas including the efficient location of articles, full-text indexing, autonomous citation indexing, information extrac tion, computation of related documents, and user profiling. ResearchIndex operates completely autonomously and performs a number of tasks including: location of research articles on the web, conversion of Postscript and PDF files to text, extraction of title and author information from artic le headers, extraction of the list of citations made in an artic le, autonomous citation indexing, and the extraction of citati on context within articles. This paper is primarily concerned with the extraction of author and title information and the a utonomous citation indexing components of ResearchIndex. Citation indexing is the indexing of the citations made in research articles, linking the citing papers with the cited works [4]. Citation indices allow, for example, the locatio n of subsequent papers that cite a given paper. The most well-known citation indices are the commercial indices created by the Institute of Scientific Information (ISI) (http://www.isinet. om/), for example the Science Citation Index (SCI) R . The ISI citation databases are created using manual effort, and are known to contain errors . Autonomous Citation Indexing (ACI) automates the task of creating citation indices. Details of the autonomous citat ion indexing performed by ResearchIndex can be found in [11]. There are several sources of potential errors in ResearchIn dex, for example errors can be made extracting the title and author information from citations and documents, and error s can be made matching citations to the same article (citation s can be written in many different formats and ResearchIndex attempts to group together citations to the same paper). Thi s paper focuses on the use of distributed error correction to correct these errors. We note that there are techniques that could potentially reduce the error rate of autonomous systems such as ResearchIndex. For example, Cameron [2] proposed a universal bibliographic and citation database that would link every scholarly work ever written. Cameron’s proposal includes the requirement that authors or institution provide citati on information in a standardized format, which removes the difficulty involved in parsing free-form citations. Howeve r, such a method imposes a substantial overhead on the authors or institutions, and has not gained widespread acceptance. Another possibility is the use of universal identifiers [1], such as those used in the Los Alamos e-Print Archive (http://xxx.lanl.gov). However, this also requires effort for the authors to lookup the identifiers. The use of identifiers for citations in the Los Alamos archive varies significantly by discipline [6]. Even with improved algorithms for the tasks performed by ResearchIndex, it is very unlikely that perfect algorithms could be created for most tasks. For example, perfect algorithms for title/author extraction in citations would have to be able to correct for errors made by the article authors and errors made in the conversion from Postscript/PDF to text (Postscript programs can be written in many different ways – the conversion task is relatively simple to do with high accuracy but very difficult to do perfectly [14]). Distributed Error Correction In distributed error correction, individual users are able to correct errors that they find while using an online system. The following sections discuss issues involved in using dis tributed error correction for online databases, with speci fic focus on the application to ResearchIndex. Trust In distributed error correction, individual users can corr ect errors in a database. An important and immediate question is: how do we prevent malicious users from corrupting the database? Various schemes could be used to validate and assign degrees of trust to users (for example, techniques si milar to those used with PGP [15]). However, they all involve some overhead, which would limit the fraction of users providing corrections. We therefore focus on detectionrather thanpreventionof malicious users. However, as is common with many web sites, we can optionally require a validated email address, i.e. we request an email address and immediately send a message to that address asking for confirmation before allowing a user to make any changes. If malicious changes were consistently made from free email addresses (Hotmail, Yahoo!, etc.) thes e could be disallowed, since most legitimate users are likely to have email addresses at universities or research labs. Recovering The first observation is that no matter what methods are used, there is always the possibility of malicious or accidently i ncorrect changes being made to the database. Therefore, we keep a transaction log of all changes which allows changes to be rolled back. This also allows easy application of corrections to new databases that may contain the same documents. Since ResearchIndex is freely available, multiple organiz tions may be obtaining correction information from users, which may be distributed to each organization for correctio n of identical documents or citations. Detecting Malicious Users Malicious changes to the ResearchIndex database may not be a significant problem, due to the target audience of scientific researchers. The Los Alamos e-Print archive ( http: //xxx.lanl.gov) has not had any difficulty with malicious users [5]. However, various methods for detecting maliciou s changes are possible. For example, consider changes to title and author information for indexed articles and citatio ns. The new title and author information should exist in the article header or citation, although there may be errors. Edit distance [12, 8] or similar algorithms could be used to analyze changes from each user. If similar strings to the new information are not contained in the original citation or ar ticle header, then this may indicate a malicious (or accidentl y incorrect) change. Motivation Users are known for not wanting to spend time providing explicit feedback. On the web most attempts to use relevance feedback have resulted in very small amounts of participation. Therefore an important question is how do we motivate users to correct the database? For scientific literature, one strong motivation is for auth ors to correct information relating to their own publications, which improves the accessibility of their research. Anothe r possibility is to provide users with alternative incentive s to correct errors, for example payments, displaying credits f or corrections, or increased status within the system. One tec hnique used in ResearchIndex is to highlight the advantages of making corrections immediately when they are made. In particular, correcting the title and author information on a document in ResearchIndex can enable the system to link the document with corresponding citations in other article s. When this is possible, we immediately perform the linking, notify the user, and provide a link to the respective citatio ns on the correction response page. A related issue is complexity of the correction process. In general, overhead limits usage, and excessive overhead can effectively prevent usage. An interesting analogy is the we b itself, arguably a large, ad hoc, poorly organized information resource, full of dead links, and lacking built-in supp ort for features such as content indexing and access payments. These deficiencies are in principle solvable, and indeed pro posals for hypertext systems without these deficiencies existed long before the web (e.g., Xanadu[13]). However, the reality of designing, implementing, and participating in m ore idealized hypertext systems, namely greater overhead for d esigners and participants, has prevented the widespread suc cess of such systems. On the other hand, a