Are Your Citations Clean ? New Scenarios and Challenges in Maintaining Digital Libraries

In many scientific-publication digital libraries (DLs) such as CiteSeer, arXiv e-Print, DBLP, or Google Scholar, “citations” play an important role. (The term “citation” refers to the collection of bibliographic information such as author name, title, publication venue, or year that are pertinent to a particular article.) Users often use citations to find information of interest in DLs, and researchers depend on citations to determine the impact of an article in DLs. In addition, when DLs are integrated, citations act as unique identifiers of associated documents. Therefore, it is important for DLs to keep citations of stored documents consistent and up-to-date. However, in general, keeping citations clean and consistent is a non-trivial task. Some of the challenges include: (1) data entry errors, (2) various citation formats, (3) lack of (the enforcement of) a standard, (4) imperfect citation gathering software, (5) common author names or abbreviations of publication venues, and (6) large-scale citation data.