ADS's Dexter Data Extraction Applet

The NASA Astrophysics Data System (ADS) now holds 1.3 million scanned pages, containing numerous plots and figures for which the original data sets are lost or inaccessible. The availability of scans of the figures can significantly ease the regeneration of the data sets. For this purpose, the ADS has developed Dexter, a Java applet that supports the user in this process. Dexter’s basic functionality is to let the user manually digitize a plot by marking points and defining the coordinate transformation from the logical to the physical coordinate system. Advanced features include automatic identification of axes, tracing lines and finding points matching a template. This contribution both describes the operation of Dexter from a user’s point of view and discusses some of the architectural issues we faced during implementation. 1. Dexter’s Operation The ADS provides access to the full-text of over 200,000 scientific papers published in astronomical journals, conference proceedings, newsletters, bulletins and books, for a total of over 1.3 million pages. The ADS article service allows users to view individual pages using any web browser with graphical capabilities. When viewing a scanned page, the Dexter applet can be started by following the link available below the image. As described in its help page, Dexter can easily be used to extract data; a relatively simple case is depicted in Figure 1. After starting Dexter, the user can select the portion of the page containing a plot or figure to be analyzed. (It is advisable to keep this portion as small as possible, both to reduce the Java Virtual machine memory requirements and to facilitate the automatic feature extraction algorithms.) When Dexter’s main window has popped up, it is generally worthwhile to attempt an automatic detection of the axes (in the Recognize menu). In the example of Fig. 1, this has worked well; in other cases it may be necessary to mark the axes manually or correct Dexter’s axes by click-and-dragging the ends of the axes. Next one fills in the text fields for the start and end values of the axes (marked with a large “1” in Fig. 1), which completes the information needed by the applet to display http://adsabs.harvard.edu/Dexter/Dexterhelp.html 321 c © Copyright 2001 Astronomical Society of the Pacific. All rights reserved. 322 Demleitner et al. Figure 1. An illustration of Dexter’s main window after the data from a graph was extracted. Marked points and axis markings have been emphasized to enhance their visibility in this grey-scale rendering. physical (graph) instead of logical (pixel) coordinates in the status line (“2” in Fig. 1). To actually mark points (the crosses in the figure), one uses the “Find Points” function from the Recognize menu and clicks on a sample point. Dexter then marks all similar points, where the similarity threshold can be adjusted in “Recognizer Settings”. Occasionally Dexter will miss overlapping points and encounter similar mishaps, or it may have a hard time following the correct path when tracing lines. In these cases, it is necessary to mark the points manually, which is done by clicking on their positions in the graph. Error bars can be added by clicking on a point and dragging the bar (error bars are only available after the coordinate system has been set up). The magnifying glass (“3” in Fig. 1) may help here and can be activated by clicking on it. To work around bugs in some Java virtual machines, it is inactive by default. When all the points are marked, any method listed in the File menu can be used to obtain the resulting data set – “Show Data” outputs it in the text field labeled “4” in Fig. 1, “Send Data” usually opens a browser window, and “Save Data” will save it in a text file on the user’s machine (provided the browser is correctly configured). The name of the text file can be set in the “File name:” field (“5” in Fig. 1) and defaults to “bibcode.page” (bibcode being the article’s bibliographic code and page the page’s sequential number within the article). ADS’s Dexter Data Extraction Applet 323