Grammatical Induction and Recognition of the Documentary Form of Records

This paper presents digital curators with a more precise understanding of the concept of documentary form, and how documentary form can be automatically learned from a sample of records of a particular document type. The ability to automatically recognize documentary form enables item description. Item description enables file unit description and this enables automatic series description. This technology can reduce the effort required of an appraisal archivist to assess the value of record series containing a large number of e-records of different documentary forms. It can also provide archivists with earlier intellectual control of accessioned e-record series by providing preliminary scope and content notes for these series. Item descriptions provide additional ways for indexing and searching collections of records. Introduction Among the challenges archivists face in appraising e-records and gaining intellectual control of accessioned e-records is the enormous volume of records and the time it requires to read and understand the content of these records. According to one source, "the Clinton White House generated 38 million e-mail messages (and the current Bush White House is expected to generate triple that number)." [3] Archivists must review presidential records page-by page before they can be disclosed to the public or it is determined that here are restrictions on disclosure. Data collected on declassification review, indicates that a reviewer can review on average one page per minute, or 60 pages per hour. Given 1920 work hours per year, an archivist doing nothing other than review, could be expected on average to review 115,000 pages per year. NARA provides eight archivists to each Presidential Library, one of which is a Supervisory Archivist. Assuming seven archivists reviewing records, and an email with attachments averaging one page in length, they could review about 800,000 email massages per year. It will take 125 years for Presidential Library archivists to review and describe the Bush Administration's email for the first time. In the next section, a method is described for recognizing the documentary form of records created by office applications such as word processors, spreadsheets and database management systems. Then it is shown how the ability to automatically recognize document type enables the automatic description of items, file units and record series. Finally, how these technologies can aid archivists in appraising e-records and gaining intellectual control of accessioned e-records is discussed. .