Advanced Decision Support for Archival Processing of Presidential Electronic Records: Final Scientific and Technical Report

The overall objective of this project is to develop and apply advanced information technology to decision problems that archivists at the Presidential Libraries encounter when processing electronic records. Among issues and problems to be addressed are areas responsive to national security, including automated content analysis, automatic summarization, advanced information retrieval, advanced support of decision making for access restrictions and declassification, information security, and Global Information Grid technology, which are also important research areas for the U.S. Army. The performance of the previously developed Information Extraction tool has been improved by the inclusion of additional wordlists and JAPE rules. Additional semantic categories such as facilities, legislative bills and statutes, governments, and relative temporal expressions are now annotated. An experiment with actual presidential e-records indicates a performance in recall, precision and F-measure of greater than .90. A method for automatic document type recognition and metadata extraction has been implemented and successfully tested. The method is based on the method for automatically annotating semantic categories such as person’s names, dates, and postal addresses. It extends this method by: (1) identifying about 100 types of intellectual elements of documents, (2) parsing these elements using context-free grammars defining the documentary form of document types, (3) interpreting the pragmatics of the form of the document to identify some or all of the following metadata: the chronological date, author(s), addressee(s), and topic. This metadata can be used for indexing and searching collections of records by person, organization and location names, topics, dates, author’s and addressee’s names and document types. It can also be used for automatically describing items, file units and record series. Speech acts are acts of speech or writing in which one does something just by saying something, for example, “I appoint you...”, “I hereby proclaim...” One hundred twenty Presidential records were analyzed with regard to the expression of speech acts with performative verbs and speech acts about the author’s past or future speech acts or other’s speech acts. More than 60 kinds of speech acts were discovered in the corpus. The analysis confirms that performative verbs are used to express the actions conveyed by records. A method has been formulated for identifying the speech acts occurring in e-records. It will be implemented, tested using records from the analyzed corpus and then experimentally evaluated A method for automatically identifying the topics of e-records would facilitate automatic description of the records, and subsequent access to record collections. A corpus of fifty presidential records of various documentary forms was analyzed to determine the topic(s) of the records and possible techniques for automatically identifying the topics. The linguistics literature addressing discourse topic was reviewed. Technologies for domain-independent document summarization were also reviewed. An approach is proposed for identifying topics in presidential e-records that is a combination of domain-dependent and domain-independent methods. A tool called the Access Restriction Checker is being developed to support archivists in archival review. Progress in implementing the Access Restriction Checker includes the interface of the prototype to the results of document type recognition and extraction of metadata about a record. Still needed is the provision to the Access Restriction Checker of the results of speech act and topic recognition. The Presidential Electronic Records PilOt System (PERPOS) has been tested by archivists at the Bush Presidential Library in processing of Presidential records in response to FOIA requests. The results of the pilot test include: (1) the conclusion that the tool substantially supports FOIA processing, (2) the identification of additional features that would better meet the needs of archivists in FOIA processing of e-records, and (3) the adaptation of PERPOS to include some of these features. Due to the rapid changes in computer technology, archivists must be concerned not only with the obsolescence of e-record file formats, but with the obsolescence of the operating systems, database management systems and integrated development environments of their Archival System. The Presidential Electronic Records Pilot System (PERPOS) as a case in point. Two exercises were conducted in using conversion tools to migrate Visual Basic 6 modules of PERPOS to Visual Basic 8 and to Java. The two exercises resulted in components that were functionally the same as the components written in VB6. The migration tools were judged useful, though substantial manual recoding was necessary. It was also concluded that to improve the maintainability of PERPOS, the more complex projects of the PERPOS architecture should be refactored into smaller, simpler classes. File format identification is a core requirement for digital archives. The UNIX file command is among the most promising technologies for file type identification, but its reliability needs to be demonstrated. A database system for managing file format information and creating the magic file used by the file command is described. A graphical user interface has been developed for the file command. File signature tests have been created for more than 800 file formats. The performance of the file command and file signature tests is being evaluated on examples of the file formats that it purportedly identifies.

[1]  Candace L. Sidner,et al.  Attention, Intentions, and the Structure of Discourse , 1986, CL.

[2]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[3]  William Underwood,et al.  Extensions of the UNIX File Command and Magic File for File Type Identification , 2009 .

[4]  A. Wierzbicka English Speech Act Verbs: A Semantic Dictionary , 1987 .

[5]  H. Cunningham,et al.  Developing Language Processing Components with GATE , 2001 .

[6]  Sandra Laib,et al.  PERPOS: An Electronic Records Repository and Archival Processing System , 2007 .

[7]  William Underwood,et al.  Issues in Migrating PERPOS to a New Development Environment , 2008 .

[8]  William Underwood Grammar-Based Recognition of Documentary Forms and Extraction of Metadata , 2010, Int. J. Digit. Curation.

[9]  Daniel Vanderveken Meaning and speech acts: principles of language use (vol. 1) , 1990 .

[10]  J. Searle,et al.  Expression and Meaning. , 1982 .

[11]  Yunhyong Kim,et al.  "The Naming of Cats": Automated Genre Classification , 2008, Int. J. Digit. Curation.

[12]  Johanna D. Moore,et al.  Latent Semantic Analysis for Text Segmentation , 2001, EMNLP.

[13]  Inderjeet Mani,et al.  The Tipster Summac Text Summarization Evaluation , 1999, EACL.

[14]  Thomas J. Grabowski,et al.  COMPREHENSION , 2010, Continuum.

[15]  Walter Kintsch,et al.  Comprehension: A Paradigm for Cognition , 1998 .

[16]  Daniel Vanderveken,et al.  Meaning and Speech Acts , 2009 .

[17]  Daniel Marcu,et al.  The rhetorical parsing, summarization, and generation of natural language texts , 1998 .

[18]  尚弘 島影 National Institute of Standards and Technologyにおける超伝導研究及び生活 , 2001 .

[19]  Walter Kintsch,et al.  8. On the notions of theme and topic in psychological process models of text comprehension , 2002 .

[20]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[21]  Robert J. Gaizauskas,et al.  SUPPLE: A Practical Parser for Natural Language Engineering Applications , 2005, IWPT.

[22]  Werner Ulrich The naming of cats , 2009 .

[23]  W. Kintsch,et al.  Strategies of discourse comprehension , 1983 .

[24]  J. Sadock Speech acts , 2007 .