Data Integration and Access - The Digital Government Research Center's Energy Data Collection (EDC) Project

This chapter describes the progress of the Digital Government Research Center in tackling the challenges of integrating and accessing the massive amount of statistical and text data available from government agencies. In particular, we address the issues of database heterogeneity, size, distribution, and control of terminology. In this chapter we provide an overview of our results in addressing problems such as (1) ontological mappings for terminology standardization, (2) data integration across data bases with high speed query processing, and (3) interfaces for query input and presentation of results. The DGRC is a collaboration between researchers from Columbia University and the Information Sciences Institute of the University of Southern California employing technology developed at both locations, in particular the SENSUS ontology, the SIMS multi-database access planner, the LEXING automated dictionary and terminology analysis system, the main-memory query processing component and others. The pilot application targets gasoline data from the Bureau of Labor Statistics, the Energy Information Administration of the Department of Energy, the Census Bureau, and other government agencies.

[1]  Kenneth A. Ross,et al.  Serving datacube tuples from main memory , 2000, Proceedings. 12th International Conference on Scientific and Statistica Database Management.

[2]  Salvatore J. Stolfo,et al.  Towards the digital government of the 21 st century , 2002 .

[3]  Judith L. Klavans,et al.  Extracting taxonomic relationships from on-line definitional sources using LEXING , 2001, JCDL '01.

[4]  Luis Gravano,et al.  Simplifying Data Access: The Energy Data Collection Project , 2001, Computer.

[5]  Smaranda Muresan,et al.  DEFINDER: Rule-based Methods for the Extraction of Medical Terminology and their Associated Definitions from On-line Text , 2000, AMIA.

[6]  Branimir Boguraev,et al.  Dictionaries, Dictionary Grammars and Dictionary Entry Parsing , 1989, ACL.

[7]  Jeffrey D. Ullman,et al.  Index selection for OLAP , 1997, Proceedings 13th International Conference on Data Engineering.

[8]  Eduard Hovy,et al.  Data Acquisition and Integration in the DGRC's Energy Data Collection Project , 2001 .

[9]  G. W. Furnas,et al.  Generalized fisheye views , 1986, CHI '86.

[10]  Kevin Knight,et al.  Toward Distributed Use of Large-Scale Ontologies t , 1997 .

[11]  Kevin Knight,et al.  Building a Large-Scale Knowledge Base for Machine Translation , 1994, AAAI.

[12]  Nina Wacholder,et al.  Document Processing with LinkIT , 2000, RIAO.

[13]  Jeffrey F. Naughton,et al.  Materialized View Selection for Multidimensional Datasets , 1998, VLDB.

[14]  Jeffrey D. Ullman,et al.  Implementing data cubes efficiently , 1996, SIGMOD '96.

[15]  Cyrus Shahabi,et al.  Fast Approximate Evaluation of OLAP Queries for Integrated Statistical Data , 2001 .

[16]  John F. Sowa,et al.  Principles of semantic networks , 1991 .

[17]  David Millman,et al.  Providing Access to a Data Library: SQL and Full-Text IR Methods of Automatically Generating Web Structure , 1994 .

[18]  Robert Mac Gregor,et al.  THE EVOLVING TECHNOLOGY OF CLASSIFICATION-BASED KNOWLEDGE REPRESENTATION SYSTEMS , 1991 .

[19]  Salvatore J. Stolfo,et al.  A digital government for the 21st century , 1998, CACM.

[20]  Craig A. Knoblock,et al.  Flexible and scalable cost-based query planning in mediators: A transformational approach , 2000, Artif. Intell..

[21]  Cyrus Shahabi,et al.  POLAP: A Fast Wavelet-based Technique for Progressive Evaluation of OLAP Queries , 2001 .

[22]  Robert M. MacGregor,et al.  The Evolving Technology of Classification-Based Knowledge Representation Systems , 1991, Principles of Semantic Networks.

[23]  Craig A. Knoblock,et al.  Query processing in the SIMS information mediator , 1997 .