Rapid Integration of Online and Geospatial Data Sources for Knowledge Discovery

Much of the work on information integration has focused on the dynamic integration of structured data sources, such as databases or XML data. With the more complex geospatial data types, such as imagery, maps, and vector data, researchers have focused on the integration of specific types of information, such as placing points or vectors on maps, but much of this integration is only partially automated. With the huge amount of geospatial data now available and the enormous amount of data available on the Web, there is a terrific opportunity to exploit the integration of online sources with geospatial sources for knowledge discovery. The challenge is that the dynamic integration of online data and geospatial data is beyond the state of the art of existing integration systems. There are two general challenges that must be addressed in order to fully exploit the combination of these different types of sources. First, automated techniques are needed to integrate the diverse source types. For example, integrating maps with imagery or online schedules with road or rail vectors are needed in order to mine the information available by integrating these source types. Second, given the ability to integrate these diverse types of sources, general integration and visualization frameworks are needed to rapidly assemble these sources to support knowledge discovery. For example, one might want a mediator that can support ad hoc queries that require dynamically integrating geospatial and online data sources. Or one might want a more specialized integration framework that supports the integration of specific types of sources to support a given knowledge discovery task. To illustrate the importance of the integration of online and geospatial data sources, consider the inadvertent bombing of the Chinese Embassy in Belgrade. On 7 May 1999, B-2 bombers dropped 5 GPS-guided bombs on what had been incorrectly identified as the headquarters of the Yugoslav Federal Directorate for Supply and Procurement (FDSP). An intelligence analyst had correctly determined that the address of the FDSP headquarters was Bulevar Umetnosti 2, but the analyst then used a flawed procedure to identify the geographic coordinates of that address. The results were tragic, especially in light of the fact that the data was available in the telephone book to determine that the target was in fact the Chinese Embassy and not the FDSP headquarters (Pickering 1999). Using sources available today, the telephone book for Belgrade, which is available online, could be superimposed on an image of Belgrade to determine the likely identity of the buildings in an image. Unfortunately, a system to automate this task does not exist today. Consider some other examples of how the integration of these various types of sources can be exploited for knowledge discovery. Online news reports of terrorist events could be superimposed on a map and organized by time to look for patterns in activities. Online schedules can be integrated with transportation vector data to make predictions about the locations of trains, buses, or ferries. Detailed maps can be integrated with high-resolution satellite imagery to automatically determine the names of the roads in an image, which are typically not available in the road vector data available from NIMA. There are many other ways the integration of these different types of sources could be exploited. But the point is that there is no way we could even anticipate all the possible ways that this information could be combined. Thus, what is needed are tools that support an analyst in the rapid, dynamic, and accurate integration of these various types of sources in order to mine the available data. In the remainder of this paper we describe some of our initial efforts on geospatial data integration, which illustrates the types of integration that are possible.