R-Cubes: OLAP Cubes Contextualized with Documents

Current data warehouse and OLAP (Kimball and Ross, 2002) technologies can be efficiently applied to analyze the huge amounts of structured data that companies produce. These organizations also produce many text documents and use the Web as their largest source of external information. Although these documents include highly valuable information that should also be exploited by companies, they cannot be analyzed by current OLAP technologies because they are unstructured and mainly contain text. The current trend is to find these documents available in XML-like formats. Our proposal is to build XML document warehouses that can be used by companies to store unstructured information coming from their internal and external sources. In (Perez et al., 2005) we proposed an architecture for the integration of a corporate warehouse of structured data with a warehouse of text-rich XML documents. We call the resulting warehouse a contextualized warehouse. Since the XML document warehouse may contain documents about many different topics, we apply well-known information retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999) techniques to select the context of analysis from the document warehouse. First, the user specifies an analysis context by supplying a sequence of keywords (e.g., an IR condition like "financial crisis"). Then, the analysis is performed on a so-called R-cube (Relevance cube), which is materialized by retrieving the documents and facts related to the selected context. Each fact in the R-cube will be linked to the set of documents that describe its context, and will have assigned a numerical value representing its relevance with respect to the specified context (e.g., how important the fact is for a "financial crisis"). In (Perez et al., 2005) we provided R-cubes with a data model and an algebra. This paper presents a prototype R-cube system, and explains how to use it.