Multidimensional content eXploration

Content Management Systems (CMS) store enterprise data such as insurance claims, insurance policies, legal documents, patent applications, or archival data like in the case of digital libraries. Search over content allows for information retrieval, but does not provide users with great insight into the data. A more analytical view is needed through analysis, aggregations, groupings, trends, pivot tables or charts, and so on. Multidimensional Content eXploration (MCX) is about effectively analyzing and exploring large amounts of content by combining keyword search with OLAP-style aggregation, navigation, and reporting. We focus on unstructured data or generally speaking documents or content with limited metadata, as it is typically encountered in CMS. We formally present how CMS content and metadata should be organized in a well-defined multidimensional structure, so that sophisticated queries can be expressed and evaluated. The CMS metadata provide traditional OLAP static dimensions that are combined with dynamic dimensions discovered from the analyzed keyword search result, as well as measures for document scores based on the link structure between the documents. In addition, we provide means for multidimensional content exploration through traditional OLAP rollupdrilldown operations on the static and dynamic dimensions, solutions for multi-cube analysis and dynamic navigation of the content. We present our prototype, called DBPubs, which stores research publications as documents that can be searched and -most importantly-- analyzed, and explored. Finally, we present experimental results of the efficiency and effectiveness of our approach.

[1]  Christian S. Jensen,et al.  A foundation for capturing and querying complex multidimensional data , 2001, Inf. Syst..

[2]  Berthold Reinwald,et al.  DBPubs: multidimensional exploration of database publications , 2008, Proc. VLDB Endow..

[3]  Alberto O. Mendelzon,et al.  Capturing summarizability with integrity constraints in OLAP , 2005, TODS.

[4]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses with CD Rom , 1998 .

[5]  Gottfried Vossen,et al.  Multidimensional normal forms for data warehouse design , 2003, Inf. Syst..

[6]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[7]  Vagelis Hristidis,et al.  ObjectRank: Authority-Based Keyword Search in Databases , 2004, VLDB.

[8]  Arie Shoshani,et al.  Summarizability in OLAP and statistical data bases , 1997, Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150).

[9]  Hans-Jörg Schek,et al.  Digital library information-technology infrastructures , 2005, International Journal on Digital Libraries.

[10]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[11]  Daniel S. Whelan FileNet integrated document management database usage and issues , 1998, SIGMOD '98.

[12]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[13]  Alan L. Mackay,et al.  Publish or perish , 1974, Nature.

[14]  Hamid Pirahesh,et al.  Extending XQuery for analytics , 2005, SIGMOD '05.

[15]  Raghu Ramakrishnan,et al.  Building Structured Web Community Portals: A Top-Down, Compositional, and Incremental Approach , 2007, VLDB.

[16]  Arie Shoshani,et al.  STORM: A Statistical Object Representation Model , 1990, IEEE Data Eng. Bull..

[17]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[18]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[19]  Andrei Z. Broder,et al.  Sampling Search-Engine Results , 2005, WWW '05.

[20]  Wolfgang Lehner,et al.  Normal forms for multidimensional databases , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[21]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[22]  Berthold Reinwald,et al.  Towards keyword-driven analytical processing , 2007, SIGMOD '07.

[23]  Jeffrey F. Naughton,et al.  Materialized View Selection for Multi-Cube Data Models , 2000, EDBT.

[24]  Nick Koudas,et al.  BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[25]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit , 2009 .

[26]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[27]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.