The concept of document warehousing for multi-dimensional modeling of textual-based business intelligence

During the past decade, data warehousing has been widely adopted in the business community. It provides multi-dimensional analyses on cumulated historical business data for helping contemporary administrative decision-making. Nevertheless, it is believed that only about 20% information can be extracted from data warehouses concerning numeric data only, the other 80% information is hidden in non-numeric data or even in documents. Therefore, many researchers now advocate that it is time to conduct research work on document warehousing to capture complete business intelligence. Document warehouses, unlike traditional document management systems, include extensive semantic information about documents, cross-document feature relations, and document grouping or clustering to provide a more accurate and more efficient access to text-oriented business intelligence. In this paper, we discuss the basic concept of document warehousing and present its formal definitions. Then, we propose a general system framework and elaborate some useful applications to illustrate the importance of document warehousing. The work is essential for establishing an infrastructure to help combine text processing with numeric OLAP processing technologies. The combination of data warehousing and document warehousing will be one of the most important kernels of knowledge management and customer relationship management applications.

[1]  A. K. Pujari,et al.  Data Mining Techniques , 2006 .

[2]  Elisa Bertino,et al.  XML and Data Integration , 2001, IEEE Internet Comput..

[3]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[4]  Natalya F. Noy,et al.  The state of art in ontology design , 1997 .

[5]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[6]  Jinho Lee,et al.  On the design and evaluation of a multi-dimensional approach to information retrieval (poster session) , 2000, Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.

[7]  Francesca Cesarini,et al.  Automatic document classification and indexing in high-volume applications , 2001, International Journal on Document Analysis and Recognition.

[8]  Li Sheng Survey of Multi-document Summarization , 2005 .

[9]  W. Bruce Croft,et al.  Probabilistic techniques for phrase extraction , 2001, Inf. Process. Manag..

[10]  Balakrishnan Chandrasekaran,et al.  What are ontologies, and why do we need them? , 1999, IEEE Intell. Syst..

[11]  Hiroshi Ishikawa,et al.  A document warehouse: a multimedia database approach , 1998, Proceedings Ninth International Workshop on Database and Expert Systems Applications (Cat. No.98EX130).

[12]  Dan Sullivan,et al.  Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing, and Sales , 2001 .

[13]  Gerald Salton,et al.  Automatic text processing , 1988 .

[14]  Jade Goldstein-Stewart,et al.  Summarizing text documents: sentence selection and evaluation metrics , 1999, SIGIR '99.

[15]  Satoshi Sekine,et al.  A survey for Multi-Document Summarization , 2003, HLT-NAACL 2003.

[16]  Elisa Bertino,et al.  Integrating XML and databases , 2001, IEEE Internet Computing.

[17]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[18]  Carole D. Hafner,et al.  The State of the Art in Ontology Design: A Survey and Comparative Review , 1997, AI Mag..

[19]  Richard D. Hackathorn Data warehousing energizes your enterprise , 1995 .

[20]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[21]  Ah-Hwee Tan,et al.  Text Mining: The state of the art and the challenges , 2000 .

[22]  Frank Shou-Cheng Tseng Design of a multi-dimensional query expression for document warehouses , 2005, Inf. Sci..

[23]  Frank S. C. Tseng,et al.  D-Tree: A Multi-Dimensional Indexing Structure for Constructing Document Warehouses , 2006, J. Inf. Sci. Eng..

[24]  Hiroshi Ishikawa,et al.  Document warehousing based on a multimedia database system , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[25]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[26]  Chi-Sheng Shih,et al.  Extracting classification knowledge of Internet documents with mining term associations: a semantic approach , 1998, SIGIR '98.

[27]  Kevin Knight,et al.  Mining online text , 1999, Commun. ACM.

[28]  José Palazzo Moreira de Oliveira,et al.  Concept-based knowledge discovery in texts extracted from the Web , 2000, SKDD.

[29]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[30]  Ralph Kimball,et al.  The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses , 1996 .

[31]  Dennis Murray,et al.  Data warehousing in the real world - a practical guide for building decision support systems , 1997 .

[32]  Hsin-Hsi Chen,et al.  A summarization system for Chinese news from multiple sources , 2003, J. Assoc. Inf. Sci. Technol..

[33]  Hiroshi Ishikawa,et al.  Document warehousing: a document-intensive application of a multimedia database , 2001, Proceedings Eleventh International Workshop on Research Issues in Data Engineering. Document Management for Data Intensive Business and Scientific Applications. RIDE 2001.

[34]  Ophir Frieder,et al.  On the design and evaluation of a multi-dimensional approach to information retrieval (poster session) , 2000, SIGIR '00.

[35]  Jade Goldstein-Stewart,et al.  Creating and evaluating multi-document sentence extract summaries , 2000, CIKM '00.

[36]  Vijayan Sugumaran,et al.  Ontologies for conceptual modeling: their creation, use, and management , 2002, Data Knowl. Eng..

[37]  Inderjeet Mani,et al.  The Challenges of Automatic Summarization , 2000, Computer.

[38]  James Allan,et al.  On-Line New Event Detection and Tracking , 1998, SIGIR.

[39]  John R. Josephson,et al.  What Are They? Why Do We Need Them? , 1999 .