A computational framework for retrieval of document fragments based on decomposition schemes in engineering information management

Abstract Retrieval of document fragments has a great potential for application in engineering information management. Frequently engineers have neither the time nor inclination to sift through long documents for small pieces of useful information. Yet it is frequently in the form of one or more long documents that the information that they seek is presented. Supporting the delivery of the right information, in the right format and in the right quantity motivates the search for better ways of handling document sub-components or fragments. Document fragment retrieval can be facilitated using modern computational technologies. This paper proposes a novel framework for information access utilising state-of-the-art computational technologies and introducing the use of multiple document structure views through decomposition schemes. The framework integrates document structure study, mark-up technologies, automated fragment extraction, faceted classification and a document navigation mechanism to achieve the target of retrieval of specific document fragments using precise, complex queries. These disparate elements have been brought together in an exploratory Engineering Document Content Management System (EDCMS). Using this, investigations using representative engineering documents have shown that information users can access and retrieve document content – at fragment level rather than at document level – both through data in a document and document metadata, through different perspectives and at different granularities, and simultaneously across multiple documents as well as within a single document.

[1]  Fabio Crestani,et al.  A graphical user interface for the retrieval of hierarchically structured documents , 2004, Inf. Process. Manag..

[2]  A. C. Foskett,et al.  The subject approach to information , 1969 .

[3]  C. A. McMahon,et al.  CADCAM: Principles, Practice and Manufacturing Management , 1999 .

[4]  Gail E. Kaiser,et al.  Automating Content Extraction of HTML Documents , 2005, World Wide Web.

[5]  Ken M. Wallace,et al.  Identifying and supporting the knowledge needs of novice designers within the aerospace industry , 2004 .

[6]  Hiroshi Imai,et al.  Fast Algorithms for k-Word Proximity Search , 2001 .

[7]  Frans Wiering,et al.  The Utrecht Blend: Basic Ingredients for an XML Retrieval System , 2004, INEX.

[8]  Nicola Ferro,et al.  Improving the Automatic Retrieval of Text Documents , 2002, CLEF.

[9]  Elio Masciari,et al.  Fast detection of XML structural similarity , 2005, IEEE Transactions on Knowledge and Data Engineering.

[10]  Geoffrey Z. Liu Semantic vector space model : Implementation and evaluation , 1997 .

[11]  Raya Fidel,et al.  The many faces of accessibility: engineers' perception of information sources , 2004, Inf. Process. Manag..

[12]  D. A. Lizorkin,et al.  Implementation of the XML linking language XLink by functional methods , 2005, Programming and Computer Software.

[13]  Divesh Srivastava,et al.  Keyword proximity search in XML trees , 2006 .

[14]  David C. Blair The challenge of commercial document retrieval, Part II: a strategy for document searching based on identifiable document partitions , 2002, Inf. Process. Manag..

[15]  Oskari Heinonen,et al.  A dynamic user interface for document assembly , 2002, DocEng '02.

[16]  Steve Culley,et al.  A method for the study of information use profiles for design engineers , 1999 .

[17]  David Levine,et al.  A Query Algebra for Fragmented XML Stream Data , 2003, DBPL.

[18]  Rudi Studer,et al.  How to structure and access XML documents with ontologies , 2001, Data Knowl. Eng..

[19]  Chris A. McMahon,et al.  No zero match browsing of hierarchically categorized information entities , 2002, Artificial Intelligence for Engineering Design, Analysis and Manufacturing.

[20]  Thomas Kudrass,et al.  Rule-Based Generation of XML DTDs from UML Class Diagrams , 2003, ADBIS.

[21]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[22]  David C. Blair,et al.  The challenge of commercial document retrieval, Part I: Major issues, and a framework based on search exhaustivity, determinacy of representation and document collection size , 2002, Inf. Process. Manag..

[23]  Gong Ruibin,et al.  An adaptive model for phonetic string search , 2005 .

[24]  Ruibin Gong,et al.  Syllable Alignment: A Novel Model for Phonetic String Search , 2006, IEICE Trans. Inf. Syst..

[25]  Ivan Koychev,et al.  Within-Document Retrieval: A User-Centred Evaluation of Relevance Profiling , 2004, Information Retrieval.

[26]  John Kingston,et al.  Knowledge management through multi-perspective modelling: representing and distributing organizational memory , 2000, Knowl. Based Syst..

[27]  Jack Mills,et al.  Faceted Classification and Logical Division in Information Retrieval , 2004, Libr. Trends.

[28]  Shiyali Ramamrita Ranganathan Philosophy of Library Classification , 2006 .

[29]  Ari-Pekka Hemeri,et al.  Product data management—exploratory study on the state-of-the-art in one-of-a-kind industry , 1998 .

[30]  Toshiyuki Amagasa,et al.  Analyzing the Properties of XML Fragments Decomposed from the INEX Document Collection , 2004, INEX.

[31]  Dave Stewart,et al.  Waypoint: An Integrated Search and Retrieval System for Engineering Documents , 2004, J. Comput. Inf. Sci. Eng..

[32]  Patrick Langdon,et al.  INVESTIGATING KNOWLEDGE SEARCHES IN AEROSPACE DESIGN , 2003 .

[33]  Weisong Shi,et al.  Accelerating Dynamic Web Content Delivery Using Keyword-based Fragment Detection , 2005, J. Web Eng..

[34]  Leszek Borzemski,et al.  Complementing Search Engines with Text Mining , 2005, IEA/AIE.

[35]  Manfred Knobloch,et al.  Web Design with XML , 2003 .