Using the Structural Content of Documents to Automatically Generate Quality Metadata

During the last decades, document sharing has become vastly more available for the general public, with large document collections being made generally available on the internet and inside of organizations on intranets. In addition, each of us has an everincreasing archive of private digital documents. At the same time efforts to enable more efficient document retrieval have only succeeded marginally. This makes finding the right document like looking for a needle in the haystack. Just now it is a bigger haystack. This lack of overview of existing document resources results in large amounts of scarce human resources that are still being used to create similar resources.A key reason to why we are faced with this challenge is that few documents receive a sufficient metadata description in order to enable efficient retrieval. Too often the document metadata is insufficient or even incorrect. Few document creators are aware of describing their documents with metadata. Trained librarians and archivists can assist authors to create and publish metadata, but this is a costly and time-consuming process. Advanced metadata formats, such as the IEEE LOM, enable detailed and precise metadata descriptions. This format is challenging to use and the potential in the format is often not leveraged. Document formats that require such metadata, e.g. SCORM Learning Objects (LOs), are not being used to their potential due to the challenges of creating metadata.This thesis shows how Automatic Metadata Generation (AMG) can stand as a foundation for creation, publishing and discovery of document resources with rich and correct metadata descriptions. This thesis shows how high quality metadata can be created automatically using the documents themselves and contextual data sources. Finally, this thesis shows how metadata descriptions can be used alongside the original document to create SCORM LOs to enable sharing of educational resources with educational metadata descriptions.The main contributions by this thesis are:C1: Establishing an overview of research literature, projects and products using AMG and the quality of their generated metadata.C2: Establishing that AMG efforts can be combined to expand the range of elements and entities that can be generated, but also to increase the quality of generated entities.C3: Establishing that AMG efforts can generate high quality metadata from nonhomogeneous document collections, vastly expanding the practical usefulness of AMG.C4: Establishing that AMG efforts can contribute extensively in promoting sharing of knowledge with the creation of sharable SCORM LOs containing the educational resources themselves and extensive metadata descriptions to enable efficient location and use.

[1]  Jihoon Yang,et al.  Knowledge-based metadata extraction from PostScript files , 2000, DL '00.

[2]  Roni Rosenfeld,et al.  Learning Hidden Markov Model Structure for Information Extraction , 1999 .

[3]  Arne Sølvberg,et al.  Understanding quality in conceptual modeling , 1994, IEEE Software.

[4]  Trond Aalberg,et al.  Using automatic metadata generation to reduce the knowledge and time requirements for making SCORM learning objects , 2009, 2009 3rd IEEE International Conference on Digital Ecosystems and Technologies.

[5]  Diane I. Hillmann,et al.  The Continuum of Metadata Quality: Defining, Expressing, Exploiting , 2004 .

[6]  Erik Duval,et al.  Making Metadata go away: "Hiding everything but the benefits" , 2004, Dublin Core Conference.

[7]  Elizabeth D. Liddy,et al.  Automatic metadata generation & evaluation , 2002, SIGIR '02.

[8]  Ying Li,et al.  Creating MAGIC: system for generating learning object metadata for instructional content , 2005, MULTIMEDIA '05.

[9]  Charlotte Jenkins,et al.  Server-side automatic metadata generation using qualified Dublin Core and RDF , 2000, Proceedings 2000 Kyoto International Conference on Digital Libraries: Research and Practice.

[10]  Shuming Shi,et al.  Web page title extraction and its application , 2007, Inf. Process. Manag..

[11]  Sara R. Tompson Scirus -- for Scientific Information , 2007 .

[12]  Branimir Boguraev,et al.  Lexical cohesion, discourse segmentation and document summarization , 2000, RIAO.

[13]  Erik Duval,et al.  Automating metadata generation: the simple indexing interface , 2005, WWW '05.

[14]  Kurt Maly,et al.  Automated Template-Based Metadata Extraction Architecture , 2007, ICADL.

[15]  Kun Bai,et al.  TableSeer: automatic table metadata extraction and searching in digital libraries , 2007, JCDL '07.

[16]  Erik Duval,et al.  SAmgI: Automatic Metadata Generation v2.0 , 2007 .

[17]  Jane Greenberg,et al.  Final Report for the AMeGA (Automatic Metadata Generation Applications) Project , 2005 .

[18]  Ingeborg Sølvberg,et al.  Metadata Challenges in Introducing the Global IEEE Learning Object Metadata (LOM) Standard in a Local Environment , 2007, WEBIST.

[19]  Hang Li,et al.  A new approach to intranet search based on information extraction , 2005, CIKM '05.

[20]  Jane Greenberg,et al.  Metadata Extraction and Harvesting , 2004 .