Using a web-based categorization approach to generate thematic metadata from texts

Conventional tools for automatic metadata creation mostly extract named entities or text segments from texts and annotate them with information about persons, locations, dates, and so on. However, this kind of entity type information is often insufficient for machines to understand the facts contained in the texts, thus precluding the possibility of implementing more advanced, intelligent applications, such as concept-based search. In this work, we try to create more refined thematic metadata inherent in texts. Based on Web resource mining, our approach acquires training corpora necessary to describe both the thematic categories and the metadata extracted from the texts. The approach then finds the corresponding relationships among them by means of categorization and thus generates thematic metadata for the textual data. Experimental results confirm the potential and wide adaptability of our approach.

[1]  Peter Burden,et al.  Automatic RDF Metadata Generation for Resource Discovery , 1999, Comput. Networks.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Steffen Staab,et al.  An annotation framework for the semantic web , 2001 .

[4]  Doheon Lee,et al.  Database summarization using fuzzy ISA hierarchies , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[5]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[6]  Anne Gilliland-Swetland,et al.  Introduction to Metadata: Pathways to Digital Information , 1998 .

[7]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[8]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[9]  Masaki Murata,et al.  Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules , 2000, ACL.

[10]  Dragomir R. Radev,et al.  Generating Natural Language Summaries from Multiple On-Line Sources , 1998, CL.

[11]  Yonatan Aumann,et al.  Maximal Association Rules: A New Tool for Mining for Keyword Co-Occurrences in Document Collections , 1997, KDD.

[12]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.

[13]  Stephen Soderland,et al.  Learning to Extract Text-Based Information from the World Wide Web , 1997, KDD.

[14]  Steffen Staab,et al.  From Manual to Semi-Automatic Semantic Annotation: About Ontology-Based Text Annotation Tools , 2000, SAIC@COLING.

[15]  Timothy W. Finin,et al.  Yahoo! as an ontology: using Yahoo! categories to describe documents , 1999, CIKM '99.

[16]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[17]  Kathleen R. McKeown,et al.  Generating natural language summaries from multiple on-line sources , 1998 .

[18]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[19]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[20]  Changning Huang,et al.  Improved Source-Channel Models for Chinese Word Segmentation , 2003, ACL.

[21]  Martin Volk,et al.  Exploiting the WWW as a corpus to resolve PP attachment ambiguities , 2001 .

[22]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.