Automatic Semantic Subject Indexing of Web Documents in Highly Inflected Languages

Structured semantic metadata about unstructured web documents can be created using automatic subject indexing methods, avoiding laborious manual indexing. A succesful automatic subject indexing tool for the web should work with texts in multiple languages and be independent of the domain of discourse of the documents and controlled vocabularies. However, analyzing text written in a highly inflected language requires word form normalization that goes beyond rule-based stemming algorithms. We have tested the state-of-the art automatic indexing tool Maui on Finnish texts using three stemming and lemmatization algorithms and tested it with documents and vocabularies of different domains. Both of the lemmatization algorithms we tested performed significantly better than a rule-based stemmer, and the subject indexing quality was found to be comparable to that of human indexers.

[1]  David Hawking,et al.  Does topic metadata help with Web search , 2007 .

[2]  Timo Järvinen,et al.  A non-projective dependency parser , 1997, ANLP.

[3]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[4]  Jarmo Saarti,et al.  Consistency of subject indexing of novels by public library professionals and patrons , 2002, J. Documentation.

[5]  Eero Hyvönen,et al.  A Semi-Automatic Semantic Annotation and Authoring Tool for a Library Help Desk Service , 2006, SAAW@ISWC.

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Paul Buitelaar,et al.  Linguistic Annotation for the Semantic Web , 2003 .

[8]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[9]  Paul Rayson,et al.  A semantic tagger for the Finnish language , 2005 .

[10]  M. E. Maron,et al.  Automatic Indexing: An Experimental Inquiry , 1961, JACM.

[11]  Tommi A. Pirinen,et al.  HFST Tools for Morphology - An Efficient Open-Source Package for Construction of Morphological Analyzers , 2009, SFCM.

[12]  Tarek El-Shishtawy,et al.  Arabic Keyphrase Extraction using Linguistic knowledge and Machine Learning Techniques , 2012, ArXiv.

[13]  Kemal Oflazer,et al.  Tagging and Morphological Disambiguation of Turkish Text , 1994, ANLP.

[14]  Jörg Rech,et al.  Emerging Technologies for Semantic Work Environments: Techniques, Methods, and Applications , 2008 .

[15]  P. Zunde,et al.  Indexing Consistency and Quality. , 1969 .

[16]  Loll N. Rolling Indexing consistency, quality and efficiency , 1981, Inf. Process. Manag..

[17]  Mikko Kurimo,et al.  Unlimited vocabulary speech recognition with morph language models applied to Finnish , 2006, Comput. Speech Lang..

[18]  Steffen Staab,et al.  Annotation for the semantic web , 2003 .

[19]  Lora Aroyo,et al.  The Semantic Web: Research and Applications , 2009, Lecture Notes in Computer Science.

[20]  Anthony McEnery,et al.  Porting an English semantic tagger to the Finnish language , 2003 .

[21]  Wessel Kraaij,et al.  MeSH Up: effective MeSH text classification for improved document retrieval , 2009, Bioinform..

[22]  Carl Gutwin,et al.  KEA: practical automatic keyphrase extraction , 1999, DL '99.

[23]  Michael Piotrowski,et al.  State of the Art in Computational Morphology - Workshop on Systems and Frameworks for Computational Morphology, SFCM 2009, Zurich, Switzerland, September 4, 2009. Proceedings , 2009, SFCM.

[24]  Olena Medelyan,et al.  Human-competitive automatic topic indexing , 2009 .

[25]  Eero Hyvönen,et al.  Efficient Content Creation on the Semantic Web Using Metadata Schemas with Domain Ontology Services (System Description) , 2007, ESWC.

[26]  I. Cicekli,et al.  Turkish keyphrase extraction using KEA , 2007, 2007 22nd international symposium on computer and information sciences.

[27]  K. Markey Interindexer consistency tests: a literature review and report of a test of consistency in indexing visual materials , 1984 .