The Influence of Collocation Segmentation and Top 10 Items to Keyword Assignment Performance

Automatic document annotation from a controlled conceptual thesaurus is useful for establishing precise links between similar documents. This study presents a language independent document annotation system based on features derived from a novel collocation segmentation method. Using the multilingual conceptual thesaurus EuroVoc, we evaluate filtered and unfiltered version of the method, comparing it against other language independent methods based on single words and bigrams. Testing our new method against the manually tagged multilingual corpus Acquis Communautaire 3.0 (AC) using all descriptors found there, we attain improvements in keyword assignment precision from 18 to 29 percent and in F-measure from 17.2 to 27.6 for 5 keywords assigned to a document. The further filtering out of the top 10 frequent items improves precision by 4 percent and collocation segmentation improves precision by 9 percent on the average, over 21 languages tested.

[1]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[2]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[3]  Charles Nicholas,et al.  Feature Selection and Document Clustering , 2004 .

[4]  Lizhu Hao,et al.  Automatic Identification of Stop Words in Chinese Text Classification , 2008, 2008 International Conference on Computer Science and Software Engineering.

[5]  Michael W. Berry,et al.  Survey of Text Mining: Clustering, Classification, and Retrieval , 2007 .

[6]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[7]  Christopher J. Fox,et al.  A stop list for general text , 1989, SIGF.

[8]  Jorge Civera Saiz Novel statistical approaches to text classification, machine translation and computer-assisted translation , 2011 .

[9]  Alfons Juan-Císcar,et al.  Bilingual Machine-Aided Indexing , 2006, LREC.

[10]  Bruno Pouliquen,et al.  Automatic annotation of multilingual text collections with a conceptual thesaurus , 2006, ArXiv.

[11]  Vidas Daudaravicius,et al.  Gravity Counts for the boundaries of collocations , 2004 .

[12]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[13]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[14]  Karel Jezek,et al.  Extending the single words-based document model: a comparison of bigrams and 2-itemsets , 2006, DocEng '06.

[15]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[16]  Stéfan Jacques Darmoni,et al.  Evaluation of French and English MeSH Indexing Systems with a Parallel Corpus , 2005, AMIA.

[17]  Yaacov Choueka,et al.  Looking for Needles in a Haystack or Locating Interesting Collocational Expressions in Large Textual Databases , 1988, RIAO Conference.

[18]  M. de Rijke,et al.  An Experiment in Automatic Classification of Pathological Reports , 2007, AIME.

[19]  Tong Zhang,et al.  Text Mining: Predictive Methods for Analyzing Unstructured Information , 2004 .