Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach

The aim of the study was to test whether query expansion by approximate string matching methods is beneficial in retrieval from historical newspaper collections in a language rich with compounds and inflectional forms (Finnish). First, approximate string matching methods were used to generate lists of index words most similar to contemporary query terms in a digitized newspaper collection from the 1800s. Top index word variants were categorized to estimate the appropriate query expansion ranges in the retrieval test. Second, the effectiveness of approximate string matching methods, automatically generated inflectional forms, and their combinations were measured in a Cranfield‐style test. Finally, a detailed topic‐level analysis of test results was conducted. In the index of historical newspaper collection the occurrences of a word typically spread to many linguistic and historical variants along with optical character recognition (OCR) errors. All query expansion methods improved the baseline results. Extensive expansion of around 30 variants for each query word was required to achieve the highest performance improvement. Query expansion based on approximate string matching was superior to using the inflectional forms of the query words, showing that coverage of the different types of variation is more important than precision in handling one type of variation.

[1]  Klaus U. Schulz,et al.  Information Access to Historical Documents from the Early New High German Period , 2006, Digital Historical Corpora.

[2]  Lon-Mu Liu,et al.  Adaptive post-processing of OCR text via knowledge acquisition , 1991, CSC '91.

[3]  Norbert Fuhr,et al.  Retrieval in text collections with historic spelling using linguistic and spelling variants , 2007, JCDL '07.

[4]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[5]  Rose Holley,et al.  How Good Can It Get? Analysing and Improving OCR Accuracy in Large Scale Historic Newspaper Digitisation Programs , 2009, D Lib Mag..

[6]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[7]  Kalervo Järvelin,et al.  Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules , 2007, TOIS.

[8]  Peter Willett,et al.  Applications of n-grams in textual information systems , 1998, J. Documentation.

[9]  Dawn Archer,et al.  Travelling through time with corpus annotation software , 2008 .

[10]  Peter Willett,et al.  A Comparison of Spelling-Correction Methods for the Identification of Word Forms in Historical Text Databases , 1993 .

[11]  Jaana Kekäläinen,et al.  The Co-Effects of Query Structure and Expansion on Retrieval Performance in Probabilistic Text Retrieval , 2004, Information Retrieval.

[12]  Eero Sormunen,et al.  Liberal relevance criteria of TREC -: counting on negligible documents? , 2002, SIGIR '02.

[13]  Kalervo Järvelin,et al.  s-grams: Defining generalized n-grams for information retrieval , 2007, Inf. Process. Manag..

[14]  M. de Rijke,et al.  A Cross-Language Approach to Historic Document Retrieval , 2006, ECIR.

[15]  Anni Järvelin,et al.  Comparison of s-gram Proximity Measures in Out-of-Vocabulary Word Translation , 2008, SPIRE.

[16]  Julian R. Ullmann,et al.  A Binary n-Gram Technique for Automatic Correction of Substitution, Deletion, Insertion and Reversal Errors in Words , 1977, Comput. J..

[17]  W. Bruce Croft,et al.  Probabilistic Retrieval of OCR Degraded Text Using N-Grams , 1997, ECDL.

[18]  Kimmo Kettunen Managing word form variation of text retrieval in practice – Why language technology is not the only cure for better IR performance? , 2013 .

[19]  Ismo Raitanen "Etsikäät hywää ja älläät pahaa." Tiedonhakumenetelmien tuloksellisuuden vertailu merkkivirheitä sisältävässä historiallisessa sanomalehtikokoelmassa , 2012 .

[20]  Ellen M. Voorhees,et al.  The TREC-5 Confusion Track: Comparing Retrieval Methods for Scanned Text , 2000, Information Retrieval.

[21]  Paul McNamee,et al.  Using Syllables As Indexing Terms in Full-Text Information Retrieval , 2010, Baltic HLT.

[22]  J. Mollon,et al.  Comparison at a Distance , 2003, Perception.

[23]  Kalervo Järvelin,et al.  Non-adjacent Digrams Improve Matching of Cross-Lingual Spelling Variants , 2003, SPIRE.

[24]  Hartmut Walravens A NORDIC DIGITAL NEWSPAPER LIBRARY , 2006 .

[25]  Turid Hedlund,et al.  Dictionary-Based Cross-Language Information Retrieval: Learning Experiences from CLEF 2000–2002 , 2004, Information Retrieval.

[26]  Gary Marchionini,et al.  Examining the effectiveness of real-time query expansion , 2007, Inf. Process. Manag..

[27]  Majlis Bremer-Laamanen The nordic digital newspaper library , 2001 .

[28]  Kalervo Järvelin,et al.  Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants , 2002, Inf. Res..

[29]  Kimmo Kettunen,et al.  Is a Morphologically Complex Language Really that Complex in Full-Text Retrieval? , 2006, FinTAL.

[30]  Norbert Fuhr,et al.  Generating Search Term Variants for Text Collections with Historic Spellings , 2006, ECIR.

[31]  Kalervo Järvelin,et al.  A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages , 2010, CLEF.

[32]  Ida G. Sprinkhuizen-Kuyper,et al.  Information Retrieval from Historical Corpora , 2002 .

[33]  Eric C. Jensen,et al.  A Survey of Retrieval Strategies for OCR Text Collections , 2002 .

[34]  Alexander M. Robertson,et al.  Word Variant Identification in Old French , 1997, Inf. Res..

[35]  Ari Pirkola,et al.  The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval , 1998, SIGIR '98.

[36]  Falk Scholer,et al.  Metric and Relevance Mismatch in Retrieval Evaluation , 2009, AIRS.

[37]  James Mayfield,et al.  Addressing morphological variation in alphabetic languages , 2009, SIGIR.

[38]  Kimmo Kettunen Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: An overview , 2009, J. Documentation.

[39]  Kimmo Koskenniemi,et al.  A General Computational Model for Word-Form Recognition and Production , 1984 .

[40]  Kalervo Järvelin,et al.  Restricted inflectional form generation in management of morphological keyword variation , 2007, Information Retrieval.

[41]  Klaus U. Schulz,et al.  Towards information retrieval on historical document collections: the role of matching procedures and special lexica , 2010, International Journal on Document Analysis and Recognition (IJDAR).

[42]  Jacques Savoy,et al.  Comparative information retrieval evaluation for scanned documents , 2011 .

[43]  Kimmo Kettunen,et al.  Does dictionary based bilingual retrieval work in a non-normalized index? , 2009, Inf. Process. Manag..

[44]  Anni Järvelin,et al.  Dictionary-independent translation in CLIR between closely related languages , 2006 .

[45]  Dawn Archer,et al.  The Identification of Spelling Variants in English and German Historical Texts: Manual or Automatic? , 2008, Lit. Linguistic Comput..

[46]  Michele Flammini,et al.  Improved Stable Retrieval in Noisy Collections , 2011, ICTIR.

[47]  Jaana Kekäläinen,et al.  Cumulated gain-based evaluation of IR techniques , 2002, TOIS.

[48]  Kalervo Järvelin,et al.  Frequent Case Generation in Ad Hoc Retrieval of Three Indian Languages - Bengali, Gujarati and Marathi , 2011, FIRE.

[49]  Wolfram Luther,et al.  Comparison of distance measures for historical spelling variants , 2006, IFIP AI.

[50]  Kazem Taghva,et al.  Results of applying probabilistic IR to OCR text , 1994, SIGIR '94.

[51]  Riitta Alkula From Plain Character Strings to Meaningful Words: Producing Better Full Text Databases for Inflectional and Compounding Languages with Morphological Analysis Software , 2004, Information Retrieval.

[52]  Eero Sormunen,et al.  A Method for Measuring Wide Range Performance of Boolean Queries in Full-Text Databases , 2000 .

[53]  Mandar Mitra,et al.  Information Retrieval from Documents: A Survey , 2000, Information Retrieval.

[54]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[55]  Norbert Fuhr,et al.  Rule-based Search in Text Databases with Nonstandard Orthography , 2006, Lit. Linguistic Comput..