Hamshahri: A standard Persian text collection

The Persian language is one of the dominant languages in the Middle East, so there are significant amount of Persian documents available on the Web. Due to the different nature of the Persian language compared to the other languages such as English, the design of information retrieval systems in Persian requires special considerations. However, there are relatively few studies on retrieval of Persian documents in the literature and one of the main reasons is the lack of a standard test collection. In this paper, we introduce a standard Persian text collection, named Hamshahri, which is built from a large number of newspaper articles according to TREC specifications. Furthermore, statistical information about documents, queries and their relevance judgments are presented in this paper. We believe that this collection is the largest Persian text collection, so far.

[1]  Hadi Amiri,et al.  Using OWA fuzzy operator to merge retrieval system results , 2007 .

[2]  Ellen M. Voorhees,et al.  Overview of TREC 2004 , 2004, TREC.

[3]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[4]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[5]  Kazem Taghva,et al.  Language model-based retrieval for Farsi documents , 2004, International Conference on Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004..

[6]  Amir Nayyeri,et al.  FuFaIR: a Fuzzy Farsi Information Retrieval System , 2006, IEEE International Conference on Computer Systems and Applications, 2006..

[7]  K. Sparck Jones,et al.  INFORMATION RETRIEVAL TEST COLLECTIONS , 1976 .

[8]  Djoerd Hiemstra,et al.  Using language models for information retrieval , 2001 .

[9]  David Hawking,et al.  Overview of the TREC 2004 Web Track , 2004, TREC.

[10]  Rémi Zajac,et al.  Persian-English Machine Translation: An Overview of the Shiraz Project , 2000 .

[11]  Fattaneh Taghiyareh,et al.  Experiments with persian text compression for web , 2004, WWW Alt. '04.

[12]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[13]  Farhad Oroumchian,et al.  N-gram and Local Context Analysis for Persian text retrieval , 2007, 2007 9th International Symposium on Signal Processing and Its Applications.

[14]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[15]  Kazem Taghva,et al.  A stemming algorithm for the Farsi language , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[16]  Mahmood Neshati,et al.  Mahak: A Test Collection for Evaluation of Farsi Information Retrieval Systems , 2007, 2007 IEEE/ACS International Conference on Computer Systems and Applications.