论文信息 - An Unsupervised Approach to develop IR System : The case of Urdu

An Unsupervised Approach to develop IR System : The case of Urdu

Web Search Engines are best gifts to the mankind by Information and Communication Technologies. Without the search engines it would have been almost impossible to make the efficient access of the information available on the web today. They play a very vital role in the accessibility and usability of the internet based information systems. As the internet users are increasing day by day so is the amount of information being available on web increasing. But the access of information is not uniform across all the language communities. Besides English and European languages that constitutes to the 60% of the information available on the web, there is still a wide range of the information available on the internet in different languages too. In the past few years the amount of information available in Indian Languages has also increased. Besides English and few European Languages, there are no tools and techniques available for the efficient retrieval of this information available on the internet. Especially in the case of the Indian Languages the research is still in the preliminary steps. There are no sufficient amount of tools and techniques available for the efficient retrieval of the information for Indian Languages. As we know that Indian Languages are very resource poor languages in terms of IR test data collection. So my main focus was mainly on developing the data set for URDU IR, training and testing data for Stemmer. We have developed a language independent system to facilitate efficient retrieval of information available in Urdu language which can be used for other languages as well. The system gives precision of 0.63 and the recall of the system is 0.8. For this Firstly I have developed an Unsupervised Stemmer for URDU Language [1] as it is very important in the Information Retrieval.

Mohd. Shahid Husain

[1] Tanveer J. Siddiqui,et al. An unsupervised Hindi stemmer with heuristic improvements , 2008, AND '08.

[2] Mohd. Shahid Husain,et al. A Language Independent Approach to Develop Urdu Stemmer , 2012, ACITY.

[3] Swapan K. Parui,et al. A Simple Stemmer for Inflectional Languages , 2008 .

[4] Miriam Butt,et al. NON-NOMINATIVE SUBJECTS IN URDU A COMPUTATIONAL ANALYSIS , 2001 .

[5] Fredric C. Gey,et al. Building an Arabic Stemmer for Information Retrieval , 2002, TREC.

[6] Richard Wicentowski. Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model , 2004, SIGMORPHON@ACL.

[7] S.M.J. Rizvi,et al. Modeling case marking system of Urdu-Hindi languages by using semantic information , 2005, 2005 International Conference on Natural Language Processing and Knowledge Engineering.

[8] Mehrnoush Shamsfard,et al. A Bottom Up approach to Persian Stemming , 2008, IJCNLP.

[9] S.M.J. Rizvi,et al. Analysis, Design and Implementation of Urdu Morphological Analyzer , 2005, 2005 Student Conference on Engineering Sciences and Technology.

[10] Robert Krovetz,et al. Viewing morphology as an inference process , 1993, Artif. Intell..

[11] Naglaa Thabet. Stemming the Qur’an , 2004 .

[12] Sarmad Hussain,et al. Assas-band, an Affix-Exception-List Based Urdu Stemmer , 2009, ALR7@IJCNLP.