论文信息 - Automatic Stemming of Words for Punjabi Language

Automatic Stemming of Words for Punjabi Language

The major task of a stemmer is to find root words that are not in original form and are hence absent in the dictionary. The stemmer after stemming finds the word in the dictionary. If a match of the word is not found, then it may be some incorrect word or a name, otherwise the word is correct. For any language in the world, stemmer is a basic linguistic resource required to develop any type of application in Natural Language Processing (NLP) with high accuracy such as machine translation, document classification, document clustering, text question answering, topic tracking, text summarization and keywords extraction etc. This paper concentrates on complete automatic stemming of Punjabi words covering Punjabi nouns, verbs, adjectives, adverbs, pronouns and proper names. A suffix list of 18 suffixes for Punjabi nouns and proper names and a number of other suffixes for Punjabi verbs, adjectives and adverbs and different stemming rules for Punjabi nouns, verbs, adjectives, adverbs, pronouns and proper names have been generated after analysis of corpus of Punjabi. It is first time that complete Punjabi stemmer covering Punjabi nouns, verbs, adjectives, adverbs, pronouns, and proper names has been proposed and it will be useful for developing other Punjabi NLP applications with high accuracy. A portion of Punjabi stemmer of proper names and nouns has been implemented as a part of Punjabi text summarizer in MS Access as back end and ASP.NET as front end with 87.37% efficiency

Vishal Gupta | Vishal Gupta

[1] Nicola Orio,et al. A novel method for stemmer generation based on hidden markov models , 2003, CIKM '03.

[2] Vincent Ng,et al. Unsupervised morphological parsing of Bengali , 2006, Lang. Resour. Evaluation.

[3] Tanveer J. Siddiqui,et al. Discovering suffixes: A Case Study for Marathi Language , 2010 .

[4] James Mayfield,et al. Single n-gram stemming , 2003, SIGIR.

[5] Gurpreet Singh Lehal,et al. Preprocessing Phase of Punjabi Language Text Summarization , 2011, ICIS 2011.

[6] Pushpak Bhattacharyya,et al. Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati , 2011 .

[7] Md. Zahurul Islam,et al. A light weight stemmer for Bengali and its use in spelling checker , 2007 .

[8] John A. Goldsmith,et al. Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[9] Tanveer J. Siddiqui,et al. An unsupervised Hindi stemmer with heuristic improvements , 2008, AND '08.

[10] M. F. Porter,et al. An algorithm for suffix stripping , 1997 .

[11] Gurpreet Singh Lehal,et al. Punjabi Language Stemmer for nouns and proper names , 2011 .

[12] Jyotsna Sengupta,et al. Information Systems for Indian Languages , 2011 .

[13] Prasenjit Majumder,et al. YASS: Yet another suffix stripper , 2007, TOIS.

[14] Vishal Gupta,et al. Automatic Punjabi Text Extractive Summarization System , 2012, COLING.

[15] Marie-Claire Jenkins,et al. Conservative stemming for search and indexing , 2005 .