Words Stemming Based on Structural and Semantic Similarity

Words  stemming  is  one  of  the  important  issues  in  the field  of  natural  language processing  and  information retrieval.  There  are  different  methods  for stemming which are mostly language-dependent. Therefore, these  stemmers are only applicable  to  particular  languages.  Because  of the importance  of  this issue,  in  this paper, the proposed method for stemming is aimed to be language-independent. In the  proposed  stemmer,  a  bilingual  dictionary  is  used and  all  of  the  words  in  the dictionary are firstly clustered. The words’ clustering is based on their structural and semantic similarity. Finally, finding the stem of new coming words is performed by making use of the previously formatted clusters. To evaluate the proposed scheme, words  stemming is  done on both  Persian  and  English  languages.  The encouraging results  indicate  the  good  performance  of  the proposed  method  compared  with  its counterparts.

[1]  Azadeh Shakery,et al.  A structural rule-based stemmer for Persian , 2010, 2010 5th International Symposium on Telecommunications.

[2]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[3]  James Mayfield,et al.  Single n-gram stemming , 2003, SIGIR.

[4]  Swapan K. Parui,et al.  A novel corpus-based stemming algorithm using co-occurrence statistics , 2011, SIGIR.

[5]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[6]  Yiming Yang,et al.  Unsupervised Learning of Arabic Stemming Using a Parallel Corpus , 2003, ACL.

[7]  Masood Ghayoomi Bootstrapping the Development of an HPSG-based Treebank for Persian , 2012 .

[8]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[9]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[10]  Nicola Orio,et al.  A novel method for stemmer generation based on hidden markov models , 2003, CIKM '03.

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[13]  Deepa Gupta,et al.  Improving Unsupervised Stemming by Fusing Partial Lemmatization Coupled with , 2012 .

[14]  Mehrnoush Shamsfard,et al.  A Bottom Up approach to Persian Stemming , 2008, IJCNLP.

[15]  Kazem Taghva,et al.  A stemming algorithm for the Farsi language , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[16]  Ahmed A. Rafea,et al.  An accuracy-enhanced light stemmer for arabic text , 2011, TSLP.

[17]  Nicola Ferro,et al.  University of Padua at CLEF 2002: Experiments to Evaluate a Statistical Stemming Algorithm , 2002, CLEF.

[18]  Johannes Leo,et al.  Book reviewCompetitive strategy: Techniques for analysing industries and competitors : Porter, Michael E. Free Press (Macmillan), New York, 396 pages, $17.95 , 1982 .

[19]  Carl P. Spaulding Sine-Cosine Angular Position Encoders , 1956 .