A review of recent advances in text mining of Indian languages

Text mining in English language has been researched extensively in past and significant amount of resources, tools and techniques have been developed. India is a country of high language diversity. A large amount of textual data is available in Indian languages. Knowledge can be discovered from this text by applying text-mining techniques. Due to the characteristics of Indian languages, tools, techniques and resources available for mining text in English language cannot be applied directly to text in Indian languages. We could not find any comprehensive literature describing the research work related to mining of text written in Indian languages. In this paper, we review the research work done so far, availability of language resources and various challenges of text mining tasks in Indian languages.

[1]  Waqas Anwar,et al.  Challenges in Developing a Rule based Urdu Stemmer , 2011 .

[2]  Srikanta Patnaik,et al.  A System for Recognition of Named Entities in Odia Text Corpus Using Machine Learning Algorithm , 2015 .

[3]  Richa Sharma,et al.  Opinion Mining In Hindi Language: A Survey , 2014, FOCS 2014.

[4]  Ashish Jain,et al.  Identification of Conjunct Verbs in Hindi and Its Effect on Parsing Accuracy , 2011, CICLing.

[5]  Pabitra Mitra,et al.  A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition , 2012, Knowl. Based Syst..

[6]  Parteek Kumar,et al.  Word Sense Disambiguation for Punjabi Language Using Overlap Based Approach , 2014, ISI.

[7]  Vishal Gupta,et al.  Domain Based Punjabi Text Document Clustering , 2012, COLING.

[8]  Mohammad Fathian,et al.  Summarising customer online reviews using a new text mining approach , 2013, Int. J. Bus. Inf. Syst..

[9]  Abbas Raza Ali,et al.  Urdu text classification , 2009, FIT.

[10]  Vishal Gupta,et al.  A survey of Named Entity Recognition in English and other Indian Languages , 2010 .

[11]  Vasudeva Varma,et al.  Language Independent Sentence-Level Subjectivity Analysis with Feature Selection , 2012, PACLIC.

[12]  K. Raghuveer,et al.  Text Categorization in Indian Languages using Machine Learning Approaches , 2007, IICAI.

[13]  M. Hanumanthappa,et al.  Indian Language Text Representation and Categorization Using Supervised Learning Algorithm , 2013 .

[14]  Sivaji Bandyopadhyay,et al.  SentiWordNet for Indian Languages , 2010 .

[15]  Vishal Gupta,et al.  Algorithm for Punjabi Text Classification , 2012 .

[16]  Sudeshna Sarkar,et al.  A Hybrid Model for Part-of-Speech Tagging and its Application to Bengali , 2004, International Conference on Computational Intelligence.

[17]  Pabitra Mitra,et al.  Named Entity Recognition in Hindi using Maximum Entropy and Transliteration , 2008, Polibits.

[18]  Waqas Anwar,et al.  Challenges in Urdu Text Tokenization and Sentence Boundary Disambiguation , 2011 .

[19]  Usman Qamar,et al.  Association Rules Mining for Urdu Language , 2012 .

[20]  R. Jayashree,et al.  Suitability of Naïve Bayesian Methods for Paragraph Level Text Classification in the Kannada Language using Dimensionality Reduction Technique , 2013 .

[21]  Saroj Kaushik,et al.  Machine Learning Approach for the Classification of Demonstrative Pronouns for Indirect Anaphora in Hindi News Items , 2011, Prague Bull. Math. Linguistics.

[22]  Niladri Sekhar Dash,et al.  Automatic classification of bengali sentences based on sense definitions present in bengali wordnet , 2015, ArXiv.

[23]  K. Rajan,et al.  Automatic classification of Tamil documents using vector space model and artificial neural network , 2009, Expert Syst. Appl..

[24]  Swapan K. Parui,et al.  A Fast Corpus-Based Stemmer , 2011, TALIP.

[25]  Pabitra Mitra,et al.  A Hybrid Feature Set based Maximum Entropy Hindi Named Entity Recognition , 2008, IJCNLP.

[26]  Vincent Ng,et al.  Unsupervised morphological parsing of Bengali , 2006, Lang. Resour. Evaluation.

[27]  Gurpreet Singh Lehal,et al.  Punjabi Language Stemmer for nouns and proper names , 2011 .

[28]  R. K. Sharma,et al.  Development of Punjabi WordNet , 2013, CSI Transactions on ICT.

[29]  Vishal Gupta,et al.  A Survey on Sentiment Analysis and Opinion Mining Techniques , 2013 .

[30]  K. Srikanta Murthy,et al.  An analysis of sentence level text classification for the Kannada language , 2011, 2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR).

[31]  Pushpak Bhattacharyya,et al.  Hybrid Inflectional Stemmer and Rule-based Derivational Stemmer for Gujarati , 2011 .

[32]  Pushpak Bhattacharyya,et al.  A Fall-back Strategy for Sentiment Analysis in Hindi: a Case Study , 2010 .

[33]  Usman Qamar,et al.  Association Rules Mining for Urdu Language Using Transaction Hash Tables based Apriori (THT-Apriori) , 2012 .

[34]  Jugal K. Kalita,et al.  Analysis and evaluation of stemming algorithms: a case study with Assamese , 2012, ICACCI '12.

[35]  Sivaji Bandyopadhyay,et al.  Maximum Entropy Approach for Named Entity Recognition in Bengali and Hindi , 2009 .