论文信息 - AMRITA_CEN@FIRE 2016: Code-Mix Entity Extraction for Hindi-English and Tamil-English Tweets

AMRITA_CEN@FIRE 2016: Code-Mix Entity Extraction for Hindi-English and Tamil-English Tweets

Social media text holds information regarding various important aspects. Extraction of such information serves as the basis for the most preliminary task in Natural Language Processing called Entity extraction. The work is submitted as a part of Shared task on Code Mix Entity Extraction for Indian Languages(CMEE-IL) at Forum for Information Retrieval Evaluation (FIRE) 2016. Three different methodology is proposed in this paper for the task of entity extraction for code-mix data. Proposed systems include approaches based on the Embedding models and feature based model. Creation of trigram embedding and BIO tag formatting were done during feature extraction. Evaluation of the system is carried out using machine learning based classifier, SVM-Light. Overall accuracy through cross validation has proven that the proposed system is efficient in classifying unknown tokens too.

P SomanK. | M. Anand Kumar | G RemmiyaDevi | V VeenaP.

[1] Jatin Sharma,et al. POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[2] Neethu John,et al. AMRITA_CEN@FIRE-2014: Named Entity Recognition for Indian Languages using Rich Features , 2014, FIRE.

[3] J. D. Pawar,et al. Discovering thematic knowledge from code-mixed chat messages using topic model , 2016 .

[4] K. P. Soman,et al. AMRITA_CEN-NLP@FIRE 2015: CRF Based Named Entity Extractor For Twitter Microposts , 2015, FIRE Workshops.

[5] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6] Wang Ling,et al. Two/Too Simple Adaptations of Word2Vec for Syntax Problems , 2015, NAACL.

[7] Devanshu Jain. DA-IICT in FIRE 2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[8] M. Anand Kumar,et al. Entity Extraction for Malayalam Social Media Text Using Structured Skip-gram Based Embedding Features from Unlabeled Data , 2016 .

[9] K. P. Soman,et al. AMRITA-CEN@SAIL2015: Sentiment Analysis in Indian Languages , 2015, MIKE.

[10] K. P. Soman,et al. AMRITA_CEN @ FIRE 2015: Extracting Entities for Social Media Texts in Indian Languages , 2015, FIRE Workshops.

[11] Joachim Wagner,et al. Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[12] Braja Gopal Patra,et al. Shared Task on Sentiment Analysis in Indian Languages (SAIL) Tweets - An Overview , 2015, MIKE.