Embedding Framework for Identifying Ambiguous Words in Code-Mixed Social Media Text

Now a day’s text on social media contains codeswitched and code-mixed contents. These contents are widely used by people to express their opinions on any topic in the languages known to them. Her code-mixing technique is analyzed to find the words which can be used both in Hindi and in English, having different contexts. This leads to word sense ambiguity problem as one word can have a different meaning when it used in context to other words in a sentence. As Hindi Roman and English language exhibit word sense ambiguity, and resolving this ambiguity is a current research issue using the machine learning model. Here character embedding features are used for the representation of each word written in code-mixed content. The proposed method was developed for identifying context words by classifying the intent for using the ambiguous word in code mixed sentence. A well-known hierarchical LSTM model is used in the paper for context-based sub-word-level ambiguity detection to identify the language of the word. The work on Language Identification in the code-mixed text using character-based embedding for processing ambiguous word is a novel approach and shows promising results.

[1]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[2]  Gowri Srinivasa,et al.  NELIS - Named Entity and Language Identification System: Shared Task System Description , 2015, FIRE Workshops.

[3]  Shashi Shekhar,et al.  Hindi Roman Linguistic Framework for Retrieving Transliteration Variants using Bootstrapping , 2018 .

[4]  Anil Kumar Singh,et al.  Language Identification in Code-Mixed Data using Multichannel Neural Networks and Context Capture , 2018, NUT@EMNLP.

[5]  Pushpak Bhattacharyya,et al.  Improving NER Tagging Performance in Low-Resource Languages via Multilingual Learning , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[6]  Fatiha Sadat,et al.  Low-Resource Machine Transliteration Using Recurrent Neural Networks of Asian Languages , 2018, NEWS@ACL.

[7]  Sergey I. Nikolenko,et al.  Word Embeddings for User Profiling in Online Social Networks , 2017, Computación y Sistemas.

[8]  Sobha Lalitha Devi,et al.  CMEE-IL: Code Mix Entity Extraction in Indian Languages from Social Media Text @ FIRE 2016 - An Overview , 2016, FIRE.

[9]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[10]  Somnath Banerjee,et al.  Overview of FIRE-2015 Shared Task on Mixed Script Information Retrieval , 2015, FIRE Workshops.

[11]  Monojit Choudhury,et al.  "ye word kis lang ka hai bhai?" Testing the Limits of Word level Language Identification , 2014, ICON.

[12]  Sudeshna Sarkar,et al.  Using Communities of Words Derived from Multilingual Word Vectors for Cross-Language Information Retrieval in Indian Languages , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[13]  Jaime G. Carbonell,et al.  White Paper on Natural Language Processing , 1989, HLT.

[14]  Dilip Kumar Sharma,et al.  An effective cybernated word embedding system for analysis and language identification in code-mixed social media text , 2019, Int. J. Knowl. Based Intell. Eng. Syst..

[15]  Gemma Boleda,et al.  Putting Words in Context: LSTM Language Models and Lexical Ambiguity , 2019, ACL.

[16]  Amitava Das,et al.  Identifying Languages at the Word Level in Code-Mixed Indian Social Media Text , 2014, ICON.

[17]  Shashi Shekhar,et al.  Linguistic structural framework for encoding transliteration variants for word origin detection using bilingual lexicon , 2017, 2017 International Conference on Multimedia, Signal Processing and Communication Technologies (IMPACT).