Building Dictionaries for Low Resource Languages: Challenges of Unsupervised Learning

The development of natural language processing resources for Albanian has grown steadily in recent years. This paper presents research conducted on unsupervised learning-the challenges associated with building a dictionary for the Albanian language and creating part-of-speech tagging models. The majority of languages have their own dictionary, but languages with low resources suffer from a lack of resources. It facilitates the sharing of information and services for users and whole communities through natural language processing. The experimentation corpora for the Albanian language includes 250K sentences from different disciplines, with a proposal for a part-of-speech tagging tag set that can adequately represent the underlying linguistic phenomena. Contributing to the development of Albanian is the purpose of this paper. The results of experiments with the Albanian language corpus revealed that its use of articles and pronouns resembles that of more high-resource languages. According to this study, the total expected frequency as a means for correctly tagging words has been proven effective for populating the Albanian language dictionary.

[1]  Pulkit Mehndiratta,et al.  Tagging Efficiency Analysis on Part of Speech Taggers , 2017, 2017 International Conference on Information Technology (ICIT).

[2]  Daniel Zeman,et al.  Universal Dependencies for Albanian , 2020, UDW.

[3]  Ryan Cotterell,et al.  The CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection , 2018, CoNLL.

[4]  Marenglen Biba,et al.  Performance Evaluation of Text Categorization Algorithms Using an Albanian Corpus , 2018, EIDWT.

[5]  Marenglen Biba,et al.  A Thorough Experimental Evaluation of Algorithms for Opinion Mining in Albanian , 2018, EIDWT.

[6]  Arvind W. Kiwelekar,et al.  Deep Learning Techniques for Part of Speech Tagging by Natural Language Processing , 2020, 2020 2nd International Conference on Innovative Mechanisms for Industry Applications (ICIMIA).

[7]  Emdad Khan Machine Learning Algorithms for Natural Language Semantics and Cognitive Computing , 2016, 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

[8]  Hongbin Yu,et al.  Feature Extraction and Analysis of Natural Language Processing for Deep Learning English Language , 2020, IEEE Access.

[9]  Marenglen Biba,et al.  An Experimental Evaluation of Algorithms for Opinion Mining in Multi-domain Corpus in Albanian , 2018, ISMIS.

[10]  Thomas Proisl,et al.  Albanian Part-of-Speech Tagging: Gold Standard and Evaluation , 2018, LREC.

[11]  Vraj Shah,et al.  Natural Language Processing , 2018 .

[12]  Thomas Proisl,et al.  A Proposal for a Part-of-Speech Tagset for the Albanian Language , 2016, LREC.

[13]  Daniel Vasic,et al.  Development and Evaluation of Word Embeddings for Morphologically Rich Languages , 2018, 2018 26th International Conference on Software, Telecommunications and Computer Networks (SoftCOM).

[14]  Matteo Pagliardini,et al.  Unsupervised Learning of Sentence Embeddings Using Compositional n-Gram Features , 2017, NAACL.

[15]  Piotr Kłosowski,et al.  Deep Learning for Natural Language Processing and Language Modelling , 2018, 2018 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA).

[16]  Indrashis Das,et al.  An Examination System Automation Using Natural Language Processing , 2019, 2019 International Conference on Communication and Electronics Systems (ICCES).

[17]  Khairullah Khan,et al.  Unsupervised Machine Learning based Documents Clustering in Urdu , 2018, EAI Endorsed Trans. Scalable Inf. Syst..

[18]  Bujar Raufi,et al.  A Systematic Mapping Study of Language Features Identification from Large Text Collection , 2019, 2019 8th Mediterranean Conference on Embedded Computing (MECO).

[19]  Joakim Nivre,et al.  Universal Dependencies , 2017, EACL.