Review and Visualization of Facebook's FastText Pretrained Word Vector Model

One of the most popular machine learning methods for processing natural language is Word2Vec. Like several other machine learning methods, there are some concerns regarding the interpretability of the resulting model. In this paper, our research aims to review and analyze a pretrained word vector model for processing Bahasa Indonesia released by Facebook, FastText. The analysis process is started by comparing words existing in the pretrained model and in the official dictionary of Indonesian language (KBBI), then words in the model are visualized to provide further analysis and review. A combination of Principal Component Analysis (PCA) method and t-SNE algorithm is used as a dimensionality reduction technique to visualize the word set. Based on the analysis and visualization result, this paper proposes several considerations needed when using the FastText pretrained word vector model to process natural language in Indonesian such as whether or not common natural language text preprocessing techniques are needed.

[1]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[2]  Kridanto Surendro,et al.  Feature expansion using word embedding for tweet topic classification , 2016, 2016 10th International Conference on Telecommunication Systems Services and Applications (TSSA).

[3]  Lalana Kagal,et al.  J un 2 01 8 Explaining Explanations : An Approach to Evaluating Interpretability of Machine Learning , 2018 .

[4]  Masayu Leylia Khodra,et al.  Word2vec semantic representation in multilabel classification for Indonesian news article , 2016, 2016 International Conference On Advanced Informatics: Concepts, Theory And Application (ICAICTA).

[5]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[6]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  Widodo Budiharto,et al.  Lstm And Simple Rnn Comparison In The Problem Of Sequence To Sequence On Conversation Data Using Bahasa Indonesia , 2018, 2018 Indonesian Association for Pattern Recognition International Conference (INAPR).

[9]  Seth Flaxman,et al.  European Union Regulations on Algorithmic Decision-Making and a "Right to Explanation" , 2016, AI Mag..

[10]  Adhi Kusnadi,et al.  Security system with 3 dimensional face recognition using PCA method and neural networks algorithm , 2017, 2017 4th International Conference on New Media Studies (CONMEDIA).

[11]  Carlos Guestrin,et al.  Model-Agnostic Interpretability of Machine Learning , 2016, ArXiv.

[12]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[13]  Dhiya Al-Jumeily,et al.  A Framework on a Computer Assisted and Systematic Methodology for Detection of Chronic Lower Back Pain Using Artificial Intelligence and Computer Graphics Technologies , 2016, ICIC.

[14]  Ali Akbar Septiandri,et al.  Detecting spam comments on Indonesia’s Instagram posts , 2017 .

[15]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[16]  Tomoko Ohkuma,et al.  Sentiment Analysis for Low Resource Languages: A Study on Informal Indonesian Tweets , 2016, ALR@COLING.

[17]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[18]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[19]  Masayu Leylia Khodra,et al.  Deep learning and distributional semantic model for Indonesian tweet categorization , 2016, 2016 International Conference on Data and Software Engineering (ICoDSE).

[20]  Ahmed Mahfouz,et al.  Visualizing the spatial gene expression organization in the brain through non-linear similarity embeddings. , 2015, Methods.

[21]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[22]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[23]  Dinggang Shen,et al.  Deep Learning Based Imaging Data Completion for Improved Brain Disease Diagnosis , 2014, MICCAI.