Document Similarity Detection Using Indonesian Language Word2vec Model

Most researches on text duplication in Bahasa uses the TF-IDF method. In this method, each word will have a different weight. The more frequencies the word appears, the greater the weight. This study aims to detect the similarity of documents by calculating cosine similarity from word vectors. The corpus was built from a collection of Indonesian Wikipedia articles. This study proposes two techniques to calculate the similarity which is simultaneous and partial comparison. Simultaneous comparison is direct comparison without dividing documents into several chapters, while partial comparison divides documents into several chapters before calculating the similarity. Similarity result from partial comparison is more accurate than simultaneous comparison. This study uses Unicheck application TF-IDF method as a benchmark. Similarity result from Unicheck and this study are different, due to the different method applied. Similarity result using TF -IDF method is smaller than using Word2vec, this is because TF-IDF can't detect paraphrase. The limitation in this study is that the Unicheck application used as a benchmark does not use the same method as the method used in this study other than that the determination of expected value is still subjective.