Focused Crawling using Dictionary Algorithm with Breadth First and by Page Length Methods for Javanese and Sundanese Corpus Construction

Abstract The need of complete corpus nowadays is very crucial, especially for linguist. In order to assist linguist to construct corpus, a tool for collecting text in a specific language from the Internet is needed. This paper describes an approach to collecting Javanese and Sundanese text from the Internet. We have modified a focused crawler named WebSPHINX such that it can be useful for crawling the text. In order to determine which pages are crawled, the focused crawler needs a language classifier. In this research, we used the dictionary algorithm for classifying the text. In order to determine the next links to visit, we employed 2 crawling methods, i.e. Breadth First and By Page Length. The purpose of our research is to observe how the algorithm and the crawling methods perform to collect Javanese and Sundanese text from the Internet. Our experiments have shown that the dictionary algorithm classify the text based on the languages with average accuracy of 88,64% depending on the size of the documents being classified. The experiments also showed that in general the Breadth First method outperfoms the By Page Length method. In this research, we also campared the dictionary algorithm to the N-Gram algorithm when different crawling methods are employed. The experiments showed that the combination of Breadth First method and Dictionary algorithm generally outperforms other combinations. Therefore, we used the combination of Breadth First method and Dictionary algorithm for crawling the text and then constructing Javanese and Sundanese corpora.