Crowdsourcing in developing repository of phrase definition in Bahasa Indonesia

Language repository is valuable as a reference in using the language, its preservation, and in developing and implementation of natural language processing algorithms. Bahasa Indonesia is one of natural languages that hardly has repository despite its large number of speakers and previous attempts to build ones. We devised a way to develop repository of phrase definition in Bahasa using a kind of crowdsourcing and investigated its implementation. An application add-on was inserted to an information system that manages final year projects of undergraduate students. The add-on invites students to participate in writing keyword definition and validating definition. Investigation in a period of six months reveals that about 25% of application users take parts into the voluntary activities either as definition writers and/or validators. During the period, about 1200 phrase definitions were added into the repository and in average each definition is validated by two participants. The activity is supported by users that are well aware of the tasks, and have positive perception about the work, despite different reasons that motivate their contribution.

[1]  Divi Galih Prasetyo Putri,et al.  An experimental study of lexicon-based sentiment analysis on Bahasa Indonesia , 2016, 2016 6th International Annual Engineering Seminar (InAES).

[2]  M. Thompson,et al.  Placing Knowledge Management in Context , 2004 .

[3]  Ruli Manurung,et al.  Developing an Online Indonesian Corpora Repository , 2010, PACLIC.

[4]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[5]  Satoshi Kurihara,et al.  Mining User Experience through Crowdsourcing: A Property Search Behavior Corpus Derived from Microblogging Timelines , 2015, 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).

[6]  Rajiv Kishore,et al.  Rules of Crowdsourcing: Models, Issues, and Systems of Control , 2013, Inf. Syst. Manag..

[7]  Diana Inkpen,et al.  Semantic text similarity using corpus-based word similarity and string similarity , 2008, ACM Trans. Knowl. Discov. Data.

[8]  Husni Thamrin,et al.  A Rule Based SWOT Analysis Application: A Case Study for Indonesian Higher Education Institution , 2017, ICCSCI.

[9]  R. Garabík,et al.  Bilingual Corpus – Digital Repository for Preservation of Language Heritage , 2013 .

[10]  Alon Y. Halevy,et al.  HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments , 2018, LREC.

[11]  Simon Winchester,et al.  The Professor and the Madman: A Tale of Murder, Insanity, and the Making of the Oxford English Dictionary , 1999 .

[12]  Faisal Rahutomo,et al.  PENGEMBANGAN PIRANTI PENELITIAN SISTEM TEMU KEMBALI INFORMASI BAHASA INDONESIA , 2015 .

[13]  Hugh E. Williams,et al.  A Testbed for Indonesian Text Retrieval , 2004, ADCS.

[14]  Rimantas Gatautis,et al.  Crowdsourcing Application in Marketing Activities , 2014 .

[15]  Ruli Manurung,et al.  Building an Indonesian WordNet , 2008 .

[16]  Gunawan Ariyanto,et al.  An Application that Invites Users to Participate in Developing Repository of Bahasa Indonesia , 2018, 2018 International Conference on Computer, Control, Informatics and its Applications (IC3INA).

[17]  Diana Inkpen,et al.  Real-word spelling correction using Google web 1Tn-gram data set , 2009, CIKM.

[18]  Adrian Leemann,et al.  The English Dialects App: The creation of a crowdsourced dialect corpus , 2018 .

[19]  Mirna Adriani,et al.  Automatically Building a Corpus for Sentiment Analysis on Indonesian Tweets , 2014, PACLIC.

[20]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[21]  Aditya G. Parameswaran,et al.  Crowdsourced Data Management: Industry and Academic Perspectives , 2015, Found. Trends Databases.

[22]  K. Bretonnel Cohen,et al.  Last Words: Amazon Mechanical Turk: Gold Mine or Coal Mine? , 2011, CL.