Data science in light of natural language processing

The focus of data scientists is essentially divided into three areas: collecting data, analyzing data, and inferring information from data. Each one of these tasks requires special personnel, takes time, and costs money. Yet, the next and the fastidious step is how to turn data into products. Therefore, this field grabs the attention of many research groups in academia as well as industry. In the last decades, data-driven approaches came into existence and gained more popularity because they require much less human effort. Natural Language Processing (NLP) is strongly among the fields influenced by data. The growth of data is behind the performance improvement of most NLP applications such as machine translation and automatic speech recognition. Consequently, many NLP applications are frequently moving from rule-based systems and knowledge-based methods to data-driven approaches. However, collected data that are based on undefined design criteria or on technically unsuitable forms will be useless. Also, they will be neglected if the size is not enough to perform the required analysis and to infer the accurate information. The chief purpose of this overview is to shed some lights on the vital role of data in various fields and give a better understanding of data in light of NLP. Expressly, it describes what happen to data during its life-cycle: building, processing, analyzing, and exploring phases.

[1]  Laurent Romary The Text Encoding Initiative: 30 years of accumulated wisdom and its potential for a bright future , 2016 .

[2]  Lisa Lund,et al.  Gamifying Natural Language Acquisition : A quantitative study on Swedish antonyms while examining the effects of consensus driven rewards , 2016 .

[3]  G. Leech,et al.  Word Frequencies in Written and Spoken English: based on the British National Corpus , 2001 .

[4]  Lennart E. Nacke,et al.  From game design elements to gamefulness: defining "gamification" , 2011, MindTrek.

[5]  Abdelhak Lakhouaja,et al.  Arabic information retrieval: Stemming or lemmatization? , 2017, 2017 Intelligent Systems and Computer Vision (ISCV).

[6]  Abdelhak Lakhouaja,et al.  Towards a standard Part of Speech tagset for the Arabic language , 2017, J. King Saud Univ. Comput. Inf. Sci..

[7]  Sudeshna Sarkar,et al.  Query Translation for Cross-Language Information Retrieval using Multilingual Word Clusters , 2016, WSSANLP@COLING.

[8]  J. A. T. K. Jayakody,et al.  “Mahoshadha”, the Sinhala Tagged Corpus Based Question Answering System , 2016 .

[9]  Joel Nothman,et al.  Learning multilingual named entity recognition from Wikipedia , 2013, Artif. Intell..

[10]  Geoffrey Leech,et al.  Corpus Annotation: Linguistic Information from Computer Text Corpora , 1997 .

[11]  Christopher D. Manning,et al.  Advances in natural language processing , 2015, Science.

[12]  Lidia S. Chao,et al.  Syntaxtree aligner: A web-based parallel tree alignment toolkit , 2016, 2016 International Conference on Wavelet Analysis and Pattern Recognition (ICWAPR).

[13]  Jean Charlet,et al.  Building an ontology of pulmonary diseases with natural language processing tools using textual corpora , 2007, Int. J. Medical Informatics.

[14]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[15]  Abdelhak Lakhouaja,et al.  Gamification for Arabic Natural Language Processing: Ideas into Practice , 2017 .

[16]  Susan Hunston Corpus Linguistics: Historical Development , 2012 .

[17]  John Sinclair,et al.  Intuition and annotation – the discussion continues , 2004 .

[18]  I. Milfull Mutual Illumination: The Dictionary of Old English and the Ongoing Revision of the Oxford English Dictionary (OED3) , 2009 .

[19]  Udo Kruschwitz,et al.  Creating language resources for under-resourced languages: methodologies, and experiments with Arabic , 2015, Lang. Resour. Evaluation.

[20]  Yonatan Belinkov,et al.  Shamela: A Large-Scale Historical Arabic Corpus , 2016, LT4DH@COLING.

[21]  Catherine N. Ball Automated Text Analysis: Cautionary Tales , 1993 .

[22]  George Giannakopoulos,et al.  Multi-document multilingual summarization corpus preparation, Part 1: Arabic, English, Greek, Chinese, Romanian , 2013 .

[23]  G. Leech 100 million words of English , 1993, English Today.

[24]  Reut Tsarfaty,et al.  Parsing Morphologically Rich Languages: Introduction to the Special Issue , 2013, Computational Linguistics.

[25]  Véronique Hoste,et al.  SemEval-2010 Task 3: Cross-Lingual Word Sense Disambiguation , 2010, SemEval@ACL.

[26]  Mohammad S. Khorsheed,et al.  Developing typewritten Arabic corpus with multi-fonts (TRACOM) , 2009, MOCR '09.

[27]  Solomon See,et al.  Building a sentiment corpus using a gamified framework , 2014, 2014 International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM).

[28]  Iryna Gurevych,et al.  Which argument is more convincing? Analyzing and predicting convincingness of Web arguments using bidirectional LSTM , 2016, ACL.