Morphosyntactic Preprocessing Impact on Document Embedding: An Empirical Study on Semantic Similarity

Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.

[1]  Wolfgang Minker,et al.  A Comparative Study of Text Preprocessing Approaches for Topic Detection of User Utterances , 2016, LREC.

[2]  Johan Eklund Review of: Chu, Heting. Information representation and retrieval in the digital age. Medford, NJ: Information Today for the American Society for Information Science and Technology, 2003 , 2005, Inf. Res..

[3]  David Sánchez,et al.  A Review on Semantic Similarity , 2015 .

[4]  M. de Rijke,et al.  Short Text Similarity with Word Embeddings , 2015, CIKM.

[5]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[6]  Jimmy J. Lin,et al.  Extracting Structural Paraphrases from Aligned Monolingual Corpora , 2003, IWP@ACL.

[7]  Rehab Duwairi,et al.  A study of the effects of preprocessing strategies on sentiment analysis for Arabic text , 2014, J. Inf. Sci..

[8]  José Camacho-Collados,et al.  On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis , 2017, BlackboxNLP@EMNLP.

[9]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[10]  Hendrik Blockeel,et al.  Web mining research: a survey , 2000, SKDD.

[11]  Dong-Yul Ra,et al.  Techniques for improving web retrieval effectiveness , 2005, Inf. Process. Manag..

[12]  Serkan Gunal,et al.  THE IMPACT OF TEXT REPRESENTATION AND PREPROCESSING ON AUTHOR IDENTIFICATION , 2017 .

[13]  Chris Quirk,et al.  Unsupervised Construction of Large Paraphrase Corpora: Exploiting Massively Parallel News Sources , 2004, COLING.

[14]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[15]  Stephen Clark,et al.  A Systematic Study of Semantic Vector Space Model Parameters , 2014, CVSC@EACL.

[16]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[17]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[18]  Serkan Günal,et al.  The impact of preprocessing on text classification , 2014, Inf. Process. Manag..

[19]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.