Web Scraping Tools are simplifying the task of creating large databases for various applications such as the construction of corpus aimed at the development of applications for natural language processing. Many of these applications require a large amount of data, and in that sense, the Web presents itself as an important data source. Among the various tasks in the NLP scope, one of the most challenging is automatic text generation. In this task the objective is to generate syntactically and semantically correct texts after a training process on a particular corpus. This article presents the elaboration of an English song lyrics Corpus, extracted from the Web, that can be used to train applications for automatic generation of lyrics, poems, or other NPL related tasks. After its normalization, an analysis of the Corpus is presented, as well as analyzes performed after the corpus vectorization (embedding) generated with the use of two current techniques.
[1]
Iryna Gurevych,et al.
C4Corpus: Multilingual Web-size Corpus with Free License
,
2016,
LREC.
[2]
Stefano Faralli,et al.
A Large DataBase of Hypernymy Relations Extracted from the Web
,
2016,
LREC.
[3]
Tomas Mikolov,et al.
Enriching Word Vectors with Subword Information
,
2016,
TACL.
[4]
Ye Wang,et al.
Quantifying Lexical Novelty in Song Lyrics
,
2015,
ISMIR.
[5]
Petr Sojka,et al.
Software Framework for Topic Modelling with Large Corpora
,
2010
.
[6]
Jeffrey Dean,et al.
Distributed Representations of Words and Phrases and their Compositionality
,
2013,
NIPS.
[7]
Plamen Milev.
Conceptual Approach for Development of Web Scraping Application for Tracking Information
,
2017
.