Detección De Idioma En Twitter (Language Detection on Twitter)

Resumen El trabajo presenta una alternativa para identificar idiomas en Twitter sin que sea necesario utilizar conjuntos de entrenamiento o informacion agregada. En dicha alternativa se utilizan tecnicas basadas en los algoritmos de reconocimiento de trigramas y small words. Se valora la utilizacion de estos algoritmos por si solos y en un modelo de composicion. Asimismo, se analiza la incidencia del pre-procesamiento de los tweets en la precision de la identificacion de los idiomas. Finalmente, despues de un proceso de experimentacion, se determina la mejor alternativa de las estudiadas. English Abstract The paper presents an alternative to identify languages on Twitter without having to use training sets or aggregated information. Such alternative is based on trigram recognition algorithms and small words techniques. The use of these algorithms is evaluated both on their own and in a model of composition. Also, the incidence of pre-processing of tweets in the accuracy of identifying the language is discussed. Finally, after a process of experimentation, the best alternative, out of those studied, is determined. The data obtained were interpreted through analysis and discussion of statistical information. The results indicate that the axes of technological surveillance are applied in the selected universities, emphasizing the following elements of study: competitive surveillance, commercial surveillance, technological surveillance itself and surveillance of the environment.