论文信息 - ENHANCING SoCIAl NEWS MEdIA IN BulGArIAN WItH NA turAl lANGuAGE ProCESSING

ENHANCING SoCIAl NEWS MEdIA IN BulGArIAN WItH NA turAl lANGuAGE ProCESSING

In this work we introduce a system based on natural language processing techniques which aim is to enhance social news media in Bulgarian. It solves the task of multi-class, multi-label classification of documents. We apply the algorithms to a collection of media articles from Svejo.net, a popular Bulgarian web resource comprising user-generated content. Our algorithms are one-versus-all classification methods widely used in the computational linguistics community. We describe the algorithms, the features employed and we evaluate the impact of the features on the performance of the models. Thereby, we show that knowledge about the user and user behavior can greatly improve performance. Also, despite the fact that our document collection is generated entirely by social media users, the quality of the classification results is comparable to that of previously reported studies. We address also the task of automatic keyword and keyphrase extraction from unstructured text, and suit it to the needs of Svejo.net for induction of’themes’. Themes are defined as text snippets that summarize the essence of an article. We evaluate the performance of several generic methods for keyword and keyphrase extraction on a corpus of articles in Bulgarian. The methods that we discuss rely on widely accepted information retrieval and machine learning techniques and are languageindependent. We also consider the effect of a stemmer component on the keyphrase extraction accuracy. The satisfactory performance of our models in spite of the limited linguistic knowledge incorporated in them recommends our models as a baseline for keyword and keyphrase extraction for Bulgarian language.

Valentin Zhikov | Ivelina Nikolova

[1] Yoram Singer,et al. BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[2] Andrew McCallum,et al. Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[3] Steve Jones. Link as you type: using key phrases for automated dynamic link generation , 1998 .

[4] Xiaojun Wan,et al. Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction , 2007, ACL.

[5] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6] Koby Crammer,et al. Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[7] Usama M. Fayyad,et al. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[8] Yaakov HaCohen-Kerner,et al. Automatic Extraction and Learning of Keyphrases from Scientific Articles , 2005, CICLing.

[9] Peter D. Turney. Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[10] Peter D. Turney. Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[11] Brian D. Davison,et al. Web page classification: Features and algorithms , 2009, CSUR.