ENHANCING SoCIAl NEWS MEdIA IN BulGArIAN WItH NA turAl lANGuAGE ProCESSING

In this work we introduce a system based on natural language processing techniques which aim is to enhance social news media in Bulgarian. It solves the task of multi-class, multi-label classification of documents. We apply the algorithms to a collection of media articles from Svejo.net, a popular Bulgarian web resource comprising user-generated content. Our algorithms are one-versus-all classification methods widely used in the computational linguistics community. We describe the algorithms, the features employed and we evaluate the impact of the features on the performance of the models. Thereby, we show that knowledge about the user and user behavior can greatly improve performance. Also, despite the fact that our document collection is generated entirely by social media users, the quality of the classification results is comparable to that of previously reported studies. We address also the task of automatic keyword and keyphrase extraction from unstructured text, and suit it to the needs of Svejo.net for induction of’themes’. Themes are defined as text snippets that summarize the essence of an article. We evaluate the performance of several generic methods for keyword and keyphrase extraction on a corpus of articles in Bulgarian. The methods that we discuss rely on widely accepted information retrieval and machine learning techniques and are languageindependent. We also consider the effect of a stemmer component on the keyphrase extraction accuracy. The satisfactory performance of our models in spite of the limited linguistic knowledge incorporated in them recommends our models as a baseline for keyword and keyphrase extraction for Bulgarian language.

[1]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[2]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[3]  Steve Jones Link as you type: using key phrases for automated dynamic link generation , 1998 .

[4]  Xiaojun Wan,et al.  Towards an Iterative Reinforcement Approach for Simultaneous Document Summarization and Keyword Extraction , 2007, ACL.

[5]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[6]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[7]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[8]  Yaakov HaCohen-Kerner,et al.  Automatic Extraction and Learning of Keyphrases from Scientific Articles , 2005, CICLing.

[9]  Peter D. Turney Coherent Keyphrase Extraction via Web Mining , 2003, IJCAI.

[10]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[11]  Brian D. Davison,et al.  Web page classification: Features and algorithms , 2009, CSUR.

[12]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[13]  Ian H. Witten Browsing around a digital library , 2003, SODA '03.

[14]  Mark Craven,et al.  Combining Statistical and Relational Methods for Learning in Hypertext Domains , 1998, ILP.

[15]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[16]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[17]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[18]  Georgi Georgiev,et al.  Edlin: an Easy to Read Linear Learning Framework , 2009, RANLP.

[19]  Carl Gutwin,et al.  Improving browsing in digital libraries with keyphrase indexes , 1999, Decis. Support Syst..

[20]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[21]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[22]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[23]  A. Nur Zincir-Heywood,et al.  Evaluation of Two Systems on Multi-class Multi-label Document Classification , 2005, ISMIS.

[24]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[25]  Diana Inkpen,et al.  Extracting semantically-coherent keyphrases from speech , 2004 .

[26]  Preslav Nakov,et al.  Feature-Rich Part-of-speech Tagging for Morphologically Complex Languages: Application to Bulgarian , 2012, EACL.

[27]  Carl Gutwin,et al.  Domain-Specific Keyphrase Extraction , 1999, IJCAI.