Toward Selectivity-Based Keyword Extraction for Croatian News

Preliminary report on network based keyword extraction for Croatian is an unsupervised method for keyword extraction from the complex network. We build our approach with a new network measure the node selectivity, motivated by the research of the graph based centrality approaches. The node selectivity is defined as the average weight distribution on the links of the single node. We extract nodes (keyword candidates) based on the selectivity value. Furthermore, we expand extracted nodes to word-tuples ranked with the highest in/out selectivity values. Selectivity based extraction does not require linguistic knowledge while it is purely derived from statistical and structural information en-compassed in the source text which is reflected into the structure of the network. Obtained sets are evaluated on a manually annotated keywords: for the set of extracted keyword candidates average F1 score is 24,63%, and average F2 score is 21,19%; for the exacted words-tuples candidates average F1 score is 25,9% and average F2 score is 24,47%.

[1]  Mark Last,et al.  Graph-Based Keyword Extraction for Single-Document Summarization , 2008, COLING 2008.

[2]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[3]  G. J. Rodgers,et al.  Network properties of written human language. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Mitsuru Ishizuka,et al.  KeyWorld: Extracting Keywords from a Document as a Small World , 2001, Discovery Science.

[5]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[6]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[7]  Haitao Liu,et al.  What role does syntax play in a language network , 2008 .

[8]  G. J. Rodgers,et al.  Differences between Normal and Shuffled Texts: Structural Properties of Weighted Networks , 2008, Adv. Complex Syst..

[9]  Zhi Zhou,et al.  Keyphrase Extraction Using Semantic Networks Structure Analysis , 2006, Sixth International Conference on Data Mining (ICDM'06).

[10]  Ana Mestrovic,et al.  Preliminary Report on the Structure of Croatian Linguistic Co-occurrence Networks , 2014, ArXiv.

[11]  Girish Keshav Palshikar Keyword Extraction from a Single Document Using Centrality Measures , 2007, PReMI.

[12]  Florian Boudin,et al.  A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction , 2013, IJCNLP.

[13]  Ana Mestrovic,et al.  Complex networks measures for differentiation between normal and shuffled Croatian texts , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[14]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[15]  Ana Mestrovic,et al.  Network Differences between Normal and Shuffled Texts: Case of Croatian , 2014, CompleNet.

[16]  Rada Mihalcea,et al.  Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization , 2004, ACL.

[17]  Jan Snajder,et al.  Unsupervised Topic-Oriented Keyphrase Extraction and Its Application to Croatian , 2011, TSD.

[18]  Dragomir R. Radev,et al.  LexRank: Graph-based Lexical Centrality as Salience in Text Summarization , 2004, J. Artif. Intell. Res..

[19]  Iraklis Varlamis,et al.  SemanticRank: Ranking Keywords and Sentences Using Semantic Graphs , 2010, COLING.

[20]  Jan Snajder,et al.  GPKEX: Genetically Programmed Keyphrase Extraction from Croatian Texts , 2013, BSNLP@ACL.

[21]  Ana Mestrovic,et al.  Comparison of the language networks from literature and blogs , 2014, 2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[22]  Shibamouli Lahiri,et al.  Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks , 2014, ArXiv.

[23]  Bojana Dalbelo Baši,et al.  Automatic Keyphrase Extraction from Croatian Newspaper Articles , 2009 .