A Comparison of Graph-Based and Statistical Metrics for Learning Domain Keywords

In this paper, we present a comparison of unsupervised and supervised methods for key-phrase extraction from a domain corpus. The experimented unsupervised methods employ individual statistical measures and graph-based measures while the supervised methods apply machine learning models that include combinations of these statistical and graph-based measures. Graph-based measures are applied on a graph that connects terms and compound expressions through conceptual relations and represents a whole corpus about a domain, rather than a single document. Using three datasets from different domains, we observed that supervised methods over-perform unsupervised ones. We also found that the graph-based measures Degree and Reachability generally over-perform (in the majority of the cases) the standard baseline TF-IDF and other graph-based measures while the co-occurrences based measure Pointwise Mutual Information over-performs all the other metrics, including the graph-based measures, when taken individually.

[1]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Katja Markert,et al.  A Comparison of Windowless and Window-Based Computational Association Measures as Predictors of Syntagmatic Human Associations , 2009, EMNLP.

[3]  Peter D. Turney Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[4]  Marek Hatala,et al.  Towards open ontology learning and filtering , 2011, Inf. Syst..

[5]  Florian Boudin,et al.  A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction , 2013, IJCNLP.

[6]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[7]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[8]  Mitsuru Ishizuka,et al.  Keyword extraction from a single document using word co-occurrence statistical information , 2004, Int. J. Artif. Intell. Tools.

[9]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[10]  Luc De Raedt,et al.  Machine Learning: ECML 2001 , 2001, Lecture Notes in Computer Science.

[11]  Shibamouli Lahiri,et al.  Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks , 2014, ArXiv.

[12]  U. Brandes A faster algorithm for betweenness centrality , 2001 .

[13]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[14]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[15]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[16]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[17]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.