Replication of the Keyword Extraction part of the paper "'Without the Clutter of Unimportant Words': Descriptive Keyphrases for Text Visualization"

Keyphrases aid the exploration of text collections by communicating salient aspects of documents and are often used to create effective visualizations of text. While prior work in HCI and visualization has proposed a variety of ways of presenting keyphrases, less attention has been paid to selecting the best descriptive terms. In this article, we investigate the statistical and linguistic properties of keyphrases chosen by human judges and determine which features are most predictive of high-quality descriptive phrases. Based on 5,611 responses from 69 graduate students describing a corpus of dissertation abstracts, we analyze characteristics of human-generated keyphrases, including phrase length, commonness, position, and part of speech. Next, we systematically assess the contribution of each feature within statistical models of keyphrase quality. We then introduce a method for grouping similar terms and varying the specificity of displayed phrases so that applications can select phrases dynamically based on the available screen space and current context of interaction. Precision-recall measures find that our technique generates keyphrases that match those selected by human judges. Crowdsourced ratings of tag cloud visualizations rank our approach above other automatic techniques. Finally, we discuss the role of HCI methods in developing new algorithmic techniques suitable for user-facing applications.

[1]  Ian H. Witten,et al.  Thesaurus based automatic keyphrase indexing , 2006, Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '06).

[2]  Jimeng Sun,et al.  FacetAtlas: Multifaceted Visualization for Rich Text Corpora , 2010, IEEE Transactions on Visualization and Computer Graphics.

[3]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[4]  Branimir Boguraev,et al.  Applications of term identification technology: domain description and content characterisation , 1999, Natural Language Engineering.

[5]  Furu Wei,et al.  Context preserving dynamic word cloud visualization , 2010, 2010 IEEE Pacific Visualization Symposium (PacificVis).

[6]  Éric Gaussier,et al.  Towards Automatic Extraction of Monolingual and Bilingual Terminology , 1994, COLING.

[7]  Maria P. Grineva,et al.  Extracting key terms from noisy and multitheme documents , 2009, WWW '09.

[8]  Vincent Ng,et al.  Conundrums in Unsupervised Keyphrase Extraction: Making Sense of the State-of-the-Art , 2010, COLING.

[9]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[10]  Marti A. Hearst Search User Interfaces , 2009 .

[11]  Shibamouli Lahiri,et al.  Keyword and Keyphrase Extraction Using Centrality Measures on Collocation Networks , 2014, ArXiv.

[12]  Paul Rayson,et al.  Comparing Corpora using Frequency Profiling , 2000, Proceedings of the workshop on Comparing corpora -.

[13]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[14]  Nina Wacholder,et al.  Document Processing with LinkIT , 2000, RIAO.

[15]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[16]  K. Gegenfurtner,et al.  Design Issues in Gaze Guidance Under review with ACM Transactions on Computer Human Interaction , 2009 .

[17]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[18]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[19]  Shibamouli Lahiri,et al.  Building a Dataset for Summarization and Keyword Extraction from Emails , 2014, LREC.

[20]  Stephen E. Robertson,et al.  Probabilistic models of indexing and searching , 1980, SIGIR '80.

[21]  Andreas Paepcke,et al.  Efficient web browsing on handheld devices using page and form summarization , 2002, TOIS.

[22]  Ken Barker,et al.  Using Noun Phrase Heads to Extract Document Keyphrases , 2000, Canadian Conference on AI.

[23]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[24]  Lucy T. Nowell,et al.  ThemeRiver: visualizing theme changes over time , 2000, IEEE Symposium on Information Visualization 2000. INFOVIS 2000. Proceedings.

[25]  Gábor Berend,et al.  SZTERGAK : Feature Engineering for Keyphrase Extraction , 2010, *SEMEVAL.

[26]  John A. Carroll,et al.  Applied morphological processing of English , 2001, Natural Language Engineering.

[27]  Thomas Lengauer,et al.  Data and text mining ROCR : visualizing classifier performance in R , 2005 .

[28]  M. Laver,et al.  Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[29]  Fernanda B. Viégas,et al.  Visualizing email content: portraying relationships from conversational histories , 2006, CHI.

[30]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[31]  Martin Wattenberg,et al.  Parallel Tag Clouds to explore and analyze faceted text corpora , 2009, 2009 IEEE Symposium on Visual Analytics Science and Technology.

[32]  Burt L. Monroe,et al.  Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[33]  Matthew Hurst,et al.  A Language Model Approach to Keyphrase Extraction , 2003, ACL 2003.

[34]  Lei Shi,et al.  Understanding text corpora with multiple facets , 2010, 2010 IEEE Symposium on Visual Analytics Science and Technology.

[35]  Chunyu Kit,et al.  Measuring mono-word termhood by rank difference via corpus comparison , 2008 .

[36]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[37]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[38]  Sriram Subramanian,et al.  Talking about tactile experiences , 2013, CHI.

[39]  Peter D. Turney Learning Algorithms for Keyphrase Extraction , 2000, Information Retrieval.

[40]  Martin Wattenberg,et al.  Participatory Visualization with Wordle , 2009, IEEE Transactions on Visualization and Computer Graphics.

[41]  John Stasko,et al.  Jigsaw: supporting investigative analysis through interactive visualization , 2008 .

[42]  Tommi S. Jaakkola,et al.  Using term informativeness for named entity detection , 2005, SIGIR '05.

[43]  Slava M. Katz,et al.  Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[44]  Chengzhi Zhang,et al.  Automatic Keyword Extraction from Documents Using Conditional Random Fields , 2008 .

[45]  J. Faraway Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models , 2005 .

[46]  Andreas Paepcke,et al.  Power browser: efficient Web browsing for PDAs , 2000, CHI.

[47]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[48]  Timothy Baldwin,et al.  SemEval-2010 Task 5 : Automatic Keyphrase Extraction from Scientific Articles , 2010, *SEMEVAL.

[49]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[50]  Christopher C. Yang,et al.  Fractal summarization for mobile devices to access large documents on the web , 2003, WWW '03.

[51]  Stefan Evert,et al.  Google Web 1T 5-Grams Made Easy (but not for the computer) , 2010, WAC@NAACL-HLT.

[52]  Jeffrey Heer,et al.  Crowdsourcing graphical perception: using mechanical turk to assess visualization design , 2010, CHI.

[53]  Laurent Romary,et al.  HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID , 2010, *SEMEVAL.

[54]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .