论文信息 - Bag of What? Simple Noun Phrase Extraction for Text Analysis - 字舞流文

Bag of What? Simple Noun Phrase Extraction for Text Analysis

Social scientists who do not have specialized natural language processing training often use a unigram bag-of-words (BOW) representation when analyzing text corpora. We offer a new phrase-based method, NPFST, for enriching a unigram BOW. NPFST uses a partof-speech tagger and a finite state transducer to extract multiword phrases to be added to a unigram BOW. We compare NPFST to both ngram and parsing methods in terms of yield, recall, and efficiency. We then demonstrate how to use NPFST for exploratory analyses; it performs well, without configuration, on many different kinds of English text. Finally, we present a case study using NPFST to analyze a new corpus of U.S. congressional bills.

Abram Handler | Brendan T. O'Connor | Matthew Denny | Hanna M. Wallach | Matthew J. Denny | Brendan T. O'Connor | H. Wallach | Abram Handler

[1] Ted Dunning,et al. Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[2] Hideki Mima,et al. Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[3] Justin Grimmer,et al. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts , 2013, Political Analysis.

[4] Timothy Baldwin,et al. Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.

[5] Slav Petrov,et al. A Universal Part-of-Speech Tagset , 2011, LREC.

[6] Mans Hulden,et al. Foma: a Finite-State Compiler and Library , 2009, EACL.

[7] Brendan T. O'Connor,et al. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments , 2010, ACL.

[8] Johan Schalkwyk,et al. OpenFst: A General and Efficient Weighted Finite-State Transducer Library , 2007, CIAA.

[9] Amy Beth Warriner,et al. Concreteness ratings for 40 thousand generally known English word lemmas , 2014, Behavior research methods.

[10] Steven Abney,et al. Part-of-Speech Tagging and Partial Parsing , 1997 .

[11] Slava M. Katz,et al. Technical terminology: some linguistic properties and an algorithm for identification in text , 1995, Natural Language Engineering.

[12] Sampo Pyysalo,et al. Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[13] Brendan T. O'Connor,et al. MiTextExplorer: Linked brushing and mutual information for exploratory text data analysis , 2014 .

[14] 悠太菊池,et al. 大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[15] Timothy Baldwin,et al. Evaluating N-gram based Evaluation Metrics for Automatic Keyphrase Extraction , 2010, COLING.

[16] Burt L. Monroe,et al. Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict , 2008, Political Analysis.

[17] Dani Yogatama,et al. Bayesian Optimization of Text Representations , 2015, EMNLP.

[18] Dustin Hillard,et al. Automated classification of congressional legislation , 2006, DG.O.

[19] Emmanuel Roche,et al. Finite-State Language Processing , 1997 .

[20] M. Laver,et al. Extracting Policy Positions from Political Texts Using Words as Data , 2003, American Political Science Review.

[21] Jeffrey Heer,et al. Replication of the Keyword Extraction part of the paper "'Without the Clutter of Unimportant Words': Descriptive Keyphrases for Text Visualization" , 2019, ArXiv.

[22] Dragomir R. Radev,et al. How to Analyze Political Attention with Minimal Assumptions and Costs , 2010 .

[23] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24] Brendan T. O'Connor,et al. Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[25] David Bamman,et al. Unsupervised Discovery of Biographical Structure from Text , 2014, TACL.

[26] Justin Grimmer,et al. A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases , 2010, Political Analysis.

[27] Mihai Surdeanu,et al. The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[28] Brendan T. O'Connor,et al. Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.

[29] James R. Curran,et al. Parsing Noun Phrases in the Penn Treebank , 2011, Computational Linguistics.

[30] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[31] Percy Liang,et al. Calibrated Structured Prediction , 2015, NIPS.