Unsupervised Concept Categorization and Extraction from Scientific Document Titles

This paper studies the automated categorization and extraction of scientific concepts from titles of scientific articles, in order to gain a deeper understanding of their key contributions and facilitate the construction of a generic academic knowledgebase. Towards this goal, we propose an unsupervised, domain-independent, and scalable two-phase algorithm to type and extract key concept mentions into aspects of interest (e.g., Techniques, Applications, etc.). In the first phase of our algorithm we proposePhraseType, a probabilistic generative model which exploits textual features and limited POS tags to broadly segment text snippets into aspect-typed phrases. We extend this model to simultaneously learn aspect-specific features and identify academic domains in multi-domain corpora, since the two tasks mutually enhance each other. In the second phase, we propose an approach based on adaptor grammars to extract fine grained concept mentions from the aspect-typed phrases without the need for any external resources or human effort, in a purely data-driven manner. We apply our technique to study literature from diverse scientific domains and show significant gains over state-of-the-art concept extraction techniques. We also present a qualitative analysis of the results obtained.

[1]  Dragomir R. Radev,et al.  The ACL anthology network corpus , 2009, Language Resources and Evaluation.

[2]  Christopher D. Manning,et al.  Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers , 2011, IJCNLP.

[3]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[4]  Sameep Mehta,et al.  Select, Link and Rank: Diversified Query Expansion and Entity Ranking Using Wikipedia , 2016, WISE.

[5]  Dragomir R. Radev,et al.  Rediscovering ACL Discoveries Through the Lens of ACL Anthology Network Citing Sentences , 2012, Discoveries@ACL.

[6]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[7]  Jiawei Han,et al.  FacetGist: Collective Extraction of Document Facets in Large Technical Corpora , 2016, CIKM.

[8]  Jianyong Wang,et al.  A dirichlet multinomial mixture model-based approach for short text clustering , 2014, KDD.

[9]  Thomas L. Griffiths,et al.  Adaptor Grammars: A Framework for Specifying Compositional Nonparametric Bayesian Models , 2006, NIPS.

[10]  Oren Etzioni,et al.  No Noun Phrase Left Behind: Detecting and Typing Unlinkable Entities , 2012, EMNLP.

[11]  Ralph Grishman,et al.  Unsupervised Learning of Generalized Names , 2002, COLING.

[12]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[13]  Sinno Jialin Pan,et al.  Short and Sparse Text Topic Modeling via Self-Aggregation , 2015, IJCAI.

[14]  Gourab Kundu,et al.  Concept-based analysis of scientific literature , 2013, CIKM.

[15]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[16]  Shaowen Wang,et al.  GeoBurst: Real-Time Local Event Detection in Geo-Tagged Tweet Streams , 2016, SIGIR.

[17]  Clare R. Voss,et al.  Scalable Topical Phrase Mining from Text Corpora , 2014, Proc. VLDB Endow..

[18]  Thomas L. Griffiths,et al.  Bayesian Inference for PCFGs via Markov Chain Monte Carlo , 2007, NAACL.

[19]  Steven Abney,et al.  Parsing By Chunks , 1991 .

[20]  J. Pitman Exchangeable and partially exchangeable random partitions , 1995 .

[21]  Bo Zhao,et al.  Community evolution detection in dynamic heterogeneous information networks , 2010, MLG '10.

[22]  Xiao Yu,et al.  Discovering Structure in the Universe of Attribute Names , 2016, WWW.

[23]  Lancelot F. James,et al.  Generalized weighted Chinese restaurant processes for species sampling mixture models , 2003 .

[24]  Qi Li,et al.  Query to Knowledge: Unsupervised Entity Extraction from Shopping Queries using Adaptor Grammars , 2016, SIGIR.

[25]  Gerhard Weikum,et al.  Fine-grained Semantic Typing of Emerging Entities , 2013, ACL.

[26]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[27]  Mark Steedman,et al.  Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2012 .

[28]  Jiawei Han,et al.  Mining Quality Phrases from Massive Text Corpora , 2015, SIGMOD Conference.

[29]  Ellen Riloff,et al.  A Corpus-Based Approach for Building Semantic Lexicons , 1997, EMNLP.