An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora

Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies. In this paper, we systematically analyze the patterns of occurrence and co-occurrence of NEs in standard large English news corpora - providing valuable insight for the understanding of the corpus, and subsequently paving way for the development of technologies that rely critically on handling NEs. We use two distinctive approaches: normal statistical analysis that measure and report the occurrence patterns of NEs in terms of frequency, growth, etc., and a complex networks based analysis that measures the co-occurrence pattern in terms of connectivity, degree-distribution, small-world phenomenon, etc. Our analysis indicates that: (i) NEs form an open-set in corpora and grow linearly, (ii) presence of a kernel and peripheral NE's, with the large periphery occurring rarely, and (iii) a strong evidence of small-world phenomenon. Our findings may suggest effective ways for construction of NE lexicons to aid efficient development of several natural language technologies.

[1]  Manaal Faruqui,et al.  Training and Evaluating a German Named Entity Recognizer with Semantic Generalization , 2010, KONVENS.

[2]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[3]  Raj Kumar Pan,et al.  Network analysis reveals structure indicative of syntax in the corpus of undeciphered Indus civilization inscriptions , 2009, Graph-based Methods for Natural Language Processing.

[4]  Ramon Ferrer i Cancho,et al.  The small world of human language , 2001, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[5]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[6]  Ralph Grishman,et al.  Discovering Relations among Named Entities from Large Corpora , 2004, ACL.

[7]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[8]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[9]  Heli Leena,et al.  HUMAN REFERENTS IN SUBTITLES A Study on Personal Pronouns and Proper Nouns in Translated and Original Finnish UNIVERSITY OF EASTERN FINLAND Philosophical Faculty Foreign Languages and Translation Studies Pro Gradu Thesis November 2010 ITÄ-SUOMEN YLIOPISTO – UNIVERSITY OF EASTERN FINLAND , 2022 .

[10]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[11]  Treebank Penn,et al.  Linguistic Data Consortium , 1999 .

[12]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[13]  David D. Palmer,et al.  A Statistical Profile of the Named Entity Task , 1997, ANLP.

[14]  Rishiraj Saha Roy,et al.  Complex Network Analysis Reveals Kernel-Periphery Structure in Web Search Queries , 2011 .

[15]  Animesh Mukherjee,et al.  The Structure and Dynamics of Linguistic Networks , 2009 .

[16]  R. Karl Rethemeyer,et al.  Network analysis , 2011 .

[17]  Animesh Mukherjee,et al.  Global topology of word co-occurrence networks: Beyond the two-regime power-law , 2010, COLING.