SpeedRead: A Fast Named Entity Recognition Pipeline

Online content analysis employs algorithmic methods to identify entities in unstructured text. Both machine learning and knowledge-base approaches lie at the foundation of contemporary named entities extraction systems. However, the progress in deploying these approaches on web-scale has been been hampered by the computational cost of NLP over massive text corpora. We present SpeedRead (SR), a named entity recognition pipeline that runs at least 10 times faster than Stanford NLP pipeline. This pipeline consists of a high performance Penn Treebank- compliant tokenizer, close to state-of-art part-of-speech (POS) tagger and knowledge-based named entity recognizer.

[1]  Ronan Collobert,et al.  Deep Learning for Efficient Discriminative Parsing , 2011, AISTATS.

[2]  Chunyu Kit,et al.  Tokenization as the Initial Phase in NLP , 1992, COLING.

[3]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[5]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[6]  Christopher D. Manning,et al.  Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[7]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[8]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[9]  Christopher D. Manning Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[10]  Paolo Ferragina,et al.  TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[11]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[12]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[13]  Ian H. Witten,et al.  An effective, low-cost measure of semantic relatedness obtained from Wikipedia links , 2008 .

[14]  Robert Dale,et al.  Handling Conjunctions in Named Entities , 2007, CICLing.

[15]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[16]  Dan Klein,et al.  Faster and Smaller N-Gram Language Models , 2011, ACL.

[17]  Daniel Gillick,et al.  Sentence Boundary Detection and the Problem with the U.S. , 2009, NAACL.

[18]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[19]  Heeyoung Lee,et al.  Stanford’s Multi-Pass Sieve Coreference Resolution System at the CoNLL-2011 Shared Task , 2011, CoNLL Shared Task.