论文信息 - SpeedRead: A Fast Named Entity Recognition Pipeline

SpeedRead: A Fast Named Entity Recognition Pipeline

Online content analysis employs algorithmic methods to identify entities in unstructured text. Both machine learning and knowledge-base approaches lie at the foundation of contemporary named entities extraction systems. However, the progress in deploying these approaches on web-scale has been been hampered by the computational cost of NLP over massive text corpora. We present SpeedRead (SR), a named entity recognition pipeline that runs at least 10 times faster than Stanford NLP pipeline. This pipeline consists of a high performance Penn Treebank- compliant tokenizer, close to state-of-art part-of-speech (POS) tagger and knowledge-based named entity recognizer.

Steven Skiena | Rami Al-Rfou'

[1] Ronan Collobert,et al. Deep Learning for Efficient Discriminative Parsing , 2011, AISTATS.

[2] Chunyu Kit,et al. Tokenization as the Initial Phase in NLP , 1992, COLING.

[3] Jason Weston,et al. Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[4] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[5] Satoshi Sekine,et al. A survey of named entity recognition and classification , 2007 .

[6] Christopher D. Manning,et al. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger , 2000, EMNLP.

[7] Dan Klein,et al. Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[8] Christopher D. Manning,et al. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[9] Christopher D. Manning. Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics? , 2011, CICLing.

[10] Paolo Ferragina,et al. TAGME: on-the-fly annotation of short text fragments (by wikipedia entities) , 2010, CIKM.

[11] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..