Turkish Named Entity Discovery Based on Termsets

Named Entity Recognition (NER) is a subtask of the information extraction process and aims to discover named entities in unstructured texts. Previous studies on NER mostly use statistical machine learning models instead of using classifiers since solving this problem as a classification task requires to deal with quite high dimensional and sparse vector spaces. In this paper, we take NER as a classical text classification problem and extract nominal features from each token in the unstructured text sequence. We convert each token to a document transaction and then, we use frequent termset mining to extract termset features and apply termset weighting to classify named entities. Therefore we deal with lower dimensional feature spaces. Our experimental results obtained on a large Turkish dataset show that frequent termsets and their weighting scheme can be used in NER task.

[1]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[2]  Gülsen Eryigit,et al.  Extending a CRF-based named entity recognition model for Turkish well formed text and user generated content , 2017, Semantic Web.

[3]  Banu Diri,et al.  Named Entity Recognition by Conditional Random Fields from Turkish informal texts , 2011, 2011 IEEE 19th Signal Processing and Communications Applications Conference (SIU).

[4]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[5]  Kemal Oflazer,et al.  Turkish Named-Entity Recognition , 2018 .

[6]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[7]  Onur Açıkgöz,et al.  A new approach for named entity recognition , 2017, 2017 International Conference on Computer Science and Engineering (UBMK).

[8]  Adnan Yazici,et al.  A hybrid named entity recognizer for Turkish , 2012, Expert Syst. Appl..

[9]  Reyyan Yeniterzi Exploiting Morphology in Turkish Named Entity Recognition System , 2011, ACL.

[10]  Tunga Güngör,et al.  Recurrent neural networks for Turkish named entity recognition , 2018, 2018 26th Signal Processing and Communications Applications Conference (SIU).

[11]  A. Cüneyd Tantuğ,et al.  RECOGNIZING NAMED ENTITIES IN TURKISH TWEETS , 2015 .

[12]  Gökhan Tür,et al.  A statistical information extraction system for Turkish , 2003, Natural Language Engineering.

[13]  Dilek Küçük,et al.  Experiments to Improve Named Entity Recognition on Turkish Tweets , 2014, ArXiv.

[14]  Dilek Küçük,et al.  Named Entity Recognition in Turkish: Approaches and Issues , 2017, NLDB.

[15]  Gökhan Akın Åžeker,et al.  Initial Explorations on using CRFs for Turkish Named Entity Recognition , 2012, Coling 2012.

[16]  Dilara Torunoglu,et al.  Named entity recognition on real data: A preliminary investigation for Turkish , 2013, 2013 7th International Conference on Application of Information and Communication Technologies.

[17]  Hakan Altinçay,et al.  Termset weighting by adapting term weighting schemes to utilize cardinality statistics for binary text categorization , 2017, Applied Intelligence.

[18]  Deniz Yuret,et al.  CharNER: Character-Level Named Entity Recognition , 2016, COLING.

[19]  Erdem Emekligil,et al.  A Bank Information Extraction System Based on Named Entity Recognition with CRFs from Noisy Customer Order Texts in Turkish , 2016, KESW.

[20]  Jiawei Han,et al.  Discriminative Frequent Pattern Analysis for Effective Classification , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Adnan Yazici,et al.  Named Entity Recognition in Turkish with Bayesian Learning and Hybrid Approaches , 2013, ISCIS.

[22]  Arzucan Özgür,et al.  Named Entity Recognition on Twitter for Turkish using Semi-supervised Learning with Word Embeddings , 2016, LREC.

[23]  Christophe Rigotti,et al.  Combining sequence and itemset mining to discover named entities in biomedical texts: a new type of pattern , 2009, Int. J. Data Min. Model. Manag..