Generalization of metric classification algorithms for sequences classification and labelling

The article deals with the issue of modification of metric classification algorithms. In particular, it studies the algorithm k-Nearest Neighbours for its application to sequential data. A method of generalization of metric classification algorithms is proposed. As a part of it, there has been developed an algorithm for solving the problem of classification and labelling of sequential data. The advantages of the developed algorithm of classification in comparison with the existing one are also discussed in the article. There is a comparison of the effectiveness of the proposed algorithm with the algorithm of CRF in the task of chunking in the open data set CoNLL2000.

[1]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[4]  Sureswaran Ramadass,et al.  A Survey of Botnet and Botnet Detection , 2009, 2009 Third International Conference on Emerging Security Information, Systems and Technologies.

[5]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[6]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[7]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Walter Daelemans,et al.  Memory-Based Language Processing , 2009, Studies in natural language processing.

[10]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[11]  Yunsong Guo,et al.  Comparisons of sequence labeling algorithms and extensions , 2007, ICML '07.

[12]  J. Kent Information gain and a general measure of correlation , 1983 .

[13]  Albert Gordo,et al.  The UJIpenchars Database: a Pen-Based Database of Isolated Handwritten Characters , 2008, LREC.

[14]  Sergio Martín,et al.  An input panel and recognition engine for on-line handwritten text recognition , 2007, CCIA.

[15]  Feng Liu,et al.  A Modified Value Difference Metric Kernel for Context-Dependent Classification Tasks , 2006, 2006 International Conference on Machine Learning and Cybernetics.

[17]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[18]  András Kocsor,et al.  ROC analysis: applications to the classification of biological sequences and 3D structures , 2008, Briefings Bioinform..

[19]  Tatsunori Mori,et al.  Information Gain Ratio as Term Weight: The case of Summarization of IR Results , 2002, COLING.

[20]  Szymon Acedanski,et al.  A Morphosyntactic Brill Tagger for Inflectional Languages , 2010, IceTAL.

[21]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[22]  András Kocsor,et al.  Sequence analysis Application of compression-based distance measures to protein sequence classification : a methodological study , 2005 .