Prototype generation on structural data using dissimilarity space representation

Data Reduction techniques are commonly applied in instance-based classification tasks to lower the amount of data to be processed. Prototype Selection (PS) and Prototype Generation (PG) constitute the most representative approaches. These two families differ in the way of obtaining the reduced set out of the initial one: while the former aims at selecting the most representative elements from the set, the latter creates new data out of it. Although PG is considered to better delimit decision boundaries, operations required are not so well defined in scenarios involving structural data such as strings, trees or graphs. This work proposes a case of study with the use of the common RandomC algorithm for mapping the initial structural data to a Dissimilarity Space (DS) representation, thereby allowing the use of PG methods. A comparative experiment over string data is carried out in which our proposal is faced to PS methods on the original space. Results show that PG combined with RandomC mapping achieves a very competitive performance, although the obtained accuracy seems to be bounded by the representativity of the DS method.

[1]  Juan Ramón Rico-Juan,et al.  Improving kNN multi-label classification in Prototype Selection scenarios using class proposals , 2015, Pattern Recognit..

[2]  Loris Nanni,et al.  Prototype reduction techniques: A comparison among different approaches , 2011, Expert Syst. Appl..

[3]  Robert P. W. Duin,et al.  The dissimilarity space: Bridging structural and statistical pattern recognition , 2012, Pattern Recognit. Lett..

[4]  Luisa Micó,et al.  Which Fast Nearest Neighbour Search Algorithm to Use? , 2013, IbPRIA.

[5]  Massimo Piccardi,et al.  Discriminative prototype selection methods for graph embedding , 2013, Pattern Recognit..

[6]  Weidong Zhang,et al.  New prototype selection rule integrated condensing with editing process for the nearest neighbor rules , 2005, 2005 IEEE International Conference on Industrial Technology.

[7]  Juan Ramón Rico-Juan,et al.  New rank methods for reducing the size of the training set using the nearest neighbor rule , 2012, Pattern Recognit. Lett..

[8]  Robert P. W. Duin,et al.  The Dissimilarity Representation for Pattern Recognition - Foundations and Applications , 2005, Series in Machine Perception and Artificial Intelligence.

[9]  Francisco Herrera,et al.  Data Preprocessing in Data Mining , 2014, Intelligent Systems Reference Library.

[10]  Dimitris Kanellopoulos,et al.  Data Preprocessing for Supervised Leaning , 2007 .

[11]  Francisco Herrera,et al.  Prototype Selection for Nearest Neighbor Classification: Taxonomy and Empirical Study , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[13]  Fabrizio Angiulli,et al.  Fast Nearest Neighbor Condensation for Large Data Sets Classification , 2007, IEEE Transactions on Knowledge and Data Engineering.

[14]  Kaspar Riesen,et al.  Graph Embedding in Vector Spaces by Means of Prototype Selection , 2007, GbRPR.

[15]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[16]  Nicolás García-Pedrajas,et al.  Boosting instance selection algorithms , 2014, Knowl. Based Syst..

[17]  Kaspar Riesen,et al.  Towards the unification of structural and statistical pattern recognition , 2012, Pattern Recognit. Lett..

[18]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[19]  David G. Stork,et al.  Pattern Classification , 1973 .

[20]  Hanan Samet,et al.  Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[21]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[22]  Juan Ramón Rico-Juan,et al.  A new iterative algorithm for computing a quality approximate median of strings based on edit operations , 2014, Pattern Recognit. Lett..

[23]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[24]  William Eberle,et al.  Genetic algorithms in feature and instance selection , 2013, Knowl. Based Syst..

[25]  Dennis L. Wilson,et al.  Asymptotic Properties of Nearest Neighbor Rules Using Edited Data , 1972, IEEE Trans. Syst. Man Cybern..

[26]  Herbert Freeman,et al.  On the Encoding of Arbitrary Geometric Configurations , 1961, IRE Trans. Electron. Comput..

[27]  José Salvador Sánchez,et al.  High training set size reduction by space partitioning and prototype abstraction , 2004, Pattern Recognit..

[28]  Peter E. Hart,et al.  The condensed nearest neighbor rule (Corresp.) , 1968, IEEE Trans. Inf. Theory.

[29]  José Oncina,et al.  Recognition of Pen-Based Music Notation: The HOMUS Dataset , 2014, 2014 22nd International Conference on Pattern Recognition.

[30]  Christine Decaestecker,et al.  Finding prototypes for nearest neighbour classification by means of gradient descent and deterministic annealing , 1997, Pattern Recognit..

[31]  Horst Bunke,et al.  An Iterative Algorithm for Approximate Median Graph Computation , 2010, 2010 20th International Conference on Pattern Recognition.

[32]  Fernando Fernández,et al.  Evolutionary Design of Nearest Prototype Classifiers , 2004, J. Heuristics.

[33]  Juan Ramón Rico-Juan,et al.  Prototype generation on structural data using dissimilarity space representation , 2017, Neural Computing and Applications.

[34]  Francisco Herrera,et al.  A Taxonomy and Experimental Study on Prototype Generation for Nearest Neighbor Classification , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[35]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[36]  Ulrich Eckhardt,et al.  Shape descriptors for non-rigid shapes with a single closed contour , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[37]  Francisco Herrera,et al.  On the combination of evolutionary algorithms and stratified strategies for training set selection in data mining , 2006, Appl. Soft Comput..

[38]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[39]  Francisco Casacuberta,et al.  Topology of Strings: Median String is NP-Complete , 1999, Theor. Comput. Sci..

[40]  Larry J. Eshelman,et al.  The CHC Adaptive Search Algorithm: How to Have Safe Search When Engaging in Nontraditional Genetic Recombination , 1990, FOGA.