DISCOVERING ROBUST EMBEDDINGS IN (DIS)SIMILARITY SPACE FOR HIGH‐DIMENSIONAL LINGUISTIC FEATURES

Recent research has shown the effectiveness of rich feature representation for tasks in natural language processing (NLP). However, exceedingly large number of features do not always improve classification performance. They may contain redundant information, lead to noisy feature presentations, and also render the learning algorithms intractable. In this paper, we propose a supervised embedding framework that modifies the relative positions between instances to increase the compatibility between the input features and the output labels and meanwhile preserves the local distribution of the original data in the embedded space. The proposed framework attempts to support flexible balance between the preservation of intrinsic geometry and the enhancement of class separability for both interclass and intraclass instances. It takes into account characteristics of linguistic features by using an inner product‐based optimization template. (Dis)similarity features, also known as empirical kernel mapping, is employed to enable computationally tractable processing of extremely high‐dimensional input, and also to handle nonlinearities in embedding generation when necessary. Evaluated on two NLP tasks with six data sets, the proposed framework provides better classification performance than the support vector machine without using any dimensionality reduction technique. It also generates embeddings with better class discriminability as compared to many existing embedding algorithms.

[1]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[2]  David G. Stork,et al.  Pattern Classification (2nd ed.) , 1999 .

[3]  F. A B I O M A S S I M O Z A N Z O T T O,et al.  A machine learning approach to textual entailment recognition , 2009 .

[4]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[5]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[6]  Alessandro Moschitti,et al.  Convolution Kernels on Constituent, Dependency and Sequential Structures for Relation Extraction , 2009, EMNLP.

[7]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[8]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[9]  Tao Jiang,et al.  Efficient and robust feature extraction by maximum margin criterion , 2003, IEEE Transactions on Neural Networks.

[10]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[11]  Alessandro Moschitti,et al.  A machine learning approach to textual entailment recognition , 2009, Natural Language Engineering.

[12]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[13]  Robert P. W. Duin,et al.  Dissimilarity representations allow for building good classifiers , 2002, Pattern Recognit. Lett..

[14]  Aron Culotta,et al.  Dependency Tree Kernels for Relation Extraction , 2004, ACL.

[15]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[16]  Robert P. W. Duin,et al.  A Generalized Kernel Approach to Dissimilarity-based Classification , 2002, J. Mach. Learn. Res..

[17]  Masaru Kitsuregawa,et al.  Kernel Slicing: Scalable Online Training with Conjunctive Features , 2010, COLING.

[18]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[19]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[20]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[21]  Guodong Zhou,et al.  Tree kernel-based semantic relation extraction with rich syntactic and semantic information , 2010, Inf. Sci..

[22]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[23]  Mihai Surdeanu,et al.  Combination Strategies for Semantic Role Labeling , 2007, J. Artif. Intell. Res..

[24]  Vincent Ng,et al.  Supervised Noun Phrase Coreference Research: The First Fifteen Years , 2010, ACL.

[25]  Dawid Weiss,et al.  A concept-driven algorithm for clustering search results , 2005, IEEE Intelligent Systems.

[26]  Stephan Bloehdorn,et al.  Combined Syntactic and Semantic Kernels for Text Classification , 2007, ECIR.

[27]  Hai Zhao,et al.  Multilingual Dependency Learning: A Huge Feature Engineering Method to Semantic Dependency Parsing , 2009, CoNLL Shared Task.

[28]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[29]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[30]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[31]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[32]  Yousef Saad,et al.  Orthogonal Neighborhood Preserving Projections: A Projection-Based Dimensionality Reduction Technique , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Haesun Park,et al.  Structure Preserving Dimension Reduction for Clustered Text Data Based on the Generalized Singular Value Decomposition , 2003, SIAM J. Matrix Anal. Appl..

[34]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[35]  Alessandro Moschitti,et al.  Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees , 2006, ECML.

[36]  Patrick Haffner Fast transpose methods for kernel learning on sparse data , 2006, ICML '06.

[37]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[38]  Haitao Zhao,et al.  Incremental eigen decomposition , 2003 .

[39]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[40]  K. Tsuda Support Vector Classi er with Asymmetric Kernel Functions , 1998 .

[41]  Tingting Mu,et al.  Supporting the education evidence portal via text mining , 2010, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[42]  Jieping Ye,et al.  A least squares formulation for a class of generalized eigenvalue problems in machine learning , 2009, ICML '09.

[43]  Don Coppersmith,et al.  Matrix multiplication via arithmetic progressions , 1987, STOC.

[44]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[45]  Juho Rousu,et al.  Efficient Computation of Gapped Substring Kernels on Large Alphabets , 2005, J. Mach. Learn. Res..

[46]  Jari Björne,et al.  Comparative analysis of five protein-protein interaction corpora , 2008, BMC Bioinformatics.

[47]  Shengyu Zhang,et al.  Combinatorial algorithms for nearest neighbors, near-duplicates and small-world design , 2009, SODA.

[48]  Stephan Bloehdorn,et al.  Exploiting Structure and Semantics for Expressive Text Kernels , 2007 .

[49]  Xin Liu,et al.  Generic text summarization using relevance measure and latent semantic analysis , 2001, SIGIR '01.

[50]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[51]  Tingting Mu,et al.  Adaptive Data Embedding Framework for Multiclass Classification , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[52]  Jun'ichi Tsujii,et al.  Task-oriented Evaluation of Syntactic Parsers and Their Representations , 2008, ACL.

[53]  Jun'ichi Tsujii,et al.  A Rich Feature Vector for Protein-Protein Interaction Extraction from Multiple Corpora , 2009, EMNLP.

[54]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[55]  Jihoon Yang,et al.  Data and text mining Kernel approaches for genic interaction extraction , 2008 .

[56]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[57]  Stephen Lin,et al.  Graph Embedding and Extensions: A General Framework for Dimensionality Reduction , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58]  Chris Cornelis,et al.  Linguistic feature analysis for protein interaction extraction , 2009, BMC Bioinformatics.

[59]  Alessandro Moschitti,et al.  Kernel methods, syntax and semantics for relational text categorization , 2008, CIKM '08.

[60]  John Langford,et al.  Hash Kernels for Structured Data , 2009, J. Mach. Learn. Res..

[61]  Dale Schuurmans,et al.  Improved Natural Language Learning via Variance-Regularization Support Vector Machines , 2010, CoNLL.

[63]  Guy Lebanon,et al.  Metric learning for text documents , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  J. Mercer Functions of positive and negative type, and their connection with the theory of integral equations , 1909 .

[65]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[66]  Robert P. W. Duin,et al.  Classifiers in almost empty spaces , 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000.

[67]  Dawid Weiss,et al.  Lingo: Search Results Clustering Algorithm Based on Singular Value Decomposition , 2004, Intelligent Information Systems.

[68]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[69]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[70]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[71]  Tai-Yue Wang,et al.  Solving multi-label text categorization problem using support vector machine approach with membership function , 2011, Neurocomputing.

[72]  Chu-Ren Huang,et al.  A Framework of Feature Selection Methods for Text Categorization , 2009, ACL.

[73]  Bernhard Schölkopf,et al.  A Kernel Approach for Learning from Almost Orthogonal Patterns , 2002, European Conference on Principles of Data Mining and Knowledge Discovery.

[74]  Ulf Leser,et al.  A fast and effective dependency graph kernel for PPI relation extraction , 2010, BMC Bioinformatics.

[75]  Yousef Saad,et al.  Enhanced graph-based dimensionality reduction with repulsion Laplaceans , 2009, Pattern Recognit..

[76]  M. Brand,et al.  Fast low-rank modifications of the thin singular value decomposition , 2006 .

[77]  Michael W. Berry,et al.  Large-Scale Sparse Singular Value Computations , 1992 .

[78]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[79]  Tingting Mu,et al.  Proximity-Based Frameworks for Generating Embeddings from Multi-Output Data , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[80]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[81]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[82]  David Howard,et al.  Automated design of robust discriminant analysis classifier for foot pressure lesions using kinematic data , 2005, IEEE Transactions on Biomedical Engineering.

[83]  Shie-Jue Lee,et al.  A Weight-based Feature Extraction Approach for Text Classification , 2007, Second International Conference on Innovative Computing, Informatio and Control (ICICIC 2007).

[84]  Byung-Joo Kim,et al.  Feature Extraction and Classification System for Nonlinear and Online Data , 2004, PAKDD.

[85]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[86]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[87]  I. Jolliffe Principal Component Analysis , 2002 .

[88]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[89]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[90]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[91]  David D. Lewis,et al.  Feature Selection and Feature Extraction for Text Categorization , 1992, HLT.

[92]  Robert P. W. Duin,et al.  Prototype selection for dissimilarity-based classifiers , 2006, Pattern Recognit..

[93]  Christine D. Piatko,et al.  Using “Annotator Rationales” to Improve Machine Learning for Text Categorization , 2007, NAACL.

[94]  Stephan Bloehdorn,et al.  Structure and semantics for expressive text kernels , 2007, CIKM '07.

[95]  Stephan Bloehdorn,et al.  Semantic Kernels for Text Classification Based on Topological Measures of Feature Similarity , 2006, Sixth International Conference on Data Mining (ICDM'06).