Apprentissage à base de Noyaux Sémantiques pour le Traitement de Données Textuelles. (Machine Learning with Semantic Kernels for Textual Data)

Apprentissage a base de Noyaux Semantiques pour le Traitement de donnees Textuelles. Depuis le debut des annees 80, les methodes statistiques et, plus specifiquement, les methodes d’apprentissage appliquees au traitement de donnees textuelles connaissent un interet grandissant. Cette tendance est principalement due au fait que la taille des corpus est en perpetuelle croissance. Ainsi, les methodes utilisant le travail d’experts sont devenues des processus couteux perdant peu a peu de leur popularite au profit des systemes d’apprentissage. Dans le cadre de cette these, nous nous interessons principalement a deux axes. Le remier axe porte sur l’etude des problematiques liees autraitement de donnees textuelles structurees par des approches a base de noyaux. Nous presentons, dans ce contexte, un noyau semantique pour les documents structures en sections notamment sous le format XML. Le noyau tire ses informations semantiques a partir d’une source de connaissances externe, a savoir un thesaurus. Notre noyau a ete teste sur un corpus de documents medicaux avec le thesaurus medical UMLS. Il a ete classe,lors d’un challenge international de categorisation de documents medicaux, parmi les 10 methodes les plus performantes sur 44. Le second axe porte sur l’etude des concepts latents extraits par des methodes statistiques telles que l’analyse semantique latente (LSA). Nous presentons, dans une premiere partie, des noyaux exploitant des concepts linguistiques provenant d’une source externe et des concept statistiques issus de la LSA. Nous montrons qu’un noyauinte grant les deux types de concepts permet d’ameliorer les performances. Puis, dans un deuxieme temps, nous presentons un noyau utilisant des LSA locaux afin d’extraire des concepts latents permettant d’obtenir une representation plus fine des documents.

[1]  Bernhard Schölkopf,et al.  Extracting Support Data for a Given Task , 1995, KDD.

[2]  Hélène Paugam-Moisy,et al.  A new multi-class SVM based on a uniform convergence result , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[3]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[4]  Daewon Lee,et al.  An improved cluster labeling method for support vector clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Carlo Strapparava,et al.  Domain Kernels for Text Categorization , 2005, CoNLL.

[6]  K. Bretonnel Cohen,et al.  A shared task involving multi-label classification of clinical free text , 2007, BioNLP@ACL.

[7]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[8]  Jun Suzuki,et al.  Hierarchical Directed Acyclic Graph Kernel: Methods for Structured Natural Language Data , 2003, ACL.

[9]  Wray L. Buntine Variational Extensions to EM and Multinomial PCA , 2002, ECML.

[10]  Thamar Solorio,et al.  Improvement of Named Entity Tagging by Machine Learning , 2004 .

[11]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[12]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[13]  Younès Bennani,et al.  Dendogram based SVM for multi-class classification , 2006, 28th International Conference on Information Technology Interfaces, 2006..

[14]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[15]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[16]  Andreu Català,et al.  K-SVCR. A Multi-class Support Vector Machine , 2000, ECML.

[17]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[18]  Mohammed J. Zaki Efficient enumeration of frequent sequences , 1998, CIKM '98.

[19]  Jian Su,et al.  Text Representations for Text Categorization: A Case Study in Biomedical Domain , 2007, 2007 International Joint Conference on Neural Networks.

[20]  James Allan,et al.  The effect of adding relevance information in a relevance feedback environment , 1994, SIGIR '94.

[21]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[22]  S. Canu,et al.  Functional learning through kernel , 2002 .

[23]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[24]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[25]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[26]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[27]  Hideki Isozaki,et al.  Efficient Support Vector Classifiers for Named Entity Recognition , 2002, COLING.

[28]  Roberto Basili,et al.  A Semantic Kernel to Classify Texts with Very Few Training Examples , 2006, Informatica.

[29]  David J. Crisp,et al.  Uniqueness of the SVM Solution , 1999, NIPS.

[30]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[31]  Guy W. Mineau,et al.  Beyond TFIDF Weighting for Text Categorization in the Vector Space Model , 2005, IJCAI.

[32]  Fabrizio Sebastiani,et al.  Supervised term weighting for automated text categorization , 2003, SAC '03.

[33]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[34]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[35]  J. Pei,et al.  Sequential Pattern Mining by Pattern-Growth : Principles and Extensions , 2005 .

[36]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[37]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[38]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[39]  Masao Fukushima,et al.  A new multi-class support vector algorithm , 2006, Optim. Methods Softw..

[40]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[41]  Nello Cristianini,et al.  Latent Semantic Kernels , 2001, Journal of Intelligent Information Systems.

[42]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[43]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[44]  Emmanuel Viennet,et al.  bitSPADE: A Lattice-based Sequential Pattern Mining Algorithm Using Bitmap Representation , 2006, Sixth International Conference on Data Mining (ICDM'06).

[45]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[46]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[47]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[48]  Fabrizio Sebastiani,et al.  A Tutorial on Automated Text Categorisation , 2000 .

[49]  Florent Masseglia,et al.  The PSP Approach for Mining Sequential Patterns , 1998, PKDD.

[50]  Jun Suzuki,et al.  Kernels for Structured Natural Language Data , 2003, NIPS.

[51]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[52]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[53]  Hisashi Kashima,et al.  Kernels for Semi-Structured Data , 2002, ICML.

[54]  Allen C. Browne,et al.  dTagger: A POS Tagger , 2006, AMIA.

[55]  Hisashi Kashima,et al.  Kernel-based discriminative learning algorithms for labeling sequences, trees, and graphs , 2004, ICML '04.

[56]  Thomas Hofmann,et al.  Learning from Dyadic Data , 1998, NIPS.

[57]  Thomas Gärtner,et al.  Graph kernels and Gaussian processes for relational reinforcement learning , 2006, Machine Learning.

[58]  Jianhua Yang,et al.  Support vector clustering through proximity graph modelling , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[59]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[60]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[61]  Hsuan-Tien Lin,et al.  A note on Platt’s probabilistic outputs for support vector machines , 2007, Machine Learning.

[62]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[63]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[64]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[65]  Kristin P. Bennett,et al.  Multicategory Classification by Support Vector Machines , 1999, Comput. Optim. Appl..

[66]  Andreu Català,et al.  K-SVCR. A support vector machine for multi-class classification , 2003, Neurocomputing.

[67]  Massih-Reza Amini,et al.  The use of unlabeled data to improve supervised learning for text summarization , 2002, SIGIR '02.

[68]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[69]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[70]  Yuji Matsumoto,et al.  Fast Methods for Kernel-Based Text Analysis , 2003, ACL.

[71]  Alistair Moffat,et al.  Exploring the similarity space , 1998, SIGF.

[72]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[73]  Thamar Solorio,et al.  Learning Named Entity Classifiers Using Support Vector Machines , 2004, CICLing.

[74]  Emmanuel Viennet,et al.  Méthodes à noyaux appliquées aux textes structurés , 2008, AAFD.

[75]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[76]  Hisashi Kashima,et al.  Kernels for graph classification , 2002 .

[77]  Gerhard Rigoll,et al.  A Novel Feature Combination Approach for Spoken Document Classification with Support Vector Machines , 2003 .

[78]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .

[79]  Suh-Yin Lee,et al.  DELISP: Efficient Discovery of Generalized Sequential Patterns by Delimited Pattern-Growth Technology , 2002, PAKDD.

[80]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[81]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[82]  Wei-Ying Ma,et al.  Improving text classification using local latent semantic indexing , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[83]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[84]  Chew Lim Tan,et al.  Proposing a New Term Weighting Scheme for Text Categorization , 2006, AAAI.

[85]  Rada Mihalcea,et al.  Random-Walk Term Weighting for Improved Text Classification , 2006, International Conference on Semantic Computing (ICSC 2007).

[86]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[87]  Yuji Matsumoto,et al.  Modeling Category Structures with a Kernel Function , 2004, CoNLL.

[88]  Alain Rakotomamonjy,et al.  Variable Selection Using SVM-based Criteria , 2003, J. Mach. Learn. Res..

[89]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[90]  Jean-François Boulicaut,et al.  GO-SPADE: Mining Sequential Patterns over Datasets with Consecutive Repetitions , 2003, MLDM.

[91]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[92]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[93]  Hava T. Siegelmann,et al.  A Support Vector Method for Clustering , 2000, NIPS.

[94]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[95]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[96]  Thomas Hofmann,et al.  Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization , 1999, NIPS.

[97]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[98]  Martin Chodorow,et al.  Combining local context and wordnet similarity for word sense identification , 1998 .

[99]  Li Zhang,et al.  Focused named entity recognition using machine learning , 2004, SIGIR '04.

[100]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[101]  Tom M. Mitchell,et al.  Semi-Supervised Text Classification Using EM , 2006, Semi-Supervised Learning.

[102]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[103]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[104]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[105]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[106]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[107]  Florence d'Alché-Buc,et al.  Support Vector Machines based on a semantic kernel for text categorization , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[108]  Mohammed J. Zaki Sequence mining in categorical domains: incorporating constraints , 2000, CIKM '00.

[109]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[110]  S. V. N. Vishwanathan,et al.  Graph kernels , 2007 .

[111]  Yuji Matsumoto,et al.  Extracting Important Sentences with Support Vector Machines , 2002, COLING.

[112]  G. Celeux,et al.  A Classification EM algorithm for clustering and two stochastic versions , 1992 .

[113]  A Survey on Inductive Semi-supervised Learning , 2006 .

[114]  Ted Pedersen,et al.  Measures of semantic similarity and relatedness in the biomedical domain , 2007, J. Biomed. Informatics.

[115]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[116]  Éric Gaussier,et al.  Relation between PLSA and NMF and implications , 2005, SIGIR '05.

[117]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[118]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[119]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[120]  T. Poibeau Evaluation des systèmes d'extraction d'information : Une expérience sur le français , 1999 .

[121]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[122]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[123]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[124]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[125]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[126]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[127]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[128]  Salvatore Orlando,et al.  A new algorithm for gap constrained sequence mining , 2004, SAC '04.

[129]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[130]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.