A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning

Ph.D. thesis (international doctorate mention) in Computer Science written by Marc Franco Salvador under the supervision of Dr. Paolo Rosso at the Universitat Politècnica de València. The author was examined in Valencia in May 2017 by a jury composed of the following doctors: Nicola Ferro (University of Padua), Bernardo Magnini (Fondazone Bruno Kessler), and Simone Paolo Ponzetto (University of Mannheim). The international doctorate mention was granted thanks to the completion of the following research internships: 1 year at the Sapienza University of Rome (Italy) under the supervision of Dr. Roberto Navigli, 2 months at the IIIT of Hyderabad and at Veooz (India) under the supervision of Dr. Vasudeva Varma and Dr. Prasad Pingali, 1 month at the INAOE (Mexico) under the supervision of Dr. Manuel Montes-y-Gómez, and 3 months at Symanto Group (Germany) under the supervision of Dr. Yassine Benajiba. The obtained grade was Excellent with Cum Laude distinction.

[1]  Simone Paolo Ponzetto,et al.  Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems , 2010, ACL.

[2]  Gemma Boleda,et al.  Distributional Semantics in Technicolor , 2012, ACL.

[3]  Hans van Halteren,et al.  Improving Data Driven Wordclass Tagging by System Combination , 1998, ACL.

[4]  W. B. Cavnar,et al.  Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[5]  Sepandar D. Kamvar,et al.  An Analytical Comparison of Approaches to Personalizing PageRank , 2003 .

[6]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[7]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8]  Shervin Malmasi,et al.  Large-Scale Native Language Identification with Cross-Corpus Evaluation , 2015, NAACL.

[9]  Renata de Matos Galante,et al.  A New Approach for Cross-Language Plagiarism Analysis , 2010, CLEF.

[10]  Lloyd A. Smith,et al.  Practical feature subset selection for machine learning , 1998 .

[11]  Gerhard Weikum,et al.  YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[12]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[13]  Ted Pedersen,et al.  Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[14]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[15]  Parth Gupta,et al.  Cross-Language Plagiarism Detection Using a Multilingual Semantic Network , 2013, ECIR.

[16]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[17]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[18]  John B. Lowe,et al.  The Berkeley FrameNet Project , 1998, ACL.

[19]  RossoPaolo,et al.  Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016 .

[20]  Qiang Yang,et al.  Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[21]  Paolo Rosso,et al.  Distributed Representations of Words and Documents for Discriminating Similar Languages , 2015 .

[22]  Alberto Barrón-Cedeño,et al.  Cross-Language High Similarity Search Using a Conceptual Thesaurus , 2012, CLEF.

[23]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[24]  Simone Paolo Ponzetto,et al.  Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[25]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[26]  Koby Crammer,et al.  Analysis of Representations for Domain Adaptation , 2006, NIPS.

[27]  Benno Stein,et al.  Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[28]  Parth Gupta,et al.  Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing , 2013, PROMISE Winter School.

[29]  Roberto Navigli,et al.  SemEval-2013 Task 12: Multilingual Word Sense Disambiguation , 2013, *SEMEVAL.

[30]  Thomas Gärtner,et al.  On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[31]  Simone Paolo Ponzetto,et al.  BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[32]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[33]  Parth Gupta,et al.  A New Approach to Cross-Language Plagiarism Detection , 2013 .

[34]  Carlos Gómez-Rodríguez,et al.  Language variety identification in Spanish tweets , 2014, EMNLP 2014.

[35]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[36]  Shervin Malmasi,et al.  Language Identification using Classifier Ensembles , 2015 .

[37]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[38]  Quan Hung Tran,et al.  JAIST: Combining multiple features for Answer Selection in Community Question Answering , 2015, *SEMEVAL.

[39]  P. Fletcher,et al.  DO SEMANTIC CATEGORIES ACTIVATE DISTINCT CORTICAL REGIONS? EVIDENCE FOR A DISTRIBUTED NEURAL SEMANTIC SYSTEM , 2003, Cognitive neuropsychology.

[40]  Andrea Esuli,et al.  SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[41]  Joel D. Martin,et al.  PORTAGE: A Phrase-Based Machine Translation System , 2005, ParallelText@ACL.

[42]  Paolo Rosso,et al.  PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[43]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[44]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[45]  Matthias Hagen,et al.  Overview of the 1st international competition on plagiarism detection , 2009 .

[46]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[47]  James Mayfield,et al.  Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[48]  John C. Platt,et al.  Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[49]  Aapo Hyvärinen,et al.  Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[50]  Daniel G. Yarlett,et al.  Language Learning Through Similarity-Based Generalization , 2008 .

[51]  Ignacio Iacobacci,et al.  SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[52]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[53]  Paolo Rosso,et al.  Single and Cross-domain Polarity Classification using String Kernels , 2017, EACL.

[54]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[55]  Rada Mihalcea,et al.  Semantic Relatedness Using Salient Semantic Analysis , 2011, AAAI.

[56]  Lillian Lee Scribes,et al.  Latent Semantic Indexing , 2007 .

[57]  Paolo Rosso,et al.  A Low Dimensionality Representation for Language Variety Identification , 2016, CICLing.

[58]  Eduard H. Hovy,et al.  Learning surface text patterns for a Question Answering System , 2002, ACL.

[59]  Roberto Navigli,et al.  A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation , 2014, CL.

[60]  James Mayfield,et al.  Indexing Using Both N-Grams and Words , 1998, TREC.

[61]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[62]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[63]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[64]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[65]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[66]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[67]  Tom Michael Mitchell,et al.  Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[68]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[69]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[70]  Yoav Shoham,et al.  Fab: content-based, collaborative recommendation , 1997, CACM.

[71]  Paolo Rosso,et al.  Language Variety Identification Using Distributed Representations of Words and Documents , 2015, CLEF.

[72]  Andrea Esuli,et al.  SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[73]  Paul Clough,et al.  Old and new challenges in automatic plagiarism detection , 2003 .

[74]  Clement T. Yu,et al.  The effect of negation on sentiment analysis and retrieval effectiveness , 2009, CIKM.

[75]  Paolo Rosso,et al.  A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[76]  Fatiha Sadat,et al.  Automatic Identification of Arabic Language Varieties and Dialects in Social Media , 2014, SocialNLP@COLING.

[77]  Danushka Bollegala,et al.  Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification , 2011, ACL.

[78]  Scott Jarvis,et al.  Maximizing Classification Accuracy in Native Language Identification , 2013, BEA@NAACL-HLT.

[79]  Patrick Pantel,et al.  Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[80]  Paolo Rosso,et al.  Continuous space models for CLIR , 2017, Inf. Process. Manag..

[81]  Benno Stein,et al.  An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[82]  Koby Crammer,et al.  Confidence-weighted linear classification , 2008, ICML '08.

[83]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[84]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[85]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[86]  W. Bruce Croft,et al.  Finding similar questions in large question and answer archives , 2005, CIKM '05.

[87]  Yee Whye Teh,et al.  A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[88]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[89]  Edward A. Fox,et al.  Research Contributions , 2014 .

[90]  Luis Alfonso Ureña López,et al.  Sentiment polarity detection in Spanish reviews combining supervised and unsupervised approaches , 2013, Expert Syst. Appl..

[91]  Steffen Staab,et al.  Explicit Versus Latent Concept Models for Cross-Language Information Retrieval , 2009, IJCAI.

[92]  Massih-Reza Amini,et al.  Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[93]  Simone Paolo Ponzetto,et al.  BabelRelate! A Joint Multilingual Approach to Computing Semantic Relatedness , 2012, AAAI.

[94]  A. Caramazza,et al.  Domain-Specific Knowledge Systems in the Brain: The Animate-Inanimate Distinction , 1998, Journal of Cognitive Neuroscience.

[95]  John A. Barnden,et al.  Semantic Networks , 1998, Encyclopedia of Social Network Analysis and Mining.

[96]  Hermann A. Maurer,et al.  Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[97]  William J. Rapaport,et al.  A Computational Theory of Vocabulary Expansion , 2007 .

[98]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[99]  Michael L. Littman,et al.  Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[100]  Hongfei Lin,et al.  A graph-based approach to mining multilingual word associations from wikipedia , 2009, SIGIR.

[101]  Michael Strube,et al.  Transforming Wikipedia into a large scale multilingual concept network , 2013, Artif. Intell..

[102]  David A. Hull Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[103]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[104]  John C. Platt,et al.  Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[105]  Bruno Ohana,et al.  Sentiment Classification of Reviews Using SentiWordNet , 2009 .

[106]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[107]  Parth Gupta,et al.  Query expansion for mixed-script information retrieval , 2014, SIGIR.

[108]  Roberto Navigli,et al.  Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction , 2013, CL.

[109]  Michael Ramscar,et al.  Testing the Distributioanl Hypothesis: The influence of Context on Judgements of Semantic Similarity , 2001 .

[110]  Paolo Rosso,et al.  Cross-language Plagiarism Detection Using BabelNet’s Statistical Dictionary , 2012 .

[111]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[112]  Hugo Larochelle,et al.  Learning Multilingual Word Representations using a Bag-of-Words Autoencoder , 2014, ArXiv.

[113]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[114]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[115]  Gökhan Tür,et al.  Leveraging knowledge graphs for web-scale unsupervised semantic parsing , 2013, INTERSPEECH.

[116]  Alessandro Lenci,et al.  One Distributional Memory, Many Semantic Spaces , 2009, Proceedings of the Workshop on Geometrical Models of Natural Language Semantics - GEMS '09.

[117]  Peter D. Turney Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[118]  Martin Chodorow,et al.  Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification , 2012, COLING.

[119]  Paolo Rosso,et al.  UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering , 2016, SemEval@NAACL-HLT.

[120]  Aoife Cahill,et al.  Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[121]  John Blitzer,et al.  Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[122]  Fermín L. Cruz,et al.  A comparative study of classifier combination applied to NLP tasks , 2013, Inf. Fusion.

[123]  Mark Stevenson,et al.  Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[124]  Iryna Gurevych,et al.  Dijkstra-WSA: A Graph-Based Approach to Word Sense Alignment , 2013, Transactions of the Association for Computational Linguistics.

[125]  Preslav Nakov,et al.  SemEval-2016 Task 3: Community Question Answering , 2019, *SEMEVAL.

[126]  Benno Stein,et al.  A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[127]  Cristian Grozea,et al.  Kernel Methods and String Kernels for Authorship Analysis , 2012, CLEF.

[128]  Fermín L. Cruz,et al.  Building layered, multilingual sentiment lexicons at synset and lemma levels , 2014, Expert Syst. Appl..

[129]  Hugo Larochelle,et al.  An Autoencoder Approach to Learning Bilingual Word Representations , 2014, NIPS.

[130]  Alberto Barrón-Cedeño,et al.  On Cross-lingual Plagiarism Analysis using a Statistical Model , 2008, PAN.

[131]  A. Damasio,et al.  A neural basis for lexical retrieval , 1996, Nature.

[132]  Chengqing Zong,et al.  Multi-domain Sentiment Classification , 2008, ACL.

[133]  George Forman,et al.  BNS feature scaling: an improved representation over tf-idf for svm text classification , 2008, CIKM '08.

[134]  Andrew McCallum,et al.  Polylingual Topic Models , 2009, EMNLP.

[135]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[136]  ChengXiang Zhai,et al.  Instance Weighting for Domain Adaptation in NLP , 2007, ACL.

[137]  Alberto Barrón-Cedeño,et al.  A statistical approach to crosslingual natural language tasks , 2008, LA-NMR.

[138]  Dragomir R. Radev,et al.  Book Review: Graph-Based Natural Language Processing and Information Retrieval by Rada Mihalcea and Dragomir Radev , 2011, CL.

[139]  Xiaodong He Using Word-Dependent Transition Models in HMM-Based Word Alignment for Statistical Machine Translation , 2007, WMT@ACL.

[140]  Lalit Agarwal,et al.  Multilingual Plagiarism Detection , 2014 .

[141]  Paolo Rosso,et al.  Cross-domain polarity classification using a knowledge-enhanced meta-classifier , 2015, Knowl. Based Syst..

[142]  Roberto Basili,et al.  Structured Lexical Similarity via Convolution Kernels on Dependency Trees , 2011, EMNLP.

[143]  Karen Ehrlich Automatic vocabulary expansion through narrative context , 1995 .

[144]  Moshe Koppel,et al.  Automatically Determining an Anonymous Author's Native Language , 2005, ISI.

[145]  Parth Gupta,et al.  Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016, Knowl. Based Syst..

[146]  Qiang Yang,et al.  Cross-domain sentiment classification via spectral feature alignment , 2010, WWW '10.

[147]  Gokhan Tur,et al.  LDA Based Similarity Modeling for Question Answering , 2010, HLT-NAACL 2010.

[148]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[149]  Piek Vossen,et al.  EUROWORDNET: A MULTILINGUAL DATABASE OF AUTONOMOUS AND LANGUAGE-SPECIFIC WORDNETS CONNECTED VIA AN INTER-LINGUALINDEX , 2004, International Journal of Lexicography.

[150]  Paolo Rosso,et al.  Answering questions with an n-gram based passage retrieval engine , 2009, Journal of Intelligent Information Systems.

[151]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[152]  Jens Lehmann,et al.  DBpedia - A crystallization point for the Web of Data , 2009, J. Web Semant..

[153]  Koby Crammer,et al.  Learning Bounds for Domain Adaptation , 2007, NIPS.

[154]  Takahiro Hara,et al.  Association thesaurus construction methods based on link co-occurrence analysis for wikipedia , 2008, CIKM '08.

[155]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[156]  Benno Stein,et al.  Towards Data Submissions for Shared Tasks: First Experiences for the Task of Text Alignment , 2015, CLEF.

[157]  Ricardo Baeza-Yates,et al.  Flexible comparison of conceptual graphs , 2001 .

[158]  Jasper Snoek,et al.  Bayesian Optimization and Semiparametric Models with Applications to Assistive Technology , 2014 .

[159]  Paolo Rosso,et al.  On the difficulty of automatically detecting irony: beyond a simple case of negation , 2014, Knowledge and Information Systems.

[160]  Danushka Bollegala,et al.  Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus , 2013, IEEE Transactions on Knowledge and Data Engineering.

[161]  Mark Stevenson,et al.  A Hybrid Distributional and Knowledge-based Model of Lexical Semantics , 2015, *SEMEVAL.

[162]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[163]  Denilson Barbosa,et al.  Open Information Extraction with Tree Kernels , 2013, NAACL.

[164]  Mirella Lapata,et al.  An Experimental Study of Graph Connectivity for Unsupervised Word Sense Disambiguation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[165]  Eneko Agirre,et al.  SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation , 2016, *SEMEVAL.

[166]  Donald A. Jackson,et al.  Similarity Coefficients: Measures of Co-Occurrence and Association or Simply Measures of Occurrence? , 1989, The American Naturalist.

[167]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[168]  Joel R. Tetreault,et al.  A Report on the First Native Language Identification Shared Task , 2013, BEA@NAACL-HLT.

[169]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[170]  Rada Mihalcea,et al.  Coarse to Fine Grained Sense Disambiguation in Wikipedia , 2013, *SEMEVAL.

[171]  Paolo Rosso,et al.  A Knowledge-based Representation for Cross-Language Document Retrieval and Categorization , 2014, EACL.

[172]  Thomas L. Griffiths,et al.  Supplementary Information for Natural Speech Reveals the Semantic Maps That Tile Human Cerebral Cortex , 2022 .

[173]  Eneko Agirre,et al.  Random Walks for Knowledge-Based Word Sense Disambiguation , 2014, CL.

[174]  Paolo Rosso,et al.  Bridging the Native Language and Language Variety Identification Tasks , 2017, KES.

[175]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[176]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[177]  Roel Popping,et al.  Knowledge Graphs and Network Text Analysis , 2003 .

[178]  Qiuping Xu Canonical correlation Analysis , 2014 .

[179]  Eneko Agirre,et al.  Two graph-based algorithms for state-of-the-art WSD , 2006, EMNLP.

[180]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.