论文信息 - A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning

A Cross-domain and Cross-language Knowledge-based Representation of Text and its Meaning

Ph.D. thesis (international doctorate mention) in Computer Science written by Marc Franco Salvador under the supervision of Dr. Paolo Rosso at the Universitat Politècnica de València. The author was examined in Valencia in May 2017 by a jury composed of the following doctors: Nicola Ferro (University of Padua), Bernardo Magnini (Fondazone Bruno Kessler), and Simone Paolo Ponzetto (University of Mannheim). The international doctorate mention was granted thanks to the completion of the following research internships: 1 year at the Sapienza University of Rome (Italy) under the supervision of Dr. Roberto Navigli, 2 months at the IIIT of Hyderabad and at Veooz (India) under the supervision of Dr. Vasudeva Varma and Dr. Prasad Pingali, 1 month at the INAOE (Mexico) under the supervision of Dr. Manuel Montes-y-Gómez, and 3 months at Symanto Group (Germany) under the supervision of Dr. Yassine Benajiba. The obtained grade was Excellent with Cum Laude distinction.

Marc Franco Salvador

[1] Simone Paolo Ponzetto,et al. Knowledge-Rich Word Sense Disambiguation Rivaling Supervised Systems , 2010, ACL.

[2] Gemma Boleda,et al. Distributional Semantics in Technicolor , 2012, ACL.

[3] Hans van Halteren,et al. Improving Data Driven Wordclass Tagging by System Combination , 1998, ACL.

[4] W. B. Cavnar,et al. Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model , 1994, TREC.

[5] Sepandar D. Kamvar,et al. An Analytical Comparison of Approaches to Personalizing PageRank , 2003 .

[6] Tomaz Erjavec,et al. The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[7] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[8] Shervin Malmasi,et al. Large-Scale Native Language Identification with Cross-Corpus Evaluation , 2015, NAACL.

[9] Renata de Matos Galante,et al. A New Approach for Cross-Language Plagiarism Analysis , 2010, CLEF.

[10] Lloyd A. Smith,et al. Practical feature subset selection for machine learning , 1998 .

[11] Gerhard Weikum,et al. YAGO2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia: Extended Abstract , 2013, IJCAI.

[12] Thorsten Joachims,et al. Optimizing search engines using clickthrough data , 2002, KDD.

[13] Ted Pedersen,et al. Extended Gloss Overlaps as a Measure of Semantic Relatedness , 2003, IJCAI.

[14] Geoffrey Zweig,et al. Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[15] Parth Gupta,et al. Cross-Language Plagiarism Detection Using a Multilingual Semantic Network , 2013, ECIR.

[16] Nello Cristianini,et al. Classification using String Kernels , 2000 .

[17] John Blitzer,et al. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[18] John B. Lowe,et al. The Berkeley FrameNet Project , 1998, ACL.

[19] RossoPaolo,et al. Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language , 2016 .

[20] Qiang Yang,et al. Co-clustering based classification for out-of-domain documents , 2007, KDD '07.

[21] Paolo Rosso,et al. Distributed Representations of Words and Documents for Discriminating Similar Languages , 2015 .

[22] Alberto Barrón-Cedeño,et al. Cross-Language High Similarity Search Using a Conceptual Thesaurus , 2012, CLEF.

[23] Ellen M. Voorhees,et al. The TREC-8 Question Answering Track Report , 1999, TREC.

[24] Simone Paolo Ponzetto,et al. Collaboratively built semi-structured content and Artificial Intelligence: The story so far , 2013, Artif. Intell..

[25] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[26] Koby Crammer,et al. Analysis of Representations for Domain Adaptation , 2006, NIPS.

[27] Benno Stein,et al. Overview of the 3rd Author Profiling Task at PAN 2015 , 2015, CLEF.

[28] Parth Gupta,et al. Knowledge Graphs as Context Models: Improving the Detection of Cross-Language Plagiarism with Paraphrasing , 2013, PROMISE Winter School.

[29] Roberto Navigli,et al. SemEval-2013 Task 12: Multilingual Word Sense Disambiguation , 2013, *SEMEVAL.

[30] Thomas Gärtner,et al. On Graph Kernels: Hardness Results and Efficient Alternatives , 2003, COLT.

[31] Simone Paolo Ponzetto,et al. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network , 2012, Artif. Intell..

[32] Jörg Tiedemann,et al. A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[33] Parth Gupta,et al. A New Approach to Cross-Language Plagiarism Detection , 2013 .

[34] Carlos Gómez-Rodríguez,et al. Language variety identification in Spanish tweets , 2014, EMNLP 2014.

[35] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[36] Shervin Malmasi,et al. Language Identification using Classifier Ensembles , 2015 .

[37] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[38] Quan Hung Tran,et al. JAIST: Combining multiple features for Answer Selection in Community Question Answering , 2015, *SEMEVAL.

[39] P. Fletcher,et al. DO SEMANTIC CATEGORIES ACTIVATE DISTINCT CORTICAL REGIONS? EVIDENCE FOR A DISTRIBUTED NEURAL SEMANTIC SYSTEM , 2003, Cognitive neuropsychology.

[40] Andrea Esuli,et al. SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining , 2006, LREC.

[41] Joel D. Martin,et al. PORTAGE: A Phrase-Based Machine Translation System , 2005, ParallelText@ACL.

[42] Paolo Rosso,et al. PAN 2015 Shared Task on Plagiarism Detection: Evaluation of Corpora for Text Alignment: Notebook for PAN at CLEF 2015 , 2015, CLEF.

[43] Marcos Zampieri,et al. Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[44] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[45] Matthias Hagen,et al. Overview of the 1st international competition on plagiarism detection , 2009 .

[46] Eric R. Ziegel,et al. The Elements of Statistical Learning , 2003, Technometrics.

[47] James Mayfield,et al. Character N-Gram Tokenization for European Language Text Retrieval , 2004, Information Retrieval.

[48] John C. Platt,et al. Learning Discriminative Projections for Text Similarity Measures , 2011, CoNLL.

[49] Aapo Hyvärinen,et al. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics , 2012, J. Mach. Learn. Res..

[50] Daniel G. Yarlett,et al. Language Learning Through Similarity-Based Generalization , 2008 .

[51] Ignacio Iacobacci,et al. SensEmbed: Learning Sense Embeddings for Word and Relational Similarity , 2015, ACL.

[52] E. Mark Gold,et al. Language Identification in the Limit , 1967, Inf. Control..

[53] Paolo Rosso,et al. Single and Cross-domain Polarity Classification using String Kernels , 2017, EACL.

[54] Geoffrey E. Hinton,et al. Semantic hashing , 2009, Int. J. Approx. Reason..

[55] Rada Mihalcea,et al. Semantic Relatedness Using Salient Semantic Analysis , 2011, AAAI.

[56] Lillian Lee Scribes,et al. Latent Semantic Indexing , 2007 .

[57] Paolo Rosso,et al. A Low Dimensionality Representation for Language Variety Identification , 2016, CICLing.

[58] Eduard H. Hovy,et al. Learning surface text patterns for a Question Answering System , 2002, ACL.

[59] Roberto Navigli,et al. A Large-Scale Pseudoword-Based Evaluation Framework for State-of-the-Art Word Sense Disambiguation , 2014, CL.

[60] James Mayfield,et al. Indexing Using Both N-Grams and Words , 1998, TREC.

[61] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[62] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[63] Preslav Nakov,et al. Overview of the DSL Shared Task 2015 , 2015 .

[64] Hal Daumé,et al. Frustratingly Easy Domain Adaptation , 2007, ACL.

[65] Roberto Navigli,et al. Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[66] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[67] Tom Michael Mitchell,et al. Predicting Human Brain Activity Associated with the Meanings of Nouns , 2008, Science.

[68] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[69] Quoc V. Le,et al. Distributed Representations of Sentences and Documents , 2014, ICML.

[70] Yoav Shoham,et al. Fab: content-based, collaborative recommendation , 1997, CACM.

[71] Paolo Rosso,et al. Language Variety Identification Using Distributed Representations of Words and Documents , 2015, CLEF.

[72] Andrea Esuli,et al. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining , 2010, LREC.

[73] Paul Clough,et al. Old and new challenges in automatic plagiarism detection , 2003 .

[74] Clement T. Yu,et al. The effect of negation on sentiment analysis and retrieval effectiveness , 2009, CIKM.

[75] Paolo Rosso,et al. A systematic study of knowledge graph analysis for cross-language plagiarism detection , 2016, Inf. Process. Manag..

[76] Fatiha Sadat,et al. Automatic Identification of Arabic Language Varieties and Dialects in Social Media , 2014, SocialNLP@COLING.

[77] Danushka Bollegala,et al. Using Multiple Sources to Construct a Sentiment Sensitive Thesaurus for Cross-Domain Sentiment Classification , 2011, ACL.

[78] Scott Jarvis,et al. Maximizing Classification Accuracy in Native Language Identification , 2013, BEA@NAACL-HLT.

[79] Patrick Pantel,et al. Discovery of inference rules for question-answering , 2001, Natural Language Engineering.

[80] Paolo Rosso,et al. Continuous space models for CLIR , 2017, Inf. Process. Manag..

[81] Benno Stein,et al. An Evaluation Framework for Plagiarism Detection , 2010, COLING.

[82] Koby Crammer,et al. Confidence-weighted linear classification , 2008, ICML '08.

[83] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[84] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[85] Elia Bruni,et al. Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[86] W. Bruce Croft,et al. Finding similar questions in large question and answer archives , 2005, CIKM '05.

[87] Yee Whye Teh,et al. A fast and simple algorithm for training neural probabilistic language models , 2012, ICML.

[88] Kilian Stoffel,et al. Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[89] Edward A. Fox,et al. Research Contributions , 2014 .

[90] Luis Alfonso Ureña López,et al. Sentiment polarity detection in Spanish reviews combining supervised and unsupervised approaches , 2013, Expert Syst. Appl..

[91] Steffen Staab,et al. Explicit Versus Latent Concept Models for Cross-Language Information Retrieval , 2009, IJCAI.

[92] Massih-Reza Amini,et al. Learning from Multiple Partially Observed Views - an Application to Multilingual Text Categorization , 2009, NIPS.

[93] Simone Paolo Ponzetto,et al. BabelRelate! A Joint Multilingual Approach to Computing Semantic Relatedness , 2012, AAAI.

[94] A. Caramazza,et al. Domain-Specific Knowledge Systems in the Brain: The Animate-Inanimate Distinction , 1998, Journal of Cognitive Neuroscience.

[95] John A. Barnden,et al. Semantic Networks , 1998, Encyclopedia of Social Network Analysis and Mining.

[96] Hermann A. Maurer,et al. Plagiarism - A Survey , 2006, J. Univers. Comput. Sci..

[97] William J. Rapaport,et al. A Computational Theory of Vocabulary Expansion , 2007 .

[98] Christopher D. Manning,et al. Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[99] Michael L. Littman,et al. Automatic Cross-Language Retrieval Using Latent Semantic Indexing , 1997 .

[100] Hongfei Lin,et al. A graph-based approach to mining multilingual word associations from wikipedia , 2009, SIGIR.

[101] Michael Strube,et al. Transforming Wikipedia into a large scale multilingual concept network , 2013, Artif. Intell..

[102] David A. Hull. Improving text retrieval for the routing problem using latent semantic indexing , 1994, SIGIR '94.

[103] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[104] John C. Platt,et al. Translingual Document Representations from Discriminative Projections , 2010, EMNLP.

[105] Bruno Ohana,et al. Sentiment Classification of Reviews Using SentiWordNet , 2009 .

[106] David H. Wolpert,et al. Stacked generalization , 1992, Neural Networks.

[107] Parth Gupta,et al. Query expansion for mixed-script information retrieval , 2014, SIGIR.

[108] Roberto Navigli,et al. Clustering and Diversifying Web Search Results with Graph-Based Word Sense Induction , 2013, CL.

[109] Michael Ramscar,et al. Testing the Distributioanl Hypothesis: The influence of Context on Judgements of Semantic Similarity , 2001 .

[110] Paolo Rosso,et al. Cross-language Plagiarism Detection Using BabelNet’s Statistical Dictionary , 2012 .

[111] Koby Crammer,et al. A theory of learning from different domains , 2010, Machine Learning.

[112] Hugo Larochelle,et al. Learning Multilingual Word Representations using a Bag-of-Words Autoencoder , 2014, ArXiv.

[113] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[114] J. Ross Quinlan,et al. Induction of Decision Trees , 1986, Machine Learning.

[115] Gökhan Tür,et al. Leveraging knowledge graphs for web-scale unsupervised semantic parsing , 2013, INTERSPEECH.

[116] Alessandro Lenci,et al. One Distributional Memory, Many Semantic Spaces , 2009, Proceedings of the Workshop on Geometrical Models of Natural Language Semantics - GEMS '09.

[117] Peter D. Turney. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews , 2002, ACL.

[118] Martin Chodorow,et al. Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification , 2012, COLING.

[119] Paolo Rosso,et al. UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering , 2016, SemEval@NAACL-HLT.

[120] Aoife Cahill,et al. Can characters reveal your native language? A language-independent approach to native language identification , 2014, EMNLP.

[121] John Blitzer,et al. Domain Adaptation with Structural Correspondence Learning , 2006, EMNLP.

[122] Fermín L. Cruz,et al. A comparative study of classifier combination applied to NLP tasks , 2013, Inf. Fusion.

[123] Mark Stevenson,et al. Developing a corpus of plagiarised short answers , 2011, Lang. Resour. Evaluation.

[124] Iryna Gurevych,et al. Dijkstra-WSA: A Graph-Based Approach to Word Sense Alignment , 2013, Transactions of the Association for Computational Linguistics.

[125] Preslav Nakov,et al. SemEval-2016 Task 3: Community Question Answering , 2019, *SEMEVAL.

[126] Benno Stein,et al. A Wikipedia-Based Multilingual Retrieval Model , 2008, ECIR.

[127] Cristian Grozea,et al. Kernel Methods and String Kernels for Authorship Analysis , 2012, CLEF.

[128] Fermín L. Cruz,et al. Building layered, multilingual sentiment lexicons at synset and lemma levels , 2014, Expert Syst. Appl..