论文信息 - From Frequency to Meaning: Vector Space Models of Semantics

From Frequency to Meaning: Vector Space Models of Semantics

Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field.

Patrick Pantel | Peter D. Turney | P. Pantel | Patrick Pantel

[1] R. Darnell. Translation , 1873, The Indian medical gazette.

[2] Wm. R. Wright. General Intelligence, Objectively Determined and Measured. , 1905 .

[3] C. K. Ogden,et al. Basic English : a general introduction with rules and grammar , 1930 .

[4] W. N. Locke,et al. Machine Translation of Languages: Fourteen Essays , 1955 .

[5] J. R. Firth,et al. A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[6] L. Tucker,et al. Some mathematical notes on three-mode factor analysis , 1966, Psychometrika.

[7] Philip J. Stone,et al. Extracting Information. (Book Reviews: The General Inquirer. A Computer Approach to Content Analysis) , 1967 .

[8] Marshall S. Smith,et al. The general inquirer: A computer approach to content analysis. , 1967 .

[9] Julie Beth Lovins,et al. Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[10] J. Chang,et al. Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[11] Richard A. Harshman,et al. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[12] Gerard Salton,et al. The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[13] Peter Ladefoged,et al. UCLA Working Papers in Phonetics, 23. , 1972 .

[14] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[15] W. Bruce Croft. Clustering large files of documents using the single-link method , 1977, J. Am. Soc. Inf. Sci..

[16] E. Rosch,et al. Cognition and Categorization , 1980 .

[17] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[18] Zellig S. Harris,et al. Distributional Structure , 1954 .

[19] C. Pollard,et al. Center for the Study of Language and Information , 2022 .

[20] Dedre Gentner,et al. Structure-Mapping: A Theoretical Framework for Analogy , 1983, Cogn. Sci..

[21] Susan T. Dumais,et al. Statistical semantics: analysis of the potential performance of keyword information systems , 1984 .

[22] R. Nosofsky. Attention, similarity, and the identification-categorization relationship. , 1986, Journal of experimental psychology. General.

[23] George W. Furnas,et al. Pictures of relevance: A geometric analysis of similarity measures , 1987, J. Am. Soc. Inf. Sci..

[24] George Lakoff,et al. Women, Fire, and Dangerous Things , 1987 .

[25] Lance J. Rips,et al. Combining Prototypes: A Selective Modification Model , 1988, Cogn. Sci..

[26] Carolyn J. Crouch,et al. A cluster-based approach to thesaurus construction , 1988, SIGIR '88.

[27] Gerard Salton,et al. Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[28] Kenneth Ward Church,et al. Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[29] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[30] C. Burgess,et al. Semantic and associative priming in the cerebral hemispheres: Some words do, some words don't … sometimes, some places , 1990, Brain and Language.

[31] T. Landauer,et al. Indexing by Latent Semantic Analysis , 1990 .

[32] Belur V. Dasarathy,et al. Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[33] David R. Karger,et al. Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[34] John R. Gilbert,et al. Sparse Matrices in MATLAB: Design and Implementation , 1992, SIAM J. Matrix Anal. Appl..

[35] Naftali Tishby,et al. Distributional Clustering of English Words , 1993, ACL.

[36] Mohamad H. Hassoun,et al. Associative neural memories , 1993 .

[37] Pentti Kanerva,et al. Sparse distributed memory and related models , 1993 .

[38] Ellen M. Voorhees,et al. Corpus-Based Statistical Sense Resolution , 1993, HLT.

[39] Hinrich Schütze,et al. A Vector Model for Syntagmatic and Paradigmatic Relatedness , 1993 .

[40] George A. Miller,et al. A Semantic Concordance , 1993, HLT.

[41] Gregory Grefenstette,et al. Explorations in automatic thesaurus discovery , 1994 .

[42] John Riedl,et al. GroupLens: an open architecture for collaborative filtering of netnews , 1994, CSCW '94.

[43] Yoshihiko Nitta,et al. Co-Occurrence Vectors From Corpora vs. Distance Vectors From Dictionaries , 1994, COLING.

[44] Kenneth Ward Church. One term or two? , 1995, SIGIR '95.

[45] Philip Resnik,et al. Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[46] Graeme Hirst,et al. Lexical chains as representations of context for the detection and correction of malapropisms , 1995 .

[47] Gene H. Golub,et al. Matrix Computations, Third Edition , 1996 .

[48] David A. Hull. Stemming Algorithms: A Case Study for Detailed Evaluation , 1996, J. Am. Soc. Inf. Sci..

[49] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[50] Wessel Kraaij,et al. Viewing stemming as recall enhancement , 1996, SIGIR '96.

[51] Curt Burgess,et al. Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[52] Gerard Salton,et al. Document Length Normalization , 1995, Inf. Process. Manag..

[53] Bernhard Schölkopf,et al. Kernel Principal Component Analysis , 1997, ICANN.

[54] James H. Martin,et al. Contextual Spelling Correction Using Latent Semantic Analysis , 1997, ANLP.

[55] T. Landauer,et al. A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[56] David W. Conrath,et al. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[57] Andrei Z. Broder,et al. On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[58] Marti A. Hearst. Text Tiling: Segmenting Text into Multi-paragraph Subtopic Passages , 1997, CL.

[59] Gerda Ruge,et al. Automatic Detection of Thesaurus relations for Information Retrieval Applications , 1997, Foundations of Computer Science: Potential - Theory - Cognition.

[60] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[61] David Heckerman,et al. Empirical Analysis of Predictive Algorithms for Collaborative Filtering , 1998, UAI.

[62] Hinrich Schütze,et al. Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[63] Martin Chodorow,et al. Combining local context and wordnet similarity for word sense identification , 1998 .

[64] Susan T. Dumais,et al. A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[65] Christiane Fellbaum,et al. Combining Local Context and Wordnet Similarity for Word Sense Identification , 1998 .

[66] Peter W. Foltz,et al. Learning from text: Matching readers and texts by latent semantic analysis , 1998 .

[67] Dekang Lin,et al. Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[68] Peter W. Foltz,et al. The intelligent essay assessor: Applications to educational technology , 1999 .

[69] H. Sebastian Seung,et al. Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[70] Lillian Lee,et al. Measures of Distributional Similarity , 1999, ACL.

[71] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[72] Oren Etzioni,et al. Grouper: A Dynamic Clustering Interface to Web Search Results , 1999, Comput. Networks.

[73] Bob Carpenter,et al. Vector-based Natural Language Call Routing , 1999, Comput. Linguistics.

[74] Freddy Y. Y. Choi. Advances in domain independent linear text segmentation , 2000, ANLP.

[75] Christiane Fellbaum,et al. Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[76] Rie Kubota Ando. Latent semantic space: iterative scaling improves precision of inter-document similarity measurement , 2000, SIGIR '00.

[77] W. Lowe,et al. Towards a Theory of Semantic Space , 2001 .

[78] Ji-Rong Wen,et al. Clustering user queries of a search engine , 2001, WWW '01.

[79] Barbara Rosario,et al. Classifying the Semantic Relations in Noun Compounds via a Domain-Specific Lexical Hierarchy , 2001, EMNLP.

[80] Magnus Sahlgren,et al. From Words to Understanding , 2001 .

[81] Dekang Lin,et al. DIRT – Discovery of Inference Rules from Text , 2001 .

[82] Peter D. Turney. Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL , 2001, ECML.

[83] John A. Carroll,et al. Applied morphological processing of English , 2001, Natural Language Engineering.

[84] Patrick Pantel,et al. DIRT @SBT@discovery of inference rules from text , 2001, KDD '01.

[85] Hinrich Schütze,et al. Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[86] Ellen M. Voorhees,et al. Overview of the TREC 2002 Question Answering Track , 2003, TREC.

[87] Bo Pang,et al. Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[88] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[89] George Karypis,et al. Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[90] Patrick Pantel,et al. Discovering word senses from text , 2002, KDD.

[91] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[92] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[93] Patrick Pantel,et al. Document clustering with committees , 2002, SIGIR '02.

[94] Barbara Rosario,et al. The Descent of Hierarchy, and Selection in Relational Semantics , 2002, ACL.

[95] James R. Curran,et al. Improvements in Automatic Thesaurus Extraction , 2002, ACL 2002.

[96] Thomas K. Landauer,et al. On the computational basis of learning and cognition: Arguments from LSA , 2002 .

[97] Yen-Jen Oyang,et al. Relevant term suggestion in interactive web search based on contextual information in query session logs , 2003, J. Assoc. Inf. Sci. Technol..

[98] Daniel Gatica-Perez,et al. On image auto-annotation with latent space models , 2003, ACM Multimedia.

[99] Michael L. Littman,et al. Measuring praise and criticism: Inference of semantic orientation from association , 2003, TOIS.

[100] Richard Sproat,et al. The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[101] K. Margaritis,et al. Analysis of Recommender Systems’ Algorithms , 2003 .

[102] Jimmy J. Lin,et al. Quantitative evaluation of passage retrieval algorithms for question answering , 2003, SIGIR.

[103] Stan Szpakowicz,et al. Roget's thesaurus and semantic similarity , 2012, RANLP.

[104] George Forman,et al. An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[105] Mirella Lapata,et al. Constructing Semantic Space Models from Parsed Corpora , 2003, ACL.

[106] R. Rapp. Word sense discovery based on sense descriptor dissimilarity , 2003, MTSUMMIT.

[107] Joel D. Martin,et al. Unsupervised Learning of Morphology for English and Inuktitut , 2003, NAACL.

[108] Tony Veale. The Analogical Thesaurus , 2003, IAAI.

[109] Greg Linden,et al. Amazon . com Recommendations Item-to-Item Collaborative Filtering , 2001 .

[110] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[111] Jeffrey P. Bigham,et al. Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems , 2003, ArXiv.

[112] Patrick Pantel,et al. VerbOcean: Mining the Web for Fine-Grained Semantic Verb Relations , 2004, EMNLP.

[113] Otis Gospodnetic,et al. Lucene in Action , 2004 .

[114] Graeme Hirst,et al. Non-Classical Lexical Semantic Relations , 2004, Proceedings of the HLT-NAACL Workshop on Computational Lexical Semantics - CLS '04.

[115] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[116] David J. Weir,et al. Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[117] Graeme Hirst,et al. Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures , 2004 .

[118] David F. Gleich,et al. SVD based term suggestion and ranking system , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[119] Sunita Sarawagi,et al. Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[120] Ido Dagan,et al. Similarity-Based Models of Word Cooccurrence Probabilities , 1998, Machine Learning.

[121] Yiming Yang,et al. An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[122] C. J. van Rijsbergen,et al. The geometry of information retrieval , 2004 .

[123] Sanjay Ghemawat,et al. MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[124] Tony Veale,et al. WordNet Sits the S.A.T. - A Knowledge-Based Approach to Lexical Analogy , 2004, ECAI.

[125] Patrick Pantel,et al. Inducing Ontological Co-occurrence Vectors , 2005, ACL.

[126] Peter D. Turney. Measuring Semantic Similarity by Latent Relational Analysis , 2005, IJCAI.