Using Generic Corpora to Learn Domain-Specific Terminology

This paper describes a knowledge-weak technique for automatically learning terminology relevant to a given domain from a corpus of domain-specific documents. We used a generic corpus as a filter for scoring the relevance of terms to a domain. We tested this approach against three corpora from different domains and, in each case, high-scoring terms consistently represented concepts relevant to the domain from which they came.

[1]  Mary Hart,et al.  Automatic indexing using selective NLP and first-order thesauri , 1991, RIAO.

[2]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[3]  Michael D. Gordon,et al.  Toward Discovery Support Systems: A Replication, Re-Examination, and Extension of Swanson's Work on Literature-Based Discovery of a Connection between Raynaud's and Fish Oil , 1996, J. Am. Soc. Inf. Sci..

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  Luis Gravano,et al.  Extracting Relations from Large Plain-Text Collections , 1999 .

[6]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[7]  Lucy Vanderwende,et al.  MindNet: Acquiring and Structuring Semantic Information from Text , 1998, COLING-ACL.

[8]  Francis Jack Smith,et al.  A Review of Statistical Language Processing Techniques , 1998, Artificial Intelligence Review.

[9]  Rodney A. Brooks,et al.  Intelligence Without Reason , 1991, IJCAI.

[10]  Peter Clark,et al.  Exploiting a Thesaurus-Based Semantic Net for Knowledge-Based Search , 2000, AAAI/IAAI.

[11]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[12]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[13]  Koichi Takeda,et al.  Information retrieval on the web , 2000, CSUR.

[14]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[15]  W. Bruce Croft,et al.  Deriving concept hierarchies from text , 1999, SIGIR '99.

[16]  Inderjeet Mani,et al.  Identifying Unknown Proper Names in Newswire Text , 1996 .

[17]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[18]  William A. Woods,et al.  Conceptual Indexing: Practical Large-Scale AI for Efficient Information Access , 2000, AAAI/IAAI.

[19]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[20]  Chris Clifton,et al.  TopCat: data mining for topic identification in a text corpus , 1999, IEEE Transactions on Knowledge and Data Engineering.

[21]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[22]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[23]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[24]  Chris Clifton,et al.  TopCat: Data Mining for Topic Identification in a Text Corpus , 2004, IEEE Trans. Knowl. Data Eng..

[25]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[26]  A. Agresti An introduction to categorical data analysis , 1997 .

[27]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[28]  I. Malargadi A search algorithm for knowledge acquisition from texts , 2001, HERCMA.

[29]  Luis Gravano,et al.  Snowball: extracting relations from large plain-text collections , 2000, DL '00.

[30]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.