Pattern Mining Across Domain-Specific Text Collections

This paper discusses a consistency in patterns of language use across domain-specific collections of text. We present a method for the automatic identification of domain-specific keywords – specialist terms – based on comparing language use in scientific domain-specific text collections with language use in texts intended for a more general audience. The method supports automatic production of collocational networks, and of networks of concepts – thesauri, or so-called ontologies. The method involves a novel combination of existing metrics from work in computational linguistics, which can enable extraction, or learning, of these kinds of networks. Creation of ontologies or thesauri is informed by international (ISO) standards in terminology science, and the resulting resource can be used to support a variety of work, including data-mining applications.

[1]  RetrievalDouglas W. OardCollege Alternative Approaches for Cross-Language Text Retrieval , 1997 .

[2]  Gregory Grefenstette,et al.  Explorations in automatic thesaurus discovery , 1994 .

[3]  David Faure,et al.  Knowledge Acquisition of Predicate Argument Structures from Technical Texts Using Machine Learning: The System ASIUM , 1999, EKAW.

[4]  Lee Gillam Systems of concepts and their extraction from text , 2004 .

[5]  David Faure,et al.  ASIUM: Learning subcategorization frames and restrictions of se-18 lection , 1998 .

[6]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[7]  Raphael Volz,et al.  The text-to-onto ontology extraction and maintenance system , 2001 .

[8]  Raphael Volz,et al.  The Ontology Extraction & Maintenance Framework Text-To-Onto , 2001 .

[9]  Steffen Staab,et al.  Handbook on Ontologies in Information Systems , 2003 .

[10]  Randolph Quirk,et al.  Grammatical and lexical variance in English , 1995 .

[11]  Lee Gillam,et al.  Sharing the knowledge of experts , 2002 .

[12]  Horacio Rodríguez,et al.  Improving term extraction by combining different techniques , 2001 .

[13]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[14]  Patrick Drouin,et al.  Term extraction using non-technical corpora as a point of leverage , 2003 .

[15]  Andrei Mikheev,et al.  Towards a Workbench for Acquisition of Domain Knowledge from Natural Language , 1995, EACL.

[16]  Frank Smadja,et al.  Retrieving Collocations from Text: Xtract , 1993, CL.

[17]  W. Bruce Croft,et al.  An Association Thesaurus for Information Retrieval , 1994, RIAO.

[18]  Lee Gillam,et al.  Scene of Crime Information System: Playing at St. Andrews , 2003, CLEF.

[19]  Francis Jack Smith,et al.  Extension of Zipf’s Law to Words and Phrases , 2002, COLING.

[20]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[21]  Khurshid Ahmad,et al.  Corpus-Based Thesaurus Construction for Image Retrieval in Specialist Domains , 2003, ECIR.

[22]  Gerard Salton,et al.  Experiments in Automatic Thesaurus Construction for Information Retrieval , 1971, IFIP Congress.

[23]  Steffen Staab,et al.  Ontology Learning , 2004, Encyclopedia of Machine Learning and Data Mining.

[24]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[25]  Lee Gillam,et al.  Terminology and the construction of ontology , 2005 .

[26]  Marvin Minsky,et al.  Semantic Information Processing , 1968 .

[27]  Natalia Grabar,et al.  Lexically-based terminology structuring , 2004 .

[28]  Henrik Eriksson,et al.  Using JessTab to Integrate Protégé and Jess , 2003, IEEE Intell. Syst..

[29]  Nina Wacholder,et al.  Spotting and Discovering Terms Through Natural Language Processing , 2003, Information Retrieval.

[30]  Guy Aston,et al.  The BNC Handbook: Exploring the British National Corpus with SARA , 1998 .

[31]  Kenneth Ward Church,et al.  - 1-What ’ s Wrong with Adding One ? , 1994 .

[32]  Lee Gillam,et al.  Digital Heritage and the use of terminology , 2002 .

[33]  Hannu Vanharanta,et al.  Visualizing Sequences of Texts Using Collocational Networks , 2003, MLDM.