Estimating term domain relevance through term frequency, disjoint corpora frequency - tf-dcf

This paper proposes a new relevance index for terms extracted from domain corpora. We call it term frequency, disjoint corpora frequency (tf-dcf), and it is based on the absolute frequency of each term tempered by its frequency in other (contrasting) corpora. Conceptual differences and mathematical computation of the proposed index are discussed in respect with other similar approaches that also take contrasting corpora into account. To illustrate the efficiency of our index, this paper evaluates tf-dcf against other similar approaches. Finally, other experiments are made in order to analyze the tf-dcf behavior according to the characteristics of contrasting corpora.

[1]  Lucelene Lopes,et al.  Building Domain Specific Parsed Corpora in Portuguese Language , 2013 .

[2]  Timothy Baldwin,et al.  Extracting Domain-Specific Words - A Statistical Approach , 2009, Australasian Language Technology Association Workshop.

[3]  Renata Vieira,et al.  Improving Portuguese Term Extraction , 2012, PROPOR.

[4]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[5]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[6]  Ali R. Hurson,et al.  TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams , 2006, 2006 5th International Conference on Machine Learning and Applications (ICMLA'06).

[7]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[8]  Paul Buitelaar,et al.  Domain-independent term extraction through domain modelling , 2013 .

[9]  Hui Zhang,et al.  Inverse-Category-Frequency based Supervised Term Weighting Schemes for Text Categorization , 2010, J. Inf. Sci. Eng..

[10]  Teresa Mihwa Chung A corpus comparison approach for terminology extraction , 2003 .

[11]  Kyo Kageura Theories of terminology: a quest for a framework for the study of term formation , 1998 .

[12]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[13]  Zhi-Hua Zhou,et al.  Distributional features for text categorization , 2006 .

[14]  Chunyu Kit,et al.  Measuring mono-word termhood by rank difference via corpus comparison , 2008 .

[15]  Siddharth Patwardhan,et al.  An empirical analysis of word error rate and keyword error rate , 2008, INTERSPEECH.

[16]  Stephen E. Robertson,et al.  On relevance weights with little relevance information , 1997, SIGIR '97.

[17]  Renata Vieira,et al.  Evaluation of cutoff policies for term extraction , 2015, Journal of the Brazilian Computer Society.

[18]  Rita Almeida Ribeiro,et al.  Automatic Extraction of Document Topics , 2011, DoCEIS.

[19]  Robert P Winkler,et al.  Semi-Automated Methods for Refining a Domain-Specific Terminology Base , 2011 .

[20]  W. Bruce Croft,et al.  Using Probabilistic Models of Document Retrieval without Relevance Information , 1979, J. Documentation.

[21]  Ani Nenkova,et al.  Measuring Importance and Query Relevance in Topic-focused Multi-document Summarization , 2007, ACL.

[22]  Daniel Martins,et al.  Extracting compound terms from domain corpora , 2010, Journal of the Brazilian Computer Society.

[23]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[24]  Yoon Kim,et al.  Credibility Adjusted Term Frequency: A Supervised Term Weighting Scheme for Sentiment Analysis and Text Classification , 2014, WASSA@ACL.

[25]  Patrick Drouin,et al.  Detection of Domain Specific Terminology Using Corpora Comparison , 2004, LREC.

[26]  Robert James Coulthard The application of corpus methodology to translation: the JPED parallel corpus and the Pediatrics comparable corpus , 2005 .

[27]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[28]  Fabrizio Sebastiani,et al.  Distributional term representations: an experimental comparison , 2004, CIKM '04.

[29]  Udo Hahn,et al.  You Can't Beat Frequency (Unless You Use Linguistic Knowledge) - A Qualitative Evaluation of Association Measures for Collocation and Term Extraction , 2006, ACL.

[30]  G. Zipf The Psycho-Biology Of Language: AN INTRODUCTION TO DYNAMIC PHILOLOGY , 1999 .