A Measure of Term Representativeness Based on the Number of Co-occurring Salient Words

We propose a novel measure of the representativeness (i.e., indicativeness or topic specificity) of a term in a given corpus. The measure embodies the idea that the distribution of words co-occurring with a representative term should be biased according to the word distribution in the whole corpus. The bias of the word distribution in the co-occurring words is defined as the number of distinct words whose occurrences are saliently biased in the co-occurring words. The saliency of a word is defined by a threshold probability that can be automatically defined using the whole corpus. Comparative evaluation clarified that the measure is clearly superior to conventional measures in finding topic-specific words in the newspaper archives of different sizes.

[1]  Jonathan D. Cohen,et al.  Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting , 1995, J. Am. Soc. Inf. Sci..

[2]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[3]  Makoto Nagao,et al.  An Automatic Method of the Extraction of Important Words from Japanese Scientific Documents , 1976 .

[4]  Hiroshi Nakagawa Automatic term recognition based on statistics of compound nouns , 2000 .

[5]  Jun'ichi Tsujii,et al.  A Method of Measuring Term Representativeness - Baseline Method Using Co-occurrence Distribution , 2000, COLING.

[6]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[7]  Kyo Kageura,et al.  METHODS OF AUTOMATIC TERM RECOGNITION : A REVIEW , 1996 .

[8]  Chris Buckley,et al.  Pivoted Document Length Normalization , 1996, SIGIR Forum.

[9]  J. R. Firth,et al.  A Synopsis of Linguistic Theory, 1930-1955 , 1957 .

[10]  J. R. Firth,et al.  Studies in Linguistic Analysis. , 1974 .

[11]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[12]  Hideki Mima,et al.  An Application and Evaluation of the C/NC-value Approach for the Automatic term Recognition of Multi-Word units in Japanese , 2000 .

[13]  松本 俊二,et al.  Word weight calculation for document retrieval by analyzing the distribution of co-occurrence words , 1999 .

[14]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[15]  Toru Hisamitsu,et al.  Topic-Word Selection Based on Combinatorial Probability , 2001, NLPRS.