Wikipedia Relatedness Measurement Methods and Influential Features

As a corpus for knowledge extraction, Wikipedia has become one of the promising resources among researchers in various domains such as NLP, WWW, IR and AI since it has a great coverage of concepts for wide-range domain, remarkable accuracy and easy-handled structure for analysis. Relatedness measurement among concepts is one of the traditional research topics on Wikipedia analysis. The value of relatedness measurement research is widely recognized because of the wide range of applications such as query expansion in IR and context recognition in WSD (Word Sense Disambiguation). A number of approaches have been proposed and they proved that there are many features that can be used to measure relatedness among concepts in Wikipedia. In the past, previous researches, many features such as categories, co-occurrence of terms (links), inter-page links and Infoboxes are used to this aim. What seems lacking, however, is an integrated feature selection model for these dispersed features since it is still unclear that which feature is influential and how can we integrate them in order to achieve higher accuracy. This paper is a position paper that proposes a SVR (Support Vector Regression) based integrated feature selection model to investigate the influence of each feature and seek a combine model of features that achieves high accuracy and coverage.