AuGEAS: authoritativeness grading, estimation, and sorting

When searching for content in in a large heterogeneous document collections like the World Wide Web it is not easy to know which documents provide reliable authoritative information about a subject. The problem is particularly pointed as it concerns content search for "high-value" informational needs such as retrieving medical information, where the cost of error may be high. In this paper, a method is described for estimating the authoritativeness of a document based on textual, non-topical cues. This method is complementary to estimates of authoritativeness based on link structure, such as the PageRank and HITS algorithms. This method is particularly suited to "high-value" content search where the user is interested in searching for information about a specific topic. A method for combining textual estimates of authoritativeness with link analysis is also presented. The types of textual cues to authoritativeness that are easily computed and utilized by our method are described, as well as the method used to select a subset of cues to increase the computation speed. Methods for applying authoritativeness estimates to re-ranking documents returned from search engines, combining textual authoritativeness with social authority, and use in query expansion are also presented. By combining textual authority with link analysis, a more complete and robust estimate can be made of a document's authoritativeness.

[1]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[2]  Geoffrey Leech,et al.  100 Million Words of English:The British National Corpus (BNC) , 1992 .

[3]  Klaus Obermayer,et al.  Regression Models for Ordinal Data: A Machine Learning Approach , 1999 .

[4]  Michael D. Gordon,et al.  Web Search---Your Way , 2001, CACM.

[5]  Matthew Richardson,et al.  The Intelligent surfer: Probabilistic Combination of Link and Content Information in PageRank , 2001, NIPS.

[6]  Craig Silverstein,et al.  Analysis of a Very Large Altavista Query Log" SRC Technical note #1998-14 , 1998 .

[7]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[8]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[9]  G. Leech 100 million words of English , 1993, English Today.

[10]  Jason Weston,et al.  Multi-Class Support Vector Machines , 1998 .

[11]  Peter Bailey,et al.  Measuring Search Engine Quality , 2001, Information Retrieval.

[12]  R. Flesch A new readability yardstick. , 1948, The Journal of applied psychology.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Ted Dunning,et al.  Accurate Methods for the Statistics of Surprise and Coincidence , 1993, CL.

[15]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[16]  Peter D. Turney Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm , 1994, J. Artif. Intell. Res..

[17]  David A. Cohn,et al.  The Missing Link - A Probabilistic Model of Document Content and Hypertext Connectivity , 2000, NIPS.

[18]  P. McCullagh Regression Models for Ordinal Data , 1980 .

[19]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[20]  Eibe Frank,et al.  A Simple Approach to Ordinal Classification , 2001, ECML.