An Evaluation Model on Information Coverage of Search Engines

Search engines usually get web pages by using links between them.With already massive and ever increasing of web pages,they can only crawl and index a portion of the whole web pages.A model to evaluate their information coverage percentages is presented.We analyze main factors why crawlers can't cover all web information,and put up three kinds of benchmarks to measure the coverage of a search engine.The paper gives out an evaluation model for two of three benchmarks as follows:First,sampling WWW to get many web pages,which are used to check the coverage percentage of quantity through generating random IPs or breadth first search.Second,selecting high qualified pages as samples of important pages,by HITS or PageRank algorithms.Finally,we submit the samples to page database of search engines,and get the coverage percentage.In our research work,we get experimental data from WebInfoMall system of Peking University and compute the coverage percentages of quantity and quality.Using different sampling approaches and algorithms,we get the same results,which can prove our model is right and all the results are exact.