Estimating the size of Arabic indexed web content

Various initiatives designed to increase Arabic Web content have been undertaken in recent years, and now search engines are reporting that the Arabic portion of Web content has grown relative to the overall Web content. An accurate estimate of Arabic Web content is crucial for those interested in studying it and enriching it. In this paper, we propose a statistics-based system to estimate the size of Arabic indexed Web content using three popular search engines; Google, Yahoo and Bing. Our system relies on selecting sample words from an Arabic corpus to estimate the size of the Arabic Web content indexed by the search engines and the overlap among them. We have used Arabic Wikipedia as a corpus, as it provides diversified content accessed by a large number of Internet users. Our results show that, as of December 2010, the size of the Arabic indexed Web content was estimated at 2 to 2.1 billion pages.   Key words: World Wide Web, the Web, search engine, index size, Arabic content, Internet, corpus.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Diana Santos,et al.  Measuring the Web in Portuguese , 2002 .

[3]  Qian Liu,et al.  Hidden-Web Database Exploration , 2006, Sixth International Conference on Intelligent Systems Design and Applications.

[4]  Antonio Gulli,et al.  The indexable web is more than 11.5 billion pages , 2005, WWW '05.

[5]  Paul Nieuwenhuysen,et al.  Internet search engines - fluctuations in document accessibility , 2001, J. Documentation.

[6]  Eric Atwell,et al.  The design of a corpus of Contemporary Arabic , 2006 .

[7]  Andrei Z. Broder,et al.  A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines , 1998, Comput. Networks.

[8]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[9]  Munindar P. Singh Deep Web Structure , 2002, IEEE Internet Comput..

[10]  Pooja Gupta,et al.  Implementation of Web Crawler , 2009, 2009 Second International Conference on Emerging Trends in Engineering & Technology.

[11]  Giles,et al.  Searching the world wide Web , 1998, Science.

[12]  Tian Ke,et al.  A framework of deep Web crawler , 2008, 2008 27th Chinese Control Conference.

[13]  Campbell B. Read,et al.  Zipf's Law , 2004 .

[14]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..