Mining the Web and the Internet for Accurate IP Address Geolocations

In this paper, we present Structon, a novel approach that uses Web mining together with inference and IP traceroute to geolocate IP addresses with significantly better accuracy than existing automated approaches. Structon is composed of three ideas which we realize in three corresponding steps. First, we extract geolocation information of Web server IP addresses from Web pages. Second, we devise heuristic algorithms to improve both the accuracy and the coverage of the IP geolocation database using these Web server IP addresses and their geolocations as input. Third, for those segments that are not covered in the first two steps, we use IP traceroute to identify the access routers of those segments. When the location of the access router is known, we can deduce the location of the associated segment since it is co-located together with the access router. By mining 500-million Web pages collected in China in 2006 (11 percent of the total Web pages in China at that time), we are able to identify the geolocations for 103 million IP addresses. This represents nearly 88 percent IP addresses allocated to China in March 2008. Structon is 87.4 percent accurate at city granularity and up to 93.5 percent accurate at province level. We also used 10 day Windows Live client log to evaluate our client IP addresses coverage: Structon identified geolocations of 98.9 percent of client IP addresses.

[1]  Serge Fdida,et al.  Improving the accuracy of measurement-based geographic location of Internet hosts , 2005, Comput. Networks.

[2]  Lakshminarayanan Subramanian,et al.  An investigation of geographic mapping techniques for internet hosts , 2001, SIGCOMM 2001.

[3]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[4]  Mark Crovella,et al.  Virtual landmarks for the internet , 2003, IMC '03.

[5]  Leslie Daigle,et al.  WHOIS Protocol Specification , 2004, RFC.

[6]  Yakov Rekhter,et al.  A Border Gateway Protocol 4 (BGP-4) , 1994, RFC.

[7]  Arun Venkataramani,et al.  A structural approach to latency prediction , 2006, IMC '06.

[8]  Emin Gün Sirer,et al.  Octant: A Comprehensive Framework for the Geolocalization of Internet Hosts , 2007, NSDI.

[9]  David Wetherall,et al.  Towards IP geolocation using delay and topology measurements , 2006, IMC '06.

[10]  Serge Fdida,et al.  Constraint-Based Geolocation of Internet Hosts , 2004, IEEE/ACM Transactions on Networking.

[11]  Margo I. Seltzer,et al.  Network Coordinates in the Wild , 2007, NSDI.

[12]  GhemawatSanjay,et al.  The Google file system , 2003 .

[13]  Hui Zhang,et al.  Predicting Internet network distance with coordinates-based approaches , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[14]  M. Frans Kaashoek,et al.  Vivaldi: a decentralized network coordinate system , 2004, SIGCOMM 2004.