An efficient sampling method for characterizing points of interests on maps

Recently map services (e.g., Google maps) and location-based online social networks (e.g., Foursquare) attract a lot of attention and businesses. With the increasing popularity of these location-based services, exploring and characterizing points of interests (PoIs) such as restaurants and hotels on maps provides valuable information for applications such as start-up marketing research. Due to the lack of a direct fully access to PoI databases, it is infeasible to exhaustively search and collect all PoIs within a large area using public APIs, which usually impose a limit on the maximum query rate. In this paper, we propose an effective and efficient method to sample PoIs on maps, and give unbiased estimators to calculate PoI statistics such as sum and average aggregates. Experimental results based on real datasets show that our method is efficient, and requires six times less queries than state-of-the-art methods to achieve the same accuracy.

[1]  Ashwin Machanavajjhala,et al.  Sampling hidden objects using nearest-neighbor oracles , 2011, KDD.

[2]  Heikki Mannila,et al.  A random walk approach to sampling hidden databases , 2007, SIGMOD '07.

[3]  Luis Gravano,et al.  Modeling Query-Based Access to Text Databases , 2003, WebDB.

[4]  Petros Zerfos,et al.  Downloading textual hidden web content through keyword queries , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[5]  Luis Gravano,et al.  Evaluating top-k queries over web-accessible databases , 2004, TODS.

[6]  Victor Carneiro,et al.  Crawling the Content Hidden Behind Web Forms , 2007, ICCSA.

[7]  Gautam Das,et al.  Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation , 2011, SIGMOD '11.

[8]  Fan Wang,et al.  Stratified sampling for data mining on the deep web , 2010, 2010 IEEE International Conference on Data Mining.

[9]  Gagan Agrawal,et al.  Stratified k-means clustering over a deep web data source , 2012, KDD.

[10]  Andrea Calì,et al.  Querying the deep web , 2010, EDBT '10.

[11]  Gautam Das,et al.  Leveraging COUNT Information in Sampling Hidden Databases , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Moritz Steiner,et al.  Dissecting foursquare venue popularity via random region sampling , 2012, CoNEXT Student '12.

[13]  David M. Pennock,et al.  Methods for Sampling Pages Uniformly from the World Wide Web , 2001 .

[14]  Fan Wang,et al.  Effective and efficient sampling methods for deep web aggregation queries , 2011, EDBT/ICDT '11.

[15]  Xin Jin,et al.  Optimal Algorithms for Crawling a Hidden Database in the Web , 2012, Proc. VLDB Endow..

[16]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[17]  Luis Gravano,et al.  Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection , 2002, VLDB.

[18]  Gagan Agrawal,et al.  Active learning based frequent itemset mining over the deep web , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[19]  Xin Jin,et al.  Attribute domain discovery for hidden web databases , 2011, SIGMOD '11.

[20]  Melissa Haithcox-Dennis Foursquare , 2011 .

[21]  Gautam Das,et al.  Turbo-charging hidden database samplers with overflowing queries and skew reduction , 2010, EDBT '10.

[22]  James P. Callan,et al.  Query-based sampling of text databases , 2001, TOIS.

[23]  Gautam Das,et al.  Digging Deeper into Deep Web Databases by Breaking Through the Top-k Barrier , 2012, ArXiv.

[24]  Juliana Freire,et al.  Siphoning Hidden-Web Data through Keyword-Based Interfaces , 2010, J. Inf. Data Manag..

[25]  Juliana Freire,et al.  Siphon++: a hidden-webcrawler for keyword-based interfaces , 2008, CIKM '08.

[26]  Ziv Bar-Yossef,et al.  Efficient search engine measurements , 2007, WWW '07.

[27]  Xin Jin,et al.  Unbiased estimation of size and other aggregates over hidden web databases , 2010, SIGMOD Conference.

[28]  Zhi-Li Zhang,et al.  Exploring venue popularity in foursquare , 2013, 2013 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[29]  Ziv Bar-Yossef,et al.  Mining search engine query logs via suggestion sampling , 2008, Proc. VLDB Endow..