Mining points of interest via address embeddings: an unsupervised approach

Digital maps are commonly used across the globe for exploring places that users are interested in, commonly referred to as points of interest (PoI). In online food delivery platforms, PoIs could represent any major private compounds where customers could order from such as hospitals, residential complexes, office complexes, educational institutes and hostels. In this work, we propose an end-to-end unsupervised system design for obtaining polygon representations of PoIs (PoI polygons) from address locations and address texts. We preprocess the address texts using locality names and generate embeddings for the address texts using a deep learning-based architecture, viz. RoBERTa, trained on our internal address dataset. The PoI candidates are identified by jointly clustering the anonymised customer phone GPS locations (obtained during address onboarding) and the embeddings of the address texts. The final list of PoI polygons is obtained from these PoI candidates using novel post-processing steps that involve density-based cluster refinement and graph-based technique for cluster merging. This algorithm identified 74.8 % more PoIs than those obtained using the Mummidi-Krumm baseline algorithm run on our internal dataset. We use area-based precision and recall metrics to evaluate the performance of the algorithm. The proposed algorithm achieves a median area precision of 98 %, a median recall of 8 %, and a median F-score of 0.15. In order to improve the recall of the algorithmic polygons, we post-process them using building footprint polygons from the OpenStreetMap (OSM) database. The post-processing algorithm involves reshaping the algorithmic polygon using intersecting polygons and closed private roads from the OSM database, and accounting for intersection with public roads on the OSM database. We achieve a median area recall of 70 %, a median area precision of 69 %, and a median F-score of 0.69 on these post-processed polygons. The ground truth polygons for the evaluation of the metrics were obtained using manual validation of the algorithmic polygons obtained from the Mummidi-Krumm baseline approach. These polygons are not used to train the proposed algorithm pipeline, and hence, the algorithm is unsupervised.

[1]  L. Philips,et al.  Hanging on the metaphone , 1990 .

[2]  T. Ravindra Babu,et al.  Address Clustering for e-Commerce Applications , 2018, eCOM@SIGIR.

[3]  Maurice van Keulen,et al.  Point of interest to region of interest conversion , 2013, SIGSPATIAL/GIS.

[4]  Ido Dagan,et al.  context2vec: Learning Generic Context Embedding with Bidirectional LSTM , 2016, CoNLL.

[5]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[6]  Alexander Zipf,et al.  Efficient Method for POI/ROI Discovery Using Flickr Geotagged Photos , 2018, ISPRS Int. J. Geo Inf..

[7]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[8]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[9]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[10]  John Krumm,et al.  Discovering points of interest from users’ map annotations , 2008 .

[11]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[12]  Domenico Talia,et al.  G-RoI , 2018, ACM Trans. Knowl. Discov. Data.

[13]  T. Ravindra Babu,et al.  Geographical address classification without using geolocation coordinates , 2015, GIR.

[14]  David G. Kirkpatrick,et al.  On the shape of a set of points in the plane , 1983, IEEE Trans. Inf. Theory.

[15]  Patrick Weber,et al.  OpenStreetMap: User-Generated Street Maps , 2008, IEEE Pervasive Computing.

[16]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[17]  Lakshya Kumar,et al.  Deep Contextual Embeddings for Address Classification in E-commerce , 2020, ArXiv.

[18]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[21]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[22]  Mor Naaman,et al.  World explorer: visualizing aggregate data from unstructured text in geo-referenced collections , 2007, JCDL '07.

[23]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[24]  M Brown,et al.  Usability of Geographic Information: current challenges and future directions. , 2013, Applied ergonomics.

[25]  T. Ravindra Babu,et al.  Address Fraud: Monkey Typed Address Classification for e-Commerce Applications , 2017, eCOM@SIGIR.

[26]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.