A hybrid method for Chinese address segmentation

ABSTRACT Chinese address segmentation is a serious challenge in geographic information system geocoding. Most previous studies have relied on predefined gazetteers without considering the information contained by a raw address corpus. In this paper, a hybrid method employing both rule-based and statistical methods is proposed for Chinese address segmentation without a predefined gazetteer. This approach utilizes statistical methods to extract address information from a raw address corpus and a rule-based method to segment Chinese addresses. Two typical statistical methods and their combinations with rule-based methods are compared with the hybrid method in an experiment involving approximately 460,000 address items in Shenzhen City, China. The experimental results indicate that the proposed method achieves an F-score of over 0.8, which is better than those of existing methods, thus validating the proposed method.

[1]  Margaret M. Fleck Lexicalized Phonotactic Word Segmentation , 2008, ACL.

[2]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[3]  HuiYou Chang,et al.  A Simple and Effective Unsupervised Word Segmentation Approach , 2011, AAAI.

[4]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[5]  Benoît Sagot,et al.  Unsupervized Word Segmentation: the Case for Mandarin Chinese , 2012, ACL.

[6]  Qingyun Du,et al.  Using an Optimized Chinese Address Matching Method to Develop a Geocoding Service: A Case Study of Shenzhen, China , 2016, ISPRS Int. J. Geo Inf..

[7]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[8]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[9]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[10]  Hiroya Takamura,et al.  An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL , 2010, EMNLP.

[11]  Michael R. Brent,et al.  An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[12]  Chu-Ren Huang,et al.  Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[13]  Kumiko Tanaka-Ishii,et al.  Unsupervised Segmentation of Chinese Text by Use of Branching Entropy , 2006, ACL.

[14]  Frederico T. Fonseca,et al.  Assessing the Certainty of Locations Produced by an Address Geocoding System , 2007, GeoInformatica.

[15]  Yorick Wilks,et al.  Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[16]  Chorkin Chan,et al.  Chinese Word Segmentation based on Maximum Matching and Word Binding Force , 1996, COLING.

[17]  Lan Huang,et al.  GeoSegmenter: A statistically learned Chinese word segmenter for the geoscience domain , 2015, Comput. Geosci..

[18]  Du Qingyun,et al.  A New Method of Chinese Address Extraction Based on Address Tree Model , 2015 .

[19]  Boris S. Mordukhovich Beauty of Mathematics , 2011 .

[20]  Xiaotie Deng,et al.  Unsupervised Segmentation of Chinese Corpus Using Accessor Variety , 2004, IJCNLP.

[21]  Baobao Chang,et al.  A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation , 2013, CCL.

[22]  Qi Li,et al.  An address geocoding solution for Chinese cities , 2006, Geoinformatics.

[23]  Anand Venkataraman,et al.  A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[24]  Kim-Teng Lua A Word-Finding Automaton for Chinese Sentence Tokenization , .

[25]  Judith Bishop,et al.  Address databases for national SDI: Comparing the novel data grid approach to data harvesting and federated databases , 2009, Int. J. Geogr. Inf. Sci..

[26]  Craig A. Knoblock,et al.  From Text to Geographic Coordinates: The Current State of Geocoding , 2007 .