论文信息 - A hybrid method for Chinese address segmentation - 字舞流文

A hybrid method for Chinese address segmentation

ABSTRACT Chinese address segmentation is a serious challenge in geographic information system geocoding. Most previous studies have relied on predefined gazetteers without considering the information contained by a raw address corpus. In this paper, a hybrid method employing both rule-based and statistical methods is proposed for Chinese address segmentation without a predefined gazetteer. This approach utilizes statistical methods to extract address information from a raw address corpus and a rule-based method to segment Chinese addresses. Two typical statistical methods and their combinations with rule-based methods are compared with the hybrid method in an experiment involving approximately 460,000 address items in Shenzhen City, China. The experimental results indicate that the proposed method achieves an F-score of over 0.8, which is better than those of existing methods, thus validating the proposed method.

Yu Zhang | Wei Wang | Lin Li | Biao He | B. He | Yu Zhang | Lin Li | Wei Wang

[1] Margaret M. Fleck. Lexicalized Phonotactic Word Segmentation , 2008, ACL.

[2] T. Griffiths,et al. A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[3] HuiYou Chang,et al. A Simple and Effective Unsupervised Word Segmentation Approach , 2011, AAAI.

[4] Naonori Ueda,et al. Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[5] Benoît Sagot,et al. Unsupervized Word Segmentation: the Case for Mandarin Chinese , 2012, ACL.

[6] Qingyun Du,et al. Using an Optimized Chinese Address Matching Method to Develop a Geocoding Service: A Case Study of Shenzhen, China , 2016, ISPRS Int. J. Geo Inf..

[7] Changning Huang,et al. Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[8] Christopher D. Manning,et al. Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[9] Maosong Sun,et al. Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[10] Hiroya Takamura,et al. An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL , 2010, EMNLP.

[11] Michael R. Brent,et al. An Efficient, Probabilistically Sound Algorithm for Segmentation and Word Discovery , 1999, Machine Learning.

[12] Chu-Ren Huang,et al. Segmentation Standard for Chinese Natural Language Processing , 1996, COLING.

[13] Kumiko Tanaka-Ishii,et al. Unsupervised Segmentation of Chinese Text by Use of Branching Entropy , 2006, ACL.

[14] Frederico T. Fonseca,et al. Assessing the Certainty of Locations Produced by an Address Geocoding System , 2007, GeoInformatica.

[15] Yorick Wilks,et al. Unsupervised Learning of Word Boundary with Description Length Gain , 1999, CoNLL.

[16] Chorkin Chan,et al. Chinese Word Segmentation based on Maximum Matching and Word Binding Force , 1996, COLING.

[17] Lan Huang,et al. GeoSegmenter: A statistically learned Chinese word segmenter for the geoscience domain , 2015, Comput. Geosci..

[18] Du Qingyun,et al. A New Method of Chinese Address Extraction Based on Address Tree Model , 2015 .

[19] Boris S. Mordukhovich. Beauty of Mathematics , 2011 .

[20] Xiaotie Deng,et al. Unsupervised Segmentation of Chinese Corpus Using Accessor Variety , 2004, IJCNLP.

[21] Baobao Chang,et al. A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation , 2013, CCL.

[22] Qi Li,et al. An address geocoding solution for Chinese cities , 2006, Geoinformatics.

[23] Anand Venkataraman,et al. A Statistical Model for Word Discovery in Transcribed Speech , 2001, CL.

[24] Kim-Teng Lua. A Word-Finding Automaton for Chinese Sentence Tokenization , .

[25] Judith Bishop,et al. Address databases for national SDI: Comparing the novel data grid approach to data harvesting and federated databases , 2009, Int. J. Geogr. Inf. Sci..

[26] Craig A. Knoblock,et al. From Text to Geographic Coordinates: The Current State of Geocoding , 2007 .