Consolidating client names in the lobbying disclosure database using efficient clustering techniques
暂无分享,去创建一个
A fuzzy-matching clustering algorithm is applied to clustering similar client names in the lobbying Disclosure Database. Due to errors and inconsistencies in manual typing, the name of a client often has multiple representations including erroneously spelled names and sometimes shorthand forms, presenting difficulties in associating lobbying activities and interests with one single client. Therefore, there is a need to consolidate various forms of names of the same client into one group/cluster. For efficient clustering, we applied a series of preprocessing techniques before calculating the string distance between two client names. An optimized threshold selection has been adopted, which helps improve clustering accuracy. A single linkage hierarchical clustering technique has been introduced to cluster the client names. The algorithm proves to be effective in clustering similar client names. It also helps to find the representative name for a particular client cluster.
[1] Gian Antonio Mian,et al. Trademark shapes description by string-matching techniques , 1994, Pattern Recognit..
[2] Peter N. Yianilos,et al. Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..
[3] Chun Wei,et al. Clustering malware-generated spam emails with a novel fuzzy string matching algorithm , 2009, SAC '09.
[4] Liviu P. Dinu,et al. Clustering Methods Based on Closest String via Rank Distance , 2012, 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.