论文信息 - Consolidating client names in the lobbying disclosure database using efficient clustering techniques

Consolidating client names in the lobbying disclosure database using efficient clustering techniques

A fuzzy-matching clustering algorithm is applied to clustering similar client names in the lobbying Disclosure Database. Due to errors and inconsistencies in manual typing, the name of a client often has multiple representations including erroneously spelled names and sometimes shorthand forms, presenting difficulties in associating lobbying activities and interests with one single client. Therefore, there is a need to consolidate various forms of names of the same client into one group/cluster. For efficient clustering, we applied a series of preprocessing techniques before calculating the string distance between two client names. An optimized threshold selection has been adopted, which helps improve clustering accuracy. A single linkage hierarchical clustering technique has been introduced to cluster the client names. The algorithm proves to be effective in clustering similar client names. It also helps to find the representative name for a particular client cluster.

Chengcui Zhang | Ariel D. Smith | Grant T. Savage | Rajan Kumar Kharel | Niju Shrestha

[1] Gian Antonio Mian,et al. Trademark shapes description by string-matching techniques , 1994, Pattern Recognit..

[2] Peter N. Yianilos,et al. Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[3] Chun Wei,et al. Clustering malware-generated spam emails with a novel fuzzy string matching algorithm , 2009, SAC '09.

[4] Liviu P. Dinu,et al. Clustering Methods Based on Closest String via Rank Distance , 2012, 2012 14th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing.