论文信息 - E-mail address categorization based on semantics of surnames

E-mail address categorization based on semantics of surnames

Surname (family name) analysis is used in geography to understand population origins, migration, identity, social norms and cultural customs. Some of these are supposedly evolved over generations. Surnames exhibit good statistical properties that can be used to extract information in names data set such as automatic detection of ethnic or community groups in names. An e-mail address, often contains surname as a substring. This containment may be full or partial. An e-mail address categorization based on semantics of surnames is the objective of this paper. This is achieved in two phases. First phase deals with surname representation and clustering. Here, a vector space model is proposed where latent semantic analysis is performed. Clustering is done using the method called average-linkage method. In the second phase, an email is categorized as belonging to one of the categories (discovered in first phase). For this, substring matching is required, which is done in an efficient way by using suffix tree data structure. We perform experimental evaluation for the 500 most frequently occurring surnames in India and United Kingdom. Also, we categorize the e-mail addresses that have these surnames as substrings.

Muttukrishnan Rajarajan | Paul A. Longley | Yogachandran Rahulamathavan | P. Viswanath | Suresh Veluru

[1] Anil K. Jain,et al. Data clustering: a review , 1999, CSUR.

[2] Robert Giegerich,et al. From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[3] Kjersti Aas,et al. Text Categorisation: A Survey , 1999 .

[4] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[5] J. Burt,et al. Elementary statistics for geographers , 1995 .

[6] P. Longley,et al. Ethnicity and Population Structure in Personal Naming Networks , 2011, PloS one.

[7] I. Barrai,et al. Isonymy and the genetic structure of Sicily , 1994, Journal of Biosocial Science.

[8] Alex Singleton,et al. Uncertainty in the Analysis of Ethnicity Classifications: Issues of Extent and Aggregation of Ethnic Groups , 2009 .

[9] Paul A. Longley,et al. Creating a regional geography of Britain through the spatial analysis of surnames , 2011 .

[10] P. Viswanth. Some Efficient and Fast Approaches to Document Clustering , 2009 .

[11] Choon Hui Teo,et al. Fast and space efficient string kernels using suffix arrays , 2006, ICML.

[12] Paul A. Longley,et al. Identifying spatial concentrations of surnames , 2012, Int. J. Geogr. Inf. Sci..

[13] Mohammed Al-Shalalfa,et al. Efficient Periodicity Mining in Time Series Databases Using Suffix Trees , 2011, IEEE Transactions on Knowledge and Data Engineering.

[14] Peter W. Foltz,et al. An introduction to latent semantic analysis , 1998 .