Deduplicação de Contatos em Dispositivos Móveis Utilizando Similaridade Textual e Aprendizado de Máquina

This paper presents a method that identifies duplicate contacts, i.e., records representing the same person or organization, automatically collected from multiple data sources. Contacts are compared using several similarity functions, of which scores are combined by a classification model based on decision trees, which eliminates the need for an expert to manually configure similarity thresholds. The experiments show that the proposed method correctly identified up to 92% of duplicate contacts.