Automatically locating salutation and signature blocks in emails

This paper focuses on the problem of automatically locating salutation and signature blocks in the body of plain-text emails. Texts of salutation and signature block in an email usually contain identity information about the email's sender or recipients. The analysis of locating and extracting salutation and signature blocks from emails has many potential applications, such as entity attributes extracting, person entity based email social network analysis, anonymization of email corpora, improving automatic content-based email classifiers and email threading. Our approach is based on the statistical method and the rules restricted method, which can greatly improve the locating efficiency and at the same time promise a relatively high accuracy of the extracted blocks. We use the statistical method to roughly estimate the number of lines in salutation and signature blocks, and introduce some restriction rules to refine the lines located by the statistical method. Results on the public subset of the Enron corpus prove the high performance of our approach with the average F1 value above 94%.