Unsupervised authorship analysis of phishing webpages

Authorship analysis on phishing websites enables the investigation of phishing attacks, beyond basic analysis. In authorship analysis, salient features from documents are used to determine properties about the author, such as which of a set of candidate authors wrote a given document. In unsupervised authorship analysis, the aim is to group documents such that all documents by one author are grouped together. Applying this to cyber-attacks shows the size and scope of attacks from specific groups. This in turn allows investigators to focus their attention on specific attacking groups rather than trying to profile multiple independent attackers. In this paper, we analyse phishing websites using the current state of the art unsupervised authorship analysis method, called NUANCE. The results indicate that the application produces clusters which correlate strongly to authorship, evaluated using expert knowledge and external information as well as showing an improvement over a previous approach with known flaws.

[1]  Stephen G. MacDonell,et al.  A Fuzzy Logic Approach to Computer Software Source Code Authorship Analysis , 1997, ICONIP.

[2]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[3]  Stephen G. MacDonell,et al.  IDENTIFIED (Integrated Dictionary-based Extraction of Non-language-dependent Token Information for Forensic Identification, Examination, and Discrimination): a dictionary-based system for extracting source code metrics for software forensics , 1998, Proceedings. 1998 International Conference Software Engineering: Education and Practice (Cat. No.98EX220).

[4]  Petra Perner,et al.  Machine Learning and Data Mining in Pattern Recognition , 2009, Lecture Notes in Computer Science.

[5]  Rong Zheng,et al.  Authorship Analysis in Cybercrime Investigation , 2003, ISI.

[6]  Fuchun Peng,et al.  N-GRAM-BASED AUTHOR PROFILES FOR AUTHORSHIP ATTRIBUTION , 2003 .

[7]  Hsinchun Chen,et al.  Applying authorship analysis to extremist-group Web forum messages , 2005, IEEE Intelligent Systems.

[8]  Stefanos Gritzalis,et al.  Source Code Author Identification Based on N-gram Author Profiles , 2006, AIAI.

[9]  Thomas Lavergne,et al.  Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[10]  Stefanos Gritzalis,et al.  Identifying Authorship by Byte-Level N-Grams: The Source Code Author Profile (SCAP) Method , 2007, Int. J. Digit. EVid..

[11]  I.N. Bozkurt,et al.  Authorship attribution , 2007, 2007 22nd international symposium on computer and information sciences.

[12]  Thomas Lavergne,et al.  Tracking Web spam with HTML style similarities , 2008, TWEB.

[13]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[14]  Ying Li,et al.  A Cybercrime Forensic Method for Chinese Web Information Authorship Analysis , 2009, PAISI.

[15]  Robert C. Hauhart,et al.  Perspectives on Identity Theft , 2009 .

[16]  Khalid Benabdeslem,et al.  Towards B-Coloring of SOM , 2009, MLDM.

[17]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[18]  Paul Watters,et al.  Data Loss in the British Government: A Bounty of Credentials for Organised Crime , 2009, 2009 Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing.

[19]  Thamar Solorio,et al.  Authorship attribution of web forum posts , 2010, 2010 eCrime Researchers Summit.

[20]  P. Watters,et al.  The Seven Scam Types: Mapping the Terrain of Cybercrime , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[21]  Paul A. Watters,et al.  Automatically determining phishing campaigns using the USCAP methodology , 2010, 2010 eCrime Researchers Summit.

[22]  J. Pieprzyk,et al.  Winning the Phishing War: A Strategy for Australia , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[23]  Benjamin C. M. Fung,et al.  Mining writeprints from anonymous e-mails for forensic investigation , 2010, Digit. Investig..

[24]  J. Yearwood,et al.  Understanding Victims of Identity Theft: Preliminary Insights , 2010, 2010 Second Cybercrime and Trustworthy Computing Workshop.

[25]  Efstathios Stamatatos,et al.  Source Code Authorship Analysis For Supporting the Cybercrime Investigation Process , 2010, Handbook of Research on Computational Forensics, Digital Crime, and Investigation.

[26]  Paul A. Watters,et al.  Recentred local profiles for authorship attribution , 2011, Natural Language Engineering.

[27]  Hugo Jair Escalante,et al.  A Weighted Profile Intersection Measure for Profile-Based Authorship Attribution , 2011, MICAI.

[28]  Paul A. Watters,et al.  Automated unsupervised authorship analysis using evidence accumulation clustering , 2011, Natural Language Engineering.