Detecting Fake Websites: The Contribution of Statistical Learning Theory

Fake websites have become increasingly pervasive, generating billions of dollars in fraudulent revenue at the expense of unsuspecting Internet users. The design and appearance of these websites makes it difficult for users to manually identify them as fake. Automated detection systems have emerged as a mechanism for combating fake websites, however most are fairly simplistic in terms of their fraud cues and detection methods employed. Consequently, existing systems are susceptible to the myriad of obfuscation tactics used by fraudsters, resulting in highly ineffective fake website detection performance. In light of these deficiencies, we propose the development of a new class of fake website detection systems that are based on statistical learning theory (SLT). Using a design science approach, a prototype system was developed to demonstrate the potential utility of this class of systems. We conducted a series of experiments, comparing the proposed system against several existing fake website detection systems on a test bed encompassing 900 websites. The results indicate that systems grounded in SLT can more accurately detect various categories of fake websites by utilizing richer sets of fraud cues in combination with problem-specific knowledge. Given the hefty cost exacted by fake websites, the results have important implications for e-commerce and online security.

[1]  Lijuan Cao,et al.  Dynamic support vector machines for non-stationary time series forecasting , 2002, Intell. Data Anal..

[2]  Ofer Arazy,et al.  Enhancing Information Retrieval Through Statistical Natural Language Processing: A Study of Collocation Indexing , 2007, MIS Q..

[3]  Nivio Ziviani,et al.  Link-based similarity measures for the classification of Web documents , 2006, J. Assoc. Inf. Sci. Technol..

[4]  Frank Neven,et al.  Proceedings of the 20th International Workshop on the Web and Databases , 2005 .

[5]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[6]  Xuhua Ding,et al.  Anomaly Based Web Phishing Page Detection , 2006, 2006 22nd Annual Computer Security Applications Conference (ACSAC'06).

[7]  Tamara Dinev,et al.  Why spoofing is serious internet fraud , 2006, CACM.

[8]  Cecil Eng Huang Chua,et al.  The Role of Online Trading Communities in Managing Internet Auction Fraud , 2007, MIS Q..

[9]  John C. Mitchell,et al.  Client-Side Defense Against Web-Based Identity Theft , 2004, NDSS.

[10]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[11]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  Jay F. Nunamaker,et al.  Stylometric Identification in Electronic Markets: Scalability and Robustness , 2008, J. Manag. Inf. Syst..

[13]  Charles J. Kacmar,et al.  Developing and Validating Trust Measures for e-Commerce: An Integrative Typology , 2002, Inf. Syst. Res..

[14]  Yaneer Bar-Yam The sciences of the artificial, 3rd edition: By Herbert A. Simon , 1998, Complex..

[15]  S. Sameen Fatima,et al.  An extensive empirical study of feature terms selection for text summarization and categorization , 2012, CCSEIT '12.

[16]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[17]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[18]  Ke Wang,et al.  Localization site prediction for membrane proteins by integrating rule and SVM classification , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Jay F. Nunamaker,et al.  Systems Development in Information Systems Research , 1990, J. Manag. Inf. Syst..

[20]  Xiaotie Deng,et al.  An antiphishing strategy based on visual similarity assessment , 2006, IEEE Internet Computing.

[21]  Thomas Lavergne,et al.  Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[22]  Andrea Everard,et al.  How Presentation Flaws Affect Perceived Site Quality, Trust, and Intention to Purchase from an Online Store , 2005, J. Manag. Inf. Syst..

[23]  Jaideep Srivastava,et al.  Blocking reduction strategies in hierarchical text classification , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[25]  Franco Salvetti,et al.  Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach , 2006, NAACL.

[26]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[27]  McKnightD. Harrison,et al.  Developing and Validating Trust Measures for e-Commerce , 2002 .

[28]  Kurt Hornik,et al.  The support vector machine under test , 2003, Neurocomputing.

[29]  Omar El Sawy,et al.  Building an Information System Design Theory for Vigilant EIS , 1992, Inf. Syst. Res..

[30]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[31]  Naresh K. Malhotra,et al.  Internet Users' Information Privacy Concerns (IUIPC): The Construct, the Scale, and a Causal Model , 2004, Inf. Syst. Res..

[32]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[33]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[34]  Jian Huang,et al.  Kernel machine-based one-parameter regularized Fisher discriminant method for face recognition , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[36]  Brian D. Davison,et al.  Detecting semantic cloaking on the web , 2006, WWW '06.

[37]  Eugene H. Spafford,et al.  Authorship analysis: identifying the author of a program , 1997, Comput. Secur..

[38]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[39]  David Gefen,et al.  Managing User Trust in B2C e-Services , 2003 .

[40]  Cecil Eng Huang Chua,et al.  Fighting Internet auction fraud: an assessment and proposal , 2004, Computer.

[41]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[42]  Marc Najork,et al.  Spam, Damn Spam, and Statistics , 2004 .

[43]  Hsinchun Chen,et al.  A framework for authorship identification of online messages: Writing-style features and classification techniques , 2006 .

[44]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[45]  Jun Wang,et al.  A support vector machine with a hybrid kernel and minimal Vapnik-Chervonenkis dimension , 2004, IEEE Transactions on Knowledge and Data Engineering.

[46]  Dianne Cyr,et al.  Modeling Web Site Design Across Cultures: Relationships to Trust, Satisfaction, and E-Loyalty , 2008, J. Manag. Inf. Syst..

[47]  Chih-Jen Lin,et al.  A tutorial on?-support vector machines , 2005 .

[48]  Min Wu,et al.  Do security toolbars actually prevent phishing attacks? , 2006, CHI.

[49]  Salvatore T. March,et al.  Design and natural science research on information technology , 1995, Decis. Support Syst..

[50]  Sirkka L. Jarvenpaa,et al.  Perils of Internet fraud: an empirical investigation of deception and trust with experienced Internet consumers , 2000, IEEE Trans. Syst. Man Cybern. Part A.

[51]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[52]  Audun Jøsang,et al.  A survey of trust and reputation systems for online service provision , 2007, Decis. Support Syst..

[53]  Herbert A. Simon,et al.  The Sciences of the Artificial , 1970 .

[54]  Xiaotie Deng,et al.  Detecting Phishing Web Pages with Visual Similarity Assessment Based on Earth Mover's Distance (EMD) , 2006, IEEE Transactions on Dependable and Secure Computing.

[55]  E. Airoldi,et al.  Data Mining Challenges for Electronic Safety: The Case of Fraudulent Intent Detection in E-Mails , 2004 .

[56]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[57]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[58]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[59]  Paul Benjamin Lowry,et al.  Explaining and Predicting the Impact of Branding Alliances and Web Site Quality on Initial Consumer Trust of E-Commerce Web Sites , 2007, J. Manag. Inf. Syst..

[60]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[61]  Tie-Yan Liu,et al.  Detecting Link Spam Using Temporal Information , 2006, Sixth International Conference on Data Mining (ICDM'06).

[62]  Ian MacInnes,et al.  Electronic Commerce Fraud: Towards an Understanding of the Phenomenon , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[63]  Linfeng Li,et al.  Usability evaluation of anti-phishing toolbars , 2007, Journal in Computer Virology.

[64]  Marios Koufaris,et al.  Applying the Technology Acceptance Model and Flow Theory to Online Consumer Behavior , 2002, Inf. Syst. Res..

[65]  Dmitry Zelenko,et al.  Kernel methods for relation extraction , 2003 .

[66]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[67]  Dmitry Zelenko,et al.  Kernel Methods for Relation Extraction , 2002, J. Mach. Learn. Res..

[68]  Hsinchun Chen,et al.  Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace , 2008, TOIS.

[69]  Les Gasser,et al.  A Design Theory for Systems That Support Emergent Knowledge Processes , 2002, MIS Q..

[70]  Elias Levy Criminals Become Tech Savvy , 2004, IEEE Secur. Priv..

[71]  L. J. Camp,et al.  NetTrust - Recommendation System for Embedding Trust in a Virtual Realm , 2007 .

[72]  Lorrie Faith Cranor,et al.  Phinding Phish: Evaluating Anti-Phishing Tools , 2006 .

[73]  Y. Lacasse,et al.  From the authors , 2005, European Respiratory Journal.

[74]  T. Bayes,et al.  Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances , 1958 .

[75]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[76]  Paul A. Pavlou,et al.  Psychological Contract Violation in Online Marketplaces: Antecedents, Consequences, and Moderating Role , 2005, Inf. Syst. Res..

[77]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[78]  Hsinchun Chen,et al.  CyberGate: A Design Framework and System for Text Analysis of Computer-Mediated Communication , 2008, MIS Q..

[79]  Alan R. Hevner,et al.  Design Science in Information Systems Research , 2004, MIS Q..

[80]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[81]  Martin Bichler,et al.  Design science in information systems research , 2006, Wirtschaftsinf..