Informed software installation through License Agreement Categorization

Spyware detection can be achieved by using machine learning techniques that identify patterns in the End User License Agreements (EULAs) presented by application installers. However, solutions have required manual input from the user with varying degrees of accuracy. We have implemented an automatic prototype for extraction and classification and used it to generate a large data set of EULAs. This data set is used to compare four different machine learning algorithms when classifying EULAs. Furthermore, the effect of feature selection is investigated and for the top two algorithms, we investigate optimizing the performance using parameter tuning. Our conclusion is that feature selection and performance tuning are of limited use in this context, providing limited performance gains. However, both the Bagging and the Random Forest algorithms show promising results, with Bagging reaching an AUC measure of 0.997 and a False Negative Rate of 0.062. This shows the applicability of License Agreement Categorization for realizing informed software installation.

[1]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[2]  Takeshi Okamoto,et al.  A distributed approach to computer virus detection and neutralization by autonomous and heterogeneous agents , 1999, Proceedings. Fourth International Symposium on Autonomous Decentralized Systems. - Integration of Heterogeneous Systems -.

[3]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[4]  Andreas Jacobsson,et al.  Learning to detect spyware using end user license agreements , 2011, Knowledge and Information Systems.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  Eric Filiol,et al.  Behavioral detection of malware: from a survey towards an established taxonomy , 2008, Journal in Computer Virology.

[7]  N. Lavesson,et al.  Automated Spyware Detection Using End User License Agreements , 2008, 2008 International Conference on Information Security and Assurance (isa 2008).

[8]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[9]  Leah S. Larkey,et al.  Automatic essay grading using text categorization techniques , 1998, SIGIR '98.

[10]  Jeffrey O. Kephart,et al.  Biologically Inspired Defenses Against Computer Viruses , 1995, IJCAI.

[11]  Peter Szor,et al.  The Art of Computer Virus Research and Defense , 2005 .

[12]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[13]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[14]  Fabrizio Sebastiani Classification of Text, Automatic , 2006 .

[15]  Stan Matwin,et al.  Feature Engineering for Text Classification , 1999, ICML.

[16]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[17]  William C. Arnold,et al.  AUTOMATICALLY GENERATED WIN32 HEURISTIC VIRUS DETECTION , 2000 .

[18]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[19]  Paul Davidsson,et al.  Evaluating learning algorithms and classifiers , 2007, Int. J. Intell. Inf. Database Syst..

[20]  Ed Skoudis,et al.  Malware: Fighting Malicious Code , 2003 .

[21]  Robert J. Hilderman,et al.  Categorical Proportional Difference: A Feature Selection Method for Text Categorization , 2008, AusDM.

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.