DeMalC: A Feature-rich Machine Learning Framework for Malicious Call Detection

Malicious phone call is a plague, in which unscrupulous salesmen or criminals make to acquire money illegally from the victims. As a result, there has been broad interest in deveploing systems to make the end-users vigilant when receiving such phone calls. Typically, these systems justify the phone numbers either by the crowd-generated blacklist or exploiting the features via machine learning techniques. However, the former is frail due to the rare and lazy crowd, while the later suffers from the scarcity of effective features. In this work, we propose a solution named DeMalC to address those problems by applying the machine learning algorithmm on a novel set of discriminative features. These features consist of properties and behaviors that are powerful enough to characterize phone numbers from different perspectives. We extensively evaluated our solution, i.e., DeMalC, using massive call detail records. The experimental result shows the effectiveness of our extracted features. Capable of achieving 91.86% overall accuracy and 79.34% F1-score on the detection of malicious phone numbers, the DeMalC has been deployed online and demonstrated to be a competitive solution for detecting malicious calls.

[1]  Guoqiang Peter Zhang,et al.  Neural networks for classification: a survey , 2000, IEEE Trans. Syst. Man Cybern. Part C.

[2]  C. Ricotta,et al.  Towards a unifying approach to diversity measures: bridging the gap between the Shannon entropy and Rao's quadratic index. , 2006, Theoretical population biology.

[3]  Hamid R. Rabiee,et al.  Crowd Labeling: a survey , 2013 .

[4]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[5]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[6]  Charu C. Aggarwal,et al.  The setwise stream classification problem , 2014, KDD.

[7]  Hans C. van Houwelingen,et al.  The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, New York, 2001. No. of pages: xvi+533. ISBN 0‐387‐95284‐5 , 2004 .

[8]  Devavrat Shah,et al.  Efficient crowdsourcing for multi-class labeling , 2013, SIGMETRICS '13.

[9]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[10]  Allan R. Wilks,et al.  Fraud Detection in Telecommunications: History and Lessons Learned , 2010, Technometrics.

[11]  Michalis Faloutsos,et al.  PhishDef: URL names say it all , 2010, 2011 Proceedings IEEE INFOCOM.

[12]  Xin Yao,et al.  Multiclass Imbalance Problems: Analysis and Potential Solutions , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[13]  Xi Fang,et al.  Crowdsourcing to smartphones: incentive mechanism design for mobile phone sensing , 2012, Mobicom '12.

[14]  B. Park,et al.  Choice of neighbor order in nearest-neighbor classification , 2008, 0810.5276.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[17]  Aniket Kittur,et al.  Bridging the gap between physical location and online social networks , 2010, UbiComp.

[18]  M. Weatherford,et al.  Mining for fraud , 2002 .

[19]  Ninghui Li,et al.  Using probabilistic generative models for ranking risks of Android apps , 2012, CCS.

[20]  Yajin Zhou,et al.  RiskRanker: scalable and accurate zero-day android malware detection , 2012, MobiSys '12.

[21]  Che-Wei Huang,et al.  FrauDetector: A Graph-Mining-based Framework for Fraudulent Phone Call Detection , 2015, KDD.

[22]  Constantinos S. Hilas,et al.  Designing an expert system for fraud detection in private telecommunications networks , 2009, Expert Syst. Appl..

[23]  Martijn Onderwater,et al.  Detecting unusual user proles with outlier detection techniques , 2010 .

[24]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.