Automatically Discovering Surveillance Devices in the Cyberspace

Surveillance devices with IP addresses are accessible on the Internet and play a crucial role in monitoring physical worlds. Discovering surveillance devices is a prerequisite for ensuring high availability, reliability, and security of these devices. However, today's device search depends on keywords of packet head fields, and keyword collection is done manually, which requires enormous human efforts and induces inevitable human errors. The difficulty of keeping keywords complete and updated has severely impeded an accurate and large-scale device discovery. To address this problem, we propose to automatically generate device fingerprints based on webpages embedded in surveillance devices. We use natural language processing to extract the content of webpages and machine learning to build a classification model. We achieve real-time and non-intrusive web crawling by leveraging network scanning technology. We implement a prototype of our proposed discovery system and evaluate its effectiveness through real-world experiments. The experimental results show that those automatically generated fingerprints yield very high accuracy of 99% precision and 96% recall. We also deploy the prototype system on Amazon EC2 and search surveillance devices in the whole IPv4 space (nearly 4 billion). The number of devices we found is almost 1.6 million, about twice as many as those using commercial search engines.

[1]  David Brumley,et al.  Towards Automated Dynamic Analysis for Linux-based Embedded Firmware , 2016, NDSS.

[2]  T. Kohno,et al.  Remote physical device fingerprinting , 2005, 2005 IEEE Symposium on Security and Privacy (S&P'05).

[3]  Franck Veysset,et al.  New Tool And Technique For Remote Operating System Fingerprinting , 2002 .

[4]  Douglas Comer,et al.  Probing TCP Implementations , 1994, USENIX Summer.

[5]  Aurélien Francillon,et al.  A Large-Scale Analysis of the Security of Embedded Firmwares , 2014, USENIX Security Symposium.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Robert Wall Emerson The χ2 (Chi-Square) Statistic , 2017 .

[8]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[9]  Ramesh Govindan,et al.  Census and survey of the visible internet , 2008, IMC '08.

[10]  Eric Wustrow,et al.  ZMap: Fast Internet-wide Scanning and Its Security Applications , 2013, USENIX Security Symposium.

[11]  Vern Paxson,et al.  How to Own the Internet in Your Spare Time , 2002, USENIX Security Symposium.

[12]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[13]  Qiang Li,et al.  Characterizing industrial control system devices on the Internet , 2016, 2016 IEEE 24th International Conference on Network Protocols (ICNP).

[14]  Fang Yu,et al.  Populated IP addresses: classification and applications , 2012, CCS.

[15]  Vern Paxson,et al.  A brief history of scanning , 2007, IMC '07.

[16]  Jeffrey H. Meyerson,et al.  The Go Programming Language , 2014, IEEE Softw..

[17]  Brian W. Kernighan,et al.  The Go Programming Language , 2015 .

[18]  Fang Yu,et al.  How dynamic are IP addresses? , 2007, SIGCOMM '07.

[19]  Dmitri Loguinov,et al.  Demystifying service discovery: implementing an internet-wide scanner , 2010, IMC '10.