Real-time query systems for complex data sources

This dissertation presents techniques for building scalable systems that allow real-time querying of complex data sources. In recent years, networking and sensing advances have dramatically increased the volume of information available to data consumers. However, coping with large scales and high data rates often requires processing data in real time, as it arrives, rather than storing it for later analysis. Our thesis is that by including the data acquisition process in the overall system design, it is possible to build scalable, real-time stream processing systems for complex data sources. We have built two systems to demonstrate a number of unique design features required for scalable operation in our chosen domains. Cobra is a system that taps online RSS feeds (such as blogs, news articles and websites' user comments) as its data source. Cobra repeatedly crawls a set of RSS feeds, matching the contents to keyword-based user queries, similar to those used in Web search engines. As RSS-based content can change frequently, the design ensures that the latency between crawls is low, while still scaling to a large number of RSS feeds and many concurrent user queries. Secondly, Argos is a system for widely-distributed, outdoor wireless network monitoring. Capturing 802.11 WiFi traffic across a large urban area, Argos enables a wide range of user queries, such as mobile node tracking, malware detection, and traffic characterization. Use of a wireless mesh network to connect the deployed sniffer nodes introduces additional challenges due to its limited bandwidth capacity. To address this restriction, we designed a novel in-network packet merging process and demonstrate its bandwidth savings. Additionally, Argos provides a variety of channel management schemes; 802.11 defines up to 14 radio channels but each sniffer can only capture from one channel at a time, necessitating policies for when to capture from which channel. These systems are built around three design principles that aid in the real-time querying of complex data sources: query interfaces tailored to the application's specific data types, optimized data collection processes, and allowing queries to provide feedback to the collection process.

[1]  Tatu Ylönen,et al.  The Secure Shell (SSH) Authentication Protocol , 2006, RFC.

[2]  Tobin J. Lehman,et al.  T Spaces , 1998, IBM Syst. J..

[3]  Alan L. Cox,et al.  Etherfuse: an ethernet watchdog , 2007, SIGCOMM 2007.

[4]  Mike Hibler,et al.  An integrated experimental environment for distributed systems and networks , 2002, OSDI '02.

[5]  Theodore S. Rappaport,et al.  Wireless Communications: Principles and Practice (2nd Edition) by , 2012 .

[6]  George Varghese,et al.  Building a better NetFlow , 2004, SIGCOMM 2004.

[7]  Kevin C. Almeroth,et al.  MIST: Cellular data network measurement for mobile applications , 2007, 2007 Fourth International Conference on Broadband Communications, Networks and Systems (BROADNETS '07).

[8]  Stefan Savage,et al.  Inferring Internet denial-of-service activity , 2001, TOCS.

[9]  Peter R. Pietzuch,et al.  Hermes: a distributed event-based middleware architecture , 2002, Proceedings 22nd International Conference on Distributed Computing Systems Workshops.

[10]  Emin Gün Sirer,et al.  Corona: A High Performance Publish-Subscribe System for the World Wide Web , 2006, NSDI.

[11]  Erik Tews,et al.  Practical attacks against WEP and WPA , 2009, WiSec '09.

[12]  Eddie Kohler,et al.  The Click modular router , 1999, SOSP.

[13]  Hari Balakrishnan,et al.  A measurement study of vehicular internet access using in situ Wi-Fi networks , 2006, MobiCom '06.

[14]  Steven McCanne,et al.  The BSD Packet Filter: A New Architecture for User-level Packet Capture , 1993, USENIX Winter.

[15]  Kimberly C. Claffy,et al.  OC3MON: Flexible, Affordable, High Performance Staistics Collection , 1996, LISA.

[16]  Bhaskaran Raman,et al.  Turning 802.11 inside-out , 2004, Comput. Commun. Rev..

[17]  Sivan Toledo,et al.  Wishbone: Profile-based Partitioning for Sensornet Applications , 2009, NSDI.

[18]  Kay Römer,et al.  SNIF: A Comprehensive Tool for Passive Inspection of Sensor Networks , 2007 .

[19]  Ingrid Moerman,et al.  A Low-delay Protocol for Multihop Wireless Body Area Networks , 2007, 2007 Fourth Annual International Conference on Mobile and Ubiquitous Systems: Networking & Services (MobiQuitous).

[20]  Yu-Chung Cheng CRAWDAD dataset ucsd/cse (v.2008-08-25) , 2008 .

[21]  Bill Segall,et al.  Content Based Routing with Elvin4 , 2000 .

[22]  Ranveer Chandra,et al.  CRAWDAD dataset microsoft/osdi2006 (v.2007-05-23) , 2007 .

[23]  Henrik Loeser,et al.  "One Size Fits All": An Idea Whose Time Has Come and Gone? , 2011, BTW.

[24]  Ratul Mahajan,et al.  Analyzing the MAC-level behavior of wireless networks in the wild , 2006, SIGCOMM 2006.

[25]  Sunghyun Choi,et al.  Performance measurement over Mobile WiMAX/IEEE 802.16e network , 2008, 2008 International Symposium on a World of Wireless, Mobile and Multimedia Networks.

[26]  Edward W. Knightly,et al.  Measurement driven deployment of a two-tier urban mesh access network , 2006, MobiSys '06.

[27]  David Kotz,et al.  Analysis of a Campus-Wide Wireless Network , 2002, MobiCom '02.

[28]  Stefan Savage,et al.  Automating cross-layer diagnosis of enterprise wireless networks , 2007, SIGCOMM.

[29]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[30]  David Mazières,et al.  OASIS: Anycast for Any Service , 2006, NSDI.

[31]  R.N. Murty,et al.  CitySense: An Urban-Scale Wireless Sensor Network and Testbed , 2008, 2008 IEEE Conference on Technologies for Homeland Security.

[32]  Michael Stonebraker,et al.  Retrospective on Aurora , 2004, The VLDB Journal.

[33]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[34]  David S. Rosenblum,et al.  Design and evaluation of a wide-area event notification service , 2001, TOCS.

[35]  Sujata Banerjee,et al.  Measuring Bandwidth Between PlanetLab Nodes , 2005, PAM.

[36]  Nello Cristianini,et al.  Flu Detector - Tracking Epidemics on Twitter , 2010, ECML/PKDD.

[37]  Stefan Savage,et al.  Jigsaw: solving the puzzle of enterprise 802.11 analysis , 2006, SIGCOMM.

[38]  Chase Cotton,et al.  Packet-level traffic measurements from the Sprint IP backbone , 2003, IEEE Netw..

[39]  Gyula Simon,et al.  Sensor network-based countersniper system , 2004, SenSys '04.

[40]  Stefan Savage,et al.  Measuring Online Service Availability Using Twitter , 2010, WOSN.

[41]  Geoffrey M. Voelker,et al.  Analysis of a mixed-use urban wifi network: when metropolitan becomes neapolitan , 2008, IMC '08.

[42]  David Mazières,et al.  Democratizing Content Publication with Coral , 2004, NSDI.

[43]  Srinivasan Seshan,et al.  Self-management in chaotic wireless deployments , 2005, MobiCom '05.

[44]  Jayant Madhavan,et al.  Google's Deep Web crawl , 2008, Proc. VLDB Endow..

[45]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[46]  Robert Tappan Morris,et al.  Architecture and evaluation of an unplanned 802.11b mesh network , 2005, MobiCom '05.

[47]  Kevin Chen-Chuan Chang,et al.  Understanding Web query interfaces: best-effort parsing with hidden syntax , 2004, SIGMOD '04.

[48]  Srinivasan Seshan,et al.  802.11 user fingerprinting , 2007, MobiCom '07.

[49]  K.C. Almeroth,et al.  Antler: A multi-tiered approach to automated wireless network management , 2008, IEEE INFOCOM Workshops 2008.

[50]  Moustafa Youssef,et al.  A framework for wireless LAN monitoring and its applications , 2004, WiSe '04.

[51]  Andrew Heybey,et al.  Tribeca: A System for Managing Large Databases of Network Traffic , 1998, USENIX Annual Technical Conference.

[52]  Lakshminarayanan Subramanian,et al.  Beyond Pilots: Keeping Rural Wireless Networks Alive , 2008, NSDI.

[53]  Michael Stonebraker,et al.  Monitoring Streams - A New Class of Data Management Applications , 2002, VLDB.

[54]  Robert B. Miller,et al.  Response time in man-computer conversational transactions , 1899, AFIPS Fall Joint Computing Conference.

[55]  Jason Lee,et al.  The devil and packet trace anonymization , 2006, CCRV.

[56]  Vern Paxson,et al.  Bro: a system for detecting network intruders in real-time , 1998, Comput. Networks.

[57]  Calvin Newport,et al.  The mistaken axioms of wireless-network research , 2003 .

[58]  U. Deshpande,et al.  Channel Sampling Strategies for Monitoring Wireless Networks , 2006, 2006 4th International Symposium on Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks.

[59]  Robert Tappan Morris,et al.  Vivaldi: a decentralized network coordinate system , 2004, SIGCOMM '04.

[60]  Robert Morris,et al.  Link-level measurements from an 802.11b mesh network , 2004, SIGCOMM 2004.

[61]  Suman Banerjee,et al.  A measurement study of a commercial-grade urban wifi mesh , 2008, IMC '08.

[62]  Yanif Ahmad,et al.  Networked Query Processing for Distributed Stream-Based Applications , 2004, VLDB.

[63]  Stig Fr. Mjølsnes,et al.  An Improved Attack on TKIP , 2009, NordSec.

[64]  Eric A. Brewer,et al.  Harvest, yield, and scalable tolerant systems , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[65]  Kun-Lung Wu,et al.  DEDUCE: at the intersection of MapReduce and stream processing , 2010, EDBT '10.

[66]  Andrew W. Moore,et al.  Architecture of a network monitor , 2003 .

[67]  Dennis Shasha,et al.  Efficient Matching for Web-Based Publish/Subscribe Systems , 2000, CoopIS.

[68]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[69]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[70]  Michael Vrable,et al.  Scalability, fidelity, and containment in the potemkin virtual honeyfarm , 2005, SOSP '05.

[71]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[72]  Srinivasan Seshan,et al.  Mark-and-sweep: getting the "inside" scoop on neighborhood networks , 2008, IMC '08.

[73]  Kevin C. Almeroth,et al.  Malware in IEEE 802.11 Wireless Networks , 2008, PAM.

[74]  Matt Welsh,et al.  LiveNet: Using Passive Monitoring to Reconstruct Sensor Network Dynamics , 2008, DCOSS.

[75]  Moustafa Youssef,et al.  An accurate technique for measuring the wireless side of wireless networks , 2005, WiTMeMo '05.

[76]  Emin Gün Sirer,et al.  Client behavior and feed characteristics of RSS, a publish-subscribe system for web micronews , 2005, IMC '05.

[77]  Yoonho Park,et al.  SPC: a distributed, scalable platform for data mining , 2006, DMSSP '06.

[78]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[79]  Larry L. Peterson,et al.  The dark side of the Web , 2004, Comput. Commun. Rev..

[80]  Vern Paxson,et al.  Automating analysis of large-scale botnet probing events , 2009, ASIACCS '09.

[81]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[82]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[83]  Ying Xing,et al.  The Design of the Borealis Stream Processing Engine , 2005, CIDR.

[84]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[85]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[86]  Yong Sheng,et al.  Map: a scalable monitoring system for dependable 802.11 wireless networks , 2008, IEEE Wireless Communications.

[87]  Wei Hong,et al.  TinyDB: an acquisitional query processing system for sensor networks , 2005, TODS.

[88]  Guruduth Banavar,et al.  Gryphon: An Information Flow Based Approach to Message Brokering , 1998, ArXiv.

[89]  Frederick Reiss,et al.  TelegraphCQ: Continuous Dataflow Processing for an Uncertain World , 2003, CIDR.

[90]  Larry L. Peterson,et al.  Experiences building PlanetLab , 2006, OSDI '06.

[91]  Dennis Shasha,et al.  Filtering algorithms and implementation for very fast publish/subscribe systems , 2001, SIGMOD '01.

[92]  Mitesh Patel,et al.  Accessing the deep web , 2007, CACM.

[93]  Michael B. Jones,et al.  Herald: achieving a global event notification service , 2001, Proceedings Eighth Workshop on Hot Topics in Operating Systems.

[94]  Ryan Newton,et al.  Design and evaluation of a compiler for embedded stream programs , 2008, LCTES '08.

[95]  Simon Patarin,et al.  Pandora: A Flexible Network Monitoring Platform , 2000, USENIX Annual Technical Conference, General Track.

[96]  Emin Gün Sirer,et al.  Meridian: a lightweight network location service without virtual coordinates , 2005, SIGCOMM '05.

[97]  Kevin C. Almeroth,et al.  Understanding link-layer behavior in highly congested IEEE 802.11b wireless networks , 2005, E-WIND '05.

[98]  Anne-Marie Kermarrec,et al.  The many faces of publish/subscribe , 2003, CSUR.

[99]  Brian Gallagher,et al.  MaxProp: Routing for Vehicle-Based Disruption-Tolerant Networks , 2006, Proceedings IEEE INFOCOM 2006. 25TH IEEE International Conference on Computer Communications.

[100]  Margo I. Seltzer,et al.  Network-Aware Operator Placement for Stream-Processing Systems , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[101]  David A. Wagner,et al.  Intercepting mobile communications: the insecurity of 802.11 , 2001, MobiCom '01.

[102]  Krishna P. Gummadi,et al.  King: estimating latency between arbitrary internet end hosts , 2002, IMW '02.

[103]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[104]  Mary Baker,et al.  Analysis of a local-area wireless network , 2000, MobiCom '00.

[105]  Robbert van Renesse,et al.  Light-weight process groups in the Isis system , 1993, Distributed Syst. Eng..

[106]  Peter Druschel,et al.  Resource containers: a new facility for resource management in server systems , 1999, OSDI '99.