A bottom-up approach to real-time search in large networks and clouds

Networked systems, such as telecom networks and cloud infrastructures, generate and hold vast amounts of configuration and operational data. The goal of this work is to make all this data available through a real-time search process named network search, which will enable new real-time management solutions. The thesis contains several contributions towards engineering a network search system. Key elements of our design are a weakly structured information model that includes spatial properties, a query language that supports location- and schema-oblivious search queries, a peer-to-peer architecture, an echo protocols for scalable query processing, and an indexing protocol for efficient routing for spatial queries. The data against which network search is performed is maintained in local real-time databases close to the data sources. The design follows a bottom-up approach in the sense that the topology for query routing is constructed from the underlying network topology. We have built a prototype of the system on a cloud testbed and developed applications that use network search functionality. Testbed measurements suggest that it is feasible to engineer a network search system that processes queries at low latency and low overhead and that can scale to 100'000 nodes. Simulation results for spatial queries show that query processing achieves response times and incurs overhead close to an optimal protocol, and that it remains accurate under significant churn.

[1]  Chris Buckley,et al.  A probabilistic learning approach for document indexing , 1991, TOIS.

[2]  Prashant Malik,et al.  Cassandra: structured storage system on a P2P network , 2009, PODC '09.

[3]  David R. Karger,et al.  Chord: A scalable peer-to-peer lookup service for internet applications , 2001, SIGCOMM '01.

[4]  David R. Karger,et al.  Looking up data in P2P systems , 2003, CACM.

[5]  Adrian Segall,et al.  Distributed network protocols , 1983, IEEE Trans. Inf. Theory.

[6]  Previous version: , 2004 .

[7]  Beng Chin Ooi,et al.  TI: an efficient indexing mechanism for real-time search on tweets , 2011, SIGMOD '11.

[8]  Stéphane Bressan,et al.  Introduction to Database Systems , 2005 .

[9]  ChengXiang Zhai,et al.  Statistical Language Models for Information Retrieval: A Critical Review , 2008, Found. Trends Inf. Retr..

[10]  Harry Halpin,et al.  Architecture of the World Wide Web , 2013 .

[11]  Rolf Stadler Protocols for Distributed Management , 2012 .

[12]  Rolf Stadler,et al.  Dynamic resource allocation with management objectives—Implementation for an OpenStack cloud , 2012, 2012 8th international conference on network and service management (cnsm) and 2012 workshop on systems virtualiztion management (svm).

[13]  Werner Vogels,et al.  Dynamo: amazon's highly available key-value store , 2007, SOSP.

[14]  Rolf Stadler,et al.  A navigation pattern for scalable Internet management , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[15]  J. Chris Anderson,et al.  CouchDB: The Definitive Guide , 2010 .

[16]  Ben Y. Zhao,et al.  Tapestry: a resilient global-scale overlay for service deployment , 2004, IEEE Journal on Selected Areas in Communications.

[17]  Christopher Olston,et al.  Building a HighLevel Dataflow System on top of MapReduce: The Pig Experience , 2009, Proc. VLDB Endow..

[18]  Mark Handley,et al.  A scalable content-addressable network , 2001, SIGCOMM '01.

[19]  Nick Koudas,et al.  BlogScope: A System for Online Analysis of High Volume Text Streams , 2007, VLDB.

[20]  Cheng-Hong Cho,et al.  Electoral voting protocol-a quorum-based approach for replica control , 2000, Proceedings Seventh International Conference on Parallel and Distributed Systems (Cat. No.PR00568).

[21]  Martin Bjorklund,et al.  YANG - A Data Modeling Language for the Network Configuration Protocol (NETCONF) , 2010 .

[22]  Wolfgang Kellerer,et al.  The sensor internet at work: Locating everyday items using mobile phones , 2008, Pervasive Mob. Comput..

[23]  Rolf Stadler,et al.  A query language for network search , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[24]  Sihem Amer-Yahia,et al.  Challenges in Searching Online Communities , 2007, IEEE Data Eng. Bull..

[25]  Orri Erling,et al.  Virtuoso, a Hybrid RDBMS/Graph Column Store , 2012, IEEE Data Eng. Bull..

[26]  Yong Yu,et al.  Optimizing web search using social annotations , 2007, WWW '07.

[27]  Cong Yu,et al.  SocialScope: Enabling Information Discovery on Social Content Sites , 2009, CIDR.

[28]  Meredith Ringel Morris,et al.  #TwitterSearch: a comparison of microblog search and web search , 2011, WSDM '11.

[29]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[30]  Nick Koudas,et al.  Searching the Blogosphere , 2007, WebDB.

[31]  Márk Jelasity,et al.  PeerSim: A scalable P2P simulator , 2009, 2009 IEEE Ninth International Conference on Peer-to-Peer Computing.

[32]  Benoit Donnet,et al.  A Survey on Network Coordinates Systems, Design, and Security , 2010, IEEE Communications Surveys & Tutorials.

[33]  Kristina Chodorow Scaling MongoDB , 2011 .

[34]  Ioannis Konstantinou,et al.  On the elasticity of NoSQL databases over cloud management platforms , 2011, CIKM '11.

[35]  Timos K. Sellis,et al.  Topological relations in the world of minimum bounding rectangles: a study with R-trees , 1995, SIGMOD '95.

[36]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[37]  David Peleg,et al.  Distributed Computing: A Locality-Sensitive Approach , 1987 .

[38]  Beng Chin Ooi,et al.  VBI-Tree: A Peer-to-Peer Framework for Supporting Multi-Dimensional Indexing Schemes , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[39]  Emin Gün Sirer,et al.  Meridian: a lightweight network location service without virtual coordinates , 2005, SIGCOMM '05.

[40]  Vinton G. Cerf,et al.  Casting the Net: From ARPANET to INTERNET and Beyond , 1995 .

[41]  Anne-Marie Kermarrec,et al.  Challenges in Personalizing and Decentralizing the Web: An Overview of GOSSPLE , 2009, SSS.

[42]  Antony I. T. Rowstron,et al.  Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems , 2001, Middleware.

[43]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[44]  Wei Hong,et al.  Proceedings of the 5th Symposium on Operating Systems Design and Implementation Tag: a Tiny Aggregation Service for Ad-hoc Sensor Networks , 2022 .

[45]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[46]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[47]  Marshall T. Rose,et al.  Management Information Base for network management of TCP/IP-based internets , 1990, RFC.

[48]  James A. Fulton,et al.  Common Information Model , 2005, Encyclopedia of Database Technologies and Applications.

[49]  Rolf Stadler,et al.  Real-time search in clouds , 2013, 2013 IFIP/IEEE International Symposium on Integrated Network Management (IM 2013).

[50]  Risto Vaarandi Real-time classification of IDS alerts with data mining techniques , 2009, MILCOM 2009 - 2009 IEEE Military Communications Conference.

[51]  Rolf Stadler,et al.  Scalable matching and ranking for network search , 2013, Proceedings of the 9th International Conference on Network and Service Management (CNSM 2013).

[52]  Anand Sivasubramaniam,et al.  DPTree: A Balanced Tree Based Indexing Framework for Peer-to-Peer Systems , 2006, Proceedings of the 2006 IEEE International Conference on Network Protocols.

[53]  Zhichen Xu,et al.  pSearch: information retrieval in structured overlays , 2003, CCRV.

[54]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[55]  Gianluca Demartini,et al.  Combining inverted indices and structured search for ad-hoc object retrieval , 2012, SIGIR '12.

[56]  David R. Karger,et al.  Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web , 1997, STOC '97.

[57]  Jaroslav Pokorný,et al.  NoSQL databases: a step to database scalability in web environment , 2011, iiWAS '11.

[58]  Rolf Stadler,et al.  Management by network search , 2012, 2012 IEEE Network Operations and Management Symposium.

[59]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[60]  James Aspnes,et al.  Skip graphs , 2003, SODA '03.

[61]  Philip S. Yu,et al.  SPADE: the system s declarative stream processing engine , 2008, SIGMOD Conference.

[62]  Marko A. Rodriguez,et al.  The Gremlin Graph Traversal Machine and Language , 2015, ArXiv.

[63]  Nancy A. Lynch,et al.  Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services , 2002, SIGA.

[64]  Joon Ho Lee,et al.  Properties of extended Boolean models in information retrieval , 1994, SIGIR '94.

[65]  Krisztian Balog,et al.  On the Modeling of Entities for Ad-Hoc Entity Search in the Web of Data , 2012, ECIR.

[66]  Daniel Turull Torrents Open source traffic analyzer , 2010 .

[67]  Edward A. Fox,et al.  Experimental Comparison of Schemes for Interpreting Boolean Queries , 1988 .

[68]  Christos Faloutsos,et al.  The R+-Tree: A Dynamic Index for Multi-Dimensional Objects , 1987, VLDB.

[69]  Jon Kleinberg,et al.  Authoritative sources in a hyperlinked environment , 1999, SODA '98.

[70]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[71]  Michael Stonebraker,et al.  Fault-tolerance in the borealis distributed stream processing system , 2008, ACM Trans. Database Syst..

[72]  Rolf Stadler,et al.  Spatial search in networked systems , 2015, 2015 11th International Conference on Network and Service Management (CNSM).

[73]  L. Meng,et al.  AN INDEXING METHOD FOR SUPPORTING SPATIAL QUERIES IN STRUCTURED PEER-TO-PEER SYSTEMS , 2011 .

[74]  Aristides Gionis,et al.  Next Generation Search , 2010, Algorithms for Next Generation Networks.

[75]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[76]  Luca Foschini,et al.  TC-SocialRank: Ranking the Social Web , 2009, WAW.

[77]  R. Manmatha,et al.  Distributed image search in camera sensor networks , 2008, SenSys '08.

[78]  Vikram Srinivasan,et al.  MAX: human-centric search of the physical world , 2005, SenSys '05.

[79]  Dimitris Sacharidis,et al.  Index-based query processing on distributed multidimensional data , 2012, GeoInformatica.

[80]  Martin Fowler,et al.  NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence , 2012 .

[81]  Emin Gün Sirer,et al.  NetQuery: a knowledge plane for reasoning about network properties , 2010, CoNEXT '10 Student Workshop.

[82]  Yang Chen,et al.  Pharos: accurate and decentralised network coordinate system , 2009, IET Commun..

[83]  Mike Andrews Searching the Internet , 2012, IEEE Software.

[84]  Risto Vaarandi,et al.  Network IDS alert classification with frequent itemset mining and data clustering , 2010, 2010 International Conference on Network and Service Management.

[85]  Amos Israeli,et al.  Self-stabilization of dynamic systems assuming only read/write atomicity , 1990, PODC '90.

[86]  Peter Mika,et al.  Ad-hoc object retrieval in the web of data , 2010, WWW '10.

[87]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[88]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[89]  Shyam Antony,et al.  PRoBe: Multi-dimensional Range Queries in P2P Networks , 2005, WISE.

[90]  Doina Caragea,et al.  Graph Databases , 2019, Encyclopedia of Big Data Technologies.

[91]  Ankit Jain,et al.  Indexing the World Wide Web: The Journey So Far , 2012 .

[92]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[93]  Wolfgang Kellerer,et al.  Real-Time Search for Real-World Entities: A Survey , 2010, Proceedings of the IEEE.

[94]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[95]  Guido van Rossum,et al.  Python Programming Language , 2007, USENIX Annual Technical Conference.

[96]  Monika Henzinger,et al.  Query-free news search , 2003, WWW.

[97]  S. Sathiya Keerthi,et al.  Large scale semi-supervised linear SVMs , 2006, SIGIR.

[98]  E. Prud hommeaux,et al.  SPARQL query language for RDF , 2011 .

[99]  Hugo Zaragoza,et al.  The Probabilistic Relevance Framework: BM25 and Beyond , 2009, Found. Trends Inf. Retr..

[100]  Gerhard Weikum,et al.  NAGA: Searching and Ranking Knowledge , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[101]  Edward A. Fox,et al.  Research Contributions , 2014 .

[102]  Daniel F. Macedo,et al.  Spatial query processing in wireless sensor networks - A survey , 2014, Inf. Fusion.

[103]  Rolf Stadler,et al.  Real-time views of network traffic using decentralized management , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[104]  Haofen Wang,et al.  Lightweight integration of IR and DB for scalable hybrid search with integrated ranking support , 2011, J. Web Semant..

[105]  Benoit Claise,et al.  Cisco Systems NetFlow Services Export Version 9 , 2004, RFC.

[106]  Gerard Tel,et al.  Introduction to Distributed Algorithms: Contents , 2000 .

[107]  Neal Leavitt,et al.  Will NoSQL Databases Live Up to Their Promise? , 2010, Computer.

[108]  Risto Vaarandi,et al.  Mining event logs with SLCT and LogHound , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.

[109]  Susan T. Dumais,et al.  Improving information retrieval using latent semantic indexing , 1988 .

[110]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[111]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[112]  Dimitris Papadias,et al.  Spatial Relations, Minimum Bounding Rectangles, and Spatial Data Structures , 1997, Int. J. Geogr. Inf. Sci..

[113]  Roi Blanco,et al.  Keyword search over RDF graphs , 2011, CIKM '11.

[114]  Hector Garcia-Molina,et al.  One torus to rule them all: multi-dimensional queries in P2P systems , 2004, WebDB '04.

[115]  Rolf Stadler,et al.  A bottom‐up design for spatial search in large networks and clouds , 2018, Int. J. Netw. Manag..

[116]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[117]  Qun Li,et al.  Snoogle: A Search Engine for Pervasive Environments , 2010, IEEE Transactions on Parallel and Distributed Systems.

[118]  Tim Berners-Lee,et al.  Linked Data - The Story So Far , 2009, Int. J. Semantic Web Inf. Syst..

[119]  Rolf Stadler,et al.  A GENERIC PROTOCOL FOR NETWORK STATE AGGREGATION , 2005 .

[120]  David F. Gleich,et al.  Algorithms and Models for the Web Graph , 2014, Lecture Notes in Computer Science.

[121]  Lei Gao,et al.  Serving large-scale batch computed data with project Voldemort , 2012, FAST.

[122]  J. Kleinberg,et al.  Networks, Crowds, and Markets , 2010 .

[123]  Kyle Banker,et al.  MongoDB in Action , 2011 .

[124]  Wolfgang Kellerer,et al.  A real-time search engine for the Web of Things , 2010, IOT.

[125]  Robert Tappan Morris,et al.  Vivaldi: a decentralized network coordinate system , 2004, SIGCOMM '04.

[126]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[127]  James P. Callan,et al.  Hierarchical Language Models for XML Component Retrieval , 2004, INEX.

[128]  Jane Greenberg,et al.  Using BM25F for semantic search , 2010, SEMSEARCH '10.

[129]  Charles H. Davis American Society for Information Science , 1984 .

[130]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[131]  Alin Deutsch,et al.  ASTERIX: towards a scalable, semistructured data platform for evolving-world models , 2011, Distributed and Parallel Databases.

[132]  Christopher Frost,et al.  Spanner: Google's Globally-Distributed Database , 2012, OSDI.

[133]  Berthier A. Ribeiro-Neto,et al.  Efficient search ranking in social networks , 2007, CIKM '07.

[134]  G. Weikum Querying the Internet with PIER , 2005 .

[135]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[136]  Yannis Manolopoulos,et al.  R-Trees: Theory and Applications (Advanced Information and Knowledge Processing) , 2005 .

[137]  Wesley M. Eddy,et al.  TCP SYN Flooding Attacks and Common Mitigations , 2007, RFC.

[138]  Ravi Kumar,et al.  Pig latin: a not-so-foreign language for data processing , 2008, SIGMOD Conference.

[139]  Cong Yu,et al.  Entity-relationship queries over wikipedia , 2010, SMUC '10.

[140]  Verena Kantere,et al.  Storing and Indexing Spatial Data in P2P Systems , 2009, IEEE Transactions on Knowledge and Data Engineering.

[141]  Werner Vogels,et al.  Building reliable distributed systems at a worldwide scale demands trade-offs between consistency and availability. , 2022 .

[142]  Kang-Tsung Chang,et al.  Introduction to Geographic Information Systems , 2001 .

[143]  Giovanni Tummarello,et al.  Effective Retrieval Model for Entity with Multi-valued Attributes: BM25MF and Beyond , 2012, EKAW.

[144]  Rolf Stadler,et al.  Graph search for cloud network management , 2014, 2014 IEEE Network Operations and Management Symposium (NOMS).

[145]  Valmir Carneiro Barbosa,et al.  An introduction to distributed algorithms , 1996 .

[146]  Raouf Boutaba,et al.  Distributed pattern matching: a key to flexible and efficient P2P search , 2007, IEEE Journal on Selected Areas in Communications.

[147]  Nipul Kithsiri Gunawardena,et al.  Introduction to geographic information system , 2014 .

[148]  Gilad Mishne,et al.  Towards recency ranking in web search , 2010, WSDM '10.

[149]  Prakash M. Nadkarni,et al.  Guidelines for the effective use of entity-attribute-value modeling for biomedical databases , 2007, Int. J. Medical Informatics.