Scalable snmp-based monitoring systems for network computing

Traditional centralized monitoring systems (MS) do not scale to emerging large network computing systems (NCS) with varying participating nodes, sizable network distances and unpredictable delays. The manager, a single point of control and information gathering, becomes a bottleneck resulting in increased delays and overheads. In this research, the scalability problem is addressed using an architectural approach within the SNMP (Simple Network Management Protocol) context. New mechanisms using SNMP primitives are designed for the proposed architecture. The result is SIMONE, an SNMP-based MS for NCS. SNMP provides wide acceptability, allows the monitoring of heterogeneous systems, permits integration with other SNMP systems, and reduces implementation costs. Individual components of SIMONE are implemented and metrics are defined to evaluate the performance of a manager-agent pair. Comparisons with alternate monitoring methods indicate resolution, latency, and overhead improvements. Distribution is achieved by introducing one or more levels of a dual entity, called the intermediate-level manager (ILM), between the manager and agents. An ILM accepts monitoring tasks described as scripts, which are delegated by the next higher level entity. Operations are SNMP-derived and retain the manager-agent model. Experiments conducted on a 1024-element testbed showed poor round trip delays in centralized configuration with more than 200 monitoring elements and significant improvements (reducing delays from seconds to less than tenth of a second) with the introduction of even two ILMs. Static distribution overlooks the dynamic nature of the MS-NCS environment and performs poorly over extended time. A reconfiguration mechanism reusing SNMP primitives is devised, whereby logical connections among agents and ILMs are dynamically modified to change the number of nodes managed by each ILM. A localized decision process determines the transformations required (merge, split, migrate) at each ILM based on current values of a local node status parameter, called temperature. The interactions between the MS elements and different classes of jobs are modeled as a queuing system and evaluated via simulation for different configuration scenarios. Results indicate that reconfiguration improves performance over static distribution by lowering processing delays at the ILMs.