Fault Management in Grid Using Multi-agents

Network faults are mutative in nature; it could not be ignored due to its vital importance irrespective of grid type and setup. Hence managing those faults is not an easy task to be carried out, it requires complete knowledge regarding the entire grid workflow. An agent based sensor network sensing the workflow for grid network fault management is proposed in this paper. The agents are trained to regularize the flow of activities within the network. As the grid is highly distributed and large, a cluster based approach is carried out. Working nodes are allowed to concentrate more on the job assigned to them rather in management activity which is carried out by the proxies created. As the proxies are configured based on the individual links, they provide quick consensus time in detection. At the same time, the failure detection is also made at a linear time and minimum bandwidth utilization. The relationship between inter and intra cluster timings are found out through the timer agents. This analysis ensures more reliability into the detection. An agent repository is created which can be used as a sensor system for any kind of grid supporting scalability and distributiveness.

[1]  Don-Lin Yang,et al.  A LAN fault diagnosis system , 2001, Comput. Commun..

[2]  Behrouz H. Far,et al.  A framework for distributed fault management using intelligent software agents , 2003, CCECE 2003 - Canadian Conference on Electrical and Computer Engineering. Toward a Caring and Humane Technology (Cat. No.03CH37436).

[3]  Francisco Vilar Brasileiro,et al.  Faults in grids: why are they so bad and what can be done about it? , 2003, Proceedings. First Latin American Web Congress.

[4]  Said Mirza Pahlevi,et al.  Editorial: A Special Issue from the Open Grid Forum , 2009 .

[5]  Thomas Fahringer,et al.  A Multi-Perspective Taxonomy for Systematic Classification of Grid Faults , 2008, 16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008).

[6]  Radu Prodan,et al.  Short Paper: Data Mining-based Fault Prediction and Detection on the Grid , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[7]  Amit Jain,et al.  Failure detection and membership management in grid environments , 2004, Fifth IEEE/ACM International Workshop on Grid Computing.

[8]  Soon Young Jung,et al.  A resource manager for optimal resource selection and fault tolerance service in Grids , 2004 .

[9]  Ravishankar K. Iyer,et al.  Modeling coordinated checkpointing for large-scale supercomputers , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[10]  Youcef Derbal A new fault-tolerance framework for grid computing , 2006, Multiagent Grid Syst..

[11]  Thilo Kielmann,et al.  Proceedings. 19th IEEE International Parallel and Distributed Processing Symposium , 2005 .

[12]  Warren Smith,et al.  An Infrastructure for Monitoring and Management in Computational Grids , 2000, LCR.

[13]  Sanjeev K. Aggarwal,et al.  A Fault Tolerance Scheme for Hierarchical Dynamic Schedulers in Grids , 2008, 2008 International Conference on Parallel Processing - Workshops.

[14]  John S. Baras,et al.  Automated network fault management , 1997, MILCOM 97 MILCOM 97 Proceedings.

[15]  Richard C. Scalzo,et al.  A meta-model for fault management , 1994, Proceedings of Words '94. The First Workshop on Object-Oriented Real-Time Dependable Systems.