Making smart investments to reduce unplanned down-time

Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. GartnerGroup disclaims all warranties as to the accuracy, completeness or adequacy of such information. GartnerGroup shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice. Unplanned application downtime causes havoc and great expense. Conventional vendor wisdom focuses on redundancy to improve availability. Redundancy, however, solves just 20 percent of the problem. Based on extensive feedback from clients, we estimate that, on average, unplanned application downtime is caused (see Figure 1): 20 percent of the time by hardware (e.g., server and network), OSs, environmental factors (e.g., heating, cooling and power failures) and disasters; 40 percent of the time by application failures including " bugs, " performance issues or changes to applications that cause problems (including the application code itself or layered software on which the application is dependent); and 40 percent of the time by operator errors, including not performing a required operations task or performing a task incorrectly (e.g., changes made to infrastructure components that result in problems and incur unexpected downtime). Thus, approximately 80 percent of unplanned downtime is caused by people and process issues, while the remainder is caused by technology failures and disasters. Improving availability requires a different strategy and set of investment choices for each of the three unplanned downtime categories. Technology Failures and Disasters: Despite being just 20 percent of unplanned downtime, these types of failures can be very catastrophic and result in a significant amount of downtime per incident (see Note 1). To mitigate this risk, enterprises should take the following steps. Monitor components for availability (since failure identification is the first step toward resolution). This is typically done with agents or sensors. Ideally, monitoring is predictive and warns the operator or vendor of potential failures prior to their occurrence. Buy vendor service contracts to reduce time to repair. Many vendors offer time-to-repair commitments for increased fees. Implement redundancy to ensure alternate processing capabilities in the event of a catastrophic failure. Data mirroring, clustering and diesel generators are examples of redundancies that limit downtime when failures occur. In comparing potential solutions, pay particular consideration …