Escolar Documentos
Profissional Documentos
Cultura Documentos
Service Level Management Using IBM Tivoli Service Level Advisor and Tivoli Business Systems Manager
Integrate Tivoli Business Systems Manager and Tivoli Service Level Advisor Map business service management to service level management Achieve proactive service level management
Edson Manoel Kimberly Cox Eswara Kosaraju Matt Roseblade Alex Shafir Venkat Surath Eduardo Tanaka Brian Watson
ibm.com/redbooks
International Technical Support Organization Service Level Management Using IBM Tivoli Service Level Advisor and Tivoli Business Systems Manager December 2004
SG24-6464-00
Note: Before using this information and the product it supports, read the information in Notices on page ix.
First Edition (December 2004) This edition applies to IBM Tivoli Business Systems Manager V3.1, IBM Tivoli Service Level Advisor V2.1, IBM Tivoli Enterprise Console V3.9, and IBM Tivoli Monitoring for Transaction Performance V5.3 products. Note: This book is based on a pre-GA version of a product and may not apply when the product becomes generally available. We recommend that you consult the product documentation or follow-on versions of this redbook for more current information.
Copyright International Business Machines Corporation 2004. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi The team that wrote this redbook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii Become a published author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Part 1. Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 1. Introduction to service level management . . . . . . . . . . . . . . . . . 3 1.1 Service level management overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2 Service level management benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Service level management components . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.3.2 Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.3.3 People . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.3.4 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 1.4 Business service management approach to service level management. . 17 1.4.1 Convergence of business service management and service level management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 1.5 Improving service level management through integration . . . . . . . . . . . . . 20 1.6 Scope of this book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Chapter 2. General approach for implementing service level management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.1 A look at the ITIL process improvement model . . . . . . . . . . . . . . . . . . . . . 25 2.2 Planning for service level management implementation . . . . . . . . . . . . . . 26 2.2.1 Identifying roles and responsibilities . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.2.2 Understanding the services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.2.3 Assessing the ability to deliver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.3 Implementing service level management . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.1 Developing service level objectives . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.3.2 Negotiating on service level agreements . . . . . . . . . . . . . . . . . . . . . 37 2.3.3 Implementing service level management tools . . . . . . . . . . . . . . . . . 38 2.3.4 Establishing a reporting function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.5 Adjusting IT processes to include service level management. . . . . . 41 2.4 Ongoing service level management program . . . . . . . . . . . . . . . . . . . . . . 44 2.4.1 Maintenance of service definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 45
iii
2.4.2 Service level agreement management via historical reporting . . . . . 46 2.4.3 Priority management of real-time faults . . . . . . . . . . . . . . . . . . . . . . 47 2.5 Continuous improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.1 Improving quality of service levels . . . . . . . . . . . . . . . . . . . . . . . . . . 48 2.5.2 Improving efficiency of service level management . . . . . . . . . . . . . . 49 2.5.3 Improving effectiveness of service level management . . . . . . . . . . . 50 Chapter 3. IBM Tivoli products that assist in service level management 53 3.1 IBM Tivoli product mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.1.1 The monitoring and measurement layer . . . . . . . . . . . . . . . . . . . . . . 54 3.1.2 The service level management layer . . . . . . . . . . . . . . . . . . . . . . . . 55 3.2 IBM Tivoli Business Systems Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.1 Business goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.2.2 High level description and main functions . . . . . . . . . . . . . . . . . . . . . 56 3.2.3 Benefits of using IBM Tivoli Business Systems Manager . . . . . . . . . 58 3.2.4 Key concepts in IBM Tivoli Business Systems Manager . . . . . . . . . 59 3.2.5 IBM Tivoli Business Systems Manager architecture . . . . . . . . . . . . . 62 3.3 IBM Tivoli Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.3.1 Business goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.2 High level description and main functions . . . . . . . . . . . . . . . . . . . . . 65 3.3.3 Benefits of using Tivoli Data Warehouse . . . . . . . . . . . . . . . . . . . . . 66 3.3.4 Key concepts in Tivoli Data Warehouse . . . . . . . . . . . . . . . . . . . . . . 67 3.3.5 Tivoli Data Warehouse architecture . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.4 IBM Tivoli Service Level Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.1 Business goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 3.4.2 High level description and main functions . . . . . . . . . . . . . . . . . . . . . 72 3.4.3 Benefits of using IBM Tivoli Service Level Advisor . . . . . . . . . . . . . . 74 3.4.4 Key concepts in IBM Tivoli Service Level Advisor . . . . . . . . . . . . . . 75 3.4.5 IBM Tivoli Service Level Advisor architecture . . . . . . . . . . . . . . . . . . 76 3.5 IBM Tivoli Monitoring for Transaction Performance . . . . . . . . . . . . . . . . . 78 3.5.1 Business goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3.5.2 High level description and main functions . . . . . . . . . . . . . . . . . . . . . 79 3.5.3 Benefits of using IBM Tivoli Monitoring for Transaction Performance80 3.5.4 Key concepts in IBM Tivoli Monitoring for Transaction Performance 80 3.5.5 IBM Tivoli Monitoring for Transaction Performance architecture . . . 83 3.6 IBM Tivoli Enterprise Console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.6.1 Business goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 3.6.2 High level description and main functions . . . . . . . . . . . . . . . . . . . . . 87 3.6.3 Benefits of using IBM Tivoli Enterprise Console . . . . . . . . . . . . . . . . 88 3.6.4 Key concepts of event groups in IBM Tivoli Enterprise Console. . . . 89 3.6.5 IBM Tivoli Enterprise Console architecture . . . . . . . . . . . . . . . . . . . . 90 3.7 IBM Tivoli Monitoring. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3.7.1 Business goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
iv
3.7.2 High level description and main functions . . . . . . . . . . . . . . . . . . . . . 94 3.7.3 Benefits of using IBM Tivoli Monitoring . . . . . . . . . . . . . . . . . . . . . . . 95 3.7.4 Key concepts in IBM Tivoli Monitoring . . . . . . . . . . . . . . . . . . . . . . . 96 3.7.5 IBM Tivoli Monitoring architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 98 3.8 Bringing it all together in support of SLM processes . . . . . . . . . . . . . . . . 100 3.8.1 Service definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 3.8.2 Real-time monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 3.8.3 Historical monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 3.8.4 Fault management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 3.8.5 SLA reporting and alerting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 3.8.6 Problem and change management . . . . . . . . . . . . . . . . . . . . . . . . . 107 Chapter 4. Planning to implement service level management using Tivoli products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.1 Implementing SLM using Tivoli products. . . . . . . . . . . . . . . . . . . . . . . . . 110 4.1.1 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.1.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.1.3 Ongoing SLM program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.1.4 Improvement process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 4.2 IBM Tivoli Business Systems Manager V3.1. . . . . . . . . . . . . . . . . . . . . . 117 4.2.1 Propagation, alerts, and events . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.2.2 Basic business system building . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.2.3 Best practices for business system building . . . . . . . . . . . . . . . . . . 120 4.2.4 IBM Tivoli Business Systems Manager business system types . . . 121 4.2.5 IBM Tivoli Business Systems Manager views in an SLM context . . 125 4.2.6 IBM Tivoli Business Systems Manager roles in an SLM context . . 132 4.2.7 Understanding your services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.2.8 Using IBM Tivoli Business Systems Manager 3.1 features for the benefit of SLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 4.2.9 Using PBT and RLP to manage high availability scenarios . . . . . . 139 4.3 Tivoli Data Warehouse V1.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.4 IBM Tivoli Service Level Advisor V2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4.4.1 Building SLAs in IBM Tivoli Service Level Advisor . . . . . . . . . . . . . 156 4.4.2 Supporting SLM with IBM Tivoli Service Level Advisor. . . . . . . . . . 164 4.4.3 Realistic expectations for real-time SLAs . . . . . . . . . . . . . . . . . . . . 186 4.4.4 Integrating IBM Tivoli Service Level Advisor with IBM Tivoli Business Systems Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 4.5 Additional products supporting SLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 4.5.1 IBM Tivoli Monitoring for Transaction Performance . . . . . . . . . . . . 190 4.5.2 IBM Tivoli Monitoring for Operating Systems . . . . . . . . . . . . . . . . . 192 4.5.3 IBM Tivoli Monitoring for Databases . . . . . . . . . . . . . . . . . . . . . . . . 192 4.5.4 IBM Tivoli Monitoring for Web Infrastructure. . . . . . . . . . . . . . . . . . 193
Contents
Part 2. Case study scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 Chapter 5. Case study scenario: IRBTrade Company . . . . . . . . . . . . . . . 197 5.1 Background of the business and its current issues . . . . . . . . . . . . . . . . . 198 5.1.1 The business perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 5.1.2 The Information Technology perspective . . . . . . . . . . . . . . . . . . . . 200 5.2 Existing IT infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 5.2.1 Systems environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 5.2.2 Systems management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 5.2.3 Reporting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 5.3 A service level management solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 204 5.3.1 Where we want to be . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.3.2 Where we are now . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.3.3 How we will get there . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 5.3.4 How we will know we have arrived . . . . . . . . . . . . . . . . . . . . . . . . . 211 5.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 5.4.1 Additional instrumentation required. . . . . . . . . . . . . . . . . . . . . . . . . 212 5.4.2 Identifying the business service . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 5.4.3 Identifying necessary users roles . . . . . . . . . . . . . . . . . . . . . . . . . . 222 5.4.4 Required resource types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 5.4.5 Creating business systems based on business functions. . . . . . . . 231 5.4.6 Defining executive dashboard views. . . . . . . . . . . . . . . . . . . . . . . . 239 5.4.7 Agreeing to and defining service level objectives . . . . . . . . . . . . . . 251 5.4.8 Identifying metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257 5.4.9 Enabling data sources in IBM Tivoli Service Level Advisor . . . . . . 260 5.4.10 Setting up schedules, realms, and customers . . . . . . . . . . . . . . . 262 5.4.11 Setting up offerings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 5.4.12 Setting up SLA in IBM Tivoli Service Level Advisor . . . . . . . . . . . 276 5.5 How the new solution works in practice . . . . . . . . . . . . . . . . . . . . . . . . . 292 5.6 Continuous improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 Chapter 6. Case study scenario: Greebas Bank. . . . . . . . . . . . . . . . . . . . 315 6.1 Background to the business and its current issues . . . . . . . . . . . . . . . . . 316 6.1.1 The business unit perspective. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 6.1.2 IT management perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 6.2 Existing IT infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 6.2.1 Systems environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320 6.2.2 Systems management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321 6.2.3 Existing service level management. . . . . . . . . . . . . . . . . . . . . . . . . 322 6.2.4 Business service management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 6.3 A service level management solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 6.3.1 Where we want to be . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325 6.3.2 Where we are now . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
vi
6.3.3 How we will get there . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 6.3.4 How we will know we have arrived . . . . . . . . . . . . . . . . . . . . . . . . . 330 6.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 6.4.1 Stage 1: Defining services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 6.4.2 Stage 2: Enhancing instrumentation . . . . . . . . . . . . . . . . . . . . . . . . 333 6.4.3 Stage 3: Determining users and roles . . . . . . . . . . . . . . . . . . . . . . . 337 6.4.4 Stage 4: Determining IBM Tivoli Business Systems Manager resource types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339 6.4.5 Stage 5: Creating IBM Tivoli Business Systems Manager business systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340 6.4.6 Stage 6: Creating IBM Tivoli Business Systems manager views . . 351 6.4.7 Stage 7: Agreeing to service level agreement objectives . . . . . . . . 363 6.4.8 Stage 8: Defining metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 6.4.9 Stage 9: Preparing for ETLs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 6.4.10 Stage 10: Preparing IBM Tivoli Service Level Advisor . . . . . . . . . 371 6.4.11 Stage 11: Creating offerings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 6.4.12 Stage 12: Creating SLAs and OLAs . . . . . . . . . . . . . . . . . . . . . . . 395 6.4.13 Stage 13: SLA reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 6.5 How the SLM solution works in practice . . . . . . . . . . . . . . . . . . . . . . . . . 414 6.5.1 Example 1: Component failure without loss of service . . . . . . . . . . 414 6.5.2 Example 2: Component failure terminates a service. . . . . . . . . . . . 421 6.5.3 Root cause analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 6.5.4 Assessing the SLM solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 6.6 Continuous improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Part 3. Appendixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445 Appendix A. Service management and the ITIL . . . . . . . . . . . . . . . . . . . . 447 The ITIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 Service management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448 Service delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 450 Service support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 Service support disciplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Configuration management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 Service desk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 Incident management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 Problem management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 Change management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 Release management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 Service delivery disciplines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 475 Capacity management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477 Availability management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Financial management for IT services . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Contents
vii
IT service continuity management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 Service level management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495 Bringing it all together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507 Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 Constant improvement is a must . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 The power of integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 Appendix B. Important concepts and terminology . . . . . . . . . . . . . . . . . 515 IBM Tivoli Service Level Advisor concepts. . . . . . . . . . . . . . . . . . . . . . . . . . . 516 IBM Tivoli Business Systems Manager concepts. . . . . . . . . . . . . . . . . . . . . . 521 Appendix C. Scripts and rules used in this book. . . . . . . . . . . . . . . . . . . 527 Abbreviations and acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 531 Related publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 Other publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 Online resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 How to get IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 Help from IBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
viii
Notices
This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing application programs conforming to IBM's application programming interfaces.
ix
Trademarks
The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Eserver DB2 Redbooks (logo) ibm.com IBM Redbooks z/OS IMS Tivoli Enterprise AIX Lotus Tivoli Enterprise Console CICS NetView Tivoli CICSPlex OMEGAMON TME Database 2 OS/390 WebSphere Domino OS/400 DB2 Universal Database Rational The following terms are trademarks of other companies: Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel Inside (logos), MMX, and Pentium are trademarks of Intel Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, and service names may be trademarks or service marks of others Peregrine ServiceCenter is a trademark of Peregrine.
Preface
Traditional availability management focuses on managing the state of IT resources at a component level, without the context of the required service necessary to support vital business functions. As IT organizations mature and focus more on meeting business objectives, they recognize the value of providing sustained levels of availability. They also improve service quality that is consistent with business objectives and cost constraints. Managing IT costs requires repeatable and measurable processes such as the best practices for service level management (SLM) documented in the IT Infrastructure Library (ITIL). Central to the ITIL best practices are the service management processes. These are subdivided into the core areas of service support (day-to-day operation and support) and service delivery (long-term planning and improvement). This IBM Redbook takes a top-down approach that starts from the business requirement to improve service management. This includes the need to align IT services with the needs of the business, to improve the quality of the IT services delivered, and to reduce the long-term cost of service provision. It focuses on how clients accomplish this by implementing SLM processes supported by IBM Tivoli Service Level Advisor and IBM Tivoli Business Systems Manager. The approach used in this book leverages Tivoli and non-Tivoli monitoring sources. IBM Tivoli Monitoring for Transaction Performance, IBM Tivoli Monitoring, and various IBM Tivoli Monitoring PACS, along with Peregrine ServiceCenter, serve as interface points to provide the end-user perspective of service delivery. For IT managers and technical staff who are responsible for providing services to their customers, use this IBM Redbook as a practical guide to SLM with IBM Tivoli products. It takes you from a general outline of SLM to specific implementation examples of banking and trading that incorporate the Tivoli monitoring products. The key elements that are addressed in this redbook are: Organizational considerations for implementing the ITIL processes Identifying which services or business functions will be used for the initial deployment Determining the metrics and monitoring sources required for operational and service level agreements (SLA) definition and evaluation, including business schedules and maintenance periods
xi
Leveraging IBM Tivoli Business Systems Manager for configuration and availability management of services Peregrine ServiceCenter for service desk in a component-level for SLA, as well as managing service incidents in real-time The value of understanding the impact of end-user response time on service delivery Managing end-to-end services that include mainframe and distributed components Improving service delivery with proactive service management using predictive analysis and operational status alerts Providing ongoing executive-level status, and on-demand reporting The next steps for expanding the deployment using the ITIL continuous improvement process approach Overall business value attained through the implementation of these processes and tools
xii
Matt Roseblade is a services consultant with the PAN-EMEA Services for Tivoli Software based in the United Kingdom (UK). He has worked for IBM for nine years and has four years of experience in working with IBM Tivoli Business Systems Manager on engagements throughout Europe. Prior to working for IBM Software Group, Matt worked for IGS SSO leading a team responsible for the systems management of IBM and outsourced z/OS systems across EMEA. During his 14 years in IT, Matt has acquired 12 years experience in system management disciplines on the mainframe. Alex Shafir is an advisory software engineer with the IBM Tivoli Software Group in Research Triangle Park, North Carolina. He has been working with IBM Tivoli Business Systems Manager since 1997 and joined IBM in 2000. He has over 30 years of IT experience in both technical and management positions. He has been involved in SLM, capacity planning, and performance management since 1984. He holds master degree in electrical engineering from Polytechnical Institute, Riga, Latvia. Venkat Surath is a senior IT specialist, as well as an IBM Certified IT Specialist, and part of IBM Software Services for Tivoli Americas. He holds a master degree in computer science from Illinois Institute of Technology, Chicago. Upon graduation, he joined Communications Products Division, IBM Research Triangle Park, NC in 1983 as a software engineer developing network management software. In 1997, he joined Tivoli Services North America and provides Tivoli Business Systems Management services. His areas of expertise include IBM Tivoli Business Systems Manager (Distributed) and Tivoli Monitoring for Transaction Performance. Eduardo Tanaka is a software engineer for the IBM Software Group, Tivoli Division in Research Triangle Park, North Carolina. He worked nine years in UNIX server hardware and software development and management for a Brazilian company. Then, in 1990, he joined IBM where he served as the development, function and system test team leader for various system and network management products. He holds a degree in electronic engineering from the Instituto Tecnologico de Aeronautica in Brazil. Brian Watson is a consulting IT specialist from Tivoli Services, EMEA North Region, IBM Software Group. He has worked for IBM for over three years, has over 25 years of IT experience in both public and private sectors, and specializes in systems management. He was one of the first people to be ITIL certified in 1995, and has successfully completed many large and complex systems management projects including implementations of IBM Tivoli Business Systems Manager.
Preface
xiii
Front row (left to right): Matt Roseblade, Kimberly Cox, and Venkat Surath; back row: Edson Manoel, Eswara Kosaraju, Eduardo Tanaka, Alex Shafir, and Brian Watson
Thanks to the following people for their contributions to this project: Peer van Beljouw Ruth van Ouwerkerk ABN AMRO Bank, Netherlands Budi Darmawan Morten Moeller ITSO, Austin Center Rosalind Radcliffe BSM Integration Architect, IBM Software Group, Raleigh Eduardo Patrocinio Tivoli SWAT Team, IBM Software Group, Raleigh Jayne T. Regan Service Level Advisor Development Manager, IBM Software Group, Raleigh Michael D. Tabron Tivoli Service Level Advisor Interaction Designer, IBM Software Group, Raleigh Joe Belna Shawn Clymer Subhayu Chatterjee TSLA Development team, IBM Software Group, Raleigh
xiv
Gareth Holl TSLA L2 Support, IBM Software Group, Raleigh Tom Odefey TBSM SVT Specialist, IBM Software Group, Raleigh Tony Bhe ITM SVT Specialist, IBM Software Group, Raleigh Jon O. Austin John Irwin Yoichiro Ishii Tivoli Customer Programs, IBM Software Group, Raleigh
Preface
xv
Comments welcome
Your comments are important to us! We want our Redbooks to be as helpful as possible. Send us your comments about this or other Redbooks in one of the following ways: Use the online Contact us review redbook form found at:
ibm.com/redbooks
Mail your comments to: IBM Corporation, International Technical Support Organization Dept. JN9B Building 003 Internal Zip 2834 11400 Burnet Road Austin, Texas 78758-3493
xvi
Part 1
Part
Fundamentals
This part includes the following chapters: Chapter 1, Introduction to service level management on page 3 Chapter 2, General approach for implementing service level management on page 23 Chapter 3, IBM Tivoli products that assist in service level management on page 53 Chapter 4, Planning to implement service level management using Tivoli products on page 109
Chapter 1.
According to the highly popular, process-based methodology IT Infrastructure Library (ITIL), SLM is the process of negotiating, documenting, agreeing and reviewing business service requirements and targets, within service level requirements and agreements between service providers and their customers. These relate to the measurement, monitoring, reporting, reviewing, and continuous improvement of service quality as delivered by the IT organization to the business. ITILs methodology provides two models for IT activities: service delivery and service support.
Service delivery
SLM, along with availability management, capacity management, IT service continuity management, and financial management for IT services, comprises the service delivery model. The primary role of this model is to offer a proactive process of planning and management of service according to the plan.
Service support The service support model includes incident management, problem
management, change management, release management, and configuration management. The primary role of this model is to offer operational implementation and monitoring of service according to the plan. Figure 1-1 shows how the service delivery and service support models fit in the ITIL roadmap for service management.
Service Delivery
Providing IT Services cost-effectively
Service Support
Providing IT Services support and maintenance
Applications Management
Security Management
IT Infrastructure Management
According to the ITIL, SLM relates to the other aforementioned disciplines as follows: Supported by availability management, IT service continuity management, capacity management, problem management, and configuration management Provides information to incident management and change management Monitored via financial management for IT services, incident management, capacity management, and availability management Supports application management, business processes, and event management SLM is the disciplined, proactive methodology. Procedures are used to ensure that adequate levels of service are delivered to all IT users in accordance with business priorities and at an acceptable cost. Service levels typically are defined in terms of availability, responsiveness, integrity, and security delivered to users of the service.
ITIL on page 447, for a definition of quality of services and how it is perceived by users and customers of IT services. Both an IT organization, as a seller, and a business unit, as a buyer, need a contract that clearly defines both the capabilities and limitations of this process. For reasons of customer satisfaction and cost control, the product must meet the specifications of this contract.
Goals
The goals of SLM are: Understand and meet the requirements of customers and end users Use resources efficiently, effectively, and provide value for money Improve continuously through a process of learning and growth Use internal process to generate added value for customers and survive Establish a business-like relationships between the customer and supplier
Challenges
The challenges of SLM are: Divergent views of business and IT organizations Diversity of organization business areas Changing the mind set from products and systems to services Perception of IT (historically not always good) Unknown components, dependencies, and ownership Poor quality management information and metrics Unable to justify investment or assess risk No measure of proof of improvement Coping with infrastructure complexity Providing consistent and stable services
Faced with many constraints, an IT organization wants recognition for providing good services based on component-centric measurement metrics. At the same time, business units feel that they are paying for a service, but cannot perform their work and do not trust IT that always report good service. SLM offers evolution for measuring IT effectiveness by moving from the component-based evaluation of service to service-based management. Figure 1-2 illustrates a situation where the reduction of the downtime of components reported by the IT organization does not improve customer satisfaction because the damage has already been done. It emphasizes the fact that business units and IT organizations have different views of the customer perception on the quality of the services provided.
BUSINESS MANAGER
IT
ME AS UR EM EN TS
TS EN
CUSTOMER IMPACT
Outages
S ES SI N BU
AS ME
EM UR
IT COMPONENTS DOWNTIME
IT MANAGER
Time
Figure 1-2 IT and business views often differ
When used correctly, SLM helps an IT organization to deploy resources fairly, defend itself from user attacks, and advertise good service.
How can SLM help IT to deploy resource fairly? Client satisfaction SLM necessitates IT management to initiate a dialog with business units to understand the requirements for service. It also forces business units to clearly state their requirements and expectations. Improved client satisfaction is the main benefit of SLM, which ensures it through negotiated SLAs, established benchmarks for service measurement, and continuing dialog through reporting and reviews. Managing expectations SLM makes it possible to avoid an expectation creep of rising levels of IT clients undocumented expectations. Undocumented users requirements and expectations levels usually lead to expectations staying ahead of service that is being delivered. SLAs document negotiated requirements and establish expectations. They also serve as brakes when users want higher levels of service than IT committed to deliver. Resource regulations SLM provides a mechanism for governing IT resources. It allows IT to reject demands for resources to applications that unfairly tie up resources, and therefore, regulate workload based on business priorities. SLM helps to avoid capacity problems by providing early warning of SLAs being violated. Additional equipment might be required to support IT commitments. Cost control SLM helps IT to determine, through dialog with users, the level of service required and to determine the acceptable capacity and staffing it needs to provide. SLM can demonstrate that desirable service is not always affordable and can impact costs through moderating user demands for higher levels of service. It allows IT to explain the financial impact of higher levels of service and avoid the unnecessary cost by forcing users to justify the additional cost. SLM helps to change relationships between business units and IT from a negative acceptance of IT as a necessary evil to viewing IT as an asset in executing their mission. When the clear service objectives are documented and negotiated measurement reporting is in place, IT has the means to manage its resources as well as user dissatisfaction.
Benefits
In summary, the benefits of SLM are: IT service designed to meet agreed requirements Clearly defined roles (activities, responsibilities, and authority) Measurable, realistic SLAs for improved customer and supplier relationships Balances service requirements against the costs
Reduces risk of unpredictable demand and capacity problems Helps identify service weaknesses Allows underpinning of supplier management Provides basis for charging and measuring value Establishes an improvement baseline
10
1.3.1 Processes
The functions in SLM can be divided as follows: Identify users expectations and define parameters for service. Ideally, IT must identify all of the business processes that must be managed. In practice, it is acceptable to select the critical business processes during the first stages of the SLM process implementation and then incorporate additional business processes as the SLM process mature. The IT organization can work with business owners to pinpoint the elements of these business processes. They can define service parameters such as end-user expectations of service, participating IT application and infrastructure components, and metrics for measuring service levels. Assess service capabilities and negotiate service agreements. First an IT organization must have a clear understanding of service expectations, composition of service elements, and service level measurement metrics. Then it must collect data and assess its current capabilities for meeting a customers expectation of service levels. After studying current capabilities for delivering all services required and indentifying opportunities for improvement, IT management is ready to talk with customers about the service levels that it can provide. IT should avoid technical terminology and describe services and expectations in a manner that is understandable to its customers. At the same time, IT should fully understand what service levels it can deliver and achieve agreement from its customers on service levels measurement and reporting criteria. IT must document negotiated expectations and measurements metrics as well as agreed upon acceptable service levels values. Manage to meet service level objectives (SLOs). IT must align its processes to proactively monitor, measure, and manage against negotiated SLAs. Accordingly, IT must develop SLOs to meet SLA obligations for underlying IT components, measure actual values against SLOs, and associate the measured status against the SLAs. Upon recognition of service level degradation (preferably through real-time alerts), IT can immediately start finding a problem and restoring service to acceptable levels as defined by SLAs. If the problem is serious, IT may also notify users so they can avoid affected services and calls to the help desk. SLAs that relate to IT operations and support (OLAs) recognize component issues quickly and evaluate their measurements prior to their impact on SLAs and IT customers. IT must come up with monitoring processes, measurement metrics, and automation that allow prompt responses to problems by technical staff in addition to reporting an OLAs status to management.
11
SLM uses reporting to communicate overall service level performance to IT and business management. Effective reporting should show IT performance against service-level commitments (successes and failures). It can be used together with financial incentives to improve IT processes and users behavior. Continue service refinement and improvement. The SLM process should always be examined for process effectiveness, service changes, and reporting accuracy. Customer expectations change as business processes grow and new applications and users are added. As monitoring technology improves, IT can expand metrics that measure component performance and customer satisfaction. IT must periodically re-evaluate the services it provides. Service improvement is a continuous process that allows IT to add more value, adjust to new realities, justify new technology, and often derive more revenue. The same can be said about the SLM process that needs continuous improvement to gain the trust of business owners, improve efficiency through automation, and effectiveness through a better understanding of business-to-IT relationships. Figure 1-3 illustrates the SLM functions.
Negotiate SLAs
12
1.3.2 Documentation
Because SLM relies on several parties involved in defining the processes, negotiations, penalties, and so on, documentation is a must. The following documents support SLM: Service level agreements An SLA is an agreement between business units (the customer) and IT organization (the service provider). It describes the service and service level measurement metrics, defines the approval and reporting process, and identifies the primary users. It can also include financial terms and conditions. SLAs provide a mechanism for establishing accountability for both IT and their customers for the provided service levels which are negotiated and agreed to based upon business requirements, priority, and cost. SLA measurements must be directly aligned with customer expectations. SLAs are the basis for service level evaluation and improvement processes that include periodic reviews and adjustments if needed. Operational level agreements An operational level agreement (OLA) is an internal agreement that shout be established between all business and IT groups prior to the execution of an SLA. The OLA establishes specific requirements that each IT group needs to meet in support of service levels and make them accountable for their contribution to the overall improvement of service levels. Well-defined OLAs show IT management which areas have more impact on service levels, where to focus attention and financial rewards, and how each group can contribute if business requirements require a change of SLAs. Underpinning contract IT should establish underpinning contracts (UCs) for any service provided by external service providers and vendors. UCs add accountability for external component of service levels in the same way as OLAs account for the internal components of service levels. IT can use the contractual agreements that they have with their third-party vendors and feed the pertinent data into the SLM process. As service levels need to be changed, IT may need to re-negotiate external contracts with vendors and modify the UCs. Figure 1-4 illustrates the flow of customer, internal, and external contracts. Service catalog The service catalog provides a place to document all services provided to the customers and to record such details as key features, components, charges, and dependencies for each service.
13
SLA
Customers
SLA
IT Infrastructure
OLA
Underpinning Contracts
Internal organization
External organizations
Service level objectives SLOs define service levels that have been agreed to by parties that negotiated SLAs which need to be monitored and reported. They include one or more service level indicators (SLIs) presented in the business context. The SLO defines the component of service and how it is being measured. SLIs determine measurement metrics for SLM quantification. SLIs should reflect user perspective such as pain points and priorities, service availability, and responsiveness. For example, the most common SLOs are availability and performance. A service availability SLO may include the SLI measured in the percentage of time that the service was in available state. A performance SLO may include two SLIs: service responsiveness (response time) and completed work (number of transactions). An IT organization must use monitoring for measuring the actual results of SLIs and reporting for communicating these results to business and IT managers. The format, details, and period vary depending on the recipients of reports. SLM can also include real-time information, alerting IT when results approach or breach service levels are guaranteed by SLAs.
14
Service improvement program SLM is a continuous process that includes service level improvement and SLM improvement activities. IT should never be satisfied with current level of service even if it satisfies its obligations to customers. IT should develop a service improvement program and document a service quality plan. This plan should include how to maintain awareness of changing business objectives, cost-effectively add new technology, improve daily operations, and expand SLIs and reporting to match user perception of service as much as possible.
1.3.3 People
The SLM process requires the involvement of people at various levels within business and IT organizations. The request for service improvements often starts with the head of a business unit or a senior executive who begins demanding more consistent service and accountability from IT. IT management may respond with tactical improvements but may be forced to implement the SLM program. SLM is a collaborative effort. Its implementation includes a number of people in dedicated or supporting roles. Responsibility for overall management of the SLM program is most likely to be assigned to a senior IT executive. IT may also assign a dedicated project manager and a dedicated service level manager. The project manager is responsible for implementing the SLM project. A service level manager is active throughout the entire implementation phase as well as after the phase. This person also coordinates ongoing management and improvement programs. In their effort, both the project manager and the service level manager need support from line managers of IT and business groups. The SLM team must include representatives from both business units and IT service delivery and may require some assistance from consultants. However, SLM is primarily an IT effort as it is IT who must handle the technical aspects of the SLM implementation, deployment, and operation. The SLM program must have an executive sponsor who provides funding for the program and is ultimately responsible for the success of the SLM program. For more details about the roles and responsibilities of the people involved in implementing SLM, see 2.2.1, Identifying roles and responsibilities on page 26.
1.3.4 Tools
While developing the SLM plan, the IT organization must choose tools to enable the SLM process that is being developed. Depending on the selected measurement metrics and the service composition of related IT resources, these
15
tools support monitoring of the chosen service indicators and user experiences. They also provide analytical capabilities and aggregation for reporting. In addition, IT must organize the collected data and make it accessible to everybody with a stake in the SLM process. Analytics and reporting must present this data in a manner that aligns the service views of both IT and their customers, allowing them to reconcile the customers perception of service with the service levels delivered by IT. IT wants to understand how resource performance and availability affects service levels and what adjustments are needed to improve service. Customers want to make sure that IT delivers availability and responsiveness to the critical applications that they use for automating their business processes. When their business process is impacted, they want IT to accurately report it so they can impose the negotiated penalties on IT. SLM is a hot topic, and many companies have made claims that their products provide SLM solutions. Some products are specifically designed for SLM. Others offer only aspects of monitoring capabilities but still market their products as SLM solutions. When implementing SLM, IT should choose the following tools to meet their design specifications: Monitoring tools to provide the measurement metrics they need to collect Reporting tools that process the data being captured and satisfy all levels of report recipients Analytical tools that provide aggregation and analysis of the collected SLM data in a manner that offers fast recognition of business impact and proactive response Administration tools that improve the productivity of SLM operators and users as well as provide the integration of monitoring, reporting, and analytical tools This book introduces solutions provided by IBM, which include a wide range of products that can monitor a variety of distributed and mainframe servers, databases, transactions, networks, Web servers and end-user experiences. In addition, IBM offers analytical products in SLM space that provide the real-time integrated event console, event correlation, business service management (BSM), and proactive SLM. All these products accept data from the majority of todays monitoring products.
16
17
delivery health and business impact of IT based on performance and availability of IT resources. The visualization of BSM runs on federated event and monitoring data as well as business and IT relationship data. The four aspects of BSM are: It consists of identifying the components of a business system. It involves measuring the performance and availability of those components. It ensures that the components are performing within SLOs. It alerts to any deviation or potential deviation from SLOs. The concepts behind BSM include: Resources are components of IT infrastructure. Business transaction is a group of IT resources supporting a particular IT workload. Business system is a group of resources that supports a business goal. Business process is composed of some automated (IT services based) and some manual steps. When policy data or service level information is attached to a business system, it turns into an IT service. IT service can be perceived as a collection of IT resources that make up the automated part of the business process.
18
Get service violation and trend alerts for any deviation or potential deviation from the SLO Ensure that services are performing within the SLO The Business/IT knowledge base provides the foundation for BSM and SLAs. In reality, BSM allows IT to decompose business processes into IT systems and document the negotiated service levels in SLAs to be managed by BSM via monitoring and analytics organized by business systems. BSM accepts data from a variety of performance and event data sources that monitor IT resources. The BSM analystics then consume this data to determine business systems status and understand its business impact. Figure 1-5 demonstrates that business systems are a cornerstone for establishing service levels and managing IT resources based on business objectives for IT services.
Underpinning Contracts Historical Reporting
SLA
OLA
Business Business Systems Business Systems Systems Service Business Business Systems Systems
IT Services
- databases - web servers - banking application - application support - development
The Technology
Figure 1-5 Business system organizes IT resources and other business systems
A successful SLM program that aims to solve user perception issues should establish a common understanding between business units and an IT organization on service delivery and quality of service measurements. As outlined earlier, the BSM approach to SLM helps this effort by collecting business knowledge and exposing the use of resources by services. This makes SLA contracts and measurement metrics more meaningful to both IT and business units.
19
20
Service Management
Service Evaluation
Service Composition
Service Delivery
Business Process
Business Knowledge
Applications
Infrastructure
The Business
Information Technology
Requirements
Large enterprise IT environments deploy many system management products to operate their diverse resources. It is difficult to integrate data from such a variety of data sources into the SLM process. BSM solutions meet this challenge by accepting data from all major monitoring vendors. BSM then integrates this data by supplying business analytics and automation that allow IT to define and manage services throughout the life cycle of SLM. Armed with business knowledge and negotiated service composition and measurement metrics, an IT organization can design its business system management, SLM, and monitoring processes to measure quality of service that correlates with user perception. To improve acceptance, IT must continue to
21
refine the service composition and measurement metrics until they become transparent to business units.
22
Chapter 2.
23
facilitate commitment during the entire SLM planning and implementation cycle by continually motivating the change and leading by example. This chapter describes a generic approach (Figure 2-1) for implementing SLM after a decision to do so is established. This methodology starts with a planning phase, continues on to implementation, and concludes with on going management and improvement of the overall process. It follows the IT Infrastructure Library (ITIL) process improvement model.
Planning
Established decision to implement SLM
Implementation
Develop service level objectives
- Describe services - Determine service level indicators - Determine metrics to be used
Improvement Process
Improving quality of service levels Improving efficiency of SLM Improving effectiveness of SLM
24
Chapter 1, Introduction to service level management on page 3, introduces the four key components of SLM: people, processes, documentation and tools. This chapter identifies and discusses each of these components in more detail.
25
at this time. These topics are also addressed in 2.3, Implementing service level management on page 35.
26
facilitate regular deployment checkpoint meetings. This ensures that everyone has a consistent level of information throughout the deployment. Choosing the correct people is critical. Whoever is chosen must represent the views of the decision makers from both IT and business organizations and have the final word on the SLM implementation plan. The SLM deployment team should include people from the areas shown in Figure 2-2.
Business Representatives
IT Representatives
The following sections summarize the responsibilities for the key participants.
Executive sponsor
The executive sponsor is typically the head of the line of business and is responsible for delivery of business services to end users. This person understands the overall picture of the business process and can state the purpose of the business. This person has the ultimate go or no-go authority for the project and the final arbiter for problems and disagreements.
Project manager
Implementation of SLM is a large scale project and should be treated as one. Appoint a qualified, full-time project manager to work closely with the service level manager and other people involved in the project to incorporate the SLM activities into a project plan.
27
Business representatives
The primary responsibility for this role is to explain the overall and component-wise picture of the business. Business services may include a number of services that require IT support. Therefore, performance of business owners depends on IT performance. Business owners understand their service well but may not understand what comprises an IT service. In large environments, this can be several people, one for each operational unit. A secondary responsibility for this role is to keep the SLM implementation business-oriented.
28
IT representatives
There are many responsibilities for this role, and they are typically fulfilled by more than one person. The responsibilities include: Providing systems management information such as hardware and operating systems, network infrastructure, application monitoring tools, and so on Describing the IT components of the business service Providing information about the day-to-day operation of the business components Providing feedback from customers to the overall SLM implementation process This is typically the service desk or customer support group with a primary line of communication to the service users. Providing the business impact of problem and change management Taking on the role of technical lead for the tools used in an SLM implementation This group should have or be ready to learn the skills required to deploy the actual tools to be used, as described in 2.3.3, Implementing service level management tools on page 38.
29
have moderated discussions with multiple people so that information and expectations can be level set among the business and IT participants.
Defining services
For the purpose of this redbook, a service is defined as a logical grouping of IT systems and applications that together deliver one or more functions to one or more users. From the IT perspective, it is a set of applications that serve a specific business objective with each application comprising of components made of IT resources. From the business perspective, a service is the mapping of IT resources to business processes. According to the ITIL, a service is the IT system or systems that enable customers and users to implement business processes. For more information about the ITIL definition, see the SLM chapter in the ITIL Service Delivery book. This chapter also introduces and encourages the use of a service catalog. Note: It is possible for a service to be made up of other services. For example, online banking can be a service that is made up of services for checking balances, depositing funds, withdrawing funds, and so on. A high-level example definition of a service is as simple as this: My service is online banking. My service is a travel reservation system. My service is a payroll system. To complete the definition of the service, you must now have an understanding of the underlying IT components that make up the service. Typically, a component represents a machine or an application with multiple event sources mapping to it. It is important to know what applications make up the components and how these applications relate to other applications, including dependencies. The following list provides suggestions to assist in defining the business service: Business information List the functions provided by the service. You may have to speak about applications if the concept of service is unfamiliar. Describe the relationships between the functions. Provide a schematic that describes how each function is integrated to create the service. The schematic may include a business flow diagram. Technical information Name the applications or components that deliver the service. State the purpose of each application or component.
30
Describe the relationships between the applications or components. Provide a schematic that describes how each application is integrated to create the service. The schematic may include a data flow diagram. The relationships may also be described in an architecture document. Table 2-1 provides a useful template for keeping track of components and relationships between components.
Table 2-1 Business service component relationships Business component examples Application server Operating system server Network device Depends on Impact Comment
Application A
This application provides <...> to the business service. The operating system is the platform for applications A, B, and C.
31
The following list provides suggestions to assist in establishing the initial perception of service: Usage information Number of users of the service If applicable, a breakdown of function usage by company employees, business partners, the general public, etc. Patterns or hours of usage, including peak times How users access the service (Internet, intranet, extranet, legacy 3270 screens, etc.) The deficient and favorable points of current IT service delivery and how they are communicated to the IT organization The challenges faced by the business, including what is on the horizon by way of new or updated services Current issues with the business service functions Table 2-2 provides a useful template for keeping track of usage information.
Table 2-2 Business service usage and perception Feature TransactionA TransactionB TransactionC TransactionD Time of day Morning Noon Evening Midnight Number of users <num> <num> <num> <num> Method of access or type of user Intranet Internet <method> <method> Perception Good Slow Poor Excellent
32
Record these expectations, so that you can address them during the assessment phase. Depending on the expectations to the quality of services, you can expect changes and improvements to the existing IT infrastructure. Define the desired quality of services objectives that make sense, are measurable, and are achievable. This helps to define the success criteria of the entire SLM implementation.
33
management processes can report on the performance and throughput achievements for SLA evaluation.
34
IT wants to understand how resource performance and availability affects service levels and what adjustments are needed to improve service. Customers want to make sure that IT delivers availability and responsiveness to the critical applications that they use for automating their business processes. When their business process is impacted, they want IT to accurately report it so they can impose the negotiated penalties on IT. Define a high-level design that provides an assessment of the existing monitoring capabilities as well as additional monitoring tools and processes. This forms a baseline for measurement of expected quality of services. Important: Do not include anything in an SLA unless you can effectively monitor and measure it at a commonly agreed point.
35
Achieving, or even approaching, the desirable level may require additional investment and may need to be addressed by a service improvement program. The negotiation stage is likely to be iterative. SLOs are specifications of a metric that is associated with a guaranteed level of service that is defined in an SLA. The metric by which SLOs are defined, are often called service level indicators (SLIs). From a business perspective, the most important objective is the availability and responsiveness of the service that IT provides to the business. Typically, IT responds to these business requirements by quantifying availability and performance: Availability: The percentage of the evaluation period when service was in an available state Performance: Usually represented by two SLIs such as responsiveness or speed and throughput or volume Additional SLOs may include accuracy (whether the service does what it is supposed to do), cost, security, number of incidents, time-to-repair, etc. SLOs must meet the following criteria before you can include them in SLAs: Attainable: The objective is worthless if IT will never be able to meet it. Measurable: The objective is worthless if it cannot be measured. Understandable: Reported statistics must relate to the user experience. Meaningful: The objective must be relevant to all parties. Controllable: Do not include objectives that cannot be controlled. Affordable: The objective may require additional funding that sponsors are not willing to provide. Additional budget allocation is a business-level decision. Mutually acceptable: One party cannot simple dictate the terms of the agreement. When developing an SLO, an IT organization needs to carefully select measurement metrics that are indicative of this SLO. For example, measuring availability from a users perspective is not a simple task. If an application is up and running, it does not mean that users can use it. If IT measures the availability of resources, it does not guarantee that this represents the actual user experience. There is no perfect solution to this problem. Nevertheless an IT organization must use SLIs that can be directly measured. SLAs must document each chosen SLI that will represent each of the SLOs and specify its data source.
36
37
38
This simplistic view of IT domains does not account for the fact that each of these domains represents a number of different technologies integrated into complex configurations that can be managed by a variety of tools. However, when these domains are taken together, they control the quality of service. Therefore, it is necessary to install products for monitoring each domain. From a functional perspective, SLM monitoring of the IT domains should include event monitoring, performance monitoring, usage monitoring, security monitoring, etc. In our illustration of a generic SLM implementation in this chapter, we do not address the specific monitoring tools. However, the following chapters demonstrate an example of SLM implementation using IBM Tivoli products. The primary challenge before an IT organization, when it initiates the SLM program, is the question of which products to install and how to integrate them into the most suitable SLM solution. After IT completes the planning and the SLA negotiation phases, it usually has a clear understanding of the tools it needs to implement to support SLAs. It has already decided to acquire missing tools. When additional products are required, installing, customizing, and integrating the new products into the existing system management solution can be a significant part of the SLM implementation effort. Since service can traverse multiple SLM domains, an IT organization must be able to view and evaluate the collected domain monitoring data for each supported service. In addition, SLM necessitates monitoring of user experiences of the delivered service through use of transaction monitors that can generate transactions and record their execution.
39
40
improved service levels, IT must relate this improvement to increase in business volumes, improved productivity, and better customer satisfaction. The same can be said about service outages and degradation. IT needs to demonstrate their impact on business performance and costs. IT management The service reports that IT distributes to business management should also be reviewed by all levels of IT management. This helps IT managers to understand how component failures and performance degradation affect service levels and impact business performance. In addition, IT management should receive the traditional technology reports that report the outages and performance degradation of resources as well as the response time and volume of application transactions. Using time as a correlation factor for both technology and service level reports, IT managers can gain knowledge regarding how the technology area that they manage affects the overall quality of IT delivered services. In addition to the SLA historical reporting (daily detailed reports, weekly summaries, monthly overviews, quarterly business summaries), an IT organization should implement the real-time alerting and proactive notification of customers and IT staff. It is important for real-time alerting of service outages and degradation to show the components that cause the impact, which business users are affected, and communicate business impact. As explained in Chapter 1, Introduction to service level management on page 3, BSM is well suited to perform this function.
41
Event management
BSM provides facilities that allow consolidation of all enterprise events and provide a single point for event management based on business priorities. This increases the value and productivity of the IT operation and service desk personnel. It also prompts IT to establish a control center function that will be responsible for managing events. Important: There are some key benefits of well implemented event management processes. For example, IT management and business executives can evaluate the immediate business impact of IT events and understand how they affect SLA compliance. IT operations can prioritize fault management.
Availability management
SLM facilitates the transition from management of IT components to management of IT services and changes the metrics for measuring availability. When the underlying IT resources experience problems or become unavailable, the service may still perform satisfactory if resources are duplicated. The focus of BSM on service state management significantly improves the understanding of services. It offers more robust capabilities to determine service states based on rules governing the impact of events received by the underlying resources. Important: When managing availability, an IT organization must focus on identifying critical events for each service that by definition impact this service availability. IT operations can significantly improve the availability of IT services through the proactive management of critical events.
Capacity management
Monitoring the performance of IT physical domains, defined in 2.3.3, Implementing service level management tools on page 38, is a well established discipline in the majority of IT organizations. When implementing SLM, an IT organization requires additional aggregations of collected performance information to meet SLA obligations for reporting on the service level performance. Important: With BSM facilitating the mapping of resource-to-service relationships, an IT organization can improve its performance management processes by prioritizing the management of IT resources based on their business value. This approach also applies to proactively planning for additional capacity when service levels are in danger.
42
Change management
An IT organization uses the change management process to evaluate the impact of requested changes and, therefore, to reduce risk of pending requests. Both SLM and BSM can significantly boost the effectiveness of any change management process by supplying the criteria for risk evaluation, provided by SLAs, and facilitating impact visualization provided by BSM. Important: An IT organization must adjust its change management process to evaluate implications of the requested changes on agreed service levels and understand their business impact.
Incident management
Some SLAs include SLOs for measuring service desk responsiveness and IT handling of faults. Service levels may include a time value for problem escalations and a mean-time-to repair value. Every IT organization has some variation of an incident reporting system and escalation procedures. BSM improves event management and incident recording. It provides capabilities for a proactive management of resources in need of repair. It often offers a bidirectional interface to a number of help desk solutions. Business focus of SLM and BSM enables an IT organization to improve its incident management process through timely recognition of faults, better understanding of their impact, and added value of SLA reporting. Important: When implementing SLM, IT needs to integrate its manual processes and the help desk solution it uses for incident management with SLAs and BSM.
Cost management
SLM uses SLAs as a mechanism for governing use of IT resources to ensure that IT services are performing according to the SLA specifications. Customers become aware of cost implications while negotiating SLAs. An IT organization must balance service cost with service delivery. As the service provider, IT should use service pricing as the mechanism for accounting for resource usage by business units. However, both resource accounting and services charges become a contentious issue between IT and business units. Important: When implemented, both SLM and BSM should have input into the cost management process. This enables an IT organization to establish the regulation of resource use based on business value and improve communication with business units when applying charges for services.
43
Application support
Many enterprises have centralized all application development activities and infrastructure management activities under one IT organization. The scenarios in Part 2, Case study scenarios on page 195, use this model. IT development organizations typically develop and support such applications. Application support staff work for IT development management and interface with both business and IT support departments. For this reason, application support people can greatly contribute to SLA development, while greatly benefitting from the SLM and BSM implementation. Application support staff typically are well aware of the business process that IT is automating with its applications. The development organization often possesses the knowledge of service parameters such as the number of expected users, the expected response time, etc. In addition, the development organization may provide its own instrumentation to assist in managing performance of the applications that it implemented in support of business. However, application support staff often lacks the knowledge of IT infrastructure and rely on IT support and operation staff when researching user problems. Important: Application support people must be included in both the planning and implementation of the SLM and BSM programs. They should be involved in the design of service compositions for both SLM and BSM and should provide further input during their ongoing application support activities.
44
45
The ongoing administration of business views includes the following activities: Adding new business views upon requests from the IT change management team Adjusting business views upon addition of new resources Deleting business views that are no longer needed Ongoing maintenance of business views
46
Without automation, ongoing SLA management often fails to deliver the intended value despite of the well planned and well executed implementation. It is unacceptable for business executives when an IT organization takes several weeks to consolidate technical reports into a combined view of service.
47
Cost of support: Better understanding of faults, their priority, and impact can significantly increase productivity of control center personnel and IT support staff. Fault management by business priorities also improves quality of IT operations, increases productivity of root cause analysis, and provides more visibility of IT value. Ongoing management for the effective priority management of real-time faults is not practical without BSM tools. The remaining chapters of this book provide detailed examples of priority management of real-time events by IBM Tivoli products.
48
IT management should work with business executives to immediately address any issues of user distrust of the reported service levels and use these issues as an opportunity for additional tuning.
49
50
IT must investigate any deviations in the existing service levels. If it finds that service violations resulted from changes in business volumes or user behavior, IT must proactively communicate its findings to business units and renegotiate service levels as necessary. IT must also integrate the rollout of new business applications with its change management process and generate change requests for new service definitions and SLOs before deploying these applications in production.
51
Most management solutions today typically require a significant customization. Integrating them with IT processes to provide SLM is a difficult and laborious effort. Chapter 1, Introduction to service level management on page 3, introduces a business-oriented approach for managing IT services or BSM and the value of its integration with SLM. A proactive approach of process and tools integration around a single set of service definitions can significantly improve the efficiency and the effectiveness of any SLM program. The remainder of this book demonstrates, via detailed examples and case studies, an SLM solution design that involves monitoring IT resources, monitoring of user experiences, event correlation as well as BSM automation, analytics, and reporting. Two test cases describe the integration of eight Tivoli products in support of two different SLM initiatives.
52
Chapter 3.
53
Availability
Event Correlation and Automation
- IBM Tivoli Enterprise Console - IBM Tivoli Monitoring for Transaction Performance - IBM Tivoli NetView
Performance
Monitor Systems and Applications / User Experience
- IBM Tivoli Monitoring for transaction Performance - IBM Tivoli Monitoring - IBM Tivoli Monitoring for Databases - IBM Tivoli Monitoring for Business Integration - IBM Tivoli Monitoring for Web Infrastructure
54
The IBM products directly relevant to SLM are: IBM Tivoli NetView Family IBM Tivoli Enterprise Console IBM Tivoli Monitoring for Transaction Performance Performance management This includes products that measure the internal performance of systems and applications. They also provide information about the experience of endusers. The functionality includes continuous monitoring and recording of information, raising alerts when thresholds are exceeded, and gauging user experience by making response time measurements and running synthetic transactions. These products can monitor hardware databases and applications. The IBM products directly relevant to SLM are: IBM Tivoli Monitoring for Transaction Performance IBM Tivoli Monitoring IBM Tivoli Monitoring for Database IBM Tivoli Monitoring for Business Integration IBM Tivoli Monitoring for Web Infrastructure
55
Main functions
The main functions of IBM Tivoli Business Systems Manager are: Console consolidation IBM Tivoli Business Systems Manager provides a consolidated view of systems management information derived from a wide range of existing IT management solutions and IT platforms. In doing so, it enables you to maintain the value of existing tools while reducing complexity. For a full list of supported platforms and systems management tools, see IBM Tivoli Business Systems Manager Getting Started Guide, SC32-9088. This list includes:
56
Distributed systems products IBM Tivoli Enterprise Console 3.7.1 or later IBM Tivoli NetView Version 7.1 or later IBM Tivoli Monitoring Version 5.1 or later IBM Tivoli Monitoring for Database, Application, Business Integration, Web Infrastructure, and Collaboration IBM Tivoli Monitoring for Transaction Performance Version 5.1 or later BMC Patrol Version 3.4 Computer Associates Unicenter TNG Versions 2.1, 8 2.2, and 2.4 NetIQ AppManager Server Version 4.02 Hewlett-Packard Openview Network Node Manager for Solaris and HP/UX IBM Tivoli System Automation for z/OS Version 2.3 IBM Tivoli NetView for z/OS Version 5.1 IBM Tivoli Workload Scheduler for z/OS Version 8.1 or later IBM Tivoli OMEGAMON products Various third-party schedulers and other systems management products from BMC, Computer Associates and Allen Systems Group
z/OS products
Monitoring from a business services perspective IBM Tivoli Business Systems Manager provides monitoring capability for a complex combination of system resources across multiple platforms. As a result, it provides views that reflect the business services being provided across the enterprise. Executive awareness of service status By providing executive dashboards that reflect the status of business services, IBM Tivoli Business Systems Manager provides executives in your organization with a clear and simple view of the status of their key business services. Impact analysis and critical path management IBM Tivoli Business Systems Manager provides views that clearly show the impact of faults in the infrastructure on business services. In doing so, it facilitates prioritization of fault resolution effort based on business impact. It also helps with the identification of single points of failure. Root cause analysis The various views and reports available in IBM Tivoli Business Systems Manager can be used to assist the process of root cause analysis. The Business Impact view shows resources that are affected by a fault and their relation to the resource with the fault. Also the Event View displays the events that triggered the resource state change.
57
Reporting IBM Tivoli Business Systems Manager provides standard reports out of the box. It also provides a process to export systems management data to the Tivoli Data Warehouse for analysis. Basing service level agreements (SLAs) on business services The close coupling of IBM Tivoli Business Systems Manager with Tivoli Data Warehouse and IBM Tivoli Service Level Advisor enables construction of SLAs based on the availability of business systems using out-of-the-box interfaces. Visibility of SLA breaches and trends The Tivoli Data Warehouse and IBM Tivoli Service Level Advisor interfaces also enables SLA breaches and trends to be made visible in executive dashboard views. Resource discovery IBM Tivoli Business Systems Manager includes several tools to assist in discovery of resources present in an enterprise to reduce implementation time and costs. See Resource discovery on page 61.
Speeds implementation time; reduces errors; ensures currency and accuracy of management view
58
Features Dynamically adjusts the business system view for components added, modified, or deleted
Advantages Automatically keeps the business system view up-to-date by avoiding the problem of manual entry leading to obsolete information displays
Business systems
Imagine a Web-based insurance application. The infrastructure for the service may consist of a set of applications running on UNIX and Microsoft Windows 2000 servers. Some may be outside the company intranet and others behind firewalls, legacy mainframe database systems, miscellaneous load balancers and other network devices, and diverse other components. Together they deliver the service that customers know as Online Insurance. A IBM Tivoli Business Systems Manager business system is a logical container or folder that is populated with resources representing IT components. In this example, IBM Tivoli Business Systems Manager represents Online Insurance as a business system that contains icons that represent the resources that deliver the service. Business systems can be created manually from the console, automatically by giving IBM Tivoli Business Systems Manager a set of rules, or via Extensible Markup Language (XML) files. For full details, see Chapter 4, Planning to implement service level management using Tivoli products on page 109. There are three aspects of a business system: Resources: The group of resources that provide the business function Relationships: The hierarchical relationship between the resources Propagation rules: The method of dealing with events that affect the resources
59
Business systems may be built for different purposes, for example: Service based: A business system that contains a set of applications and other resources that support a service such as internet banking Department based: A business system that contains all resources supporting the accounting department Technology based: A business system that contains all UNIX servers in the enterprise Geographically based: A business system that contains all applications for the Europe, Middle East, Africa (EMEA) region
60
Work spaces
The IBM Tivoli Business Systems Manager systems administrator can design different work spaces for users. The workspace setup determines what individual users will see when they log on. The systems administrator must design work spaces carefully to reflect the roles of the people using them. They must also focus the attention of support staff on the most important business services. A help desk may need a work space that includes a business system view based on the physical organization of systems and applications. But a CIO may want a work space that shows all the business processes in the enterprise, at a lower level of detail than the help desk.
Resource discovery
Before IBM Tivoli Business Systems Manager can monitor a resource, it must be aware of its existence, understand what type of resource it is, and know where it belongs in the enterprise. Even a medium-sized enterprise contains too many resources to record manually, so IBM Tivoli Business Systems Manager provides several mechanisms for discovering resources: Bulk discovery: This runs as a batch job on z/OS systems. It also sends information about discovered resources to the IBM Tivoli Business Systems Manager database where Load/Discover scheduled jobs are run to complete the processing. A similar bulk discovery process is provided for Tivoli Workload Scheduler for z/OS, and for distributed systems resources instrumented with monitors. They communicate through the IBM Tivoli Business Systems Manager common listener interface, including IBM Tivoli NetView and CA Unicenter TNG. Rediscovery: This is similar to bulk discovery, except that resources already in the database are ignored. It is essentially a delta discovery. Auto discovery: When enabled, this process automatically discovers certain types of resources, including DB2, IMS, and CICSPlex resources. Similar script-driven processes are available to drive delta discoveries for resources instrumented though the common listener interface and the set of IBM Tivoli Monitoring products. Discovery by event: This process discovers resources that were not previously identified from messages and exceptions sent to IBM Tivoli Business Systems Manager. If an event is received for an unknown resource, the discovery process creates the resource and posts the event to it.
61
62
z\OS
Source/390 Tivoli NetView for z\OS Tivoli Data Warehouse
TBSM Servers
Host Integration Server Event Handler Server History Server Web Console
Propagation Server
Database Server
Console Server
Console
Agent Listener
63
Console server: This supports IBM Tivoli Business Systems Manager Clients using the Java console. Propagation server: This performs impact analysis on events received by IBM Tivoli Business Systems Manager to determine what business systems are affected. Events are propagated to higher level business system objects in accordance with the business system hierarchy and propagation rules. Event handler server: This processes events coming to IBM Tivoli Business Systems Manager from z/OS environments if these are being managed. Host integration server: This is required if IBM Tivoli Business Systems Manager is to process events from z/OS machines that do not have TCP/IP communications protocol installed. It handles Systems Network Architecture (SNA)-based communications used on legacy systems. In practice, most client implementations of Tivoli Business Systems Manager do not require this service. Web Console application server: This supports clients accessing IBM Tivoli Business Systems Manager with a Web browser-based console. The Web console provides many of the views available to users of the Java console and is suitable for many types of users. Health monitor server: This monitors the health and availability of the other IBM Tivoli Business Systems Manager servers and their related components.
64
65
Main functions
There are four main functions within Tivoli Data Warehouse. Importing data from source applications: This involves running a source Extract-Transform-Load (ETL) program, commonly referred to as an ETL1, to move operational data from the source location into the central data warehouse. Data is condensed as this is done. Preparing data for use in reporting: This involves running a target ETL program, commonly known as an ETL2, to prepare data and move it into a data mart ready for use by the target reporting application. Design and production of reports: Apart from producing simple reports, this is done using the functionality of the reporting or business intelligence tools rather than the Tivoli Data Warehouse itself. Housekeeping: Various housekeeping jobs are run to maintain the database and archive old data at a predetermined point. Many IBM Tivoli products are delivered with warehouse enablement packs (WEPs), which provide the ETLs needed for the previously listed processes. The concepts of ETLs and data marts are explained further in 3.2.4, Key concepts in IBM Tivoli Business Systems Manager on page 59.
Data consolidation
Open, proven, and out-of-the box interfaces for many IBM Tivoli products Being built on a relational database management system (RDBMS) architecture provides a high degree of scalability
66
Features Ability to use many analysis and reporting tools Out-of-the-box reports for IBM Tivoli applications Integration with IBM Tivoli Service Level Advisor Built-in security
Advantages Provides the ability to use the reporting tool of choice for the organization Standard reports delivered with IBM Tivoli applications may be sufficient for many purposes Out-of-the-box interface enables rapid development of SLAs based on data in the warehouse Ability to segregate data for different customers using out-of-the-box functionality
Reduced cost of designing and producing standard reports Rapid development of SLAs
Ability to use one data warehouse for multiple customers to reduce costs and maintenance
ETL programs
ETL programs process data in three steps. 1. Extract: Data is extracted from the data source. 2. Transform: Data is validated, transformed, aggregated, and cleansed so that it fits the required format. 3. Load: The processed data is loaded into the target database. In Tivoli Data Warehouse, there are two types of ETLs whose operation is shown in the diagram in Figure 3-3. Central warehouse ETL: Otherwise known as a source ETL or ETL1, this ETL extracts the data from the source applications and loads it into the central data warehouse. Data mart ETL: Otherwise known as target ETL or ETL2, this ETL loads data into data marts and is discussed in the next section.
67
Data Source
ETL1
2 ETL
ETL 2
Web-based Reports
Data marts
Although it is possible to run a query against the entire central data warehouse, this is inefficient because of the large volume and range of data that builds up over time. Instead, data is prepared in advance for use in target applications, such as Crystal Reports, and placed in a data mart. A data mart is a subset of the historical data that satisfies the needs of a specific department, team, or customer. It is optimized for interactive reporting and data analysis. The format of a data mart is specific to the reporting or analysis tool you plan to use. Each application that provides a data mart ETL creates its data marts in the appropriate format. The data mart ETL extracts a subset of historical data from the central data warehouse that contains data tailored to and optimized for a specific reporting or analysis task. The data mart ETL is also known as target ETL or ETL2.
68
Win NT/2000
TDW 1.2 Control Center
Web-based Reports Cr
IE 5.5 SP2 & 6.0 Netscape 6.2.3
ys ta le Po rtf o
lio
WM Agent
DB2 UDB EE & DB2/390 Central Data Data Mart Warehouse ETL2 Data Mart Data Mart Data Mart Star Schema
Data Mart
Win NT/2000/2003
Tivoli Data Warehouse is implemented on a set of Intel or UNIX servers. The exact number of physical servers required depends on the size and type of the enterprise that is being managed. Tivoli Data Warehouse Release Notes Version 1.2, SC32-1399, provides guidance about hardware and software prerequisites, as well as the physical placement of the logical servers. Figure 3-4 gives an overview of the Tivoli Data Warehouse 1.2 architecture and supported software components. The architecture can be comprised of the following elements: Tivoli Data Warehouse Control Center Server One or more central data warehouse databases One or more data mart databases IBM DB2 warehouse agents and agents sites Crystal Enterprise server The following sections explain each of these elements in detail.
69
Source databases
A source databases holds operational data to be loaded into the Tivoli Data Warehouse environment. Typically, the source databases are application specific and their number is likely to increase for a Data Warehouse installation. Most Tivoli products provide a WEP, which makes application-specific data available in a source database. This can be a dedicated warehouse source database since it is coming with IBM Tivoli Monitoring. Or it can be an interface to the applications built in database as provided for IBM Tivoli Storage Manager or IBM Tivoli NetView. A WEP for Tivoli products also includes the means to upload data from the source database to the central data warehouse, minimizing the efforts for data collection.
Data marts
A separate set of IBM DB2 databases contains the data marts for your enterprise. Each data mart contains a subset of the historical data from the central data warehouse that satisfies the analysis and reporting needs of a specific department, team, customer, or application. You can have up to four data mart databases in a Tivoli Data Warehouse 1.2 deployment. Each data mart database can contain the data for multiple central data warehouse databases. A WEP for a Tivoli application provides all necessary means to fill data marts with their specific data.
70
71
72
IBM Tivoli Service Level Advisor helps IT service delivery organizations to increase the business value of their delivered service by providing the ability to understand and measure service level attainment within their organization. This service level understanding helps to: Maintain productivity and customer satisfaction Verify end user service levels Analyze historical data to predict future service levels Manage costs, and improve planning by assuring offered services Measure, manage, and report on availability and performance Automate SLM based on SLOs Evaluate service delivery based on business schedules Provide Web-based customer reports IBM Tivoli Service Level Advisor depends on the collected performance and availability data from a variety of monitoring and performance tools to deliver SLA reports and SLA trends identification. Figure 3-5 illustrates the flow of data.
ITSLA Environment
SLM Server
ET L1
Regis n tratio ETL
Source Appl 2
Sourc e
ETL 2
ITSLA Database
Pr o ces s ET L
Source Appl N
Figure 3-5 Data flow in the IBM Tivoli Service Level Advisor
Service level management life cycle with IBM Tivoli Service Level Advisor
SLM is an ongoing process. Both the service provider and customer must adjust the SLOs to achieve the best service level with reasonable costs and efforts regularly.
73
IBM Tivoli Service Level Advisor supports the full life cycle of the SLM process: 1. 2. 3. 4. Creating the SLA Monitoring and reporting the Service Level Delivery and reviewing of SLA reports Ongoing refinement of SLA agreements
IBM Tivoli Service Level Advisor offers easy-to-use interfaces, quick and easy customization of features, and default values where appropriate. It is delivered with several additional IBM applications that support the functionality: IBM DB2 Universal Database (DB2 UDB) Enterprise Edition: This database is used to store measurement data. IBM Tivoli Service Level Advisor warehouse enablement packs (also known as warehouse packs): This includes ETL routines both for collecting data from the central data warehouse and writing data back into the central data warehouse for use by other applications. IBM WebSphere Application Server: This is used by IBM Tivoli Service Level Advisor as the operating environment for the administrative user interface and the reporting interface.
Manage service level definition and business schedules across existing IT infrastructure Flexible, Web-based reporting
74
Advantages Provides open, extensible aggregation point for all systems management data (including non-Tivoli data), and cross-domain reporting
Benefits Leverages business intelligence tools for data mining, and provides an open interface to include additional monitoring data in SLAs
Offerings
An offering is a template used to describe a service, with agreed service levels, that forms the basis for SLAs in which it is ultimately included. Offerings can be differentiated to provide service level choices to customers, such as Gold, Silver, and Bronze services, or any other naming convention that suggests a unique level of service. An offering is associated with a business schedule that is defined with one or more schedule periods. Each schedule period is associated with a unique schedule state, such as peak, prime, standard, off hours, and others. Each of these states can be configured to represent a unique level of service for that schedule period. As a result, you can offer a wide range of service levels in your offering, while also providing for scheduled outages for maintenance or other downtime activities.
Realms
The highest level of segregation is called a realm. A realm contains one or more customers. For example, you may create a realm for all customers in the United States and another realm for customers in Europe. You might also create a realm for customers in a particular line of business within your organization or another grouping that makes sense for your enterprise. Customers can be associated with more than one realm.
75
Customers
The second level of segregation is called a customer. A customer must be associated with at least one realm. When SLAs are defined in IBM Tivoli Service Level Advisor, they are associated with both realms and customers. When IBM Tivoli Service Level Advisor users are given access to reporting functionality, they are given permission to access specific realms and customers. They are unable to view data related to realms or customers for which they have not been granted permissions.
76
SLM reports
The report servlets use the functions of the IBM WebSphere Application Server to obtain SLA results data and generate summary reports in the form of tables and graphs that can be displayed in a Web browser. The enterprise can use these servlets to create customized Web pages for customers, displaying results of evaluation and trend analyses, such as: Actual level of service provided Number of SLA violations Trends toward future violations
77
78
Simulate customer transactions: While mimicking the behavior of real users performing standard tasks, you can collect performance data that helps you assess the health of your on demand business components and configurations under different conditions and at different times. Reporting: You can produce comprehensive real-time reports that display recently collected data in a variety of formats and from a variety of perspectives. By integrating with Tivoli Data Warehouse, you can store collected data for use in historical analysis and long-term planning. Notification of performance issues: You can receive prompt automated notification of performance problems either directly through a console or by integration with IBM Tivoli Enterprise Console and IBM Tivoli Business Systems Manager. Root cause analysis: You can quickly isolate the source of performance problems as they occur, so that you can correct those problems before they produce expensive outages and lost revenue.
79
IBM Tivoli Enterprise Console integration IBM Tivoli Business Systems Manager integration Tivoli Data Warehouse integration
80
comprehensive end-to-end management capability that includes measuring application availability, application performance, application usage, and end-to-end transaction response time. The ARM API defines a small set of functions that can be used to instrument an application to identify the start and stop of important transactions. IBM Tivoli Monitoring for Transaction Performance provides an ARM engine to collect the data from ARM instrumented applications. This is a multithreaded application implemented as the tapmagent that exchanges data though an IPC channel, using the libarm library, with ARM instrumented applications. Data is collected and aggregated to generate useful information. It is correlated with other transactions, and then thresholds are checked against policies. Data is forwarded to the management server and placed into the database for reporting purposes. IBM Tivoli Monitoring for Transaction Performance Version 5.3 also provides a generic ARM component for more transaction monitoring coverage. The generic ARM capability enables you to monitor custom ARM-instrumented applications. Note: ARM instrumentation does not support a 63Cbit Java Virtual Machine (JVM). The ARM engine notifies the IBM Tivoli Monitoring for Transaction Performance Management Server of transaction violations, new edge transactions appearing, and edge transaction status changes. The following paragraphs provide an overview of the transaction correlation provided by IBM Tivoli Monitoring for Transaction Performance. For additional information, including instrumenting applications using ARM, see the IBM Tivoli Monitoring for Transaction Performance Administrators Guide Version 5.3, GC32-9189. ARM correlation is the method by which parent transactions are mapped to their respective child transactions across multiple processes and multiple servers. Each IBM Tivoli Monitoring for Transaction Performance component is automatically ARM-instrumented and generates a correlator. The initial root/parent or edge transaction is the only transaction that does not have a parent correlator. From there, IBM Tivoli Monitoring for Transaction Performance can automatically connect parent correlators with child correlators to trace the path of a distributed transaction through the infrastructure. It provides the mechanisms to easily visualize this through the topology views.
81
IBM Tivoli Monitoring for Transaction Performance implements the following ARM correlation mechanisms: Parent-based aggregation: This process collects transaction performance data on the parent of a subtransaction and displays transaction performance relative to its path. This provides the ability to monitor the connection points between transactions. It also monitors path-based transaction performance across farms of servers providing the same function. Policy-based correlators: A portion of the correlator is used to pass a unique policy identifier within the correlator. The associated policy controls the amount of data collected and the thresholds associated with that data. Instance and aggregated performance statistics: This provides both additional metrics and a complete and exact trace of the path taken by a specific transaction. Parent performance initiated trace: The trace flag within the ARM correlator is used by the agent in the trace field for transactions that are performing outside of their threshold. This provides for the dynamic collection of instance data across all systems where this transaction executes. Sibling transaction ordering: This is the ability to determine the order of execution of a set of child transactions relative to each other. Aggregated correlation: IBM Tivoli Monitoring for Transaction Performance carries out aggregated correlation. This provides a summary of a transaction over a period of time rather than a record for each and every instance of a transaction.
82
The Generic Windows component plays back a Rational Robot recording to provide timing measurements.
J2EE instrumentation
IBM Tivoli Monitoring for Transaction Performance provides enhanced J2EE instrumentation capabilities. The collection of ARM data generated by J2EE applications is invoked from the management server and is controlled by user-configured policies. The monitoring policy is then distributed to the management agent. The transactions to monitor are specified using edge definitions, for example, the first URI invoked when using the application. It is possible to define the level of monitoring for each edge. To monitor a J2EE application server, the computer must be running the IBM Tivoli Monitoring for Transaction Performance Agent. A single IBM Tivoli Monitoring for Transaction Performance agent can monitor multiple J2EE application servers on the management agents host. IBM Tivoli Monitoring for Transaction Performance J2EE monitoring uses Java byte-code insertion (BCI).
83
84
Real-time reports: This interface is also accessed by a browser and provides graphical displays of performance data collected by the monitoring and playback components. There are reports to help you assess the performance and availability of your Web sites and Microsoft Windows applications. Event generation: Application events are generated when performance thresholds are exceeded; system events are generated for system errors and notifications. Events can be viewed and event severities configured to decide what action will to be taken when they are generated. The management server can send e-mail notification to specified recipients, run a specified script, or forward selected event types to the IBM Tivoli Enterprise Console or as Simple Network Management Protocol (SNMP) traps. Storage of policies and data: The management server controls a set of databases that store policy information, events, and performance data collected by management agents. Communication with management agents: The management server uses Web services and the Secure Sockets Layer (SSL) to communicate with the management agents. ARM data is uploaded to the management server from management agents at regularly scheduled intervals (the upload interval). By default, the upload interval is once per hour.
85
Servers. All of the management agents in the group can run a J2EE listening
policy that you create to monitor the banking application. Threshold checking: When performance thresholds in listening or playback policies are exceeded, the management agent sends events to the management server. Events can be set for transactions, and in many cases, for the subtransactions within a transaction. This is one step in an overall transaction.
86
products from independent software vendors (ISV) or custom in-house applications. The Generic ARM component can also detect and monitor custom metrics that are recorded from these ARM instrumented applications. All transaction data collected by the Quality of Service, J2EE, STI, and Generic Windows monitoring components of IBM Tivoli Monitoring for Transaction Performance is collected by ARM.
87
To forward appropriate events to the IBM Tivoli Business Systems Manager to enable it to determine the business impact of faults The Tivoli Enterprise Console product helps you effectively process the high volume of events in an IT environment by: Prioritizing events by their level of importance Filtering redundant or low-priority events Correlating events with other events from different sources Determining who should view and process specific events Initiating automatic corrective actions, when appropriate, such as escalation notification, and opening trouble tickets Identifying hosts and automatically grouping events from the hosts that are in maintenance mode in a predefined event group
88
89
90
or modify the event. If human intervention is required, the event server notifies the appropriate operator. The operator performs the required tasks and then notifies the event server when the condition that caused the event is resolved. Incoming are events given a unique number and time stamped as they are entered into the event database. They are then evaluated by the rule engine. If the rule engine is busy, events are buffered and evaluated later. Rules include action to be taken when an event meets the specified rule conditions. This helps to reduce the amount of interpretation and responses required by operators. For example, a particular event may be known to trigger one or more instances of another event. In such a case, a rule can be used to automatically downgrade the severity of the event or close events that are known to be caused by the triggering event. The event server can use rules to delay responses to an event. This may be use to deal with self-correcting faults to prevent an operator from needlessly responding to a problem that will shortly go away. Rules can be used, for example, to attempt to restart a router and give an operator a low-severity notice. If the attempts to restart the router within a designated time period fail, a rule can specify that attempts to retry be cancelled and that a higher-severity notice be sent to an operator. If an operator does not respond to an event after a specified period of time, the event server can take additional actions including sending an e-mail, paging the operator, or sending an e-mail notice to an alternate contact. You can use the predefined rules that the Tivoli Enterprise Console product provides, or you can create your own. For full information about the predefined rules, see IBM Tivoli Enterprise Console Rule Set Reference Version 3.9, SC32-1282. You can find information about creating your own rules in IBM Tivoli Enterprise Console Rule Developers Guide Version 3.9, SC32-1234. A rule can specify the following actions among others: Correlating events Responding automatically to events, such as running an application or script Delaying responses to events Escalating events Modifying event attributes Modifying attributes of other events Preventing duplicate events from being displayed Dispatching Tivoli or other administrative actions on resources Reevaluating a set of events Discarding an event Generating a new event Forwarding an event to another event server
91
92
Support of multiple views: Configuration view to configure the event consoles Summary chart view to show a high-level overview of the health of resources represented by an event group Priority view showing event groups are represented by buttons with the status indicated by color
Windows event log adapter on the host. When an event adapter receives
information from its source, the adapter formats the information and forwards it to the event server for interpretation and response. You can configure an event adapter to discard selected events instead of forwarding them all to the event server to reduce network traffic and event server workload.
93
Tivoli NetView
IBM Tivoli NetView provides the network management function for the IBM Tivoli Enterprise Console product. It monitors the status of network devices and automatically filters and forwards network-related events to IBM Tivoli Enterprise Console.
94
Ready-to-use resource models that report on specific aspects of a systems status For example, the Process resource model provides information about the status of processes, CPU usage, and so forth. The ability to add resource models to a Tivoli profile, which can be distributed to multiple systems simultaneously The ability to modify resource models by changing, for example, threshold levels to match specific requirements The ability to view both real-time and historical data for any system from a centralized monitoring application, called the Web Health Console, which is supplied with the product The ability to send the results of data collection and analysis to the IBM Tivoli Enterprise Console or to the IBM Tivoli Business Systems Manager The ability to specify automatic corrective or preventive actions to resolve situations that could develop into real problems The ability to schedule monitoring to take place at user-specified times A heartbeat function that regularly checks the availability and status of attached endpoints and makes the information available to the IBM Tivoli Enterprise Consoleserver, IBM Tivoli Business Systems Manager, or Tivoli Monitoring Notice Group
95
Features IBM Tivoli Business Systems Manager Integration Tivoli Data Warehouse Integration
Advantages Enables the business impact of events to be assessed and to enable escalation Enables long-term storage of performance and availability data and supports the use of data in SLAs created with IBM Tivoli Service Level Advisor
Benefits Ensures focus on the most important issues based on the business impact of a fault Reduced data storage costs and the creation of meaningful SLAs
Resource models
In IBM Tivoli Monitoring terminology, a resource model is defined as the logical modeling of one or more resources, along with the logic on which cyclical data collection, data analysis, and monitoring are based. In practical terms, a resource model is a pre-built set of rules for monitoring a resource using IBM Tivoli Monitoring that is installed, for example on a server that may take corrective action or send an event if an exception condition is detected. IBM Tivoli Monitoring provides a range of out-of-the box, predefined resource models to specify which resource data is accessed from the system at runtime and how this data is processed. For example, the Process resource model obtains data related to processes running on the system. Performance data is automatically collected by the resource model and processed by an appropriate algorithm to determine whether the system is performing to your expectations. Generally, you can use the resource model default values and still obtain useful data. However, if necessary, you can customize the resource models to suit your requirements or even build your own resource models using the IBM Tivoli Resource Model Builder. For details about the resource models supplied with the product, see IBM Tivoli Monitoring Version 5.1.2 Resource Model Reference Guide, SH19-4570-03. For guidance about creating resource models, see IBM Tivoli Resource Model Builder Version 1.1.3 Users Guide, SC32-1391-02.
96
information every 60 seconds. The data collected is a snapshot of the status of the resources specified in the resource model. Each of the supplied resource models has a default cycle time, which you can modify. Each resource model defines one or more thresholds. A threshold is a named property of the resource with a default value that you can modify in the customization phase. Typically, the value specified for a threshold represents a significant reference level of a performance-related entity. If the level is exceeded or not reached, the operator or system administrator should be notified.
Indications
Each resource model generates an indication if certain conditions implied by the resource models thresholds are not satisfied in a given cycle. Each resource model has its own algorithm to determine which combinations of thresholds should generate an indication. Indications may be generated in any one of the following circumstances: A single threshold is exceeded: For example, in the Windows Process resource model, the Process High CPU indication is generated when the High CPU Usage threshold is exceeded. A combination of two or more thresholds are exceeded: For example, in the Windows Logical Disk resource model, a High Read Bytes per Second indication is generated when both the following thresholds are exceeded: The amount of bytes transferred per second (being written or read) exceeds the High Bytes per Second threshold. The percentage of time that the selected disk drive spends for read or write requests exceeds the High Percent Usage threshold.
97
98
99
MANAGEMENT
Historical VISUALIZATION
Real-Time
EVENTS Monitoring User Experiences Monitoring EVENTS Monitoring Resources Monitoring Transactions
NEGOTIATE AGREEMENTS
Application
Infrastructure
Business Units
IT Development
IT Operations
Figure 3-10 An integrated view of SLM, BSM, and monitoring in process context
How can you integrate the existing Tivoli products to maximize their value in support of the process illustrated by Figure 3-10? Since software products are simply tools in support of processes deployed by an IT organization, and their solutions vary with each IT organization, the following sections outline a generic integration approach that is represented by Figure 3-10.
100
The integration approach addresses the following elements: Service definitions Real-time monitoring Historical monitoring Fault management SLA reporting and alerting Problem and change management
101
Level Advisor. This information includes business system hierarchical structures and the actual time for each of six states for every business system. IBM Tivoli Service Level Advisor operates based on service offerings that are defined manually and have a set of metrics that is linked to the service while it is created. Important: The practical approach to Tivoli Business Systems Manager and IBM Tivoli Service Level Advisor integration involves the IBM Tivoli Service Level Advisor service offering structures modeled on Tivoli Business Systems Manager services. Therefore, Tivoli Business Systems Manager business system data can be used for more accurate measurement of availability for each defined service offering while IBM Tivoli Service Level Advisor can notify the corresponding Tivoli Business Systems Manager service of the pending SLA violation and trending alerts.
102
Important: Tivoli Business Systems Manager expands real-time event monitoring into real-time monitoring of resource states. It adds value by processing incoming events and recognizing their impact on the state of the corresponding resources. Using the business systems constructs and propagation rules, Tivoli Business Systems Manager combines the states of related resources and allows real-time monitoring of services.
103
Tivoli Business Systems Manager history server and reporting system that provide Tivoli Business Systems Manager ASP reports Reports available using the Tivoli Data Warehouse reporting interface: Crystal Enterprise Professional for Tivoli Tivoli Business Systems Manager information in the central data warehouse database is also used by IBM Tivoli Service Level Advisor to generate SLA reports. IBM Tivoli Service Level Advisor uses a set of ETLs to extract data from the central data warehouse database to the SLM measurement data mart database for further analysis and reporting. For details about Tivoli Data Warehouse and IBM Tivoli Service Level Advisor data sources, see Chapter 4, Planning to implement service level management using Tivoli products on page 109. Each data source has a unique code that identifies the product with which it is associated. Important: Tivoli Data Warehouse facilitates an integration of historical data from Tivoli and third-party products through a centralized database and a set of supported WEP. The main task is to install and schedule these WEPs. Since the size of a database depends on the size of the IT enterprise, it is critical to plan runs and estimate timings for each WEP.
104
Tivoli Business Systems Manager is designed to manage events in the SLM context through automatic alert propagations to prebuilt and dynamically constructed business systems and services. Tivoli Business Systems Manager events are preclassified by the resource class, alert state, priority, and event type. Most of the defaults can be customized via a GUI, and new resource classes and events can be added. For details about Tivoli Business Systems Manager events and their classification, refer to IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085. Tivoli Business Systems Manager provides management facilities, but a customers preparedness plays a significant role in achieving effective fault management. Some of the preparation activities are: Identify which events can cause outages; tune Tivoli Business Systems Manager red defaults Identify which events can cause degradation; tune Tivoli Business Systems Manager yellow defaults Consider business impact when constructing business systems Customize alert propagation rules to maximize alert management Find the best use of available views to match operational processes Customers need to classify faults. Tivoli Business Systems Manager red alerts, particularly of critical or high priority, can be classified as faults. Tivoli Business Systems Manager yellow alerts, and perhaps some red alerts of medium and low priorities, can be classified as warnings. Before rolling out Tivoli Business Systems Manager for production, do some preparation. Continuous adjustments and operational training help to improve the effectiveness of fault management and reduce the impact on service levels. Important: A potential outage needs to be fixed as soon as possible to keep SLA attainment. Faults may arrive at a rapid rate and operators must respond to problems based on business impact. Prioritizing faults can greatly improve operators productivity and reduce problem investigation time. Effective use of event, impact, and topology views to evaluate events and their impact are essential to efficient fault management.
105
provides management reports about the actual service levels, SLA violation statistics, and trends toward SLA violations. IBM Tivoli Service Level Advisor depends on the collected performance and availability data from a variety of monitoring and performance tools. This data is stored in the SLM measurement data mart, but all analysis and evaluation results are stored in the SLM database. You can retrieve the analysis data and summarize it into reports that you can view using a Web browser. The SLM report console provides a colorful high level summary report that is displayed in table form, showing totals of trends and violations across the reporting period, grouped by realms and customers. Clicking the table cells invokes accompanying color charts and additional tables of summary information about trends and violations, key operations information, and specific details about particular customers and SLAs. For more details, refer to IBM Tivoli Service Level Advisor SLM Reports, SC32-1248. IBM Tivoli Service Level Advisor analyzes data that is obtained from Tivoli Data Warehouse according to a predefined schedule. This data is evaluated for violations and trends toward future violations of the agreed upon levels of service. Notifications of violations and trends are sent automatically by a way of e-mail, SNMP traps, or IBM Tivoli Enterprise Console events. IBM Tivoli Service Level Advisor performs evaluation of the aggregate data collected from Tivoli Data Warehouse against predefined breach values (for each metric and schedule state periods) to determine if service levels are being maintained. (If the breach value is violated, IBM Tivoli Service Level Advisor generates the violation event.) For example, the breach value defined for total is compared to the sum of all hourly values reported over the entire evaluation period. Accordingly, the breach value for maximum or minimum is compared to the lowest or highest single hourly value. IBM Tivoli Service Level Advisor uses a linear algorithm or exponential stress detection algorithm to analyze existing measurement data and to predict trends toward violations. Both algorithms are active and evaluate the same data for trends according to their methods of evaluation. Due to the iterative estimations and calculations used by the exponential stress detection algorithm, no graphical trend line associated with this algorithm is displayed with graph data. Trend lines that are displayed with graphs are associated with the linear algorithm only. If the predicted value approaches the breach value and if the value is predicted to exceed the breach value by either the linear or the exponential stress detection algorithm, then a trend detection event is reported. If there is an outstanding trend detection event, and the current evaluation value is significantly away from the breach value, a trend cancel event is reported. However, if a violation occurs after the trend detection event, a trend cancel event is never reported.
106
IBM Tivoli Business Systems Manager V3.1 introduced the Executive View console, which provides a dashboard approach to presenting a service status to executives. Optionally, a service can show status information for IBM Tivoli Service Level Advisor as the Secondary Impact Information (SII) indicator. SII indicators do not follow the normal Tivoli Business Systems Manager status propagation rules. The status of an SLA SII alert is shown by a symbol rather than by a color. IBM Tivoli Service Level Advisor can send SLA trend and violation events to IBM Tivoli Enterprise Console where they are trapped by a IBM Tivoli Enterprise Console rule and forwarded to Tivoli Business Systems Manager via the event enablement and the agent listener. SLA alerts are posted to the corresponding service object and can be viewed in executive console as secondary impact indicators. In addition, SLA alerts can be forwarded automatically to people on the notification list via IBM Tivoli Enterprise Console e-mail and paging facilities. Important: The actual evaluation takes place automatically when the IBM Tivoli Service Level Advisor ETL completes its operation of moving the most recent measurement data from the data warehouse into the SLM measurement data mart. However, IBM Tivoli Service Level Advisor also enables additional advanced settings for intermediate evaluations, frequency of trend analysis, and logging messages for missing data.
107
problem ticket processing. Then it transfers control to the user-written program for integration with users problem management application. Change request processor: This implements interfaces for entering data and generating requests to create, query, search, find, retrieve, and update change requests. The Tivoli Business Systems Manager change integration function displays the menu options for the Tivoli Business Systems Manager change request processing. Then it transfers control to the user-written program for integration with users change management application. Automatic ticket request processor: This is any request processor written by users that can process command line input parameters, read a text-based input file containing the data passed from the Tivoli Business Systems Manager automatic ticket integration function, and create a text-based output file to contain problem ID returned from the problem management application. The automatic ticket integration function differs from the problem and change integration functions within the Tivoli Business Systems Manager product. It does not have a console interface. Its sole function is to create problem tickets and optionally generate automatic notifications by pager or e-mail. The automatic ticket integration function interacts with a users request processor when message or exception events are sent to Tivoli Business Systems Manager. All events are processed by the automatic ticket integration function based on predefined automatic ticket event rules that provide criteria for passing the matched events to the request processor. When Tivoli Business Systems Manager console is set up to work with problem and change managements systems, the user can perform the following tasks: Create, find, update, and close problem tickets Two types of create are supported (from the context menu of a resource and from an ownership note) Create, find, update, and close change requests Important: Tivoli Business Systems Manager provides integration functions and request processors for problem, change, and automatic ticketing. Users must develop their own customized programs that can interface their change and problem management systems. Most problem and change management applications provide some type of APIs. After a Tivoli Business Systems Manager request is processed, interface programs must return control to the Tivoli Business Systems Manager exit point and provide notification of results.
108
Chapter 4.
109
Planning
Established decision to implement SLM
Implementation
Develop service level objectives
- Describe services - Determine service level indicators - Determine metrics to be used
Improvement Process
Improving quality of service levels Improving efficiency of SLM Improving effectiveness of SLM
110
4.1.1 Planning
During the planning stage, you should become familiar with the capabilities and features of the IBM Tivoli products that are available to you. You must also become familiar with any new products and revise perceptions of existing and installed products. What may now be an under-used event monitor may well become a key tool in SLM. This idea is explored further in Understanding the services on page 111 and Implementing additional monitoring on page 113.
111
4.1.2 Implementation
The implementation phase is when you install new Tivoli products and review existing Tivoli and other systems management products for SLM.
External metrics are defined in the SLA contract. They are visible to the customer. An example of an external metric is Overall Response Time of Service. Internal metrics are accessory metrics from system monitors that can be used by the service provider in a proactive manner to ensure that the contract is being met. Internal metrics are not shown to the customer and are not part of the SLA contract. An example of an internal metric is Response time of DB2 Databases used by the Application.
112
These products measure the internal performance of systems and applications. The functionality includes continuous monitoring and recording of information, raising alerts when thresholds are exceeded, and gauging user experience by making response time measurements. These products can monitor hardware databases and applications.
113
Implement IBM Tivoli Monitoring for Transaction Performance to provide user-experience monitoring. User experience monitoring is key to providing an end-to-end view of a service. Implementing and exploiting IBM Tivoli Monitoring for Transaction Performance is explained in 4.5.1, IBM Tivoli Monitoring for Transaction Performance on page 190, and in Part 2, Case study scenarios on page 195.
114
Many IBM Tivoli Service Level Advisor capabilities can be used for this. Trends toward violations IBM Tivoli Service Level Advisor calculates trending toward violations for any metric selected to be part of an SLA. It analyzes the data for the metric and sends a trend event when the algorithm detects that the data shows a linear or stress exponential trend that may violate within a predetermined interval. See Chapter 5, Case study scenario: IRBTrade Company on page 197, for an example. Intermediate evaluations These evaluations are done more frequently than the report one. A common situation is a monthly evaluation and a daily intermediate evaluation. With this, the IT organization can check everyday on the status of the various services it is providing and take action while it is possible to affect the SLA at the end of the month. For details about this function, refer to Part 2, Case study scenarios on page 195. Adjudication In some situations, some violations will happen in conditions that, according to the SLA contract, can be adjudicated. An example of this is when the number of users, who are using a certain application, exceeds what was in the contract, so the violation for the month can be adjudicated. Refer to Adjudication on page 170 for details.
115
Decommissioning resources is not reflected in IBM Tivoli Business Systems Manager. A decommissioned object remains in the business system and no longer receives events. These decommissioned objects from business views have no effect on continued IBM Tivoli Business Systems Manager function. They can be cleaned up as a maintenance function to avoid having too many decommissioned objects. You can use Automatic Business Systems (ABS) and Extensible Markup Language (XML) Business System building to ensure that changes to the service are reflected in IBM Tivoli Business Systems Manager. Failure to reflect service changes in IBM Tivoli Business Systems Manager reduces the effectiveness of SLM. Continued failure compromises SLM and renders the monitoring and metrics useless.
116
117
An event that affects a core business process causes the business system to be overlaid with a red or yellow icon (see following section) indicating the impact on the business process of the event. A similar event that affects a non-critical component does not light up the business system. Because IBM Tivoli Business Systems Manager graphically shows the event in the correct context, you can judge the impact and direct resolution efforts accordingly.
118
never cleared but are overlaid with other messages of the same or greater priority. For example, a high red message is overlaid with a high green message, sending the affected object to a green alert state. Exceptions are more flexible. Any number of exceptions can apply to a single object. Most events from system management tools are posted as exceptions by IBM Tivoli Business Systems Manager. Exceptions are not overlaid by other exceptions unless the exception has an identical exception ID. In that case, the exception count increments. Outstanding exceptions can be cleared automatically when the problem is resolved by sending the same exception with the exception text of OK. For details about message and event handling, see IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085.
119
ABS-created business systems are dynamically built and populated with all qualifying existing objects as defined in the ABS rules. Maintenance is especially low for keeping business systems up to date since newly discovered and created objects are automatically placed in business systems by ABS. For instructions on using ABS, see IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085.
XML
XML-built business systems are a new component introduced in IBM Tivoli Business Systems Manager V3.1. This feature allows business systems to be built and updated using XML and to be extracted and backed up as XML files. The XML method was not used for this IBM Redbook. You can learn more about this method in IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085.
120
BSS are copies of a parent business system. The objects in the BSS are the same objects as in the parent business system. They are not duplicates. Most of the properties of the parent BSS are inherited by the BSS, but you can change these properties in the BSS. If you change the parents properties, then the change is reflected in the children BSSs. You can unlink the properties of a child BSS and change them to suit the requirements placed upon the BSS. If required, you can relink the childs properties back to the parent so that the child has the parents properties once again. Some properties are not inherited by the child BSS. A business system that is defined as an Executive View Service does not automatically pass on this property to a child BSS. We used BSS to allow different propagation rules to apply to the same business system so that different roles can get different information from the same business system structure. Chapter 6, Case study scenario: Greebas Bank on page 315, offers more information about exploiting BSS.
121
122
Figure 4-3 shows a schematic diagram of a business process business system. It shows the business process broken down into functions and the functions broken down into applications. The applications are made up of aggregations of technologies, such as servers and databases. Underneath the aggregation layer is the technology layer that represents the actual hardware and software. The monitors layer shows the feeds that go into IBM Tivoli Business Systems Manager. It does not represent components of the IBM Tivoli Business Systems Manager business system.
One of the most challenging parts of IBM Tivoli Business Systems Manager implementation is correctly identifying the components that make up the business process. Processes for gathering the necessary business process information are discussed in Chapter 2, General approach for implementing service level management on page 23, and in Data gathering and business system decomposition on page 134.
123
This type of business system can be built by using ABS. However the objects within scope must conform to naming standards so that they can be correctly placed by ABS. You can use XML to build the business system. This method is especially effective if you can obtain an XML extract of the component from a federation of monitoring databases or some other repository that contains details about the business process. Figure 4-4 shows an example of a business process-based business system. For clarity, this view is only partially-expanded.
124
Tree view
The IBM Tivoli Business Systems Manager tree view is the base view of IBM Tivoli Business Systems Manager. The Business Systems view and All Resources view are in tree format and all business systems open as a tree view by default. The tree view is useful for the administrator to manipulate logic within the business system structure. The tree view is less useful for operational management of the components in the business system. Refer to Figure 4-4 to see the partially-expanded tree view of a business system.
Event Viewer
For users to quickly use and understand IBM Tivoli Business Systems Manager, the tree view can be enhanced with the IBM Tivoli Business Systems Manager Event Viewer. Figure 4-5 shows the IBM Tivoli Business Systems Manager Event Viewer for CICS events.
Figure 4-5 Using the IBM Tivoli Business Systems Manager Event Viewer
The IBM Tivoli Business Systems Manager Event Viewer shows events in the linear way similar to traditional systems management tools. This enables users to use IBM Tivoli Business Systems Manager quickly, without having to change working practices to adapt to IBM Tivoli Business Systems Manager. Note that, in Figure 4-5, the columns were resized and rearranged to make the view of
125
events more user friendly. From this view, users can take ownership of events, close out unnecessary events, and see who owns existing events.
Hyperview
Hyperview is a dynamic, real-time view of an exploded business system. This view offers a quick overview of a business system. Because the hyperview always centralizes on a click of a users mouse, it is a volatile view and can accidently obscure events in the hyperview. Figure 4-6 shows a hyperview for a business system. The default for hyperview is a minimum alert state of green. This means that every object is shown. We recommend that you change this default because the console display becomes too busy.
Figure 4-6 Hyperview set to show the minimum alert state of green
126
Topology view
The topology view is automatically built from business systems. It can be used to display a business system and its components or simply the high level icon for the business system. Where the hyperview is volatile, the topology view is static. Both views are real time and display events as they are received. Figure 4-7 shows the same business system as shown earlier, but this one shows the general topology view. This option is available to show all details as in the hyperview, but the icons shrink as the view expands and the desktop becomes more difficult to use.
Figure 4-7 Topology view of business system: Not all detail enabled
127
IBM Tivoli Business Systems Manager also provides complex topology views for some mainframe feeds, such as CICS, IMS, and DB2. Technical support teams can use these views. For IBM Tivoli Business Systems Manager V3.1, IMS and DB2 topologies are new and the CICS topology view no longer requires CICSplex to be implemented. See Figure 4-8.
For details about exploiting the IBM Tivoli Business Systems Manager topology view, see IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085.
Work spaces
The IBM Tivoli Business Systems Manager console can consist of several windows that contain any or all of the previously mentioned views. The IBM Tivoli Business Systems Manager administrator typically creates a set of views that are suitable for a role such as an operator or a database specialist. The administrator then saves the set of views in a work space. A work space can be assigned to specific operator and restricted operator IDs so that only these users can see the views. The administrator can also set work spaces to open on console startup. Most IBM Tivoli Business Systems Manager windows examples in this document show work spaces. Figure 4-9 shows an example work space set up for three
128
business systems using an Event Viewer in another window overview of all three business systems.
Figure 4-9 Sample work space using three topology views and Event Viewer
Web Console
For IBM Tivoli Business Systems Manager 3.1, the Web Console was redesigned and introduces improved authentication using IBM WebSphere. It is a functional Web console based on Java that can be used by defined users to manage business systems and events. Some Java console functions, such as hyperview and the topology view, are not replicated in the Web Console. However, business system management is still easily achieved without these features. The Web Console introduces the Critical Watch List (CWL).This is an administrator-defined list of business systems and individual resources that are kept on the users Web Console. From the CWL, a user can see events that are
129
posted to a business system and can drill down, assess the business impact and take ownership of the event. Actions taken on the Web Console are reflected in all other console types so that, for example, an event owned by a Web Console user, shows as being owned in the Java console and the executive dashboard. Figure 4-10 shows a sample Web Console showing a CWL for a user with the operator role.
Executive dashboard
The executive dashboard is a new concept for IBM Tivoli Business Systems Manager 3.1. The executive dashboard is designed to inform senior managers of overall service status without providing technical detail that is not necessary to that level of user. An executive dashboard user can be notified of service status and SLA status but is not notified of problems and incidents that are not impacting the business process. The user can see that a business process is impacted and that the causing incident is being owned and managed. The user can also see when an SLA is trending toward violation and when an SLA is violated. The executive dashboard enables senior management to be aware of business process status without forcing unnecessary training and information onto them.
130
The executive dashboard is a non-intrusive console that can run minimized on a desktop. It is Web-based and accessible via a Uniform Resource Locator (URL) and does not require any code installation on the desktop. There are two levels of executive dashboard user: executive and IT executive. The executive-level user is shown only the highest level of alerts and sees only non-technical messages. The IT executive-level user is expected to be used by more technically-aware managers. Therefore IBM Tivoli Business Systems Manager provides more technical detail to supplement the high-level alerting given to the executive-level user. Figure 4-11 shows an executive dashboard that is seen by both executive and IT executive users.
131
Figure 4-12 shows the different information made available to each user. The dashboard on the left is for the executive user and shows service status. The dashboard on the right is for the IT executive and shows details about the affected resource.
Executive User Figure 4-12 Comparison of drill-down information available to each role
IT Executive User
132
However, the administrator role is responsible for developing the business systems and views used by other roles to aid SLM. Super administrators can create and administer CWLs for the Web Console and the equivalent in Java Console, which is Critical Resource Lists (CRL). CRLs are not widely used but are detailed in IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085. The administrator cannot perform this task. Other than this, the two roles are identical. The IBM Tivoli Business Systems Manager administrator should work closely with the IBM Tivoli Service Level Advisor administrator. This is so that the definition of IBM Tivoli Business Systems Manager Services as IBM Tivoli Service Level Advisor Services can be properly coordinated. See Marking an IBM Tivoli Business Systems Manager business system as a service on page 187 for more details.
Operator
The operator is responsible for monitoring the whole or parts of the enterprise. This person needs to see all severities of events that affect components of the enterprise. It is good practice to send only events for service level managed resources to operators. Sending events from non-SLM resources can be distracting to operations and divert attention from SLM resources. If a system has an SLA, send events to operations so that the system and the SLA can be managed. If a system has no SLA, then operations should not spend effort on resolving events for it.
Restricted operator
The restricted operator is the same as the operator with additional restrictions. That is the restricted operator cannot view all business systems nor add resources to their own CRLs.
133
The executive user also receives service status from IBM Tivoli Business Systems Manager and IBM Tivoli Service Level Advisor. However this user does not receive details about events. Note: These user IDs do not have access to the other IBM Tivoli Business Systems Manager consoles. See Executive dashboard on page 130 for details and examples about the executive dashboards.
134
damage the credibility of both IBM Tivoli Business Systems Manager and the BSM approach. However, using IBM Tivoli Business Systems Manager with the awareness that not all the business process is covered still gives great value for the parts of the business process that are covered by IBM Tivoli Business Systems Manager. Monitoring gaps can be overcome by using customer-experience software, such as IBM Tivoli Monitoring for Transaction Performance, to report on the end-to-end performance of the business process. It is important that the remaining components of the business system are discovered and defined to IBM Tivoli Business Systems Manager as soon as possible. See Implementing additional monitoring on page 113 for an overview of the methods to fill in the gaps.
Enhancing monitoring
Business process decomposition frequently shows monitoring gaps. These occur when some components of the business process are not under the control of a systems management tool or organization. This is a common occurrence that is difficult to quickly overcome. It can be possible to plug gaps with existing systems management tools and then integrate them into IBM Tivoli Business Systems Manager. However often there are going to be gaps in the end-to-end monitoring of the business process. It can be argued that an early benefit of IBM Tivoli Business Systems Manager is that it drives the customer to discover gaps in their monitoring. Regardless of the BSM tool that is used, gaps in the monitoring of a business process are undesirable and should be closed as soon as possible. For large monitoring gaps, a delay to IBM Tivoli Business Systems Manager implementation should be considered while the gaps are filled. There are situations where a large part of the business process is not monitored because it is outside of the remit of the customer. A common example of this is when the network is out sourced. It is not desirable to bring network monitoring back in house for IBM Tivoli Business Systems Manager, because then both the network providers and the IBM Tivoli Business Systems Manager users monitor the network. If you prefer to have end-to-end monitoring and want to include the network, we recommend that you use IBM Tivoli Monitoring for Transaction Performance V5.3 to replay transactions and measure the network latency. Any severe network latency in the sample transactions can be reported to IBM Tivoli Business Systems Manager. For details about IBM Tivoli Monitoring for Transaction Performance network latency measurements, see IBM Tivoli Monitoring for Transaction Performance V5.3 Administrators Guide, GC32-9189.
135
4.2.8 Using IBM Tivoli Business Systems Manager 3.1 features for the benefit of SLM
Of the many new features in IBM Tivoli Business Systems Manager V3.1, two of the most useful ones for effective SLM are resource level propagation (RLP) and percentage-based thresholding (PBT).
136
Percentage-based thresholding
With the PBT method, a group of immediate, weighted, child resources are monitored by rules. When a percentage of these resources have an alert state (such as red), a preconfigured event is sent to the parent object where the PBT rules are set. PBT rules are triggered when the following formula is satisfied:
%age_Min =< ((Alert_Weight / All_Weight) x 100 ) =< %age_Max
In this formula, note the following explanation: %age_Min: The lower limit of the PBT rule percentage range Alert_Weight: The total weight of resources in the desired alert state (for example, red) All_Weight: The weight of all resources in the scope of the PBT rule %age_Max: The upper limit of the PBT rule percentage range
137
A simple illustration is where four objects are covered by a rule. The objects each have a weight of 25 and the rule has to fire when three of the objects are red. Three red objects is 75%, so the rule fires when 75% of the objects are red. We set the range from 51% to 76% so that the rule doesnt fire when two or four objects are red. This gives us the following values: %age_Min = 51 (more than two reds) Alert_Weight = 75 (three reds) All_Weight = 100 (all four resources) %age_Max = 76 (less than four reds) The formula is:
51 =< ((75 / 100) x 100) =< 76 TRUE
For a practical run through PBT, see 4.2.9, Using PBT and RLP to manage high availability scenarios on page 139, and Chapter 6, Case study scenario: Greebas Bank on page 315. Before you can use PBT, you must enable it for use by the IBM Tivoli Business Systems Manager Administrator. You do this using the Administrator Preferences option (see Figure 4-14). After PBT is enabled, you see the Propagation tab in an objects properties window.
138
139
servers running at full capacity. The extra servers are provided for redundancy and service resiliency and to spread the workload across the all servers. Due to the over capacity of the servers, up to two servers can be impacted by red events before there is a likelihood of the service being degraded. If three servers are impacted, there is a risk of service degradation because all the work is likely to be performed by one server. If all four servers are impacted, the service is severely impacted and possibly down. In this scenario, we use RLP to ensure the following criteria: Any red or yellow objects: Show alerts on affected objects. Up to two red or four yellow objects: Dont propagate to the PBT Demo business system. Three red objects: Propagate a yellow alert to PBT Demo. Four red objects: Propagate a red alert to PBT Demo. Remove PBT alerts when only two red alerts remain on objects. This scenario demonstrates two desired event behaviors that are now possible with IBM Tivoli Business Systems Manager V3.1: Managing redundant groups Sending a yellow event from receiving red events
140
Usually, you must set the RLP at the level directly above the objects that are to be manipulated by RLP. If we set RLP at the PBT Demo business system, then the only child events that can propagate to this business system would have a priority of Critical.
Creating PBT rules for four red objects and three red objects
You must set the PBT threshold rules at one level above the objects that are affected by the PBT rules because the scope of the PBT rules is the objects in the next level down the tree. In this case, you set the rules against the redundancy business system.
141
You start with the easiest rule to define, which is to send a red event when all four objects are red. Each object represents 25% of the total, so the percentage criteria to satisfy this rule is to have between 76% and 100% of in-scope red. The rule only fires when all four objects are red. See Figure 4-17. It is equally correct for this rule to specify 100% as both the minimum and maximum percentage. However, for more complex PBT rules, it helps to ensure that the rules cover all situations so that all percentages are covered. As the math becomes more complex, the need to ensure that all percentages are covered by rules increases.
142
This rule sends a critical red event when its criteria is satisfied. The event is posted against the redundancy business system object. Because this event is posted against the actual object, it is not a child event and so is not affected by the RLP settings done previously. The RLP settings only affect child events. The posted event is also propagated to the PBT Demo business system as desired. The second rule covers the situation of three red child objects. The percentage range of this rule is between 51% and 75%, so it fires only when three of the four objects have a red event against them. See Figure 4-18. Three red events cause a yellow event to be posted to the redundancy business system object and up to PBT Demo as desired.
143
The ability to send a yellow event on receipt of red child events adds a lot of flexibility to IBM Tivoli Business Systems Manager. It also enables a lower severity event to be sent when the service is, for example, degraded but still available and working.
144
Although some of the objects may have an outstanding red status, the green status is posted to the top-level business system because enough components are available and the business process is no longer impacted. Figure 4-20 shows the completed Propagation properties for the redundancy business system. All of the child objects have an equal weight of 100, so they are included in the PBT calculations. The three rules described earlier are set and now the business system is ready to manage this high availability scenario.
145
The rules dictate that two reds do not cause propagation to the top-level business system. They also prevent propagation of any number of yellow events to the top-level business system. Without the rules, the red and yellow events would propagate to the PBT Demo business system. Figure 4-21 shows that the rules are holding. In this case, the RLP rules and the third PBT rule are in use.
A third red event is sent to the objects in the business system. This causes PBT rule 2 to fire. This rule is set to trigger when there are three red objects in the business system and to propagate a yellow event up to the high-level business system. Figure 4-22 shows how this happens.
146
Figure 4-22 Three reds: PBT rule 2 fired, yellow event sent
147
A fourth red event is sent, so PBT rule 1 is triggered and sends a red event to the PBT Demo business system. This is shown in Figure 4-23.
Figure 4-23 Four reds: PBT rule 1 fired, red event sent
148
When two of the events are owned, PBT rule 3 is triggered as, in this case, the alerts have been cleared from the objects. This sets them to a green status and so PBT Rule 3 is eligible to fire. Figure 4-24 shows this.
Compare Figure 4-24 and Figure 4-25 where the alerts are not cleared from the owned events, so the objects stay red and PBT rule 1 is still in effect. Attention: The option to clear alerts from resources when taking ownership can be set globally by the IBM Tivoli Business Systems Manager Administrator using Administrator Preferences. By default, the alert is left posted against the resource. The user can override this in the Take Ownership window. The administrator can change the default to clear all alerts and can remove the override option from the Take Ownership window.
149
150
source and target databases. Such an environment enables the monitoring applications to run independently of each other. Data is moved from the source database to Tivoli Data Warehouse database using extract, transform and load (ETL) steps. Since the monitoring applications used in this solution provide warehouse enablement packs (WEP), we deploy them for collecting monitoring and measurement data into the Tivoli Data Warehouse environment. Each application has a unique code identifying the application data in Tivoli Data Warehouse. The main task is to schedule the execution of these WEPs. The data must be stored, aggregated, correlated from the source application databases into the data warehouse datamart databases. Therefore, it is essential for these WEPs to complete its run before the next cycle. The size of the databases in Tivoli Data Warehouse depends on the size of the IT enterprise. IBM Tivoli Service Level Advisor mines data from Tivoli Data Warehouse. Therefore, you must schedule the WEPs. This enables IBM Tivoli Service Level Advisor ETL runs after the completion of all the ETLs for the monitoring applications to provide data to IBM Tivoli Service Level Advisor, including IBM Tivoli Business Systems Manager. If an organization has monitoring applications, you must install WEPs of these applications on the control center of the Tivoli Data Warehouse. Refer to the documentation provided to install these WEPs. The planning gives an estimated time to run each of these WEPs. Table 4-1 provides timing estimates.
Table 4-1 Monitoring applications with estimated runtime Monitoring application IBM Tivoli Monitoring for Web Infrastructure V5.1.2: WebSphere IBM Tivoli Monitoring for Web Infrastructure V5.1.2: Apache Server IBM Tivoli Monitoring for Databases V5.1.0: DB2 IBM Tivoli Monitoring for Transaction Performance V5.3 IBM Tivoli Monitoring for Web Infrastructure V5.1.2: OS Pack Peregrine Service Center Estimated daily run time 15 minutes 15 minutes 35 minutes 20 minutes 40 minutes 10 minutes
Schedule the WEP of each application according to the estimated times. Set the WEP to run in test mode to confirm the estimated times. When you know the times, schedule the WEP accordingly and then move its steps into production mode. Similarly, plan and test the runtime for the WEP of IBM Tivoli Business Systems Manager.
151
Using the IBM Tivoli Service Level Advisor ETL to extract Tivoli product data from Tivoli Data Warehouse
As we explain in Chapter 3, IBM Tivoli products that assist in service level management on page 53, IBM Tivoli Service Level Advisor uses a set of ETL steps to extract data from CDW database into SLM databases. The ETL steps in IBM Tivoli Service Level Advisor are grouped into four processes. Figure 4-26 displays the four ETL processes for IBM Tivoli Service Level Advisor with msrc_cd value DYK. The details for each process are: DYK_m00_Initiate_Process: This process is not to be scheduled. It is supposed to be run only once after migrating from previous versions of IBM Tivoli Service Level Advisor. DYK_m05_Populate_Registration_Datamart_Process: This process extracts the resource definition data-type components, measurement types, attributes, etc. from the CDW to the SLM database. DYK_m10_Populate_Measurement_Datamart_process: This process extracts the measurement data of the resources from CDW to the SLM database. DYK_m15_Purge_Measurement_Datamart_process: This process prunes the aging measurement data periodically.
152
Figure 4-26 ETL processes for IBM Tivoli Service Level Advisor WEP
The DYK_m05_Populate_Registration_Datamart_Process is referred as Registration ETL. The Registration ETL extracts the measurement type, component type data, and corresponding rules from the CDW to the SLM database. This also extracts the components, its attributes, and other relation into the SLM database. This data helps in defining the service levels objectives and SLAs. By default, the Registration ETL does not extract any data of the available data types from CDW until they are enabled. Before you run this step, you must enable specific source applications in IBM Tivoli Service Level Advisor. To determine the available types of data in the CDW, connect to the central warehouse database (twh_cdw) database from a DB2 command window and may execute a select command as follows:
db2 connect to twh_cdw user <db2_Inst_Owner_ID> using <db2_Inst_Owner_PW> db2 select * from twg.msrc
153
For example, if SLAs must be defined using data from IBM Tivoli Monitoring for Operating Systems, then a value in the MSRC_CD column for that source application must be enabled in IBM Tivoli Service Level Advisor. To do this, from the IBM Tivoli Service Level Advisor server machine, follow these steps: 1. Launch a command window and change the directory to the location of the IBM Tivoli Service Level Advisor installation (C:\TSLA for example). 2. Run the following command for your system: For Windows
slmenv.bat
For UNIX
. ./slmenv.sh
This lists the applications that were added as shown in Example 4-2.
154
Example 4-2 List of source applications added by default Measurement Application Flag: N Measurement Application Flag: N Measurement Application Flag: N Measurement Application Flag: N Measurement Application Flag: N Measurement Application Flag: N Measurement Application Flag: N Source Code: BWM Name: Tivoli Web Services Manager Source Code: APF Name: Tivoli Application Performance Management Source Code: DMN Name: Distributed Monitoring Classic Edition Source Code: GTM Name: Tivoli Business System Manager Source Code: ECO Name: Tivoli Enterprise Console Source Code: MODEL1 Name: Tivoli Common Data Model v1 Source Code: AMW Name: IBM Tivoli Monitoring
4. If the required source application is not listed, then enable the data sources using the codes as listed in Example 4-1. Add and enable the codes that apply.
scmd etl addApplicationData <msrc_cd> <msrc_nm> scmd etl enable <msrc_cd>
Here msrc_cd and msrc_nm are listed in Example 4-1. An example of this is:
scmd etl addApplicationData AMY IBM Tivoli Monitoring for Operating Systems scmd etl enable AMY
The process here is the same for all the other source applications for which the SLAs are to be created. Some applications may use the Tivoli Common Data Model whose msrc_cd is MODEL1. This is documented in each individual WEP document. Check forTWG.MsmtTyp table. If it says MODEL1 in the msrc_cd column, then enable MODEL1. The DYK_m10_Populate_Measurement_Datamart_process is also referred as Process ETL. This process extracts the measurement data that is related to the components and measurement types that were extracted in the previous ETL process. This data is then evaluated for the existing SLAs. Assuming that the runtime of the IBM Tivoli Business Systems Manager WEP is 15 minutes, schedule the IBM Tivoli Service Level Advisor WEP for two hours
155
and 30 minutes after the first WEP is scheduled. This ensures that IBM Tivoli Service Level Advisor obtains all the information from Tivoli Data Warehouse database. This avoids the SLA not being evaluated because the evaluation of the data is tied with the completion of the IBM Tivoli Service Level Advisor WEP.
156
In IBM Tivoli Service Level Advisor, you can define two of types schedules: auxiliary and business schedules. The periods defined in auxiliary schedules take precedence over the periods defined in a business schedule.
Auxiliary schedules are used to define the schedule periods that are common to all the business units in the organization. For example, you can include the holidays of the organization where the service levels of the objectives dont matter. Similarly, to define a maintenances period, auxiliary schedules are used as well. You can include one or more auxiliary schedules in a business schedule, but auxiliary schedules cannot contain an auxiliary or a business schedule. Enabling hourly evaluation in IBM Tivoli Service Level Advisor
IBM Tivoli Service Level Advisor supports the evaluation of the SLOs to be run every hour, two hours, three hours, four hours, six hours, eight hours, daily, weekly, and monthly. By default only daily, weekly, and monthly intervals are supported. For hourly evaluations supported, run the following command from IBM Tivoli Service Level Advisor environment-enabled command window:
scmd mem showHourlyFrequencyIntervals enable
Creating SLOs with an hourly frequency depends on the source monitoring application data collected and extracted into the CDW database within that frequency. If you do not consider these items, you may receive unwanted results.
157
Building an offering
We need a lot of information to build an offering. We concentrate on two items since they are less obvious than the other information. For a full, practical walk-through of defining an offering, see Chapter 5, Case study scenario: IRBTrade Company on page 197, and Chapter 6, Case study scenario: Greebas Bank on page 315. The two items are: How to select the right resource type How to select the evaluation and intermediate evaluation frequencies
Figure 4-28 TBSM business view showing resources that support services
158
We monitor a metric of either one business system or a component inside it. With this information, use the following steps to define the resource type to select in the IBM Tivoli Service Level Advisor offering. 1. Knowing the metric and the type of the component, know which application is used to monitor it. If this application is not installed yet, install it and its WEP. Then enable it inside. Refer to Getting Started with IBM Tivoli Service Level Advisor, SC32-0834-03. 2. Look in the applications Warehouse Enablement Pack Implementation Guide, which you can find in the application CD that contains the WEPs. Go to the directory that contains the WEPs (should contain the acronyms wep, tdw, tedw, etl, etc.). Then go down through the directories until you find the doc directory. This one contains the document. 3. In the document, look for the following tables: Measurement type (table MsmtTyp) Component measurement rule (table MsmtRul) 4. Look either for a metric in the MsmtTyp table or a component type in the MsmtRul table. If you start from the MsmtTyp table, you should see the MsmtTyp_ID (first column) of the metric you selected and the corresponding Comp_Typ_CD in the MsmtRul table. Sometimes more than one CompTyp_CD may correspond to a given metric. Choose the one that you want to monitor. At the end of this step, you should have a component type (CompTyp_CD column in the MsmtRul table). 5. Find the Component Type (table CompTyp) table. With the CompTyp_CD information, find the corresponding CompTyp_Nm. This is the resource type that you should type into the IBM Tivoli Service Level Advisor offering. In the case that you have more than one component type in the previous step, this table can help you decide which one to choose, because it gives you more information about each of the component types. For example, with IBM Tivoli Business Systems Manager, if you go to the MsmtRul table in the enablement guide, you see only one component type, BUSINESS_SYSTEM. This translates to Business System in the CompTyp table. This is the resource to choose when selecting the resource during the offering creation. IBM Tivoli Monitoring for Transaction Performance is another simple case. In the enablement guide, the MsmtRul table has only one component type, BWM_TX_NODE, that translates to Transaction Node in the CompTyp table. As another example, suppose that you want to use, as part of an SLA, the CPU utilization of one of our servers. IBM Tivoli Monitoring can collect this metric, specifically using IBM Tivoli Monitoring for Operating Systems. In the enablement guide, look at the MsmtTyp, search for the word CPU somewhere in the metric, and select the Percent of time that the CPU is idle for example. This corresponds
159
to MsmtTyp_ID 47. In the MsmtRul table, 47 corresponds to AMY_CPU. In CompTyp table, AMY_CPU is a system processor. Use this as a resource inside the offering. In a third example, you want the number of HTTP sessions as the metric. You can collect this metric by the IBM Tivoli Monitoring for WEB Infrastructure. In the enablement guide, in the MsmtTyp table, choose the Number of concurrently live servlet sessions (load) metric. This is MsmtTyp_ID 15. In the MsmtRul table, 15 corresponds to IZY_SERVLET_SESS. In CompTyp table, IZY_SERVLET_SESS is the IBM WebSphere servlet session. During the creation of the offering in IBM Tivoli Service Level Advisor, in the Select Resource Type pane (Figure 4-29), select one entry in the tree on the left. Then the resource types are displayed in the table on the right. The resource type that you want for the offering may already appear in the table in the left panel. This happens, for example, in the case where the resource type is of business systems and transaction node.
160
For System Processor, notice that it does not appear in the table. To enable it, select Host Monitored by IBM Tivoli Monitoring. This shows a table with three pages. If you advance to the last page, you see the System Processor resource type as shown in Figure 4-30. After you select a resource type, click Next and then click Add. Then you reach the Select Metrics page. From here, you follow the steps that are presented in Part 2, Case study scenarios on page 195.
161
Building SLAs
This section explains how to select a service, how to select a resource, and how to select the SLA Start Date when creating the SLA in IBM Tivoli Service Level Advisor. For a full walk-through of the SLA definition, refer to Part 2, Case study scenarios on page 195.
Selecting the service On the Select Service page, associate the SLA to the business service that
describes the service the SLA is monitoring. In this case, the name of the service is the same as the business system in IBM Tivoli Business Systems Manager. Define the business system in IBM Tivoli Business Systems Manager as a service to allow the association of an SLA to it. Refer to Marking an IBM Tivoli Business Systems Manager business system as a service on page 187 to do this. Then, run the IBM Tivoli Business Systems Manager WEP. Also run both IBM Tivoli Service Level Advisor Registration ETLs (Populate Registration and Populate Measurement) to make the information about the newly-created service available on the Select Service page. For example, assume that you are creating an SLA for the Online Accounts business system shown in Figure 4-28 on page 158. On the Select Services page, you select the Online Accounts service as shown in Figure 4-31.
162
163
Tip: When defining dynamic resources, select the Preview current evaluation filters option in the Filter Resources window to see the resources that currently match the filters.
Reports
In IBM Tivoli Service Level Advisor, the reports are on demand. This means that you, at any time, can obtain any report of what is currently happening with the SLAs. Depending on the type of user that is accessing the reports and its attributes, all the SLAs or a subset of them are available for viewing. The type of reports that are available depend on the variables listed in the following sections.
164
Types of users
There are three types of report users: operator, executive, and customer. This is particularly important when creating the various report users. Figure 4-32 shows the relationship among the various IBM Tivoli Service Level Advisor report users. Provider of services can be the internal IT department or an application service provider. Recipient of services can be the various lines of business inside an enterprise or the users of application services from the applications service provider. In either case, there is an SLA between the provider and the recipient of services. The report of this and other SLAs is the objective of each user according to each ones perspective.
Provider of Services
Recipient of Services
Executive
SLA
Customer
Operator
The operator and the executive belong to the provider organization. They are responsible to provide services to the customer. AN SLA exists between the executive and the customer. The executive is responsible for the service, but the operator is the one who takes care of the day-to-day operations to guarantee the service level. Therefore, the operator needs maximum details to diagnose any problems. The executive needs a high level idea of all the services provided, and the customer needs only the information about his or her own SLAs. The following two objects in IBM Tivoli Service Level Advisor are important when dealing with reports:
Customers are the recipients of service. In an operational level agreement (OLA), customers can help to distinguish the various internal providers of a
service or in a underpinning contract to designate the external provider of service.
165
You create the users using the IBM Tivoli Service Level Advisor command line interface (CLI) as shown in this example:
scmd report addUser -name BankingExecutive -view 3 -customer Banking -userType 3
This command creates a report user called BankingExecutive with an external view. This user is a customer type of user and is restricted to viewing reports of the customer Banking. Refer to Command Reference for IBM Tivoli Service Level Advisor, SC32-0833-03, for details about this CLI.
Types of reports
Many types of reports are available to IBM Tivoli Service Level Advisor report users. Table 4-3 lists the reports that are available to each user type. These reports include all the SLAs to which a particular user can have access.
166
Table 4-3 Available reports by user type Operator Dashboard Customers by Realms SLA by Customers Ranking SLA SLA Type Customer Realm Offering Component Resource Details Overall details SLA Results Trends Violations Yes Yes Yes Yes Yes No No No Yes No No No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No No Yes No No No No No Yes Default Default Yes No Default Executive Customer
The dashboard reports are, by default, the first page that a user see when logging in. They give an overall idea of the status of all the SLAs a user has access to or to all the customers (depending if you are a executive user) for whom the user is responsible. See Figure 4-37 on page 174 for an example. The user can modify the time range or the SLA types listed, using the Filter Criteria section in the report. In this view, the user can easily see where problems or potential problems are and explore details to find the causes. The user does this by clicking in the cell that shows the violations or trends (red or yellow cell). Then they see the SLA Details view. For more information about the contents of this type of report, see IBM Tivoli Service Level Advisor SLM Reports, SC32-1248. Ranking reports (Figure 4-33) consider the number of violations, trends, and SLAs, and display them in order. This is used to quickly find the most impacted objects (SLA, SLA type, resource, customer, realm, or offering component) in order. It uses an algorithm to define the rank. For details about the algorithm, see IBM Tivoli Service Level Advisor SLM Reports, SC32-1248.
167
Details reports show more details about a set of SLAs, such as SLO results, trends, and violations.
Summary graphs
In some of the reports, summary graphs are displayed. Two sets of graphs can be displayed depending on the type of report that is shown. For SLA details or Overall details reports, a pair of graphs is displayed at the top of the page. You can customize the type of graph and choose from the following variables: Metrics or resources Trends or violations Bar or pie chart The graph can be displayed for the metrics or the resources with most trends or violations.
168
For the ranking reports, eight different graphs can be displayed per object type (SLA, SLA Type, customer, realm, offering component and resource): Violations per object Trends per object Violations per time period Trends per time period Violations and trends per object Rank per object Top objects with the most violations Top objects with the most trends Figure 4-34 shows two examples of summary graphs. One example of using a ranking report is for the executive who wants to know about the resources that most contributed to violations in the last month.
169
Replacing resources
The second situation is when a resource is replaced. For example, Server1 breaks and is replaced by Server2. In this case, it would be nice if the monitoring application that is monitoring Server1 starts monitoring Server2 as well. Then you should run the ETLs for both the monitoring application and for IBM Tivoli Service Level Advisor. With this, you can see a reference to Server2 during the Replace Resource in IBM Tivoli Service Level Advisor. For example, consider that you want to replace S2STI-TBSMWebCons_67 with the Step_1... resource as shown in Figure 4-45 on page 185. Follow these steps: 1. Log in to the IBM Tivoli Service Level Advisor administrators console. 2. Click Administer SLA Replace Resource. 3. In the Find Resource window, click Browse. 4. In the Select Resource Type window, select Transaction Node and click Next. 5. In the Create Filter window, complete these tasks: a. b. c. d. Click Create Filter. In the Attribute field, select Transaction Management Policy. In the Value field, type S2STI-TBSMWebCons. Click Next.
6. In the Select Resources window, select S2STI-TBSMWebCons_67 and click Next. 7. In the Find Resource window, click Next. 8. In the Replace Resource window, repeat steps 3 to 7, but now choose the Step_1... resource. Select Online Accounts Trend SLA and click Finish. 9. In the Track Updated SLAs window, verify that the SLA is there and click Close. This way the resources are replaced in the Online Accounts Trend SLA.
Adjudication
IBM Tivoli Service Level Advisor provides a way to adjudicate violations. In the SLA, you can specify situations where a violation can be adjudicated. For
170
example, one situation can be that the service level is guaranteed only up to a certain number of users connected to an application running in WebSphere. You can use IBM Tivoli Monitoring for WEB Infrastructure live servlet sessions metric to monitor the number of sessions in a given server. When the number of sessions exceed a certain breach value, you receive a violation. This metric can be created in IBM Tivoli Service Level Advisor as an internal one, so that the customer does not receive the violation event. But with this, you can have a well documented way to justify the adjudication. To adjudicate any violation, follow these steps: 1. Log in to the IBM Tivoli Service Level Advisor administrator console. 2. Click Administer SLAs Manage Violations. 3. In the Manage Violations window, select the violation that is to be excluded and click Exclude. 4. In the Exclude Violation window, write the reason for excluding the violation and click OK.
Tiered SLAs
IBM Tivoli Service Level Advisor has the capability to combine one or more SLAs into another one. Here you use this to create an SLA that includes all three banking SLAs. If any of these SLAs has a violation, the Banking SLA shows a violation. You also link this to the Banking business service, so that the Banking business system icon in the IBM Tivoli Business Systems Manager executive console shows any violations in any of the Banking services. 1. In the IBM Tivoli Service Level Advisor administrator console, click Administer Offerings Create Offering. 2. In the Name Offering window, complete these tasks: a. For Name, type Banking Offering. b. For Description, type This offering includes all the SLAs in the Banking business unit. c. Click Next. 3. In the Select SLA Type window, click Next. 4. In the Include SLAs window, click Add. 5. In the Select SLAs window, select all three SLAs: Online Accounts Interbank Transfers Account Application Then click OK.
171
7. In the Select Business Schedule window, select 24 x 7 schedule and click Next. 8. In the next panels, click Next until you see the Summary window. 9. In the Summary window, select Publish the offering and click Finish. Dont include any offering components. To create the SLA, follow these steps: 1. Click Administer SLA Create SLA. 2. In the Name SLA window, in the SLA Name field, add Banking SLA and click Next.
172
3. 4. 5. 6. 7.
In the Select Customer window, select Banking and click Next. In the Select Service window, select Banking and click Next. In the Select Offering window, select Banking Offering and click Next. In the Select SLA Start Date window, click Next. In the Summary window, click Finish.
Now look at the reports for this SLA. Log in to the IBM Tivoli Service Level Advisor Reports interface as the SLA Administrator. Then click in one of the cells of the Banking SLA. Now you see the Banking SLA with the three other SLAs that it contains as shown in Figure 4-36.
173
If you go back to the high level report, you will see that each violation on two of the SLAs are reflected on the Banking SLA (that is the parent). You also see that two of the component SLAs have one violation and that the Banking SLAs have two. Each of the component SLAs violations is reflected in the parent or tiered SLA as shown in Figure 4-37.
Details of what is seen for SLA violations are given in the case study scenarios presented in Part 2, Case study scenarios on page 195. If a violation or trend is propagated to this SLA from one of the associated ones, this event is sent to IBM Tivoli Business Systems Manager to be shown in the executive dashboard and is associated with the Banking business system.
174
Maintenance schedule
It is important to schedule preventive maintenance from time to time. Be sure to include a maintenance window in the signed SLA. The maintenance, in this case, should happen every three months on a Sunday. The maintenance should be done from 0:00 a.m. to 2:00 a.m. on Sunday. To define this to IBM Tivoli Service Level Advisor, the only prerequisite is that the maintenance window is in the future. The process to assign a maintenance window is to create a new schedule with a No Service period defined to cover the maintenance window and replace the existing schedule with it. Assume that today is 12 October 2004 and you want the maintenance to happen on 12 December 2004 from 0:00 a.m. to 2:00 a.m. Also assume that you want to do this maintenance in the resources that support the Online Banking service.
175
7. In the Create Period window (Figure 4-38), complete these tasks: a. In the Frequency field, select Single Date. b. The window changes for the options relative to Single Date. i. In the State field, select No Service. ii. Keep the Time Zone and Start Time as the default. iii. In the End Time field, select 01:59. iv. In the Date field, type 12/19/2004 or use the calendar icon on the right side of the field. v. Click OK.
176
8. You return to the Define Periods window (Figure 4-39). The difference is that you added the No Service period. Click Next.
177
In the Manage Schedules window (Figure 4-40), you see the added schedule.
178
3. In the Associate SLAs window (Figure 4-41), in the task list, click Select Compatible Business Schedule.
179
4. In the Compatible Business Schedule window (Figure 4-42), select 24 x 7 20041219M schedule and click Next.
5. Continue clicking Next until you reach the Summary window. 6. In the Summary window (Figure 4-43), at the bottom, there is a table with all the SLAs that are affected by this change. Click Finish.
180
7. In the Track Updated SLAs window, you see a table similar to the one in Figure 4-43 for tracking the SLAs that are affected by the change on this offering. Click Close. Now the maintenance window is included. At the end of the month (monthly SLA period), the SLA will be calculated taking into account this maintenance period.
This adds a No Service state on 12 November 2004 between 05:00 hours and 12:00 hours. This CLI is helpful if you must suddenly set up a maintenance period by adding a No Service period.
Trends
Another SLM tool in IBM Tivoli Service Level Advisor is the use of trends. Trends are automatically calculated in all the metrics selected for an SLA. To improve this capability, you can add another metric. This section explains how to add another metric, for example, and how to set the metric for trending analysis. The metric is to collect the performance on the same resource in IBM Tivoli Monitoring for Transaction Performance that is feeding a IBM Tivoli Business Systems Manager business system.
181
We already created the original SLA, Online Banking SLA. Now we modify this SLA to include this new metric and enhance the trend. For this, we include the same resource that is feeding events to the resources under Real-time Online Account Transactions. The first stage is to modify the offering. Because IBM Tivoli Service Level Advisor does not allow you to add new service offering components, create another offering using the original one as a base. The reason IBM Tivoli Service Level Advisor behaves this way is because the published offering can be assigned to some other SLAs other than the one you want to modify. This can cause changes on those SLAs when this was not the intention.
182
9. In the Include Metrics window, click Add. 10.In the Select Metrics window, select Response Time and click Next. 11.In the Define Breach Values window, complete these tasks: a. As defined in OLA, in the Average files field, type 10. b. For Keep Violation Condition with, select Actual average greater than supplied average. c. Click Next. 12.In the Evaluation Frequency window, complete these tasks: a. In Access to Results, select Internal Use Only. We dont want business executives outside of the business unit to see this. a. In Evaluation Frequency, select Monthly. b. In Advanced Metric Settings, select Configure advanced metric settings. c. Click Next. 13.In the Advanced Metric Settings window, complete these tasks: a. In Intermediate Evaluations, select Perform intermediate evaluations. b. Still in Intermediate Evaluations, keep the Daily selection.
183
c. In Trend Analysis, select Current evaluation Period Only. d. Click Finish. 14.In the Include Metrics window, click Next. 15.In the Name Offering Component window, in Offering Component field, add Online account response time. Click Next. 16.In the Include Offering Components window, click Next. 17.In the Summary window, select Publish the offering and click Finish.
184
c. In the Value field, add any part of the name of the transaction management policy, for example, S2STI-TBSM. d. Click Next. 10.In the Select Resources window (Figure 4-45), you see Step_1_..., which is a subtransaction of the other transaction. Select S2STI-TBSMWebCons_67 and click Next.
11.In the Add Resources to Online Account Response Time window, click Next. 12.In the Select SLA Start Date window, complete these tasks: a. Make this SLA valid for the next month. In the SLA Start Date, specify the first day of the next month. b. Click Recalculate First Evaluation Dates. c. Click Next. 13.In the Summary window, click Finish.
185
The escalation message can be any of the following forms: E-mail message Simple Network Management Protocol (SNMP) event TEC event To enable TEC event escalation with service details when violation or trending toward violation occurs, load the sample ruleset provided with the SLM Event class definitions into the TEC Rule base. See Command Reference for IBM Tivoli Service Level Advisor, SC32-0833-03, for details to customize and enable the event escalation. You can toggle on or off the event escalation for parent SLAs in the tiered SLA using the CLI:
scmd escalate parentSLAEscalation {true|false}
This disables any violation or trending toward violation event escalation to TEC. Load the sample TEC rule, slmDropParentEvents.rls, that is provided into TEC. After the rule is loaded and event escalation is switched on using the CLI, the parent SLA events can be controlled for escalation.
4.4.4 Integrating IBM Tivoli Service Level Advisor with IBM Tivoli Business Systems Manager
Section 4.4, IBM Tivoli Service Level Advisor V2.1 on page 156, introduces the concept of loading IBM Tivoli Business Systems Manager data into Tivoli Data Warehouse and extracting it to IBM Tivoli Service Level Advisor. This enables IBM Tivoli Service Level Advisor to use IBM Tivoli Business Systems Manager data to calculate SLA metrics. In Escalating the SLA events on page 186, you can learn how to send IBM Tivoli Service Level Advisor events to TEC. In
186
Executive dashboard on page 130, you learn how the IBM Tivoli Business Systems Manager executive dashboard can receive IBM Tivoli Service Level Advisor events. This section describes the process to pass IBM Tivoli Service Level Advisor events from TEC into IBM Tivoli Business Systems Manager.
Getting IBM Tivoli Service Level Advisor events into IBM Tivoli Business Systems Manager executive dashboard
For IBM Tivoli Service Level Advisor events to show in the correct icon on the IBM Tivoli Business Systems Manager executive dashboard, you must perform the following actions: 1. Place IBM Tivoli Business Systems Manager data into IBM Tivoli Service Level Advisor (TSLA). This is detailed in 4.4, IBM Tivoli Service Level Advisor V2.1 on page 156. 2. Mark the IBM Tivoli Business Systems Manager business system as a service. 3. Build an SLA or SLAs around services defined in IBM Tivoli Business Systems Manager. This is detailed in Building SLAs on page 162. 4. Enable TSLA TEC TBSM event traffic and display it in the TEC console. The following sections explain how to mark IBM Tivoli Business Systems Manager business services as a service. They also explain how to enable IBM Tivoli Service Level Advisor to send event data, using TEC, to IBM Tivoli Business Systems Manager for display in executive dashboard views.
Marking an IBM Tivoli Business Systems Manager business system as a service The concept of services is shared between IBM Tivoli Business Systems
Manager, Tivoli Data Warehouse, and IBM Tivoli Service Level Advisor. Basically, an entity defined as a service in IBM Tivoli Business Systems Manager will be a service within Tivoli Data Warehouse. It is also available as a service for selection during the SLA definition process in IBM Tivoli Service Level Advisor. Marking a resource a service within IBM Tivoli Business Systems Manager can be done for both business systems and individual objects within a business system. Note that objects that are not in business systems cannot be marked as services.
187
To mark a resource as a service, click the resources Properties tab and select the Executive View tab. This opens the Executive Dashboard panel (Figure 4-46) for defining a resource as a service.
The Executive Dashboard panel contains two check boxes and five text fields to complete (starting from the top of the right pane in Figure 4-46):
188
Running this script sets up everything. After this is done, IBM Tivoli Service Level Advisor events are sent to IBM Tivoli Business Systems Manager. If the events are for a service that is represented in the executive dashboard, the IBM Tivoli Service Level Advisor icons show that there are outstanding violations or trends. You only need to perform this process once for each TEC feeding into IBM Tivoli Business Systems Manager. Figure 4-47 shows an executive dashboard that has non-viewed SLA violations (red square) and viewed SLA trends (blue arrow).
189
Integrating IBM Tivoli Monitoring for Transaction Performance events into IBM Tivoli Business Systems Manager
IBM Tivoli Monitoring for Transaction Performance sends events to TEC through simple configuration of parameters on the IBM Tivoli Monitoring for Transaction Performance Management Server. You can pass IBM Tivoli Monitoring for Transaction Performance events on to IBM Tivoli Business Systems Manager by configuring TEC to forward the events. 1. Add the IBM Tivoli Monitoring for Transaction Performance baroc file and rule to TEC. 2. Extend the perl script to forward IBM Tivoli Monitoring for Transaction Performance events to IBM Tivoli Business Systems Manager. 3. Create generic objects in IBM Tivoli Business Systems Manager for IBM Tivoli Monitoring for Transaction Performance resources. This is close to the same process used for sending any form of event from TEC to IBM Tivoli Business Systems Manager. This is described in IBM Tivoli Business Systems Manager Installation and Configuration Guide, SC32-9089, and in IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085. TMTP objects are standard generic TBSM objects, so they look like whichever icon is chosen for them by the IBM Tivoli Business Systems Manager Administrator when creating the generic objects. The actual IBM Tivoli Monitoring for Transaction Performance events contain a lot of details about the transaction and the thresholds as shown in Figure 4-48.
190
Figure 4-48 A TMTP event as seen in IBM Tivoli Business Systems Manager
Quality of Service (QoS) helps to give metrics of user response and overall
user-experience of a transaction by using a reverse proxy to measure round-trip
191
time. QoS is potentially a performance overhead and is not covered further in this redbook. IBM Tivoli Monitoring for Transaction Performance has rich J2EE Monitoring. This can be useful for monitoring a WebSphere-based J2EE application. Plus, the data from IBM Tivoli Monitoring for Transaction Performance can add a lot to SLM. The final IBM Tivoli Monitoring for Transaction Performance component is the
Rational Robot. You can use it to great effect by recording and replaying user
transactions on desktop applications. The Robot is not restricted to Web browser transactions, so it has many uses. The Robot needs manually-added Application Response Monitoring calls placed into the Robot script so that metrics can be returned to IBM Tivoli Monitoring for Transaction Performance. To learn about exploitation of the Robot, see Chapter 5, Case study scenario: IRBTrade Company on page 197.
192
allow you to define and create SLOs for the database that is provided by individual resource models of IBM Tivoli Monitoring for Databases.
193
194
Part 2
Part
195
196
Chapter 5.
197
198
CEO
Marketing
Financial Consultancy
Information Technology
Figure 5-1 Organization chart for the IRBTrade Company business units
IRBTrade Company began as a small online trading company with a loyal customer base. Since going public one year ago, the company has seen their customer base increase steadily. In addition, the recent economic upturn of the past few months has led to an exceptional growth of 50%, which is due also in part to such promotions as one free trade with every five and no commission day. Recent research from marketing indicates that: Customers are satisfied with the low commission rates and the promptness and reliability of the service during off-peak hours. High-peak performance and availability are often unacceptable according to many customers. Specifically it can take two to three attempts to successfully login. This complaint is twice as common on promotional days. During peak times, transactions sometimes take so long to complete that the stock price has changed. Occasionally during peak times, the entire transaction times out and must be repeated. Overall performance on heavy trading days is poor. Heavy trading is usually caused by acts of terrorism, the exposure of corporate fraud, etc. In this competitive market, customer loyalty is typically due to promptness, reliability, and per-trade commission rates. If customer satisfaction does not improve soon, customers will find another online trading company to use. The marketing business unit is concerned that poor performance factors on such days will decrease customer loyalty when less value for money is perceived. Further research by marketing has shown that they can increase revenue if they can quantitatively prove the companys superior service compared to its competitors. As a result of this research, marketing is willing to fund a project to implement SLM to facilitate its marketing strategy.
199
Summary of issues: Low customer satisfaction Loss of customers in spite of promotional activities Decreased customer loyalty No tools to quantitatively prove IRBTrade Companys superior service Inability to understand the impact of peak loads on customer satisfaction Reports provided by the IT business unit are written in technical terms and do not contain information relevant to the business.
200
Figure 5-2 shows the organizational hierarchy for the IT business unit.
Information Technology
Service Desk
Application Development
IT Infrastructure
Web Infrastructure
Databases
Network
Operating Systems
Figure 5-2 Organization chart for the IRBTrade Company IT business unit
The IT business unit is constantly enhancing the online trading application by adding new features, making it easier to use, and by improving performance and system availability. However, since the results of these improvements are not visible to the marketing business unit, the IT business unit has been under pressure to demonstrate quality service. The IT business unit is responsible for planning and implementing the SLM project funded by the marketing business unit. Summary of issues: IT services provided by the IT business unit are not aligned with the current and future needs of the business. Perception of quality of delivered IT services is low. There is a lack of visibility of the work being done to improve the online trading application and underlying infrastructure support. There is a lack of understanding on the impact of IT services to the overall business of IRBTrade Company. Existing systems management tools are being under used. Reports are manually produced and do not provide information required by the marketing business unit as described in 5.2.3, Reporting on page 203.
201
202
5.2.3 Reporting
Individually, each team of the IT business unit provides reports indicating overall availability of the system or software being maintained. These reports are produced manually and are often prone to errors. More detailed reports from the operating systems group indicate periodic episodes of high CPU utilization, but nothing on a regular basis. Similarly, the
203
Web infrastructure team reports some periods of high usage, but is unable to identify any trends. All of the reports inform the IT infrastructure manager of periodic performance problems. However, there is no way to correlate all the information to what the surveys of the marketing business unit are showing and complaints in terms of performance and customer satisfaction. When reports are provided to the marketing manager, the information provided mainly shows good to average availability and performance of the systems. However, they are written in technical terms, are not consolidated, and therefore, do not provide information that is relevant to the business.
204
205
The main issue seems to lack correlation between the two organizations when evaluating the effective level of service that is being provided. Table 5-2 identifies this and other issues.
Table 5-2 Key issues Issue Low customer satisfaction Absence of quantitative data to support the level of service being provided by each organization, and then in turn to the customer Under utilization of the existing IT infrastructure and tools. Impact Loss of customers; diminished growth of customer base; reduced marketing potential Inability of IT to address customer perception. Inability to prioritize resolution of incidents The inability to identify the areas of the IT infrastructure that are performing or not performing to the desired levels to meet the overall business goals of the company Report creation takes too long; accuracy is questionable; reports are after the fact; there is no trend to failure; no proactive analysis No root cause analysis of business failures Since there are no objectives to meet, there are no drivers to improve service levels
No formal SLM processes in place; manual process for availability and performance reporting and analysis No clear understanding of the impact of IT failures on the business No formal operational level agreements (OLA) or service level agreements (SLA) defined
206
1. Identify the services and business processes that will be part of the SLM project. 2. Identify the consumers, customers and providers of various services. In this case study scenario, from a point of view that is external to the IRBTrade Company, the consumers and customers are the users of the online trading application. The provider is the IRBTrade Company. From a point of view that is internal to the IRBTrade Company, the responsibilities of the provider go to the IT business unit, since they provide IT services that make up the online trading application. 3. Identify and reconcile customer requirements and providers capabilities. 4. Define SLOs and SLAs. 5. Identify and implement additional systems management and monitoring tools. 6. Identify the resources and components that make up the defined services. 7. Identify proper metrics for each defined service. Determine the desired metrics and the current monitoring sources. Perform analysis to determine if additional ones are needed. 8. Identify, implement, and customize monitoring tools and procedures for collecting metric data. 9. Identify the reporting frequency. 10.Identify and define executive views and assign proper services to the views. 11.Implement a proactive warning mechanism for potential service breaches. 12.Review and adjust processes whenever necessary.
207
Warehouse enablement packs (WEPs) for the following products: IBM Tivoli Monitoring IBM Tivoli Monitoring for Databases IBM Tivoli Monitoring for Web Infrastructure IBM Tivoli Enterprise Console IBM Tivoli Monitoring for Transaction Performance IBM Tivoli Business Systems Manager
Chapter 3, IBM Tivoli products that assist in service level management on page 53, provides a high-level description of the IBM software tools that are used in this solution. This section explains how the specific features of IBM Tivoli Service Level Advisor V2.1 and IBM Tivoli Business Systems Manager V3.1 are used to meet the objectives of the SLM project for IRBTrade Company. Refer to 5.4.1, Additional instrumentation required on page 212, for specific information about how these features are implemented in our case study scenario. Table 5-3 summarizes the IBM Tivoli Business Systems Manager features that are used.
Table 5-3 IBM Tivoli Business Systems Manager features and usage Feature Business systems Executive dashboard services Executive dashboard display Executive dashboard secondary impact indicators (SIIs) IBM Tivoli Business Systems Manager WEP Console consolidator Reason for use To create representations of business services to monitor from a business perspective To enable critical business system status to be displayed in an executive view To provide executive views showing service status with SLA indicators To provide visibility of SLAs violations and trends for critical services To enable IBM Tivoli Business Systems Manager business system availability data to be used in SLAs built with IBM Tivoli Service Level Advisor To consolidate views and representation of IT resources, based on the administrators roles and responsibilities
208
Table 5-4 summarizes the IBM Tivoli Service Level Advisor features that are used.
Table 5-4 IBM Tivoli Service Level Advisor features and usage Feature Realm and customer definition Service offerings IBM Tivoli Business Systems Manager/IBM Tivoli Enterprise Console (TEC) integration Service Level notification Tiered SLAs Ability to add a maintenance window Adjudicate Violations Reason for use To segregate services for external and internal clients To provide options to define different options for SLOs and targets To enable breaches and trends for services to be displayed on IBM Tivoli Business Systems Manager executive dashboards To escalate via SNMP, TEC and e-mail when the SLO is breached or trending toward violation To group various SLAs via tiered SLAs Add an unexpected maintenance period to an active SLA To have the ability to adjudicate violations with an agreement between the customer and the service provider To have the ability to create SLAs using data from any monitoring application if the WEP is available To have the ability to create SLAs using Peregrine ServiceCenter data using Peregrines TDW connector To display the SLA status using SLM reports To evaluate monitoring data using multiple interval frequencies such as hourly, two hourly, etc. To create an offering, schedule, customer, realm, etc. To use the provided WEP to extract data from multiple central data warehouse databases
Ability to plug-in any monitoring application Create SLAs using Service Desk data Executive dashboard Provision of various evaluation intervals Wizard based Administration Console Ability to deal with data from multiple warehouse databases
The service level manager assigned a team formed by technical and business representatives to decide on how to implement the features and customize IBM Tivoli Business Systems Manager and IBM Tivoli Service Level Advisor to achieve the desired results for SLM in IRBTrade Company. As described in Chapter 4, Planning to implement service level management using Tivoli products on page 109, the team performed the following activities:
209
1. Identifies all the services that will be considered in the project 2. Performs a service decomposition task to identify all the resources that make up the service 3. Decides on the relationships among the various resources 4. Identifies the business system units 5. Outlines the business systems views for each of the executives 6. Defines the SLOs per business units 7. Establishes agreements on SLOs between business units representatives 8. Determines the service level reporting content and frequency The team created a high level representation of the various business systems, resources, executive views, SLAs and components, and reporting to use as a basis for IBM Tivoli Business Systems Manager (TBSM) and IBM Tivoli Service Level Advisor (TSLA) configurations for IRBTrade Company. See Figure 5-4.
CEO
Marketing
User Experience
Trade Application
OLA
Development
Service Desk
- Web Servers Support - Web Application Servers Support - Database Servers Support - OS Servers Support
- External - Internal
SLA
SLA
Legends: Service SLA Definition D S Dashboard SLA Report OLA Service Provider Service Receiver
Refer to 5.4, Implementation on page 211, for details about how the configuration is performed in the systems management and monitoring environment of IRBTrade Company.
210
5.4 Implementation
This section shows how the SLM processes is implemented in the IRBTrade Company. It also provides references to how the solution maps to ITIL recommendations in here supplement what weve said in earlier sessions. The high level steps are: 1. Determine and implement additional instrumentation on the existing systems management environment of IRBTrade Company. 2. Determine and define business services and their infrastructure components at a high level. 3. Determine user roles for IBM Tivoli Business Systems Manager and IBM Tivoli Service Level Advisor. 4. Define the required IBM Tivoli Business Systems Manager resource types. 5. Create business systems based on business functions. 6. Agree and define content of executive dashboard views. 7. Agree and define SLOs. 8. Define the required metrics to measure SLOs. 9. Enable data sources in IBM Tivoli Service Level Advisor. 10.Set up IBM Tivoli Service Level Advisor schedules, realms, and customers.
211
11.Set up offerings in the SLAs in IBM Tivoli Service Level Advisor. 12.Define the required SLAs in IBM Tivoli Service Level Advisor.
212
213
Figure 5-5 illustrates the IRBTrade Company user experience transactions as defined in the IBM Tivoli Monitoring for Transaction Performance console.
Installing IBM Tivoli Monitoring for Transaction Performance management server, management agents, creating playback recordings, and policies to monitor the IRBTrade Company user experience are outside the scope of this redbook. Refer to IBM Tivoli Monitoring for Transaction Performance Administrators Guide, GC32-9189, for implementation details.
214
The business systems are defined to facilitate monitoring of service levels at each organization level (line management, senior management, and executive management) and to identify and define OLAs between the organizations. The existing monitoring capabilities using IBM Tivoli Monitoring are also integrated into the business systems to provide operational status of each IT resource that is critical to the business and to the service. All IBM Tivoli Business Systems Manager business systems are defined using IBM Tivoli Business Systems Manager Java Console and drag-and-drop approach. For information about how business systems are defined, refer to the IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085. The business systems defined for the IRBTrade Company SLM solution are presented in 5.4.2, Identifying the business service on page 216. IBM Tivoli Business Systems Manager distributed resource types are defined to represent various IT resources and the IRBTrade Company user experience-related transactions. The resource types defined (as listed in 5.4.4, Required resource types on page 225) are based on the existing monitoring capabilities, potential internal and external services, and SLAs that facilitate implementation of SLM. Existing IT resource monitoring, IBM Tivoli Enterprise Console event sources, plus additional event sources as a result of deploying IBM Tivoli Monitoring for Transaction Performance user experience-related transactions are analyzed. These events are mapped to the appropriate IBM Tivoli Business Systems Manager distributed resources types defined. TEC events are integrated into the IBM Tivoli Business Systems Manager distributed solution by: Mapping TEC events to the appropriate IBM Tivoli Business Systems Manager resource type and then to a specific instances of that resource type Using the TEC rules and a single perl script to forward the event to TBSM Both IBM Tivoli Enterprise Console rule and script are listed in Appendix B, Important concepts and terminology on page 515. Forwarding the event data to IBM Tivoli Business Systems Manager using the IBM Tivoli Business Systems Manager Event Enablement component installed at TEC (via the ihstttec application programming interface (API) call) The approach used in this case study scenario, using IBM Tivoli Enterprise Console rule and a script, is one of the many ways to integrate TEC events into TBSM distributed solution. Using the TEC rule and script to evaluate the event and then forward TEC events to TBSM via the ihstttec API call allows the most flexibility in mapping TEC events to TBSM resource types. This approach also allows any automation (IBM Tivoli Enterprise Console rules, etc.) that is in place to take effect before forwarding events to IBM Tivoli Business Systems Manager.
215
216
convention for service definitions. It is IRB Trade <ServiceName>, where IRB Trade represents the core business of IRBTrade Company as defined earlier. To facilitate SLM of the IRB Trade service, we identify additional services based on the IRBTrade Company mission, organizational structure, responsibilities of each organization, and inter-dependencies between the organizations to provide the best possible service to its customers. With this in mind, and based on the information provided in Figure 5-4 on page 210, we identify the following business services that map to executive level management given the IRBTrade Companys organizational structure: Marketing services This is related to the services provided by the IRBTrade Company to its external customers. It is mainly concerned with customer traffic or volume, customer perception about the quality of the service, and end-user transaction promptness as perceived by its customers. This service is based on end user load, and end user experience as monitored by the IRBTrade Company. We name this service IRB Trade Marketing. Financial consultancy services This service deals with providing stock analysis information provided to the IRBTrade Company customers. It is not addressed in any further detail, but is included here for the sake of making this case study scenario complete at this level. We name this service IRB Trade Research. IT services This service is based on the services provided by the IT business unit and its organizations: trade application production and support, trade application development, service desk, and IT infrastructure. It is made of the IRBTrade Company Web site supporting software and all other IT infrastructure that is used to run the IRBTrade Companys day-to-day business. These services support the services listed earlier. We name this service IRB Trade IT Division.
217
Figure 5-6 shows an overview of these services and their relationships in terms of SLAs and OLAs. The hierarchy of identified business services for IRBTrade Company begins with IRB Trade. Underneath this level are: IRB Trade IT Division IRB Trade Marketing IRB Trade Research We must perform further decomposition of the services provided by IRBTrade Company, as explained in the following sections.
IRBTrade Company
Marketing Information Technology SLA SLA
Financial Consultancy
OLAs
Customers
Service Desk
Application Development
IT Infrastructure
218
Figure 5-7 shows the final breakdown of IRBTrade Companys business services.
IRB Trade IRB Trade IT Division IRB Trade Application IRB Trade Availability IRB Trade Web Servers IRB Trade Web Application Servers IRB Trade Database Servers IRB Trade Unix Servers IRB Trade Wintel Servers IRB Trade Development IRB Trade Infrastructure IRB Trade Infra Web Server Support IRB Trade Infra Web Application Server Support IRB Trade Infra Database Server Support IRB Trade Infra Unix System Support IRB Trade Infra Wintel System Support IRB Trade Service Desk IRB Trade External Customer Incident Management IRB Trade Internal Customer Incident Management IRB Trade Marketing IRB Trade User Load IRB Trade User Experience IRB Trade Application Availability IRB Trade External Customer Incident Management IRB Trade General Web Site Response or Experience IRB Trade On-line Quote Response time IRB Trade On-line Sell Transaction Response time IRB Trade On-line Buy Transaction Response time IRB Trade Research
219
220
IRB Trade Infra Web Server Support IRB Trade Infra Web Application Server Support IRB Trade Infra Database Server Support IRB Trade Infra UNIX System Support IRB Trade Infra Wintel System Support
Senior level management that reports to the executives is responsible for providing these services and meeting the SLAs. IRB Trade Service Desk This service is related technical support provided by the IRBTrade Company help desk to external customers and internal customers. The service level measurement is based on the trouble ticket resolution time. This service is related to incidents created using the service desk management system implemented in the IRBTrade Company, which is Peregrine ServiceCenter 6. The service IRB Trade Service Desk consists of the following services: IRB Trade External Customer Incident Management IRB Trade Internal Customer Incident Management The IRB Trade Service Desk service provides the ability to track customer (internal and external) technical support effectiveness in terms of incident management, such as open incidents, closed incidents, and incident resolution time.
221
IRB Trade User Experience This service is related to the availability and response times associated with the customer transactions and activities performed using the IRBTrade Company Web site. The service level is calculated using the data collected on availability and response times for various common IRBTrade Company customer transactions such as the availability of the Web site, response time for quotes, and buy or sell orders. This service is managed by a senior level manager and is used by the marketing organization. This is a senior level manager service. It deals with the IRBTrade Companys user satisfaction with the online application that is critical to the business. This service is based on the following services managed by the manager in-charge. These services deal with user transaction performance and availability of the online trade application. The service IRB Trade User Experience service deals with the following aspects of IRBTrade Company: IRB Trade Application Availability IRB Trade Customer Help desk Experience IRB Trade General Web Site Response or Experience IRB Trade On-line Quote Response time IRB Trade On-line Sell Transaction Response time IRB Trade On-line Buy Transaction Response time
222
irbSuperAdmin
IRBTrade Company executives or senior managers who own the business that a service represents They do not want to see details about the each IT resource but want to see a high level summary of the services (business systems) in their scope of responsibility. IRBTrade Company senior managers and line managers who either manage the IT business unit or an IT group who are more interested in the details supporting the executive dashboard than the TBSM_Executives role
TBSM Executives IT
irbItExec irbMarketingExec irbServiceDeskExec irbTradeApplSrMgr irbTradeDbaMgr irbTradeInfraSrMgr irbTradeSysSupMgr irbTradeWebInfraMgr irbUserExpSrMg irbOper1 irbOper2
TBSM_Operators
Note: Each IBM Tivoli Business Systems Manager dashboard user needs a user ID to view his or her dashboard.
223
SLMSupp OffrgSpl
SLMSupp OffrgSpl
The second group deals with the reports and its usage. Table 5-8 shows the list of various users. These users are created using the command line interface provided by IBM Tivoli Service Level Advisor V2.1. Refer to Command Reference for IBM Tivoli Service Level Advisor, SC32-0833-03, for usage details.
Table 5-8 TSLA report users Local OS user role SLMAdmin itexec mktgexec inframgr dbamgr TSLA user role SLMAdmin itexec mktgexec inframgr dbamgr TSLA role description SLM Administrator for the TSLA reports. This is equivalent to the operator role specified for TSLA reports. This user role gets the executive summary report for the realm IT Division. This user role receives the executive summary report for the customer marketing executive. This user role receives the executive summary report for the customer IRB Trade Infrastructure Sr Mgr. This user role receives the executive summary report of all database servers.
224
TSLA role description This user receives the executive summary report of all the servers for their hardware and operating system performance. This user gets the executive summary report of all the Web infrastructure performance. This user receives the executive summary report of the application production environment regarding its availability and performance.
225
IBM Tivoli Business Systems Manager Distributed generic object types are defined using the gemgenprod command. An icon is assigned to each object type using the LoadGEMIcons command. Refer to IBM Tivoli Business Systems Manager Command Reference Guide, SC32-1243, for additional details about these commands. Example 5-1 lists the commands that were executed to define the IBM Tivoli Business Systems Manager resource types for this case study scenario.
Example 5-1 TBSM object type definition gemgenprod -m TBSM -p ExtUserIncident -v 1.0 LoadGEMIcons -p ExtUserIncident -v 1.0 -f ../cid_transactionServer_32.gif gemgenprod -m TBSM -p IntUserIncident -v 1.0 LoadGEMIcons -p IntUserIncident -v 1.0 -f ../cid_transactionServer_32.gif gemgenprod -m TBSM -p UserTransaction -v 1.0 LoadGEMIcons -p UserTransaction -v 1.0 -f ../cid_event_32.gif gemgenprod -m TBSM -p WebServer -v 1.0 LoadGEMIcons -p WebServer -v 1.0 -f ../cid_webServer_32.gif gemgenprod -m TBSM -p WebApplServer -v 1.0 LoadGEMIcons -p WebApplServer -v 1.0 -f ../cid_webServer_32.gif gemgenprod -m TBSM -p DB2Server -v 1.0 LoadGEMIcons -p DB2Server -v 1.0 -f ../cid_databaseServer_32.gif gemgenprod -m TBSM -p MSSQLServer -v 1.0 LoadGEMIcons -p MSSQLServer -v 1.0 -f ../cid_databaseServer_32.gif gemgenprod -m TBSM -p UnixServer -v 1.0 LoadGEMIcons -p UnixServer -v 1.0 -f ../cid_server_32.gif gemgenprod -m TBSM -p LinuxServer -v 1.0 LoadGEMIcons -p LinuxServer -v 1.0 -f ../cid_system_32.gif gemgenprod -m TBSM -p WintelServer -v 1.0 LoadGEMIcons -p WintelServer -v 1.0 -f ../cid_system_32.gif
For example, the last two commands in Example 5-1 define a IBM Tivoli Business Systems Manager distributed resource type called WintelServer. Then they assign an icon specified by file cid_system_32.gif. If an icon is not specified, IBM Tivoli Business Systems Manager assigns a default icon.
226
227
Note that bc1srv7.itso.ral.ibm.com is the IBM Tivoli Business Systems Manager Database Server, and bc1srv5 is the IBM Tivoli Enterprise Console Server for the IRBTrade Company. To generate an initial set of instances of various resource types that are part of IRBTrade Company, we must issue a sequence of ihstttec commands. Each ihstttec command generates an event to IBM Tivoli Business Systems Manager that associates the resource with its resource type. Example 5-3 shows a sample of the commands used for IRBTrade Company.
Example 5-3 IRBTrade Company sample TBSM initial discovery commands # Event for DB2 Servers - Repeat command for every DB2 server d:/tbsm/TDS/EventService/ihstttec.exe -b 'DB2Server;1.0' -i 'bc1srv12.itso.ral.ibm.com' -d 'DB2Server' -h 'bc1srv12.itso.ral.ibm.com' -p 'CreateDB2ServerInstance' -s 'HARMLESS' -m 'Event to create the instance' # Event for WebSphere Servers - Repeat command for every WebSphere server d:/tbsm/TDS/EventService/ihstttec.exe -b 'WebApplServer;1.0' -i 'bc1srv11.itso.ral.ibm.com' -d 'WebApplServer' -h 'bc1srv11.itso.ral.ibm.com' -p 'CreateWebApplServerInstance' -s 'HARMLESS' -m 'Event to create the instance' # Event for HTTP Servers - Repeat command for every HTTP server d:/tbsm/TDS/EventService/ihstttec.exe -b 'WebServer;1.0' -i 'bc1srv35.itso.austin.ibm.com' -d 'WebServer' -h 'bc1srv35.itso.austin.ibm.com' -p 'CreateWebServerInstance' -s 'HARMLESS' -m 'Event to create the instance' # Event for WinTel Servers - Repeat command for every WinTel server d:/tbsm/TDS/EventService/ihstttec.exe -b 'WintelServer;1.0' -i 'bc1srv11.itso.ral.ibm.com' -d 'WintelServer' -h 'bc1srv11.itso.ral.ibm.com' -p 'CreateWintelServerInstance' -s 'HARMLESS' -m 'Event to create the instance'
228
After IBM Tivoli Business Systems Manager receives and processes the events, resources are placed in the IBM Tivoli Business Systems Manager Console associated with its respective resource type. Figure 5-8 shows a sample of the resources and resource types defined for IRBTrade Company.
Figure 5-8 Sample Resources view after the initial discovery: Topology view
229
Figure 5-9 shows the various IRBTrade Company IBM Tivoli Business Systems Manager resources that are created as result of initial discovery. These resources are used in creating the IRBTrade Company business systems. Note: In order for these commands to result in creating an initial set of IRBTrade Company resource instances in the IBM Tivoli Business Systems Manager database, you must define the IBM Tivoli Business Systems Manager database server as one of the event enablers as shown in Example 5-2 using the gemEEConfig command.
Figure 5-9 Sample Resources view after the initial discovery: Table view
230
2. After the lower level business systems are defined, then the higher level business systems are defined and the lower-level business systems are associated to them to build the hierarchy of business systems. For example, the higher-level business systems defined include: IRBTrade Company User Experience IRBTrade Company Infrastructure Database Server Support IRBTrade Company Marketing IRBTrade Company Research IRBTrade Company Infrastructure
The strategy is to create the lower-level business systems first and then use these business systems to create or build higher-level business systems by association. In the IRBTrade Company scenario, all business systems are created using IBM Tivoli Business Systems Manager Java Console. They are not created using the Automatic Business Systems (ABS) configuration file and ABS commands. This method can be used and is appropriate when resource mapping can be determined by resource type or some other supported resource attribute consistently based on well defined (naming) convention. The business system IRBTrade Company infrastructure, and its lower-level business systems, could have been defined using this approach in our scenario instead of the manual console method.
231
Figure 5-10 shows all IBM Tivoli Business Systems Manager business systems defined for IRBTrade Company in the left pane, and the high-level business systems in the right pane.
232
The following sections go into more detail about the main business systems created for IRBTrade Company. Later in this chapter, SLAs are defined based on these business systems.
233
234
235
236
237
238
239
Figure 5-18 shows the main executive dashboard definitions for IRBTrade Company.
Executive level
IRB Trade CEO Executive IRB Trade IT Division IRB Trade Marketing IRB Trade Research IRB Trade IT Executive IRB Trade Application IRB Trade Infrastructure IRB Trade Service Desk IRB Trade Marketing Executive IRB Trade User Load IRB Trade User Experience
Management Level
IRB Trade IT Infrastructure Manager IRB Trade Infra Web Server Support IRB Trade Infra Web Application Server Support IRB Trade Infra Database Server Support IRB Trade Infra Unix System Support IRB Trade Infra Wintel System Support IRB Trade Service Desk Manager IRB Trade External Customer Incident Management IRB Trade Internal Customer Incident Management
IRB Trade Application Manager IRB Trade Availability IRB Trade Web Servers IRB Trade Web Application Servers IRB Trade Database Servers IRB Trade Unix Servers IRB Trade Wintel Servers IRB Trade Marketing Manager User Experience IRB Trade Application Availability IRB Trade Customer Help desk Experience IRB Trade General Web Site Response or Experience IRB Trade On-line Quote Response time IRB Trade On-line Sell Transaction Response time IRB Trade On-line Buy Transaction Response time
Operational Level
IRB Trade Web Infrastructure Support Manager IRB Trade OS Support Manager IRB Trade DBA Support Manager IRB Trade DB2 Servers Support IRB Trade MSSQL Servers Support
IRB Trade Infra Web Server Support IRB Trade Infra Web Application Server Support
Figure 5-18 Identified executive dashboards and services for IRBTrade Company
240
Each of the IBM Tivoli Business Systems Manager business systems has a service, an a executive dashboard service, and an SLA supported service. Then depending on the role, appropriate IBM Tivoli Business Systems Manager business views or services are included into each dashboard. The left pane of Figure 5-19 lists all business systems related to IRBTrade Company. The right pane lists the executive dashboards with appropriate business systems or services assigned to each dashboard user.
The figures in the following sections show the IRBTrade Companys executive dashboards of some of the key players in this case study scenario.
241
242
243
244
245
246
247
248
249
Figure 5-28 IRB Trade Web Infrastructure Support Manager executive dashboard
250
251
Reports are provided to the customers presented in Table 5-10 where SLAs and OLAs are involved. Some of the SLAs and OLAs mentioned are intended to provide a measurement of the quality of the delivery of key infrastructure subsystems being delivered by the infrastructure support teams.
Table 5-10 IRBTrade Company customers and providers of SLAs Description Trading user experience availability and performance User level customer support Trade application availability and performance DB server availability and performance Web infrastructure availability and performance Hardware and operating systems availability and performance Service desk Customer Marketing executive Provider IT executive Type SLA Business systems IRB Trade User Experience
Marketing executive Trade application manager Infra senior manager Infra senior manager
IT executive IT Infra senior manager DBA managers Web infrastructure manager Infra system support manager IT executive
SLA OLA
OLA OLA
IRB Trade Infra Database Service Support IRB Trade Infra Web Server Support and IRB Trade Infra Web Application Server Support IRB Trade Wintel System Support
OLA
OLA
The SLOs were divided into the following subgroups to match the business systems defined earlier: SLOs for database servers SLOs for Web infrastructure servers, for example, HTTP servers and Web application servers SLOs for operating system level performance This is defined for the server part of the Wintel business systems. SLOs for service desk SLOs for availability of defined business systems
252
SLOs for the IRB trade Application business system SLOs for the user experience business system
253
Table 5-12 SLOs for Web servers Service level objectives Apache server running Breach condition Average < 99.99% Average < 99.50% Apache Web site running Average < 99.99% Average < 99.50% Apache failed pages Average > 4 Average > 7 Schedule period Critical: 9 a.m. to 4 p.m. Monday through Friday All other times Critical: 9 a.m. to 4 p.m. Monday through Friday All other times Critical: 9 a.m. to 4 p.m. Monday through Friday All other times
The Web application servers used by IRBTrade Company run IBM WebSphere. The SLOs for the Web application servers are considered as total used Java Virtual Machine (JVM) memory, state of the IBM WebSphere administration server and IBM WebSphere application server, average Enterprise JavaBean (EJB) response time, and number of live servlet sessions. Table 5-13 lists the SLOs for the IRBTrade Companys Web application servers.
Table 5-13 Web application server SLOs Service level objectives WebSphere used JVM memory Breach condition Average > 512 MB Average >512 MB WebSphere server state up Average <99.99% Average <99.50% Average EJB response time Average > 350 msec Average > 450 msec Number of live servlet sessions Average > 20000 Average > 15000 Schedule period Critical: 9 a.m. to 4 p.m. Monday through Friday All other times Critical: 9 a.m. to 4 p.m. Monday through Friday All other times Critical: 9 a.m. to 4 p.m. Monday through Friday All other times Critical: 9 a.m. to 4 p.m. Monday through Friday All other times
254
255
Table 5-17 presents the SLOs for availability as defined per business system.
Table 5-17 Business systems SLOs Service level objectives Availability Breach condition Average < 99.99% Average < 99.50% Schedule period Critical: 9 a.m. to 4 p.m. Monday through Friday All other times
256
257
metrics of each monitoring application chosen to derive the SLOs defined in this case study scenario.
258
Metric name Average EJB response time Live servlet session Web application server state up
Metric description Average total method response time for the remote methods of the bean for the cycle Number of concurrently live servlets sessions Percentage of time the Web application server is up and running
259
260
1. Launch a command window. 2. Change the directory to the location of IBM Tivoli Service Level Advisor installation and source the IBM Tivoli Service Level Advisor environment by issuing the following command:
slmenv.bat
3. Run the following command to find the source applications that were added:
scmd etl getApps
4. Enable the data sources using the sequence of scmd commands and the AVA codes (Table 5-25) as shown in the following example. The syntax of the scmd commands is:
scmd etl addapplicationdata <avacode> <avacode description/Monitoring Application> (to add any new source applications if not present in step 2) scmd etl enable <avacode>
AVA codes to be enabled AMY CTD GWA IZY BWM, MODEL1* GTM, MODEL1*
The MODEL1 AVA code is part of the new Tivoli Common Data Model V1 and must also be enabled in IBM Tivoli Service Level Advisor.
Schedule the WEPs for the monitoring applications to run appropriately so that they start one after another. For example, consider the sequence presented in the following paragraph. If the daily roll up of data by IBM Tivoli Monitoring into its database finishes at 01:00 oclock, schedule the AMX ETL for 01:30 hours everyday. This ensures that data collection for all IBM Tivoli Monitoring applications is complete. Then
261
schedule the IBM Tivoli Service Level Advisor WEPs (DYK) for 30 minutes after the AMX ETL completes. Always schedule the WEPs so that only one WEP runs at a time. After the successful run of the IBM Tivoli Service Level Advisor WEP, the metrics described in this section are available in the IBM Tivoli Service Level Advisor databases (DYK_CAT and DYK_DM).
Creating schedules
Schedules are made up of one or more periods that have a start and end time. Schedules are categorized into business and auxiliary schedules in IBM Tivoli Service Level Advisor. A business schedule can contain one or more auxiliary schedules. An auxiliary schedule is used to specify maintenance periods and holidays. Each schedule period is used to represent an SLO in that period. Since the business model for IRBTrade Company considers 9 a.m. to 4 p.m., Monday through Friday, as critical period, create a business schedule with this period as critical. This business schedule also contains a maintenance schedule and holiday schedule that are created as auxiliary schedules. To create the auxiliary schedules for IRBTrade Company, follow these steps: 1. Launch the IBM Tivoli Service Level Advisor Administration console. 2. Select Manage Schedules Create in the portfolio. 3. Name the schedule and (for example, Maintenance schedule) and specify it as type auxiliary schedule. 4. Click the Create button to create a schedule period. a. Specify the No Service option. b. Select the interval from 00:00 hrs to 14:59 hrs. c. Set the frequency to every first Saturday of the month. 5. Similarly, create the holiday schedule, specifying the holidays of the site.
262
After you create the auxiliary schedules, use the following steps to create the business schedule for IRBTrade Company. 1. On the IBM Tivoli Service Level Advisor Administration console, select Manage Schedules Create. 2. Name the schedule IRB Trade Business Schedule and specify it as type business schedule. 3. Add the previously created auxiliary schedules to the business schedule. We define two auxiliary schedules. One has a period of no service on the first Saturday of every month. The other has no service period on predefined public holidays. Select Manage Schedules Add and select the auxiliary schedules. 4. Define the business schedule periods. a. Under the Define the schedule state to be active during unspecified periods option, select Standard. b. Select Create a new schedule period. c. Mark this period as Critical. d. Select the start time as 9:00 and end time as 15:59. e. Select the frequency as Weekly. f. Deselect Saturday and Sunday. g. Proceed to the next page and select Finish to complete the schedule creation.
263
For our case study scenario, we must create the following business schedules. The periods and auxiliary schedules are similar to the IRB Trade business schedule. Separate schedules are created for maintainability. If the business schedule for IRB Trade DB Servers business unit must be changed, then changing one schedule affects only that SLA. The business schedules are: IRB Trade DBSchedule IRB Trade WebSchedule IRB Trade OSSchedule IRB Trade SDSchedule IRB Trade Availability Schedule IRB Trade User Exp Schedule
264
To create these schedules, complete these tasks: 1. Select Manage Schedules. 2. Select IRB Trade Business Schedule. 3. Select the Create Like option.
265
Figure 5-31 shows an example of a realm created for our case study scenario.
266
Table 5-26 lists the customers to be defined for IRBTrade Company and their respective realm relationships.
Table 5-26 Customer and realms relationships Customer name IRB Network Manager IRB Infra DBA Manager IRB Infra Sys Support Manager IRB Web Infrastructure Manager IRB WebServer Administrator IRB WebAppServer Administrator IRB Trade Application Manager IRB Trade Development Manager IRB.IT infrastructure Marketing executive Service desk IRB.Marketing Division IRB.IT Division IRB.Marketing Division IRB.IT Infrastructure IRB Web Infrastructure Manager IRB.IT Infrastructure IRB Web Infrastructure Manager IRB.IT Division Realm IRB.IT Infrastructure
Customers are created in IBM Tivoli Service Level Advisor using this process: 1. Launch the IBM Tivoli Service Level Advisor Administration console. 2. Select Create Customer. 3. Provide the customer name and a description. For example, we type the customer name IRB Infra DBA Administrator and a description of Manages all the DB2 Servers in the Organization. Then click Next. 4. Because we must relate this customer to a realm, click Add. 5. Choose the appropriate realm. In this example, the IRB Infra DBA Administrator customer belongs to the IT infrastructure, we selected the realm IRB.IT Infrastructure. Click Next. 6. Click Next again to reach the Summary page. 7. On the Summary page, Click Finish to finalize the customer creation.
267
268
Table 5-27 Business systems and offerings relationship Business system IRB Trade Database Server IRB Trade WebServer IRB Trade Web Application Servers IRB Trade Wintel Servers IRB Trade Availability Offerings IRB DB Offering IRB Web Server Offering IRB WebApp Offering IRB SysSupport Offering The offerings are included in a tiered SLA for the resources that the trade application is running and name it the IRB TradeApplication business system offering. IRB User Experience business system offering includes the user experience metrics of TMTP.
User Experience
The offerings are created using the IBM Tivoli Service Level Advisor Administration console using the process shown in Figure 5-33.
Name Offering
Select Metrics
Publish Offering
269
The following process creates an offering using the DBServer Offering: 1. Launch the IBM Tivoli Service Level Advisor Administration console. 2. In the portfolio, select Manage Offerings Create. 3. Type a name for the offering, for example IRB DB Offering. Click Next. 4. For SLA type, select Internal and click Next. 5. Select the Use an existing business schedule option. Select an existing business schedule for the offering. In our case, we selected the business schedule IRB Trade DB Schedule. Click Next. 6. The next page displays a resource type tree and the resource types that are available. Expand the resource type tree and select the appropriate resource type for the offering. Figure 5-34 shows the selection of the resource type for our example. Click Next.
270
7. Add the metrics for the offering. Depending on the chosen resource type, available metrics are presented. Select the appropriate metric and define the breach values for the metric. Figure 5-35 shows the breach selection for our example. Click Next.
271
8. Define trend analysis and the evaluation frequency for the offering. Select the appropriate evaluation frequency and select the Advance Metric Settings check box. See Figure 5-36. Click Next.
272
9. In the Advanced Metrics Setting panel (Figure 5-37), under the Intermediate Evaluations section, select the check box. Then define the Trend Analysis and Current Evaluation period. Provide a name for this metric, for example, DB2 Distributed Instance-DB2Up.
10.Similarly define other SLOs. For Resource Type Tree object, choose DB2 Distributed Database. Define the other details listed in Table 5-28. 11.Publish the offering after all of the metrics are configured.
273
The following tables provide details about the various offerings that are created for IRBTrade Company in our case study scenario. IRB DB Offering: Define this using the resource type tree object, resource type, metrics, breach values, and condition listed in Table 5-28.
Table 5-28 DBServer offering settings Resource type DB2 Distributed Database DB2 Distributed Database DB2 Distributed Database DB2 Distributed Instance Metric Index Hits Average value 90 - Critical 85 - Standard (%) 90 - Critical 85 - Standard (%) 90 - Critical 80 - Standard 99.9 - Critical 99.50 - Standard Breach condition Average greater than supplied average Average greater than supplied average Average greater than supplied average Average less than supplied average
IRB Web Server Offering: Choose the resource type tree object as [Root] and the resource type, metrics, breach values, and conditions listed in Table 5-29.
Table 5-29 Web server resource types, metrics, breach conditions Resource type IBM HTTP Server (powered by Apache) Apache HTTP Web site Apache HTTP Web site Metric Server Running Web Site Running Failed Pages Average value 99.99 - Critical 99.50 - Standard (%) 99.99 - Critical 99.50 - Standard (%) 1 - Critical 3 - Standard (Quantity) Breach condition Average value less than supplied average Average value less than supplied average Average value greater than supplied average
274
IRB WebApp Offering: Select the resource type tree object as IBM WebSphere Administration Server and the resource type, metrics, breach values, and condition listed in Table 5-30.
Table 5-30 Web Administration Server resource types, metrics, breach conditions Resource type IBM WebSphere Administration Server IBM WebSphere application server Metric WebSphere Server state up Average EJB Response Time Average value 99.99 - Critical 99.50 - Standard (%) 350 - Critical 450 - Standard (msec) Breach condition Average value less than supplied average Average value greater than supplied average
Select the IBM WebSphere Application Server as a resource type tree object and the resource type, metrics, breach values, and condition listed in Table 5-31.
Table 5-31 Web Application Server resource types, metrics, breach conditions Resource type IBM WebSphere Java Virtual Machine IBM WebSphere Servlet Session Metric Used Memory Average value 536870912 - Critical 536870912 - Standard (Bytes) 20000 - Critical 15000 - Critical (Quantity) Breach condition Average value greater than supplied average Average value greater than supplied average
IRB SysSupport Offering: Select the resource type tree object as Host Monitored by ITM and resource type, metrics, breach values, and conditions listed in Table 5-32.
Table 5-32 Wintel server resource types, metrics, breach conditions Resource type Logical Disk Metric Free space on Logical Disk Total Available Memory Processor Time Used By the process Average value 10 - Critical 10 - Standard (%) 64 - Critical 64 - Standard (MB) 70 - Critical 60 - Critical Breach condition Average value less than supplied value Average value less than supplied value Average value greater than supplied average
Memory
System Processor
275
IRB User Experience: Select the resource type tree as [Root] and resource type, metrics, breach values, and conditions listed in Table 5-33.
Table 5-33 User experience resource types, metrics, breach conditions Resource type Transaction Node (Measurement source is Tivoli Common Data Model V1) Transaction Node Metric Response Time Successful Transactions Average value 300 - Critical 400 - Standard (msec) 99.9 - Critical 99.2 - Standard (%) Breach condition Average value greater than supplied average Average value less than supplied value
After we create all of the offerings, we see the Manage Offerings page in the IBM Tivoli Service Level Advisor Administration Console (Figure 5-38).
276
SLAs are created using the IBM Tivoli Service Level Advisor Administration Console and following the process illustrated in Figure 5-39.
Name SLA
Select Customer
Select Service
Select Offering
Add Resources
For our case study scenario, the SLAs that we define can be mapped to the business systems defined in IBM Tivoli Business Systems Manager because we used the offerings that reflect these business units to create SLAs. These SLAs can be further divided into two groups. SLAs that map to the lower level business systems (Table 5-34): These form the infrastructure of the organization
Table 5-34 SLAs that are mapped to the low level business systems SLA name IRBInfraDBServer SLA IRBInfraWintelSeverSLA IRBInfraWebServer SLA IRBWebAppServer SLA IRBTradeUserExperience SLA IRBTradeDBServer SLA IRBTradeWintelServer SLA IRBTradeWebServer SLA IRBTradeWebAppServer SLA Description The SLA for all the database servers in the organization SLA for all Windows servers in the organization SLA for the Web servers in the organization SLA for all Web application servers in the organization SLA for the success and response time of the transactions SLA of the database servers hosting the trade application SLA for the Windows servers hosting the trade application SLA for the Web servers hosting the trade application SLA for the Web application servers hosting the trade application
277
SLAs that are mapped to the higher level business systems (Table 5-35): Here we use the tiered SLA function explained in Chapter 4, Planning to implement service level management using Tivoli products on page 109.
Table 5-35 SLAs that were mapped to the higher level business systems SLA name IRBUserExperience business system SLA IRBTradeApplication business system SLA IRBInfrastructure SLA Description Maps resource of the user experience business system to the offering that includes IRBTradeUserExperience SLA, the availability metric of the business system Maps the resource of the trade application business system to the offering that includes IRBTradeDBServer SLA, IRBTradeWintelServerSLA, IRBTradeWebServer SLA, and IRBTradeWebAppServer SLA Maps the resource of the IRB Trade IT infrastructure business system to the offering that includes IRBInfraDBServer SLA, IRBInfraWintel Server SLA, IRBInfraWebServer SLA, and IRBInfraWebAppServer SLA
For example, using the IRBUserExperience business system SLA, we see that it is made up of two items: One SLA that measures the transaction response time and number of successful transactions. Such metrics are obtained from monitoring data collected by IBM Tivoli Monitoring for Transaction Performance, which is IRBTradeUserExperience SLA. Availability of the various business systems that make up the user experience business system. Such metrics are obtained using IBM Tivoli Business Systems Manager. We use tiered SLAs to achieve this SLA. Tiered SLAs are used to include one or more SLAs in an offering. This enables the tracking of OLAs against underpinning contracts or business systems that depend on these OLAs. To create such a tiered SLA, we use a three-step approach: 1. Create SLAs using transaction response time and successful transaction measurements for each IT infrastructure business system. 2. Create an offering that contains the SLAs defined in step 1. 3. Create an overall SLA for user experience using the offerings in step 2.
278
279
4. In the next panel (Figure 5-41), select an existing customer to be associated with the SLA, such as IRB Trade Application Manager. Click Next.
280
5. In the Select Offering panel (Figure 5-42), select the offerings to be part of the SLAs definitions, such as IRB User Experience. Click Next.
281
7. In the Select Resource List Type panel (Figure 5-43), define the type of resources to add to the SLA. The Dynamic Resource List is used to group resources and create filter. Static resources are used for particular resources that are to be added. Click Next.
282
8. In the Filter Resource panel (Figure 5-44), create a filter so that only relevant resources are listed. Select the attribute, condition, and value for the filter. For example, for Attribute, select Name; for Condition, select Contains; and for Value, select Trade. Click Next.
283
9. The resources are displayed and you can select them to be included in the SLA definition. You can add or change resources in this panel. The resources must be defined to every metric used in the SLA. For example, our UserExperience offering has two metrics defined. In this case, resources must be assigned to both metrics. Figure 5-45 shows the resources included for the first metric in the offering. Click Next.
284
10.The Select SLA Start Data panel (Figure 5-46) is displayed. The start date of the SLA is used to evaluate the previous monitoring data to verify the SLOs instantaneously. If there is no data, choose the default date (the current date). Optionally select the time zone for the SLA to be evaluated. Click the Recalculate the First Evaluation button to refresh the first evaluation date depending on the SLA start date. Figure 5-46 shows the details used in the UserExperienceSLA definition.
11.The summary of the SLA creation is displayed. Click Finish to complete the SLA creation. If the SLA start date is an earlier date, the SLA evaluates it immediately.
285
The following process creates the IRBUserExperience business system offering: 1. Launch the IBM Tivoli Service Level Advisor Administration Console. 2. In the portfolio, select Manage Offerings Create. 3. Provide a name for the offering, for example, IRBUserExperience business system offering, and a description, such as Offering for the business system that describes the user experience. Click Next. 4. For SLA type, select External, and click Next. 5. In the Include SLAs panel (Figure 5-47), complete these tasks: a. b. c. d. Click the Add button. Select the SLA IRBTradeUserExperience SLA. Click OK to display the included SLA. Click Next.
6. In the next panel, select Use an existing business schedule. For the schedule, select IRB Trade UserExperience Schedule. Click Next. 7. Click Add to include the offering components.
286
8. The Select Resource Type panel (Figure 5-48) is displayed. For Resource Type, select Business System. Click Next.
Figure 5-48 Selecting the business system resource type for the offering
9. Click Add and for the metric, select Availability. 10.Define the breach values for the user experience business system. a. b. c. d. e. For the breach value, specify 99.99. For the critical period, select Average value less than supplied average. Define another breach value of 99.20 For Standard period, select Average value less than supplied average. Click Next.
11.In the next panel, complete these tasks: a. For Evaluation frequency, select Weekly. b. Select Advanced Metric Settings. c. Click Next. 12.Complete these tasks: a. Select the Perform Intermediate evaluations check box. b. For Intermediate evaluation frequency, select Daily. c. Finish the SLO creation. We enable the intermediate evaluations because this enables the SLO of the metric up to the current day from the start of the evaluation start. This is reflected in the SLA reports.
287
13.Provide a name to the offering component and a description. Optionally, use the default name if it is unique in this offering. Click Next. 14.Select Publish the offering to complete the offering creation.
288
Figure 5-49 Selecting the service for the SLA being created
We can further enhance the IRBUserExperience business system SLA by adding an SLA for the Number of Live Servlet Sessions metric provided by IBM Tivoli Monitoring for WebSphere. To do this, we use these steps: 1. Create a new offering IRBUserLoadOffering and include this metric. 2. Define the breach values and evaluation frequency similar to the IRB User Experience Offering. 3. Create an SLA using the customer name IRB Trade Application Manager. 4. Assign the resources of the trade application. 5. Include this SLA in the IRBUserExperience business system offering. Doing so gives service details for the user load. This provides the information required to plan for the future in terms of the load, as it may require extra resources to meet higher load.
289
Using the IRB TradeApplication business system SLA as an example, we follow a procedure similar to what was explained in the previous example. This requires multiple SLAs defined as described in the previous section. Figure 5-50 shows the list of SLAs. You must define these SLAs with the resources used in the trade application.
After we define the SLAs, we build an SLA that encompasses all of the resources and applications used by the trade application. 1. From the portfolio, click Manage Offerings. 2. Click Create to create an offering. 3. Provide a name, for example, TradeApplicationBSO, and optionally provide a description. Click Next. 4. For SLA Type, select External. Click Next. 5. Click the Add button to add the SLAs. Add the SLAs as listed in Figure 5-50. Click Next. 6. Each SLA that is added appears in the list. Click Next. 7. For the business schedule, select IRB Trade Business Schedule. Click Next.
290
8. Click Add to include the offering components. For Resource type, select Business System. Click Next. 9. Click Add to include the metrics. Select Availability. Click Next. 10.Define the breach values for the IRB Trade Application business system. a. b. c. d. e. Define a breach value of 99.99. For Critical period, select Average value less than supplied average. Define another breach value of 99.20. For Standard period, select Average value less than supplied average. Click Next.
11.In the next panel, complete these tasks: a. For Evaluation frequency, select Weekly. b. Select the Advanced Metric Settings check box. c. Click Next. 12.In the next panel, follow these steps: a. Select the Perform Intermediate evaluations check box. b. Set the intermediate evaluation frequency as Daily. c. Finish the SLO creation. 13.Proceed to the next page. Provide a name to the offering component and a description. Optionally, use the default name if it is unique in this offering. Click Next. 14.Click Publish the offering to complete the offering creation. 15.From the portfolio, select Manage SLAs. 16.In the Manage SLAs panel, click the Create button. 17.In the Create SLA panel, type the name IRB TradeApplication BS SLA and optionally type a description. Click Next. 18.For the service, select IRB.Trade.Application. Click Next. 19.For the offering, select IRBTradeApplication BS Offering. Click Next. 20.Click the Add button to include the resources. For Filter type, select static resource filter. Click Next. 21.Create the filter. For attribute, select Name; For Condition, select Contains; and for Filter value, select IRB.Trade. Click Next. 22.For the resource, select /IRB.Trade.IT.Dividion/IRB.Trade.Application. Click Next. 23.Select SLA start date as 10/01/04, by using the calendar widget that is provided or by typing the value. Choose the default time zone. Click Next. 24.Click the Finish button.
291
This completes the SLA creation IRB TradeApplication BS SLA. Figure 5-51 shows all the SLAs defined for IRBTrade Company in our case study scenario.
292
Usage example: Monitoring business system views using IBM Tivoli Business Systems Manager Administrative Console
The line manager or the senior DBA responsible for database administrative services may monitor the IBM Tivoli Business Systems Manager business system view (BSV) shown in Figure 5-52. This person may notice that the DB2 server running on bc1srv12.itso.ral.ibm.com is in an exception state and is soon turning red. Upon this event, this person takes the appropriate action to correct the problem. This keeps its impact on applications that use the DB2 server (bc1srv12) to a minimum.
293
The line manager responsible for the IRBTrade Company user experience may monitor the IBM Tivoli Business Systems Manager view shown in Figure 5-53. He or she may notice that the IRBTrade Company customers are experiencing slow response time (yellow/warning condition) when requesting stock quotes and stock trading (sell) online (red/critical condition). By looking at the TBSM event view from this view, the line manager can see that the simulated stock quote and stock sell transactions are exceeding the specified thresholds. These transactions are monitored using IBM Tivoli Monitoring for Transaction Performance playback policy running from the IBM Tivoli Monitoring for Transaction Performance management agent bc1srv6.itso.ral.ibm.com.
294
295
296
The IRBTrade Company IT Executive who is concerned with trade application, IT infrastructure, and service desk may monitor the IBM Tivoli Business Systems Manager executive dashboard shown in Figure 5-56. He or she may notice that the IRB Trade IT Infrastructure service level is trending toward a violation. This trend indicator may provide an opportunity to investigate (by looking at the IBM Tivoli Service Level Advisor reports, for example) the underlying issues and resolve them in time to stop the negative trend to avoid the SLA violation.
Figure 5-56 IRBTrade Company IT executive dashboard with SLA trend indicator
297
The IRBTrade Company IT Executive who is concerned with trade application, IT infrastructure, and service desk may look at the IBM Tivoli Business Systems Manager executive dashboard shown in Figure 5-57. He or she may realize that the IRB Trade IT Infrastructure and Trade Application service levels violated the SLA agreements for the previous period.
298
Figure 5-58 and Figure 5-59 show that the IT infrastructure service is trending toward a violation for the upcoming period as well. This trend indicator may provide an opportunity to further investigate (by looking at the IBM Tivoli Service Level Advisor reports, for example) the underlying issues and resolve them in time to stop the negative trend. This helps to avoid the SLA violation for the upcoming period.
Figure 5-58 IT executive dashboard with an SLA violation and trend indicator
299
Note: Escalation is enabled in IBM Tivoli Service Level Advisor to send events to the TEC server. You can do this during installation or post installation. Refer to Getting Started with IBM Tivoli Service Level Advisor, SC32-0834-03, to help you enable the escalation for post installation. The various types of events that are escalated (or posted to TEC) are violation of SLA, trending toward a violation for an SLA, trend cancel for the previously sent trending toward violation for an SLA, and application type events. You can configure the TEC server to forward IBM Tivoli Service Level Advisor events to IBM Tivoli Business Systems Manager. The trending evaluation period is set to daily for all SLOs.
The IRBTrade Company Infrastructure senior manager checks the SLA high level report and finds that there trending toward a violation event was escalated during
300
the intermediate evaluation period of 10/11/2004 to 10/14/2004. Refer to Figure 5-61 for the sample report.
Figure 5-61 IRB Infrastructure senior managers TSLA high level report
The IRBTrade Company Infrastructure senior manager clicks the high level report and sees that the details are of the trend shown in Figure 5-62.
Figure 5-62 Trend details as seen by the IRB Trade Infrastructure senior manager
301
Further investigation indicates that the trending was due to the available memory that was decreasing in a way toward a violation as shown in Figure 5-63.
302
Also the manager was informed that the problem may be due to a memory leak in the application on WebSphere application server. And the manager was informed that the development team is looking into it. The trending toward a violation condition is investigated and escalated for immediate attention from the system support group. The system support group finds that WebApplication server process is the root cause of the problem. The process had higher CPU usage and the JVM runtime indicated that the memory used was increasing. Figure 5-64 shows the CPU usage of the Java process.
The IRBTrade Company system support manager looks into the details of the intermediate evaluations of the system that is trending toward violation. The manager finds that the total available memory is decreasing day by day and may violate on 10/16/2004 at 8 p.m. The IRBTrade Company system support managers report provides further details. Refer to Figure 5-63 for the sample report. The problem was transferred to the Web Infrastructure group for further evaluation.
303
Web infrastructure support is informed about the findings. The team looks into the issue and finds that the application hosted on the server in question may be having a memory leak. This was reported to the development team of the application. While the development team investigates the issue (for resolution), the Web infrastructure support group suggests increasing the memory on the system in question so that the SLO is satisfied. This trend event is propagated to the SLO of the trade application because the SLA that is measuring the SLO is the parent of the SLA that is measuring the SLOs of the Wintel servers. Note: The trade application is the business service. The trend is propagated to the executive dashboard of the IT executive, which can result in taking timely action.
304
The marketing executive logs into the TSLA reports and sees the high level report shown in Figure 5-66.
Figure 5-66 Marketing executive IBM Tivoli Service Level Advisor Reports
305
The marketing executive drills down into the report and sees the violations of the availability of the business system IRB Trade User Experience and the response time of the trade sell response (Figure 5-67).
306
At the same time, the IT executive dashboard shows the SLA violations for IRBTrade Company IT Infrastructure (see Figure 5-68). It starts the investigation into an underlying course. The marketing executive contacts the IT executive and calls for a meeting to discuss the SLA violation.
Figure 5-68 IT executive dashboard with SLA violation and trend indicator
307
IT executive management logs into the SLA reports sees the high level report as shown in Figure 5-69.
308
After drilling through the details, the IT executive management gathers the following information: The violations in the IRBTrade Company Infrastructure are due to the DB2 and WebSphere servers that were hosted on bc1srv12 and bc1srv21. See Figure 5-70. Because the trade application is hosted on these servers, the availability and the user experience SLOs are also effected due to this outage.
The outage impacted the availability of the trade application from the end-user experience. The trade application production environment has violations because the DB2 server (bc1srv12) and WebSphere server (bc1srv21) were down. This is indicated in Figure 5-52 on page 293 and Figure 5-54 on page 295. The availability of the trade application suffers and the successful number of transactions was lower than the specified SLO. The IT executive sees the violation report (Figure 5-70).
309
The Trade Application Manager sees the report for the period shown in Figure 5-71.
310
The Trade application manager report displays the violations due to the user experience. The unavailability of the servers that caused the outage is shown in the violations report (Figure 5-72).
Similarly the IRB Trade IT Infrastructure manager sees the violations for the two systems in question. After the analysis, the team suggested the following options to the IT executive to address the problem and reduce the potential for future SLA violations of this nature: Make a backup of the production system available at all times. Replicate the data on the production system. Then when any system in the production environment goes down, the backup system immediately takes over. Employing one of these options will satisfy the SLO of the availability of the production environment.
311
312
Closely monitor the usage of the hard drive space and memory to plan for future requirements. Specify the start date of the SLA in the past so that it gives an idea of the current performance of the enterprise infrastructure for each SLO. This may lead to a better SLO. Determine the bottle necks and improve the performance of the application by using better tuning parameters and assigning better resources, depending on the mission criticality of the application. When this is done, try to improve the SLO. Use the adjudication function of IBM Tivoli Service Level Advisor when the violations can be adjudicated and agreement is reached between the service provider (IT infrastructure team or Trade Application team) and the user (marketing executive). Send e-mail escalation to the service desk, so that each violation is treated as an incident. This helps to measure the violations.
313
314
Chapter 6.
315
316
Figure 6-1 shows the CEOs organization chart. This case study focuses on the banking business and the IT department.
CEO
Banking Director
Trading Director
IT Director
The other business unit directors think they have insufficient information about services delivered by the IT department. Service level agreements (SLAs) are in place, but they are based on the availability of technology components rather than business services. And they almost always show that SLA targets are met. At best, the monthly SLA reports appear two weeks after the end of the reporting period, but often are delivered much later. The mismatch between the SLA reports and customer perception has sparked heated discussions between the business unit directors. The concern is that, if the bank acquires a reputation for poor service, the loss of clients will grow exponentially. Because of the threat to the company, the board of directors has agreed to fund a program of service improvement proposed by the IT director. The board has expects to see results within six months. Summary of the issues: There is an increase in account closures and a loss of repeat business. Online checking for accounts is unreliable at peak periods. SLAs are delivered late and are not meaningful to the business. SLA results do not tally with reported user experiences. Improvements must be made within six months.
317
IT department organization
The IT director is responsible for all IT services across the company. He has two managers reporting to him who are responsible for software development and service delivery respectively.
Figure 6-2 shows the high-level organization chart for the IT department. The IT department is highly centralized with most staff working at the banks headquarters building in central London. The banks lights-out data center has been designed with multiple instances of most components to provide high availability and disaster recovery. There are small teams of technical staff at the other main locations who report to the operations manager and provide local support for desktop computers, e-mail, file/print servers, and networks.
318
IT Director
Development Manager
Development TL Banking
Operations Manager
Development TL Trading
Summary of the issues: The cause of the problem with the online checking systems is not known. The IT staff are working in a reactive rather than proactive mode. Current tools do not provide data on the user experience. Separate tools provide disjointed technology based views of the infrastructure. Judging the impact of component failure depends on knowledge held in the heads of key technicians who are not always available. SLM processes are known to be ineffective and are based on unsuitable software tools.
319
The IT director knew that he had to respond urgently to the concerns of the business or risk losing his job. He has already set up a task force to work on the service improvement program and placed a contract with IBM to provide consultants to give best practice advice and guidance on systems management.
Mainframe infrastructure
The company mainframe infrastructure consists of 22 logical partitions (LPARs) on five z/OS machines. DB2 and IMS are used for data storage and CICS is widely used for transaction processing by legacy applications. All major production services have multiple instances of software components running on different LPARs to provide high availability.
320
321
available to measure end-to-end performance of applications, and therefore the user experience.
Table 6-1 Maturity of systems management tools Product System Automation for z/OS IBM Tivoli NetView for z/OS Tivoli Workload Scheduler Omegamon XE for z/OS and OS/390 Omegamon II for CICS Omegamon II for IMS Omegamon XE for DB2 IBM Tivoli Monitoring IBM Tivoli Enterprise Console IBM Tivoli Business Systems Manager Tivoli Data Warehouse Platform Mainframe Mainframe and distributed Mainframe Mainframe Mainframe Mainframe Mainframe Distributed Distributed Mainframe and distributed Distributed Maturity of exploitation Very mature Very mature Very mature Very mature Very mature Very mature Very mature Mature Mature Immature Immature
322
The IT director shares the view of the business unit directors that it is better to have SLAs that reflect the business. He knows he has to resolve the issues around production of SLA reports, but is not sure how to make this happen.
323
Figure 6-4 Console consolidation using IBM Tivoli Business Systems Manager
The success of console consolidation enabled the IT department to reduce the number of consoles on the operations bridge, making it much less cluttered. Operators now have one screen to watch for all the events received by all the monitoring tools. The operators and technical support teams still log in and use other tools when necessary, but the need to do this has been greatly reduced through the use of the TBSM Task Server. This enables an operator to launch a software tool in the context of an object selected (by right-clicking) on the IBM Tivoli Business Systems Manager console. Tip: To learn about setting up the IBM Tivoli Business Systems Manager Task Server, see IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085. The IBM Tivoli Business Systems Manager implementation team should now gather information to build business systems based on business services. This work has been put on hold by the IT Manager because the technical staff with the
324
key information about the infrastructure was told to focus on resolving the immediate difficulties with the Internet Banking service.
4 SLAs based on the availability or performance of business services, agreed between IT and business unit directors, and implemented
325
Who benefits? IT director, operations manager and service level manager Business unit and IT directors, service level manager, and operations manager All stakeholders
6 SLA reports available within one day of the end of the reporting period; intermediate SLA evaluation reports produced on demand throughout the reporting period 7 Demonstrated improvement in business services as measured by the SLA reports and a reduction of instances of lost clients 8 OLAs agreed and implemented between technical team leaders and the IT director 9 New IT systems and processes in line with ITIL recommendations
Check date SLA reports are received; include a statement of due dates and actual dates of reports in an SLA reporting pack Demonstrate measurable improvement in availability or performance of business services in SLA reports Count how many OLAs are in productive use within six months of implementation Audit systems management processes as part of a continuous improvement program
326
Key issues
The task force also documented their understanding of the key issues that the IT department needed to tackle, and the impact this was having, as summarized in Table 6-3.
Table 6-3 Key issues Issue Business services are not performing as expected No effective way of measuring services No clear understanding of how the infrastructure maps onto business services Technical staff does not always target incidents causing the greatest business impact SLAs do not reflect delivery of business services Production of SLA reports is expensive, slow, and erratic Impact Client dissatisfaction Ineffective service management and inability to construct meaningful SLAs The business impact of component failure is either not known or relies on expertise of individuals; systems management cannot account for business impact Potential for serious impacts to business services because of inappropriate prioritization in the absence of reliable business impact data Poor SLM and dissatisfied internal customers Poor SLM, dissatisfied internal customers, and wasted IT resources
327
Table 6-4 Key tasks in the service improvement programs Task description Detailed analysis of potential causes of the poor performance of the Internet Banking service Build business systems based on those starting with banking applications Review operations and technical team processes for incident prioritization Update the systems management architecture to deliver the desired outcomes Agree the success criteria Plan implementation of the solution Implement the solution Review the implementation against the success criteria and refine if necessary Put a continuous improvement plan in place Desired outcomes addressed 7 1, 2, 3, 7 3, 7, 8 All outcomes All outcomes All outcomes All outcomes All outcomes 7, 8
Other items agreed at an early stage were: Production of the current SLA reporting will stop immediately to enable the SLA team to assist in implementing meaningful SLAs Business representatives will be appointed to the task force
328
329
6.4 Implementation
Chapter 4, Planning to implement service level management using Tivoli products on page 109, covers the implementation of Tivoli products for SLM. This scenario uses the stages that are summarized in Table 6-7.
Table 6-7 Stages of implementation for the scenario Stage 1 2 Define services Enhance instrumentation Determine users and roles Determine IBM Tivoli Business Systems Manager resource types Create IBM Tivoli Business Systems Manager business systems Description Identify and define business services and their infrastructure components at a high level Identify and implement additional instrumentation to enable the service to be measured Decide who will use IBM Tivoli Business Systems Manager and IBM Tivoli Service Level Advisor and what type of access they need Create any special IBM Tivoli Business Systems Manager objects if required Reference Stage 1: Defining services on page 332 Stage 2: Enhancing instrumentation on page 333 Stage 3: Determining users and roles on page 337 Stage 4: Determining IBM Tivoli Business Systems Manager resource types on page 339 Stage 5: Creating IBM Tivoli Business Systems Manager business systems on page 340
330
Stage 6 Create IBM Tivoli Business Systems Manager views Agree SL objectives Define metrics Prepare for ETLs
Description Configure IBM Tivoli Business Systems Manager to meet the requirements of the various users and user roles Decide what service parameters will be measured in SLAs Decide which specific metrics will be used in SLAs Check IBM Tivoli Service Level Advisor implementation; test and schedule running of ETLs Set up IBM Tivoli Service Level Advisor realms, customers and schedules Create service offerings for use in SLAs Create the SLAs and OLAs to support the defined services Produce the SLA and OLA reports
Reference Stage 6: Creating IBM Tivoli Business Systems manager views on page 351 Stage 7: Agreeing to service level agreement objectives on page 363 Stage 8: Defining metrics on page 366 Stage 9: Preparing for ETLs on page 369 Stage 10: Preparing IBM Tivoli Service Level Advisor on page 371 Stage 11: Creating offerings on page 375 Stage 12: Creating SLAs and OLAs on page 395 Stage 13: SLA reporting on page 409
8 9
10
Prepare IBM Tivoli Service Level Advisor Create offerings Create SLAs and OLAs SLA reporting
11 12 12
You can find the details of these stages in Chapter 2, General approach for implementing service level management on page 23. Or in the context of Tivoli products, refer to Chapter 4, Planning to implement service level management using Tivoli products on page 109.
331
Figure 6-5 shows the high level implementation tasks. The numbers in the boxes correspond to the stages listed in Table 6-7.
#2 Enhance Instrumentation
#8 Define Metrics
What are the services: From the business representatives How are the services architected: From the application development
representatives
332
Figure 6-6 shows our first stage analysis of the banking services.
Banking Asset Management Batch CICS ATM System ATM Networks ATM Servers ATM Transactions Online Accounts checking and savings
Batch
Inter-bank Transfers BACS Clearing Processes Commercial Interbanking DTS data Transmissions Personal Interbanking Online Accounts Checking Accounts Daily Batch Monthly Interest Batch Savings Account
333
products that assist in service level management on page 53, and Chapter 4, Planning to implement service level management using Tivoli products on page 109. IBM Tivoli Monitoring for Transaction Performance simulates standard user transactions and measures how long they take to complete. The time to complete each transaction is measured and the result is sent as an event to TEC and, from there, to IBM Tivoli Business Systems Manager. Response time data is transferred from IBM Tivoli Monitoring for Transaction Performance and IBM Tivoli Business Systems Manager to the Tivoli Data Warehouse. We explain how data from IBM Tivoli Monitoring for Transaction Performance is used to measure user experience in Online accounts performance data on page 367. You can find detailed technical instructions for installing IBM Tivoli Monitoring for Transaction Performance, configuring it to forward events to TEC, and installing the Tivoli Data Warehouse WEP in IBM Tivoli Monitoring for Transaction Performance Administrators Guide, GC32-9189, and IBM Tivoli Monitoring for Transaction Performance Warehouse Enablement Pack Implementation Guide, SC32-9109. For additional information, see Business Service Management Best Practices, SG24-7053. We assume that the implementation of the product and integration with TEC has already been completed and tested. We concentrate on explaining how to use IBM Tivoli Monitoring for Transaction Performance to provide information about availability and response time.
334
335
The MA code must be installed on machines that are capable of running the synthetic transactions as described in IBM Tivoli Monitoring for Transaction Performance Administrators Guide, GC32-9189.
Configuring the IBM Tivoli Monitoring for Transaction Performance playback policies
Important: You must deploy the STI playback component to at least the IBM Tivoli Monitoring for Transaction Performance MA in your environment before you begin this step. We decide on which management agent machines the synthetic transactions will run and the schedule used to run the transactions. We also decide on the thresholds that will be used to determine whether events should be sent to TEC and IBM Tivoli Business Systems Manager. We consider these points first for playback: The schedule must be set up to ensure that transactions are given time to complete. STI transactions must be run from locations that represent user locations, for example different countries or regions. The more frequently transactions are run, the better they represent the user experience.
Configuring IBM Tivoli Business Systems Manager for IBM Tivoli Monitoring for Transaction Performance events
For an overview of how IBM Tivoli Monitoring for Transaction Performance events are forwarded and displayed in IBM Tivoli Business Systems Manager, see Chapter 4, Planning to implement service level management using Tivoli products on page 109. For this scenario, we keep IBM Tivoli Monitoring for Transaction Performance objects and events in a separate child business system: Real-time User Experience Banking. We did this because: Events indicating degradation to user experience usually propagate to the top-level business system so they can come to the attention of the business process owner.
336
IBM Tivoli Monitoring for Transaction Performance events can put other events received for the technology objects in the business system into context. For example, a technology event received by a server shows the servers criticality to the business system. Corresponding IBM Tivoli Monitoring for Transaction Performance events indicating an increase in user response times show the impact upon users of the server hit. Incorrect event management at the source can result in giving insufficient priority to an event. If this were the case for the server hit, we would want the IBM Tivoli Monitoring for Transaction Performance event to always be visible in the business system. This is most easily achieved by keeping it in a separate business system that is subject to different propagation rules to the technology business systems. If components of the infrastructure are not instrumented, IBM Tivoli Business Systems Manager may not receive events that show that they are defective. We can overcome this deficiency by using complementary events from IBM Tivoli Monitoring for Transaction Performance, which may notice that the user experience has deteriorated without any information about the cause. If this occurs, it may be necessary to either implement additional instrumentation or to modify the business system. To enable this, we recommend that you create separate business systems for the user experience with their own propagation rules. We adapt the business systems to suit the requirements of the IBM Tivoli Business Systems Manager executive dashboard users. This requires us to have the IBM Tivoli Monitoring for Transaction Performance objects subject to different propagation rules to the other objects. The business system structure that we use is shown in Figure 6-8 on page 342. The application of propagation rules to suit IBM Tivoli Monitoring for Transaction Performance events is described in detail in Setting PBT rules to allow propagation to top-level business system on page 348.
337
business systems for other users. There is no requirement to use IBM Tivoli Service Level Advisor. In IBM Tivoli Business Systems Manager, this maps to the super administrator and administrator roles with Java console access.
Operators
Operators need an IBM Tivoli Business Systems Manager work space which allows them to manage the entire production enterprise using all available IBM Tivoli Business Systems Manager views. They have no requirement to use IBM Tivoli Service Level Advisor. Although in some organizations, operators focus on specific services or customers, in this scenario, the shift operators share responsibility for all computer systems and there is no need to restrict the resources they can manage. In IBM Tivoli Business Systems Manager, this maps to the operator and restricted operator role with Java console or Web console access.
338
a key application are experiencing poor response time. They also want to be aware of potential and actual SLA breaches that relate to key business services. They want a simple display without the technical details. They want a nominated deputy to view the display, access current SLA reports online, and receive SLA reports sent via e-mail from the SLM team. In IBM Tivoli Business Systems Manager, this maps to the TBSM_Executives user role with executive dashboard access. In IBM Tivoli Service Level Advisor, this maps to an SLM Reports Console user role.
6.4.4 Stage 4: Determining IBM Tivoli Business Systems Manager resource types
No additional IBM Tivoli Business Systems Manager resource types were required for the solution in this scenario, so no action was necessary. The section is included to remind you that this may be necessary depending on the event sources you are using. To learn how to define IBM Tivoli Business Systems Manager resource types as generic objects, see Chapter 5, Case study scenario: IRBTrade Company on page 197.
339
6.4.5 Stage 5: Creating IBM Tivoli Business Systems Manager business systems
The Banking business system was built using the structure shown in Figure 6-7. There are six child business systems, each of which contain their own child business system. The business system was built using drag and drop, but could have been built using Automatic Business Systems (ABS) or Extensible Markup Language (XML) as discussed in IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085. The Online Accounts business system is critical, the ATM System business system is important, and the other business systems are of equal low criticality. The banking director must be informed, without delay, of impacts to ATM System and Online Accounts. The banking director is less concerned about the other business systems but should be notified if they are affected by severe problems.
340
Although the director is the primary customer of this business system, we need to ensure that the customizations do not impair the ability of the other IBM Tivoli Business Systems Manager users to fulfil their responsibilities. We configure the business system to the directors requirements using these steps: 1. Set resource level propagation (RLP) to stop child events from propagating to the top level business system. 2. Configure RLP for the Real-time Account Application Transaction child business system to allow for single IBM Tivoli Monitoring for Transaction Performance transaction failures. 3. Set child business system weighting to prioritize business system alerting. 4. Set the priority of Real-time User Experience Banking business system to permit propagation to override percentage-based thresholding (PBT) rules. 5. Set PBT threshold rules to permit child business systems to propagate to a top level business system. 6. Define the business system as a service, and configure the executive dashboard for the business system. 7. Verify that the business system is valid for other user roles.
341
342
Configuring RLP to allow for single IBM Tivoli Monitoring for Transaction Performance failures
The Real-time Account Application Transaction business system is a child of the Banking business system as shown in Figure 6-9. It contains five child objects. Each object represents an instance of the same IBM Tivoli Monitoring for Transaction Performance STI running on a different Management Agent. A short-duration network problem could cause an individual STI transaction to fail.
RLP is used to ensure that propagation only happens based on the settings in Table 6-8 to prevent such transient faults from alarming the Executive Console users.
Table 6-8 RLP settings for Real-time Account Application Transaction business system Propagation conditions High Medium Low Red >1 >2 >3 Yellow >2 >3 >5
343
This configuration of RLP allows the business system to receive at least two events before an event is propagated up to the next level. This is done by determining the desired thresholds and defining them in the Child Event window of the Real-time Account Application Transaction business system properties (Figure 6-10).
Figure 6-10 RLP settings for Real-time Account Application Transaction business system
344
Figure 6-11 Different weights for child business systems based on priority
345
Table 6-9 summarizes the importance of each business system based on weight.
Table 6-9 Importance of child business systems based on weight Business system ATM System Asset Management Batch Interbank Transfers Online Accounts Real-time User Experience Importance High Low Low Low Very high Not included in calculations Weight 200 50 50 50 250 0
Important: The weight values for the business systems are based on what makes the PBT mathematics work to satisfy the requirements. Some trial and error are involved to ensure that more complex scenarios work as required. See Chapter 4, Planning to implement service level management using Tivoli products on page 109, for more details.
Setting the priority of the business system to override RLP and PBT rules
The Real-time User Experience Banking business system has a weight of 0. This means that it will not participate in the PBT calculations and will not send any events to the Banking business system. This may seem odd considering that we already stated that we want this business system to send its events to the Banking business system. However, giving this business system a weight would complicate the PBT rules. We resolve this apparent contradiction by using an override mechanism. Real-time User Experience Banking only sends up user experience events that have already passed the thresholds set by its own child business systems. Any event sent to the Real-time User Experience Banking business system indicates a problem with user experience.
346
We want to propagate this to the relevant executive dashboard users. To do this, we set the priority of the Real-time User Experience Banking business system to Critical as shown in Figure 6-12. This overrides all propagation rules and allows the event to be propagated directly to the Banking business system.
347
Figure 6-13 Sending a low yellow event for one or two red non-critical business systems
348
For three red child low-criticality business systems or a red event on the ATM System business system, send a high yellow event to Banking. This rule is for three non-critical business systems that have red events or for the ATM System business system that have a red alert. The weighting is set up so that this rule fires when the ATM System is the only business system to have a red event or when all three of the non-critical business systems have a red event. A high yellow event is sent to the Banking business system.
Figure 6-14 Sending a high yellow event for three red non-critical or ATM System
349
For a red event on the Online Accounts business system, send a high red event to Banking. This rule fires when the Online Accounts business system is the only business system to have a red event. It also fires when the ATM System business system and one or more non-critical business systems have red events. It does not fire if all the non-critical business systems have red events.
Figure 6-15 Sending a high red when Online Accounts has a red event
350
For green child business systems, clear PBT events from Banking. This rule is set to clear out all PBT-generated events when all child business systems are in green status. It is similar to the green threshold rule for the Personal Finance business system except that the event ID is set to be the same as the other Banking business system PBT threshold rule event IDs. This allows the PBT-generated events to be cleared. Tip: It is possible, and sometimes desirable, to set green rules to match every red and yellow PBT rule to clear each PBT-generated event when it is no longer applicable. We chose not to do this here because the business system is already complex, and the extra refinement presents administrative overhead with little benefit. The rules are set to notify the Executive Console users when there is a problem impacting their business and, when the problem is resolved, the rules clear the notification from the executive dashboard. Further refinement is possible but not necessary.
351
Role: Operators
The IBM Tivoli Business Systems Manager administrator creates a work space for the operations team that contains the whole enterprise represented as business systems. No special customization is required other than to create the business systems. IBM Tivoli Business Systems Manager operators have an extensive range of IBM Tivoli Business Systems Manager views available to them as explained in Chapter 4, Planning to implement service level management using Tivoli products on page 109. They normally access IBM Tivoli Business Systems Manager using a Java console to allow them to use hyperviews, topology views and the Event Viewer. In this scenario, the operations team sees an initial view as shown in Figure 6-16 when they first log on to IBM Tivoli Business Systems Manager. The work space includes two windows containing: A hierarchical topology view An Event Viewer
352
Event Viewer
Underneath the topology view is an Event Viewer. It enables operators to view the events that affect the business systems shown in the top view and take action on individual events as required. This conforms to the operators accustomed working practices and smooths the transition to IBM Tivoli Business Systems Manager. The column adjustments done for the consolidation consoles (Figure 6-4 on page 324) are retained for this use of the IBM Tivoli Business Systems Manager Event Viewer.
353
They can view the IBM Tivoli Business Systems Manager Web Console over a secure link to assess the business impact of a failed component and direct fault resolution. A Critical Watch List (CWL) (Figure 6-17) was created for operations using the same business systems as used in the Java Console work space.
354
355
356
357
This second set of icons on the dashboard indicates component failures that have not impacted services but must still be fixed. It reflects the status of the business systems that represent the services. These business systems are shortcuts of the child business systems of the Banking, Trading and Personal Finance business systems. The child business systems have new high-level business systems created without any thresholding rules. All events propagate up to these business systems. This satisfies the requirements of the people in these roles to be aware of all technology events. The structure of the three business systems that support the services are shown as children of the Operations Manager business system (see Figure 6-21). The business system icons may show a different status than the previous set of icons because there may be some issues with components that are not so serious as to cause a failure of services.
Figure 6-21 Business system that support services and their executive dashboard icons
358
359
360
To test event behavior, we sent red alerts to objects in the Asset Management business system. As expected, when Asset Management turned red, the top of the business system tree, the Banking object itself, received a yellow alert as shown in Figure 6-24.
We cleared the previous alert and sent another red event to an object in the Online Accounts business system. This one red event caused the Banking object to turn red as the rules state that it should (see Figure 6-25).
Figure 6-25 Red propagating to the top of the Banking business system
361
This red event turns the executive dashboard red for the banking executive view as shown in Figure 6-26.
The drill down of this event shows that it is the PBT-configured event that is sent to the Executive and not the technical alert that caused the incident as shown in Figure 6-27.
362
We issued and cleared events across the Banking business system until we exhaustively tested the rules that were configured for this business system and verified that it performs as expected for all roles. It is fit for use in production. Tip: You can develop business systems in a test IBM Tivoli Business Systems Manager environment. You can also subject them to behavior verification without impacting the production IBM Tivoli Business Systems Manager environment. After the business system is verified, you can extract it from the test environment and implant it into the production environment using the XML facilities provided with IBM Tivoli Business Systems Manager V3.1. For details about using XML to export and import business systems, see IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085.
363
Table 6-10 SLA and OLAs for production Online Accounts services Description Online accounts performance and availability Account application performance and availability Interbank transfers performance and availability OS availability for z/OS servers (production banking) OS availability for Windows servers (production banking) OS availability for UNIX servers (production banking) WebSphere service availability (production banking) Network service availability (production banking) DB2 database availability (production banking) CICS region availability (production banking) CICS availability (production banking) Client Banking director Banking director Banking director Operations manager Operations manager Operations manager Operations manager Operations manager Operations manager Operations manager Operations manager Provider IT director IT director IT director Technical support manager Technical support manager Technical support manager Technical support manager Technical support manager Technical support manager Technical support manager Technical support manager Type SLA SLA SLA OLA OLA OLA OLA OLA OLA OLA OLA
SLA targets
Having determined the SLAs and OLAs required, we must define the targets for each of them. We discuss examples of an SLA and an OLA here. Later in this chapter, we show how they are implemented.
364
The service parameters are: The service hours are 24 hours per day, 7 days per week. The service should be available for at least 99.5% of the time during service hours. The response time of the service should no greater than 10 seconds for 99.5% of transactions during the service hours. The reporting period for measurement is one month. Interim weekly SLA reports should be provided for at least the first three months after which the requirement will be reviewed. Reports should be available for review within one day of the end of the reporting period.
365
366
367
368
369
Thereafter, database reorganization must be done periodically to maintain the performance of IBM Tivoli Service Level Advisor. To run the database reorganization, use these steps: 1. Stop the IBM Tivoli Service Level Advisor service from Windows Services. 2. Open a DB2 command window. Type the following command to check that there are no connections to the IBM Tivoli Service Level Advisor database:
db2 list active databases
Note: If there are connections to the IBM Tivoli Service Level Advisor database, you must terminate them before you run this command.
370
The response should show no connections listed for the DYK_CAT database. 3. Connect to the IBM Tivoli Service Level Advisor database:
db2 connect to DYK_CAT
5. Restart the IBM Tivoli Service Level Advisor service from Windows Services. Tip: To see the time that each ETL took to run, you can go to the Work in Progress window of the DB2 Data Warehouse Center tool.
371
Realms
In this scenario, it is possible to use only a single realm because everyone works for the same company. However we set up two realms: one for the business units and one for the IT department. See Figure 6-32.
372
Customers
We also set up the first customers as required for the SLAs and OLAs that we identified. The initial IBM Tivoli Service Level Advisor customers are: Banking Trading Personal Finance Operations These customers are identified in Figure 6-33.
373
Creating schedules
Schedules specify the time period over which IBM Tivoli Service Level Advisor offerings (and ultimately SLAs) are evaluated. For detailed instructions about setting up schedules, see Creating SLAs with IBM Tivoli Service Level Advisor 2.1, SC32-1247. The service level manager logs on to IBM Tivoli Service Level Advisor using the SLM administrator role. He navigates to Manage Schedules, and then selects Create to create a new schedule. The services we are working with in this scenario all have a requirement for service hours of 24 hours per day, 7 days per week. Figure 6-34 shows the schedule after it is created.
374
#1 Name Offering
#6 Select Metrics
#9 Publish Offering
375
376
For complete instructions to set up offerings, see Creating Offerings in Creating SLAs with IBM Tivoli Service Level Advisor 2.1, SC32-1247.
377
378
379
3. You now see the Include Offering Components panel (Figure 6-40). At this stage, you have not entered any components and the bottom section of the panel is empty. Click Add.
380
381
382
383
384
385
5. The customer has asked for daily evaluations of the SLA. In the Advanced Metrics Settings panel (Figure 6-46), complete these tasks: a. Select the Perform Intermediate Evaluations check box. b. For Define the frequency for intermediate evaluations, select Daily. c. Set Range of Data to Current evaluation period only because we only want to examine data from the current reporting period. d. Click Finish.
386
6. You return to the Include Metrics panel (Figure 6-47), which now shows the metric that you added. In this case, you only use a single metric for the business system. If necessary, you can enter another metric on this panel. Click Next on this panel to continue.
Figure 6-47 Include Metrics panel after adding the first metric
387
7. In the Name Offering Component panel (Figure 6-48), complete these steps: a. Change the Offering Component Name from the default entry of Business System to Business System Availability. b. Leave the description field blank. c. Click Next.
8. You return to the Include Offering Components panel (Figure 6-49), which shows the offering component that represents availability of the business system that we added. Add a second component to deal with the performance of the Online Accounts service. This requires data from a
388
business system as explained in Online accounts performance data on page 367. To set this up, repeat Step 5: Including the offering components on page 381 through Step 9: Publishing the offering on page 392 with exactly the same selections and entries in the panels. However, in Step 9, change the Offering Component Name from the default entry to Business System Performance. Attention: You create two offering that use exactly the same resources, metrics, and breach values. However, the offerings are set up for different purposes and are exploited in different ways to suit your requirements.
Figure 6-49 Include Offering Components panel after adding the first component
389
9. You return to the Include Offering Components panel (Figure 6-50). Click Next.
390
391
Figure 6-52 Manage Offerings panel with the Online Accounts Offering
392
393
394
#1 Name SLA
#2 Select Customer
#3 Select Service
#4 Select Offering
#5 Add Resources
395
396
397
398
399
400
401
2. In the Filter Resources panel (Figure 6-61), set a filter to restrict the number of business system resources shown. If you do not set a filter, you would see an error message indicating that there are too many resources to display. To create the filter, click Create Filter.
402
3. In the next Filter Resources panel (Figure 6-63), in the Value field, type Online Accounts. Click Next.
403
4. In the Select Resources panel (Figure 6-64), select /Banking/Online Accounts and click Next.
5. You return to the Add Resources to Business System Availability panel. Click Next. Tip: You can help find the resource by looking in the IBM Tivoli Business Systems Manager console. The business system we are looking for is called Online Accounts and is located in the Banking business system in IBM Tivoli Business Systems Manager. In IBM Tivoli Service Level Advisor, you see it in the Select Resources panel as /Banking/Online Accounts. You now see the Add Resources to Business System Performance panel (Figure 6-65).
404
405
406
5. In the Summary panel (Figure 6-67), click Finish. The SLA is now complete.
407
The Dynamic Resource List enables filtering based on the names or attributes of resources. This makes it suitable for OLA resources where naming standards are used for common resources such as servers. Create the OLA in the same way as an SLA until you reach the Select Resource List Type panel. 1. In the Select Resource List Type panel, select Dynamic Resource List and click Next. 2. In the Filter Resources panel, complete these steps: a. Click Create Filter. b. A row appears in the Resource Filter table. In the Value field, add Critical Server, which selects and isolates all resources in the business system. c. Select Preview current evaluation of filters. d. Click Next. 3. You see the View Dynamic Resource List panel (Figure 6-68) next because you selected the Preview current evaluation of filters option in the previous panel. You use this window to verify that the filter or filters selected the correct resources. Click Next.
408
4. In the Name Dynamic Resource List panel, name the Dynamic Resource List. a. In Dynamic Resource List Name field, type Critical Server List. b. In Dynamic Resource List Description field, type List of all the critical servers under OS availability of Windows servers. c. Click Next. 5. Complete the build of the OLA exactly the same as for an SLA. The example OLA is now defined and active.
IBM Tivoli Service Level Advisor Reporting Console offers two types of reporting: Intermediate evaluation reports End of SLA evaluation period reports
The average availability for the first day was 97.01%. This equates to an outage of 43 minutes. Although this exceeds the daily permitted average outage, it is not close to the monthly permitted outage and there could be up to 614 minutes of additional outages before the SLA is violated.
409
Being aware of the position as the reporting period progresses, the IT department has an opportunity to focus effort on the relevant part of the infrastructure to seek improvement for the remainder of the month. However, if the intermediate evaluation was not run until day 15 in the month and the result was availability of 97.01% as before, this would represent a total outage of:
(2.99 x 60 x 24 x 15)/100 = 646 minutes
This would leave a downtime margin of 11 minutes for the remainder of the month. In this case, there would be little room in which to manoeuvre. This illustrates that intermediate SLA evaluation can give the IT department early warnings. However, it should be done regularly and must be followed up with urgent remedial action. Otherwise the exercise is pointless. IBM Tivoli Service Level Advisor can calculate trends toward violations automatically. By linking SLAs to services defined to the IBM Tivoli Business Systems Manager executive dashboard, trending events are shown on the
410
dashboard icon. This is explained in Chapter 4, Planning to implement service level management using Tivoli products on page 109.
411
Important: According to ITIL, there are cases where implementing SLM processes has failed because unrealistic SLA targets were set. Before you put formal agreements in place between customers and suppliers, we recommend that you set up interim SLAs and use them to measure what is currently being achieved with the infrastructure. Tune the SLAs to make sure that targets can be met. If the targets are lower than what is considered desirable by the business, address this using a service improvement project with goals to improve performance over time. SLA targets can then be progressively increased and used to demonstrate how services have been improved as a result of changes made. You can also set shorter evaluation periods, and set retrospective SLA start dates initially to get faster feedback of results. See Adjusting SLAs after reviews on page 441 for details about adjusting SLAs to suit targets
412
Sample SLA
Table 6-11 shows a sample of the kind of information you can expect to find in the written SLA contract based on previous SLA.
Table 6-11 Sample SLA Name of the service Approvals Description Hours Measurement Period Availability Online Banking service Names, positions, and signatures, for example, Banking director The Online Banking Service is the Greebas Bank application that enables clients to manage checking and savings accounts through a browser interface. The service should be available 24 hours per day, 7 days per week and 365 days per year. The measurement period is one calendar month starting on the first of each month. Availability of the service is determined from agreed measurements obtained from IBM Tivoli Business Systems Manager. The service should be available 99.5% of the time during the measurement period, excluding any planned and agreed maintenance windows. Performance of the service is determined from agreed measurements obtained from IBM Tivoli Business Systems Manager and derived from synthetic transactions driven by IBM Tivoli Monitoring for Transaction Performance. A value of 99.5% of measured browser transactions should take less than 10 seconds. Reports should be available for review within one day of the end of the reporting period. The reports must contain the following minimum information: An overview report showing the status of all the SLAs of the business unit for the last reporting period Lists of SLA violations with details Weekly reports on service levels for three months from the date this agreement was accepted Reviews SLA review meetings are held each month and to discuss performance levels and violations. SLA planning meetings are held every three months to discuss long-term trends, new services, and proposals to modify SLA targets. This includes additional information such as customer support, change management, scheduled maintenance, and escalation.
Performance
Reporting
Other details
413
414
415
The operator can view where the impacted object fits into the business system structure by selecting the event in the Event Viewer, right-clicking, and selecting Business Impact. In this case, as shown in Figure 6-72, the Business Impact view shows the operator that the failing component is part of the ATM System business system that is a child of the Banking business system. Although the failed component does not impact the business system, the operator can proactively resolves the problem before the business process is compromised. The operator also sees a lot of red business systems. These are the technology business systems for the operating system support teams. They have no propagation-limiting rules and therefore propagate events to the top of the tree. See the following section for details about what the operating system support team sees.
Figure 6-72 Business impact showing the business system containing an affected component
416
417
Figure 6-74 Operating system support team leader executive dashboard view
418
419
420
421
The STI alerts are from two STI objects in the Real-time Account Application business system. This business system is set to propagate only when two or more red events are received from IBM Tivoli Monitoring for Transaction Performance. In this case, the red event is propagated to the top of the tree although a red event is already generated by PBT for the WebSphere event.
Figure 6-78 Operator view for critical events affecting the Banking business system
422
assess business impact using the IBM Tivoli Business Systems Manager Business Impact facility. However, it would be worth considering adding the critical business systems to the OS Support team view. Tip: Refining views to suit user roles is a process of continuous improvement. It does not stop once views are used in production environments.
When the IT users logs in, he or she sees the dashboard shown in Figure 6-79.
423
The IT user is an internal user and can view more information than an external user such as a banking executive. The IT user sees all the internal metrics, where the banking executive sees only a summary. For example, in Figure 6-80, the IT user uses an Intermediate Evaluation for Response Time, which is an internal metric. Internal metrics added for IT department users can help in diagnosis without affecting the SLA.
424
Figure 6-81 Operations and technical support managers executive dashboard view
425
426
When the manager logs in to the Report interface, he or she sees the page shown in Figure 6-83. On this page, the manager has a view of all the services that are provided as organized by realms.
Figure 6-83 Service delivery manager IBM Tivoli Service Level Advisor view
427
The manager can see that last month there were four violations in the Banking business unit. Clicking in the relevant cell shows the resources with the most violations. In this case, the /Banking/Interbank Transfers components has the most violations as shown in Figure 6-84.
428
Figure 6-85 shows the banking directors view in IBM Tivoli Business Systems Manager.
The Banking icon is red as is the IBM Tivoli Service Level Advisor indicator. When the director drills down, the icon shows generic details of what has occurred as shown in Figure 6-86.
429
430
When the BankingExecutive user logs in, this person sees the dashboard in Figure 6-87. This dashboard shows the banking director all the SLAs. Notice that only banking SLAs are available in this view. By clicking in the cell as indicated in the figure, the user can view some of the details of the last months violation. Notice that the cell is exactly under the column of the last day of the month.
Figure 6-87 Banking executive view in IBM Tivoli Service Level Advisor
431
Figure 6-88 shows the resulting window. Notice that in the section Violations, the violation occurred in the /Banking/Online Accounts component. In the SLO Results section, you can see that the other component is fine. Notice that you can only see two metrics. The SLA contains more metrics, but the others are internal and are not visible to this user.
432
The banking director may also want to know how well the IT department is meeting the SLAs in the reporting period that is underway. The director checks this by clicking in the appropriate cell related to the current period on the initial panel. See Figure 6-89. This shows that Real-time User experience is a little under the target. If this is a matter of concern, the director can discuss this immediately with the IT department.
433
When the director clicks in one of the values of the table shown in Figure 6-89, he or she sees a graphical view of the values for a specific date. The director can also see measurements based on longer intervals by setting the Start Date in the Filter Criteria section and clicking Update. Figure 6-90 provides an example of the type of display.
434
The root cause is not always so obvious. This section explains how the IBM Tivoli Business Systems Manager console, IBM Tivoli Business Systems Manager historical reporting, and IBM Tivoli Service Level Advisor can assist in finding it.
Using the IBM Tivoli Business Systems Manager Console for root cause analysis
You can configure IBM Tivoli Business Systems Manager so that it monitors both the infrastructure and user experience. In a properly instrumented enterprise, an indication of bad user experience should match indications of failure of infrastructure components. By using well-designed business systems, the link between the user experience and infrastructure failure should be apparent from examination of the IBM Tivoli Business Systems Manager console and by navigating through the business system hierarchy using the various views that are available.
435
TBSM Historical Reporting has a selection of reports available for use. We recommend that you use this approach. 1. Run the Business System Availability report against the business system in which you are interested. Report around the approximate time of the outage. For example, Figure 6-91 shows a report run against the PBT Demo business system between the 14 and 16 October 2004 and the report selection options.
436
2. Analyze the results and extract the start and end times for red and yellow status. Figure 6-92 shows an example of the output of the report request. The business system indicates that it entered red status at 4:40:57 p.m. on 21 October and returned to green status at 4:47:57 p.m. on the same day. The business system was red for seven minutes. What caused this?
437
3. Run the Business System Events report to establish which events were received to cause red status. Figure 6-93 shows the report selection options for the Business System Events report. We selected to search between the times of the outage and added a couple of minutes at either end of the time parameters.
438
4. Analyze the report to identify the components likely to cause an outage. The report for this business system, as shown in Figure 6-94, indicates that the red status was caused by four objects receiving red events at 4:40:47 on 21 October. The objects and the business system were set to green status at 4:47:47 when the events were owned by user ID S2Admin1. The option to clear the alerts from the objects was taken, so the red status was removed from the objects.
439
Using IBM Tivoli Service Level Advisor and Tivoli Data Warehouse Reporting for root cause analysis
Further information to aid correlation can be extracted from IBM Tivoli Service Level Advisor. For instance, the Components with the Most Violations Report can show which component of the business system has the most failures. For non-specific components, such as business systems, this is of limited value. However, for an SLA or OLA built using granular components, such as every component in the business system specified as individual resources of the Service, the Components with the Most Violations Report shows the actual component that is the root cause of the outage.
4 SLAs based on the availability, performance, or both of business services agreed between IT and business unit directors and implemented 5 Early warnings of potential SLA breaches 6 SLA reports available within one day of the end of the reporting period, and intermediate SLA evaluation reports produced on demand throughout the reporting period
440
Desired outcome 7 Demonstrated improvement in business services as measured by the SLA reports and a reduction in the instances of lost clients
Extent of achievement The SLM solution provides the means of measuring the quality of delivery of business services, but cannot in itself deliver service improvement. This must come via analysis, process changes, and corrective actions. Initial OLAs are in place. These must be extended and refined over time. The approach taken is based on ITIL recommendations.
8 OLAs agreed and implemented between technical team leaders and the IT director 9 New IT systems and processes in line with ITIL recommendations
The solution has met most of the desired outcomes. But as in the real world, there is still much work to be done. Chapter 3, IBM Tivoli products that assist in service level management on page 53, discusses the need for continuous improvement. The next section describes some specific actions that you can take to make further improvements to the scenario described in this chapter.
441
Lets say that the banking director agreed to a slightly lower level of service as an interim measure as follows: Online Accounts availability: 98.5% Online Accounts performance (Real Time): 98% We can apply these changes to the current SLA without creating a new SLA. This is done by creating a new offering to reflect the changed measurement requirements. To ensure that measurements are consistent, we recommend that you make SLA changes from the first day of the next measurement period. This enables you to compare the effect of the change with the previous measurement period. The best time to make the change is after the final evaluation is finished. When DYK_M10_Populate_Measurement_Datamart_Process finishes in the data warehouse, the evaluation is complete. Now we can make SLA changes. To change an SLA, we use these steps: 1. Create a new offering that includes the new breach values, ideally based on the old offering. 2. Replace the old offering in the SLA with the new one. We can create a new offering, based on the Online Banking offering, using these steps. 1. In the IBM Tivoli Service Level Advisor Administrator Console, select Administer Offerings Manage Offerings. 2. In the Manage Offerings window, select Online Accounts Offering and click Create Like. This creates a copy of the Online Banking offering. 3. In the Name Offering window, complete these tasks: a. In the Offering Name field, add Online Accounts Offering date. b. In the Offering Description field, add This offering was reviewed in date. c. Click Next. 4. Continue through the offering definition as before until you reach the Define Breach Values window. 5. In Define Breach Values window, in the Average field, replace the value with 98.5. Click Next. 6. Continue through offering definition as before until you reach the Include Offering Components window. 7. In Include Offering Components window, repeat the same process again for Business System Performance. Enter a breach value of 98. Click Next. 8. Finish and then publish the offering.
442
To replace this offering in the SLA, follow these steps: 1. Click Administer SLAs Replace Offering. 2. In the Old Offering window, select Online Accounts Offering and click Next. 3. In the New Offering window, select Online Accounts Offering date and click Next. 4. In the Move Resources window, follow these steps: a. In the first To field, select Business System Availability. b. In the second To field, select Business System Performance. c. Click Next. 5. In the Select SLAs window, select Online Accounts SLA and click Next. 6. In the Summary window, click Finish. 7. In the Track Updated SLAs window, monitor the modified SLAs in this window. Click Close. The SLA is now updated with the new offering. From now on, the bank can use the new offering to calculate compliance with the SLA. It can use the Track Updated SLAs window to monitor and verify the SLAs that have been modified.
443
444
Part 3
Part
Appendixes
This part includes the following appendixes: Appendix A, Service management and the ITIL on page 447 Appendix B, Important concepts and terminology on page 515 Appendix C, Scripts and rules used in this book on page 527
445
446
Appendix A.
447
The ITIL
The ITIL is a series of documents that are used to aid the implementation of a framework for IT service management. This customizable framework defines how service management is applied within an organization. The ITIL was originally created by the Central Computing and Telecommunications Agency (CCTA), a United Kingdom (UK) Government agency (now known as the Office of Government Commerce (OGC)). It is now is becoming more popular and has been adopted and used across the world as the standard for best practice in the provision of IT service. Although the ITIL covers many areas, its main focus is on IT service management. The ITILs IT service management is organized into a series of sets, which are divided into two main areas: service support and service delivery. Each area contains several disciplines, which stipulate the ITIL practices or requirements. Service support is the practice of those disciplines that enable IT services to be provided effectively. Service delivery covers the management of the IT services themselves. It involves many management practices to ensure that IT services are provided as agreed upon between the service provider and the customer. Refer to the following Web sites for details about what ITIL is and what it can provide: IT systems management forum Web site
http://www.itsmf.com
Service management
Today, the service management revolution is well on its way. Almost every IT organization is moving toward business-oriented service delivery. IT is being called upon to participate as a partner in the corporate mission, which requires their functioning as a proactive group that is responsive to their customers.
448
Adopting this mind set is difficult for internal service providers, who face an increasingly less captive audience. The corporate IT organization is now challenged to operate as a stand-alone business, without corrective forces that profit orientation and that the threat of losing customers presents for companies operating in a free market. In the absence of these forces, IT organizations are embracing a new competitive mindset: service level management. Through the process of establishing an SLM orientation, IT organizations can engage customers, as though they were driven by market forces. SLM is a means for the lines of business (LOB) and IT organization to explicitly set their mutual expectations for the content and extent of IT services. It also allows them to determine in advance what steps to take if these conditions are not met. The concept and application of SLM allows IT organizations to provide a business-oriented, enterprise-wide service by varying the type, cost, and level of service for the individual LOB. For the IT organization to make and use the service level agreements (SLAs) with the LOBs as a tool for decision making, the IT organization must organize itself accordingly and establish internal procedures that support SLA management. SLM is not an isolated activity. It interacts with, and draws upon, all the other disciplines that are part of the IT infrastructure management. There is no point in agreeing to deliver a service if the basic tools and processes needed to deploy, manage, monitor, correct, and report the service level achieved are not established. All of these activities are grouped into two major disciplines (Figure A-1): service delivery and service support.
Service Delivery
Service Level Management Financial Management for IT Services Capacity Planning IT Service Continuity Management
Availability Management
Incident Management
Problem Management
Change Management
Service Support
Figure A-1 The service management disciplines
449
Service delivery
The primary objective of the service delivery discipline is proactive. It consists of planning and ensuring that the service is delivered according to plan and, in turn, to the SLA. The tasks that you must accomplish to make this happen are: Service level management This involves managing customer expectations and negotiating service delivery agreements. It involves determining the customers requirements and how you can meet them the best way possible within the agreed-upon budget. Working together allows IT disciplines and departments to plan and ensure the delivery of services. This involves setting measurable performing targets, monitoring performance, and taking action where targets are not met. Refer to Chapter 1, Introduction to service level management on page 3, and Chapter 2, General approach for implementing service level management on page 23, for a description of the approach to SLM used in this redbook. Financial management for IT services You must register and maintain cost accounts related to the usage of IT services. You must also deliver cost statistics and reports to SLM to assist in obtaining the right balance between service cost and delivery. And you must assist in pricing the services in the service catalog and SLAs. Capacity management This involves planning and ensuring that adequate capacity with the expected performance characteristics is available to support the service delivery. It also entails delivering capacity usage, performance, and workload management statistics, and trend analysis to SLM. IT services continuity management This requires you to plan and ensure the continuing delivery, or minimum outage, of the service by reducing the impact of disasters, emergencies, and major incidents. You do this work in close collaboration with the companys business continuity management, which is responsible for protection of all aspects of the companys business, including IT. Availability management This entails planning and ensuring the overall availability of the services. It also requires you to provide management information in the form of availability statistics, including security violations, to SLM. This discipline may include negotiating underpinning contracts with external suppliers, and defining maintenance windows and recovery times.
450
Service support
The disciplines in the service support group are reactive and concerned with implementing the plans and providing management information regarding the levels of service achieved. Service desk This is an essential function to effective service management that acts as the main point-of-contact for the users of the service. You register incidents, allocate severity, and coordinate the efforts of the support teams to ensure timely and correct resolution of problems. Escalation times are noted in the SLA and are, as such, agreed upon between the customer and the IT department. This discipline also requires you to provide statistics to SLM to demonstrate the service levels achieved. Incident management This goal of this discipline is to restore services to their normal operational levels as soon as possible, ensuring service levels are maintained. You must maintain meaningful records of all reported incidents that causes, or may cause, interruption or degradation of quality of IT services. You must also provides investigation and diagnosis of incidents, as well as incident ownership, monitoring, and tracking. Problem management For this discipline, you must ensure that resources are prioritized to resolve problems in the most appropriate order based on business needs. A problem is the unknown cause of one or more incidents. When the root cause is known and a temporary work-around or a permanent fix is determined, the problem becomes a known error. You must also agree on escalation times internally with SLM during the SLA negotiation. And you must provide problem resolution statistics to support SLM. Change management In the change management discipline, you must ensure that the impact of a change to any component of a service is well known, and the implications regarding service level achievements are minimized. This includes changes to the SLA documents and the service catalog, as well as organizational changes and changes to hardware and software components.
451
Release management For release management, manage the master software repository, named the Definitive Software Library (DSL), and deploy software components of services. You must also deploy changes upon the request of change management. And you must provide management reports regarding deployment. Configuration management With configuration management, your must register all components in the IT service, including customers, contracts, SLAs, hardware and software components, and more. Plus, you must maintain a repository of configurable attributes and relationships among the components. Figure A-2 shows the key relationships among the disciplines.
Requirements: Requirements: Budget Budget Performance Performance Availability Availability Disaster Disaster
Deliverables: Deliverables: Costs Costs Performance Performance Availability Availability Recovery Recovery
Planning:
Support:
IT Service Continuity Management Service Desk
Financial Management
Capacity Management
Change Management
Problem Management
Availability Management
Deliverables: Deliverables: Configuration data Configuration data Software installations Software installations Configurations: Configurations: Capacity Capacity Equipment Equipment Components Components etc. etc.
Infrastructure:
Configuration Management
Release Management
452
To fully understand the responsibilities of each of the disciplines and the relationships among them, the following sections discuss both the service support and the service delivery disciplines.
453
attributes, users, and so on. Likewise, there is a need to know where the roles and responsibilities of different activities involved with service support are placed within the support organization. During all the phases of the life cycle, the IT organization as a whole should be able to answer the question: Who does what to which component: where, when, why, how, and authorized by whom? Providing the answer requires contributions from all the disciplines in the service support group: Configuration management: Answers the where and which Service desk: Should be in a position to answer why incident and problem management: Are responsible for the what and how Change management: Takes care of the when and whom Release management: Depends upon the nature of the change; who is often placed here Change requests may originate from sources other than incident management, problem management, and service desk. For example, if a request to increase the size of a file system is issued from capacity management, the change request is passed directly to change management without the knowledge of service desk. However, each change request should be registered with and governed by configuration management. This enables the service desk to find the answer to why, even though the change did not address a specific incident received by the service desk.
Configuration management
For day-to-day incident, problem, and change handling, as well as deployment of new services, information about all the components that are related to delivery of a service is vital. Configuration management is responsible for providing and maintaining this information because it is, perhaps, one of the toughest tasks related to service management. Configuration management, as a discipline of service support, is not restricted to the configuration management aspects of development. If it applies to the specific environment, development aspects are included. But configuration management includes all of the components within the IT infrastructure that are related to delivery of a service. Configuration management should be applied throughout the organization and should not be restricted to IT-related items.
454
The four main activities of configuration management are: Identification: This involves identifying all the configuration items (CIs) in the IT infrastructure, as well as defining the information to hold each of the CIs and the relationships between them. Additionally, it entails defining baselines and identifying variants. To summarize, this task is responsible for defining the policies regarding the type and level of information that is maintained in the organization. Not only may identifying, gathering, and storing the information initially require a huge effort, but maintaining the information may be even worse. The basic principles for identifying the CIs are as follows: CIs must be uniquely identified. The indoctrination must be prominent and clearly visible. Identities must be as meaningful as possible. Versioning must be supported. Growth must be catered to.
Control: This activity handles maintenance, updates, and access to the configuration repository, called the configuration management database (CMDB). Many of the other service management disciplines support this effort, but it requires adequate control procedures to be in place: Specifications of CIs are agreed upon and frozen. Only changes authorized through predefined change management procedures are allowed. Status accounting: Since the CMDB is used by all system management disciplines, it is vital that the information is correct and timely. The CMDB holds active and historical configuration data. Therefore, attributes must be defined and maintained to track the configuration of CIs over time. These attributes must support the state of acquisition, development, testing, or implementation of the CIs and must be recorded as soon as they happen. Another way of expressing the responsibilities of this activity is to record and report all current and historical data for all CIs. Some useful reports are: The number of incidents from a particular CI in a particular period The change history for a CI in a particular period The total amount spent with a particular supplier over a particular period Verification: It is important to audit the contents of the CMDB, that is, verify them to make sure that the repository reflects the actual configuration of the IT infrastructure. The configuration management staff themselves can accomplish this, or some of the operational procedures (for example, related to incident reception in the service desk) may assist. Review the consistency of the CMDB regularly.
455
The accuracy of the CMDB may be easier if: The CMDB is active rather than passive. The CMDB is updated automatically whenever possible. Configuration management activities are integrated into other relevant operational procedures. Automatic audits are built into the system.
456
You should record at least four basic attributes for every CI: ID: A unique identification. To ensure uniqueness and easily differentiate different types of CI from one another, you must develop a naming standard for CI IDs that supports type. This naming standard for the CIs should not include elements of information that may change over time. Therefore, avoid location and owner information because it may change, requiring the CI to assume a new ID. Location: Record the location where the CI may be found to assist all other service management disciplines. In particular, the impact analysis of the change management process relies on this piece of information. When discussing mobile computers, it may not make sense (or it may be difficult) to determine the physical location of the CI. Each individual organization must be determined within itself, if the efforts to maintain the physical location are too high to take on this task. Owner: To charge services, monitor SLA achievements for different LOBs, determine maintenance policies, and so on, it is necessary to connect the CI to an owner or user. Linking this information to the organizational structure, which is also recorded in the CMDB, the CI can be associated to a specific group or department, providing the key to the desired information. State: The state of the CI is vital to track the CI through its life cycle to ensure that each CI is made to cost, on time, complete, to specification, authorized, and more. During each state, you may track responsibilities, progress, and problems. The information needed to manage a configuration varies as a function of the type of configuration and the management task performed. In the previous example, free space is an attribute of the configuration of type PC, and the serial number is an attribute of a configuration of type hard disk. When you break down the IT infrastructure into configurations and configuration items, you must follow these three principles: Break down CIs only to the level at which they can be changed or amended independently. The level of CI breakdown and the attributes stored for each CI vary depending upon the individual organization and the purposes for which control is exercised. The cost of gathering and storing information must never exceed the value of the information. Besides attributes, you may also use relationships to associate CIs with one another. This may be the most obvious way to break down CIs for tangible components, such as hardware. However, defining the CI structure for software
457
and organizational configurations becomes more complicated. The CMDB must be able to handle relationships between: Hardware and hardware Hardware and software Software and subsystems Applications, hardware, and software Hardware, software, and operating systems Networks All of the previous items and their users Incidents, problems, solutions, and change requests
458
This list is by no means complete. More benefits will be obvious for each specific discipline.
Service desk
The service desk provides a main point of contact for users of the services. Whenever users experience problems, have questions, or need information regarding the use of services, they should contact the service desk. The service desk is also responsible for notifying users about disruptions in service, planned outages, and availability of new functions. It serves as a two-way conveyer of information between the service users and the staff supporting the service. This section focuses on the one-way information flows from user to staff (Figure A-3). Providing quality service requires processes and procedures to detect and rectify problems as quickly as possible. Detection is either done by programs that monitor specific resources of the hardware and software components of the IT infrastructure or by the users of the service. When an issue is reported, it is recorded centrally with the service desk as an incident. This central incident control is required, partly to ensure that the issue is handled and partly to ensure that the same issue is handled only once, even though more incidents may have been opened against the issue. When the issue is reported, the service desk must provide a solution to it. The service desk may (but is not required to), through incident management processes, identify, test, and apply the solution. It must also keep track of the incident to ensure that the issue is solved within the time agreed in the SLA and to escalate the issue if necessary.
459
If a service desk cannot identify a solution to the issue on its own, the incident is recorded as a problem, which is stored in the CMDB. Now, problem management assumes responsibility to provide a solution for the problem by accepting the problem. When the root cause of the problem is known and a temporary work-around or permanent fix is identified, it is recorded as a known error. When a solution is available, it may require changes to the CI for which the incident was opened or another CI within the infrastructure on which the failing CI relies. The service desk is now required to open a request for change in order for change management to access the impact and authorize the change. Once authorized, release management may take over to perform the actual implementation of the change. During this process, each service support discipline is responsible for recording status information in the CMDB. The service desk must also keep the user informed through all the stages in the life cycle of the incident. It must also confirm that the issue has been resolved, record the solution to the known error, and close the incident.
Incident occurs
Call answered
Known errors
Categorization
Initial investigation
Assign problem
Problem Management
Confirm
Y
Resolve
460
Incident management
As described earlier in this appendix, incident management has as a goal to restore services to their normal operational levels as soon as possible, to ensure that service levels are maintained. The service desk plays a key role in incident management. When an issue is reported, the service desk captures the data needed to open a new incident. This data must include an ID of the person (or proxy) who submitted the issue report, and the ID of the CI suffering the impact. With this information, service desk can query the CMDB to investigate whether the CI exists in the CMDB, and whether any outstanding problems, changes, or other incidents are active for that particular CI. It should also be determined if the particular issue was reported earlier.
461
If there are no indicators showing that the issue is being handled, the incident must be categorized. A type and an impact code are assigned to the incident. Do not confuse this with priority, urgency or severity, as defined here: Impact: Impact of the incident on the achievement of business objectives Severity: Impact of an incident on service provision Urgency: Determines the speed with which an incident must be resolved Priority: Order of handling incidents, based on a combination of impact, severity, urgency, and availability of resources to address the incident Using these definitions, it is clear that an incident can have a high impact on the achievement of business objectives and yet have an insignificant impact on the provision of the service (and vice versa). The priority primarily depends on the impact on the business and secondly on the impact on the service. However, since the business relies on the service, incidents with a high service impact quickly affect the business as well. The priority of the incident is determined from both a business and a service perspective as shown in Figure A-4.
severity
high
low
medium
Having categorized the incident, an initial investigation may be carried out using incident management processes. This involves searching the CMDB for similar or related issues to identify the cause of the incident as a known error. If this is the case, the service desk can inform the user of the status of the problem, when to expect the issues to be fixed, or any actions the user can take to circumvent the issue.
462
If no immediate solution can be found, the incident becomes a problem, and a solution must be provided by the problem management discipline. When the service desk passes the problem to problem management, the responsibility of managing the problem still lies with the service desk. The service desk is now responsible to keep the user informed about the progress and escalate the problem if the times for problem resolution set out in the SLA cannot be met.
Problem management
The activities performed by problem management are similar to those of the service desk. Problems are received, accepted, diagnosed, and assessed for severity. This is known as problem control. Then, solutions are developed or identified, tested, verified, and recorded, which is all part of the error control process. The problem control process is concerned with identifying the real causes of incidents to prevent future recurrences. This process is made up of five phases: 1. 2. 3. 4. 5. Initially investigating the nature of the problem Accepting the problem Assigning priority (impact on service delivery and business objectives) Allocating support effort Performing further investigation and diagnosis
After the problem is accepted and a work-around or permanent fix is identified, it is recorded in the CMDB as a known error. There are two types of known errors: Accepted problems that are not yet rectified (Root cause analysis has been done, solution has been identified, but not implemented.) Accepted problems for which a resolution or circumvention is available Allocating the support effort to find a solution to a problem is important. Depending on the nature of the problem, the impact, urgency, and the severity, it may prove more productive to the business as a whole to live with the problem rather than using all available support staff and all the budget for external support to diagnose and rectify it. Making a decision such as this requires detailed impact analysis and acceptance from the service level manager as well as the sponsor. It may lead to renegotiation of the SLA. When the cause of the problem is identified and a decision to provide a solution is approved, error control takes over. The primary objective of this function is to eliminate all known errors by providing solutions to the problems and ensuring that they are implemented on all CIs where the problem has occurred or may occur. To meet this objective, error control and change management go hand-in-hand since change control is responsible for approving any changes made to any CI. See Figure A-5.
463
Service Desk
incident
Incident Management
problem
Problem Management
Problem control
Error control
Change Management
Change control
The verification of solutions is especially important. First, you must verify that the proposed solution targets the source of the problem rather than removing the symptoms. Secondly, you must ensure that implementation of the solution does not result in any undesired side effects. If this is the case, the solution implementation may lead to other (even worse) problems that will harm the overall service delivery. All of the disciplines in service support should work together to avoid the vicious circle of change. Much too often, solutions, changes, and implementations are rushed through without proper testing, leading to even more severe incidents of higher impact. This requires even quicker resolution, so the solution is not tested properly and new incidents are the result. This is depicted on the left side in Figure A-6. On the right side of Figure A-6, error control has had enough time to assess the impact of the solution. Change management also has had adequate time to assess the impact of the change, and the implementation had exactly the foreseen implications. The source of the problem was eliminated, and the technical support staff can start working on the next problem.
464
incident
incident problem
problem
implementation
change change
implementation
465
restore services that support the business as quickly as possible, performing tasks such as researching the CMDB for known errors, while problem management focuses on determining the root causes of incidents, their resolutions, and prevention.
Change management
After configuration management, change management is the most important to continue delivering quality service. The responsibility of change management is to manage changes to the configuration items such as: Hardware Software Communication equipment and software Production application software All documentation, plans, and procedures relevant to running, supporting, and maintaining the production systems Environmental equipment People By using the term production, it is indicated that changes to equipment and applications used for development and test purposes are normally not the responsibility of change management. The processes that are used to manage changes involve: 1. 2. 3. 4. 5. 6. 7. 8. Change initiation Change reception: Logging and filtering Initial change prioritization Change assessment and scheduling Change building Change testing Change implementation Change review
To support the processes, several players must be involved. In the typical IT organization, a dedicated change manager is appointed. The change manager must receive, access, approve, and manage the changes. To assist the change manager, a change advisory board (CAB) is appointed. This board consists of members from all the support groups within the organization, such as service desk, networking, space management, platform support, and representatives of the business. The board is responsible for assessing proposed changes for impact and estimating the resource requirements needed to design, build, test, implement, and review a change. The
466
CAB also advises the change manager in change acceptance matters and assist in scheduling changes. The CAB may be divided into subcommittees that handle changes in specific areas as shown in Figure A-7. The LOB representative from finance does not have to attend the meeting when changes to the production control software are discussed. Also, the presence of the representative for networking is not always required when changes to the central disk configuration are handled. A super-committee, the CAB/emergency committee (CAB/EC), is also appointed. The purpose of this committee is to meet to authorize urgent changes on short notice. Because of the size of the change advisory board, it is impractical to convene a full meeting to handle urgent changes. The change manager may be authorized to accept some urgent changes, but we do not recommend doing so without considering other key personnel. The CAB/EC, for example, may be made up of the change manager and key staff members from the CAB. It acts as the safety net, or sparring partner, of the change manager. The selection of members of the CAB/EC is a matter of preference and the nature of the change, but the change manager should always be a born member.
LOB representative
IT Manager Security
LOB representative
Operations
Networking
Systems Support
Service Desk
Development
LOB representative
Subsystems
Solutions
Change Manager
467
Change Initiators Service Desk, Tech. Staff, or users Receive and filter RFCs Change Manager Allocate priority Yes Urgent? No Decides priority To urgent procedure
Estimate impact and resources. Confirm agreement to change and priority Schedule Change May be interactive No Authorized? Yes Build Change Device back-out and test plans
Change Manager
Change Builder
No
No
Successful? Yes
No
Close Change
468
Change initiation
Usually, changes can be requested by any technical staff member in the organization. Users should also be allowed to submit RFC, but to provide initial filtering and coordination, user RFCs require approval of a LOB manager.
469
Change building: If the change is authorized, the appropriate technical group is given the task of building the change and devising a test plan. Create backout plans to enable the implementation team to revert to a known trusted state in case problems arise during the implementation of the change. Change testing: An independent testing authority should test both the change and backout procedures prior to implementation. The change cannot be allowed to be implemented before satisfactory tests have been completed. Change implementation: Upon completion of testing, the change manager coordinates the implementation of the change. Advise all relevant staff in advance of the planned implementation, perhaps through the service desk. If anything fails, execute the backout plans and remove the change. Change review: To ensure that the desired effects are achieved and to assess whether the resource estimates are accurate, review all changes after a predefined period of time. This also helps to improve future estimates.
470
Change Manager
Call CAB/EC meeting Change Advisory Board / EC Urgently assess impact, resource requirements and urgency. No Urgent? Yes Urgently Build Change. Create back-out plans Change Builder To normal procedure
No
Change Tester
No
Co-ordinate change implementation No Satisfactory? Yes Ensure records are brought up to date Implement back-out plans. Change is referred back to CAB/EC
Close Change
471
This interdependence leads to: Configuration management tasks to update the configuration repository should be prompted in several ways, a large number of which fall within the scope of change management. Some of these are: When new CIs are added to the IT infrastructure When the status of CIs changes When the owners of CIs change When the location of CIs changes When relationships between CIs change When old CIs are removed When a unregistered CI is found or information regarding a CI is inaccurate When a change is requested Change management should assess that changes impact on the business and identify other CIs that could possibly be affected. If the CMDB is not up to date, this affects the way in which the change is treated. Any change request is made using a RFC, which is reflected in the CMDB. Unless this is done, it is difficult to track progress and trace problems in the IT infrastructure back to previous changes. Unless change management is functioning effectively, the CMDB cannot reflect the current status of specific CIs in the organization. If changes fail, the CMDB can be used to indicate what state the CI should be reverted to. If that is out of date, time is wasted trying to remember what the CI looked like before the work started.
Release management
Since configuration management is responsible for managing the logical aspects of CIs (including software and hardware CIs), release management is responsible for the physical aspects. Release management is involved whenever a significant hardware or software rollout takes place. In relation to software, the main types that are to be controlled are: Application programs developed in-house Bought-in application software and utilities System software provided by suppliers All of this software must be stored in a common secure software library, called the Definitive Software Library. This library contains all the definitive
472
quality-controlled versions of all the software CIs defined in the configuration repository. The DSL is one single library, separate from other parts of the environment. At least, the DSL, logically, is single, but it may be practical to use more physical locations, formats, and backup storage as part of the contingency plan. For hardware control, set aside an area for the secure storage of approved hardware components, named Definitive Hardware Store (DHS). Similarly to all the approved software, record all details that relate to the hardware components in the CMDB. The tasks performed by release management are: Planning and overseeing the successful rollout of new and changed software and hardware and associated documentation Physical storage, protection, distribution, and implementation of all approved software and hardware Control of access to authorized versions and support of change control in releasing software for distribution for further work Ensuring that only correctly-released and authorized versions of software are in use Distributing software to remote locations Implementing (or bringing into service) approved software and hardware Managing the organizations rights and obligations regarding software and hardware The release management processes include elements that are concerned with development and other elements that are concerned with the production environment. Both are managed to ensure that the required standards are met when the service is delivered and to control the way the software is being used in the production environment. This is why release management is considered a service management discipline. Figure A-10 shows the details of the release management process. The left part of the figure shows the tasks that are related to verifying and ensuring the functionality and quality of the new software CIs, which are developed in-house or bought-in. This is the control part of release management. After the required specifications are met, the software, along with its attributes, are registered in the CMDB and stored in the DSL. The right part of the figure shows the functions that are related to distribution. The software is copied from the DSL and built. The build process may be a simple copy or a complete (or partial) compilation and linkage. The main issue is
473
to test and verify that the output from the build process can be distributed and implemented successfully. This must be tested before initiating any distributions and implementations.
Test
System Testing
DSL
Build
Distribution
Implementation
474
475
The cost of delivering a one-of-a-kind service properly is much higher than the cost of delivering a standard service. The price that the customer pays reflects the cost. To determine the cost, and thereby predefine the price that the customer must pay, you must answer all of the questions concerning who, why, what, where, when, and how. That is you must define the service in such detail that there can be no misinterpretations about: The deliverable Quantities and quality of the deliverable Prerequisites and requirements for the delivery Division of roles and responsibilities between customer and provider How, where, and when the delivery takes place The penalties for not delivering Benefits/penalties for increased delivery And finally, when all these items are defined, you must determine the price. Discussing SLM in the context of IT services typically applies to volume-customization and one-of-a-kind services. Within the enterprise, the IT organization provides the same basic services to all LOBs (mail, office applications, Internet access, etc.). It fulfills particular needs for each LOB by providing specialized services designed solely for this purpose (for example, accounts payable/receivable, payroll, procurement, and so on). Likewise, an external network service provider wants to sell similar networking services to many customers and perhaps design special services for customers with special needs. In the service management organization, SLM is responsible for defining services. It is also responsible for managing customer demand and negotiating the SLAs. After the services are established and delivery has begun, service providers need to assure that the service is delivered as expected. They must also ensure continued delivery, which is also the responsibility of SLM. To do this, SLM needs assistance from other disciplines that focus on various aspects of the service delivery processes and the overall mission of the IT department: Capacity management: Deals with the daily monitoring and reporting of workloads, resource usage, and component performance. It is also responsible for capacity planning by identifying trends and predicting future needs. Availability management: Ensures that the services are available to the users that are authorized to use those services, when they are needed. This is primarily achieved by ensuring the availability of each of the components that is part of the service.
476
Financial management of IT services: Manages the IT budgets and negotiates contracts with suppliers. It also plays a key role in determining the cost of a service (often based on resource usage), therefore assisting SLM with pricing the service. IT service continuity management: Ensures that the IT services delivery may continue, or be re-established quickly, after a disaster. IT services are often required to perform business transactions, so the IT organization must have completed and tested plans and procedures for disaster recovery and related subjects. The following sections explore these four disciplines and their association with SLM.
Capacity management
Insufficient capacity often leads to bottlenecks, performance problems, and loss of availability, all of which contribute to degrading service delivery. Looking at a typical client/server service, it is evident that, since more components make up the service as it is perceived by the end user, the capacity of each individual component must balance with the capacity of the other components. In the IT community, more capacity is often synonymous with new technology. Capacity is an attribute of the hardware components that make up the service or the amount of hardware resources available to software components. Therefore, capacity management is often seen as managing procurement of new advanced technology. Too often, new technology is procured when performance or capacity problems are experienced, and then the capacity management function becomes reactive rather than proactive. This tends to happen in a very complex environment where many components are a part of more services and are tied together in a giant web. Considering capacity as the maximum performance or output of a component, we can say that, to manage capacity of a service, it is important to manage the workloads of the service to forecast the need for capacity. It is also important to know what workloads run where and when, and under what circumstances. In general, this means that the objective of capacity management is to ensure that the appropriate technology is used in the best way possible. The word appropriate is determined by the level of service that is to be provided to the business at all times. Also, the phrase best way is determined by how well any given technology supports the business requirements of the users. Ensuring that the right technology is used to provide the best support for the business is like trying to hit a moving target that varies in size. Not only does the business environment change constantly, but technology changes happen so fast
477
these days, that ordered devices may be obsolete before they are received. The rapid development of new technologies may even pose new possibilities and opportunities for the business leading to business changes driven by the availability of new technology. The e-revolution is one of the best examples of technology-driven business changes. Some of the questions that change management helps to answer are: How will the new technology affect the way business is conducted? How can we make the best use of these technologies? Will they really save us money? Are they going to make us more productive? To answer these questions, capacity management draws upon data of the past environment where the variables are known. It compares this date to current projected future variables. Data about the past and present environment also helps to optimize current performance, estimate future needs and demands, and take steps to be ready to meet them when required. To overcome all this, capacity management is divided into the following subdisciplines, each covering different aspects of capacity management. Capacity management database: Maintains the data related to capacity management Performance management: Monitors and optimizes the performance of the existing components Workload management: Identifies, understands, and forecasts workloads Application sizing: Predicts service levels, as well as cost and resource implications of future applications or major modifications to existing applications Modeling: Predicts systems performance under given volumes and varieties of work Resource management: Understands the IT infrastructure to ensure that the organization uses the available technology that best suits the business Demand management: Prioritizes customer demand for use of component resources without adding more capacity Capacity planning: Predicts when components reach their saturation point and identifies the action to be taken to prevent this
478
The type of information that is stored in the capacity management database is technical, business, and cost data required by capacity management to produce technical and management reports showing usage and trends.
Performance management
The objective of performance management is to ensure that the agreed-upon service level is maintained. In addition, performance management is responsible for ensuring that each hardware, software, and networking component delivers the expected capacity. This is a day-to-day task that involves monitoring the capacity delivered to quickly identify problems or bottlenecks. The information gathered for monitoring purposes is stored in the capacity management database to keep historical information and help determine trends. SLM delivers the required service levels to be achieved for performance management. These are in the form of thresholds for each component that must be met to provide the agreed-upon level of service. If these thresholds are not met or if indicators show that they will not be met in the near future, performance management investigates the reason, identifies actions to tune the systems to meet the thresholds, and implements the tuning activities shown in Figure A-11.
tuning
implementation
analysis
monitoring
479
All the activities of performance management are conducted in close contact with configuration, problem, and change management.
Workload management
Workload management has three objectives: Understand and document all workloads Establish interfaces with relevant parties in the IT department for interchange of information Implement an effective workload forecasting system Breaking down a service into individual workloads that execute on one or more components in the IT infrastructure is crucial to understanding and defining the capacity needs for any one component. Furthermore, workloads often depend on one another to form a hierarchy in which one workload must be completed before the next one occurs. All the workloads, and the relationships between them, must be defined and categorized in the workload catalog, which is part of the overall capacity management database, as shown in Figure A-12.
In addition to the existing workloads, capacity management must understand new workloads to estimate future capacity needs. The metrics used for this estimation are obtained from the application sizing and modeling tasks of capacity management.
480
Application sizing
The objectives of this task are to establish a means of predicting the service level, resource, and cost implications of new applications and major changes to existing applications. Application sizing is of particular interest in the early stages of the life of a service. Part of determining the cost of providing the service is a clear picture of the required capacity. Capacity management, therefore, supports SLM through the application sizing activities in the preliminary cost and business implications analysis.
Modeling
The modeling activities involve estimating or predicting the performance of a system under a given volume and variety of work. Modeling is the application sizing of hardware and networking components. You can perform modeling with more or less accuracy. The most accurate method is benchmarking, where a load is run on a given system and the performance is measured. This is the most expensive way of modeling (Figure A-13).
At the other end of the scale is estimation. Based on historical performance data and known variables, the performance of a workload is estimated. This is the most inaccurate way of modeling, but also the cheapest. Between estimation and benchmarking are: Trend analysis: More historical data representing different workloads on different systems is compared with the expected workload on a new system. Analytical modeling: Statistical methods are brought into play to provide a more detailed workload and system models.
cost
481
Simulation modeling: A subset of a workload is run on the new system to obtain data that can be extrapolated to provide the expected performance figures. Analytical models and even the equipment needed to run simulation and benchmarking tests may be provided by the hardware supplier. However, internally in the IT department, the most commonly found types of modeling are estimation, trend analysis, and common sense. Modeling must be regarded as a tool that is available to all the tasks of capacity management since it is equally important and applicable to each of them.
Resource management
Resource management works together with the availability and configuration management disciplines. It helps to provide an understanding of the organizations hardware, software, infrastructure, and other resources and to ensure that the organization is aware of changes in technology. This information is vital when evaluating the business implications of acquiring new technology. It is also important when suggesting the application of new technologies to solve business challenges.
Demand management
Capacity management must also manage customer demand for IT resources of limited capacity. (Limited, in this sense, means that the available capacity cannot be increased for technical, financial, or business reasons.) Such a situation may occur when a component fails completely or when decreased capacity of exceptionally high demand is experienced. The capacity constraints may even be the result of a deliberate business decision not to invest in the full capacity needed to provide full service to all LOBs during peek hours. In a situation with limited capacity available, customers compete for service, and there is an evident need for prioritizing the tasks. Demand management is related to capacity management and prioritizes competing demands based on business reasons rather than technical or other reasons. In this capacity, change management has to make some unpopular decisions, such as stopping or decreasing the service delivered to some users while others receive the usual high service level. However, since the decisions are based on business reasons, chances are that they are supported by senior management. And capacity management certainly needs that support when prioritizing.
Capacity planning
Using all the other capacity management disciplines, the foundation to create a capacity plan has been established. The ITIL defines the capacity plan as a plan
482
that predicts when components will reach their saturation point and identify actions to prevent saturation. Often, the capacity management discipline is perceived as creating and maintaining the capacity plan. In this definition, it is implied that all the other tasks (performance, workload, resource, and demand management as well as application sizing and modeling) are accomplished to provide all the information necessary to create the capacity plan. Figure A-14 illustrates capacity planning.
time
load s
ca pa cit y
work
gy olo hn tec
capacity plan
bus ines s
d an em
The capacity plan is by no means a static plan. Since both the business and technological environments change over time, demand, available capacity, service levels to deliver, and business priorities change accordingly, affecting the capacity plan.
appli catio ns
483
The primary collaboration is between capacity management and SLM. When negotiating new SLAs (or renegotiating existing ones), SLM consults capacity management to assess the capacity needs to accommodate the customer requirements. After the SLA is negotiated, SLM sets the targets for capacity management to deliver, and capacity management reports performance and throughput achievements back to SLM.
Availability management
Sometimes, availability management can be regarded as part of capacity management. However, the responsibilities of availability management include planning, implementation, management, and optimization of IT services so that they can be used where and when the business requires. Availability management, as defined by the ITIL, is involved with much more than system availability. Availability management focuses on entire services and ensures that the services are available where and when they are needed. Doing this, availability management is heavily influenced by the following factors: The complexity of the services The reliability of the IT components and environmental services The level of maintenance provided by suppliers or elements of self-maintenance The infrastructure on which the services are built The configuration of the infrastructure used to provide the service When conducting availability management, you must observe the key elements (combined for all the components that are part of the service) in the following sections.
Availability
Availability is one of the main attributes of the quality of service delivery perceived by users. The availability of components to meet user requirements as stipulated in the SLA (expressed as a percentage) depends on these factors: The reliability of components The resilience to failure The quality of maintenance and support The quality of operating procedures To optimize the availability of the service, you must take into account all of these factors for all components of the service. In this context, it is important to remember that the users perception of the service is depends on the availability
484
of the hardware, software, and networking components as well as the availability of the data that is used. A service that meets the required availability may be characterized as a service that has minimal interrupts yet, when an incident occurs, is recovered quickly and efficiently.
Reliability
From a quality service point of view, reliability can be defined as freedom from operational failure. It is often measured as the mean time between failure (MTBF), the mean time between system incidents (MTBSI), or the number of breaks in a period. All of these values help determine the reliability of a component to perform a required function under the stated conditions for a stated period of time. The reliability of a service is partly determined by the amount of resilience built into the service and partly by the pervasive management applied with the aim of preventing failures from occurring. The resilience of a service is the ability of the service to continue providing an operation service when components of the infrastructure are non-operational.
Maintainability
Maintainability defines the ability of an IT service to be maintained in or restored to a satisfactory operational state. Maintaining or restoring a service involves five separate stages: Anticipating failures Detecting failures Diagnosing failures Resolving failures Recovering from failures
Serviceability
As used by the ITIL, serviceability defines the reliability, maintainability, and maintenance support of components for which external suppliers are responsible. When an external party assumes complete responsibility for an entire IT service and its support (as when a service is outsourced), availability is equivalent to serviceability.
Security
Availability management has the responsibility of the last letter in the basic security CIA principle:
485
From the perspective of availability management, among the security considerations that you must address are: Services must only be available to authorized personnel. After failure, services must be recoverable without compromising confidentiality and integrity. Services must be recoverable without contravening IT security policies. Access for contractors to hardware and software should be clearly identifiable. Data must only be available to authorized personnel and only at agreed-upon times as specified in the SLA. Figure A-15 shows the availability management perspective of the relationships between users, the IT organization, and external suppliers of services and the agreements/contracts that govern these relationships.
User
User
User
User
Users
IT Services
IT Systems
IT Systems
Serviceability
Underpinning contracts
Software developers
Software maintenance
Other maintenance
hardware
software
networking
486
487
Charging should be implemented only after careful consideration has been made. It may work as a double-edged sword. While providing money to the IT department, it may scare off users so seriously that they refuse to deal with their internal IT service provider and seek services from external providers. This may lead to higher costs for the remaining users, giving them more incentives to go to external providers, and, before long, the entire IT department may be outsourced. Figure A-16 illustrates the vicious charging cycle.
Fewer Users
Higher Cost
For these reasons, you may consider using notional charging instead of hard charging. This creates user awareness of the costs involved in the service provision without affecting their budgets. However, notional charging is effective only if the normal financial management for IT services processes are functional and effective so the users have a realistic idea of the cost of a service. Implement charging only when it will give a clear value to the organization. An environment that is ready for charging has these characteristics: Budgetary control by users Charging exists for other resources Freedom of choice Commercial flexibility Adequate monitoring capabilities The reasons for charging may include: Improved cost consciousness Better utilization of resources Allows comparisons Demand management To recover IT costs in an equitable manner Inform users how changes are derived, so they can influence usage/charges Raise revenue
488
The costing and charging mechanisms used to align the IT infrastructure more closely to the business objectives is referred to as the cost management system. This must be an integral part of the overall financial management system of the organization. The objectives for the cost management system are to: Provide assistance in developing a sound investment strategy that evaluates the options available from technology in the light of business strategy and objectives Set targets for financial performance and measure that performance in terms of budgeted versus actual costs Provide a basis for prioritizing resource usage Ensure sound stewardship of all assets employed in the organization Provide information for managements decision making and planning requirements Provide a flexible and fast response to changing business circumstances The way financial management for IT services meets these objectives varies slightly depending on the nature of the IT department whether it is a profit center or a cost center. Following the ITIL, the two may be defined as a profit center or cost center. Profit center: A computer services business center that operates as a separate business entity, but with its business objectives set by the organization. It provides clearly-identified products that are sold to a market. Each of the provided services carries a price tag. Cost center: A utility cost center that provides services to other cost centers. Performance is not measured in terms of projected or anticipated return but on how effectively and efficiently it provides services to its users. The major difference between the two models is the extent to which they charge the users. The profit center must charge in order to generate a profit, where the cost center may charge primarily to raise cost awareness among the users. Both need to estimate and measure the costs of service provision. In its simplest form, cost estimation begins by identifying the IT services to be provided and then estimating the total resources needed to provide them. The cost of the resources is then broken down into costs per unit of output. The aim of cost estimation is to understand (on a user-by-user level) the proportion of the IT resources being used. To do this, it is necessary to break costs down into cost units that can be measured according to workloads used by individual users. The cost estimation is based on the following areas:
489
Cost units: A way to accumulate and classify costs for the purpose of calculating a rate. Typical cost units include: Software Equipment Accommodation Transfer Organization
Cost classification: Breaking down costs into units is not enough. There is still no way to determine how much a cost or resource is related to a particular user or group. Cost accounting can assist by further cost classification as: Direct Indirect Capital Operational Fixed Variable
Workload estimation and forecasting: A way to calculate how each service is going to be used. Input is typically provided by capacity management. Standard cost calculation: A standard cost is a carefully predetermined unit cost that can be used as a basis for total cost calculations or the measure of financial performance. Standard cost units: Are used to determine the overall budget estimates. During the year, standard costs are monitored, and updated forecasts are made. A comparison of standard costs to actual costs enables financial management for IT services to assess the need for cost reduction or price increases. Cost monitoring: The identified costs are monitored on a regular basis to enable more effective financial planning and capacity planning. Monitoring is also a prerequisite to implement charging. Monitoring should be automatic.
Pricing
Any pricing policy must take the into account the objectives of charging, the direct and indirect costs, the demand for the commodity, the size of the market and the nature of the competitors. Based on the type of IT department (cost or profit center), charging can now be performed according to one or more of the following methods: Direct charging: Customers are charged directly upon receiving a service, such as charging for the delivery of a PC. Resource usage: Charges are based on the use of specific IT components or resources, such as disk space or CPU seconds.
490
Output related: Customers are charged for specific printouts or reports. Appointment: The costs of shared facilities are split up between the users of that facility or resource. Market related: Customers are charged based on what other organizations are charging.
491
Fraud, sabotage, extortion, or commercial espionage Infiltration of IT systems by viruses and other forms of malicious users Industrial action or other unavailability of key staff The three objectives of IT service continuity management and BCM are: To reduce or avoid identified risks To plan for the recovery of business processes if the business is disrupted To transfer all or part of the risk to a third party All business units or LOB within an enterprise should develop and maintain plans to continue business in case of a disaster. Figure A-17 shows the typical process model for business continuity.
Stage 1: Initiation Stage 2: Requirements and strategy
Initiate BCM
Stage 3: Implementation
Organisation and implementation planning Implement standby arrangements Develop business recovery plans Develop procedures Initial testing Implement risk reduction measures
Review
Testing
Change control
Training
Assurance
492
Since the LOBs rely on IT services to perform their business, the IT department is heavily involved in this process. As is the case with any other business unit, the IT department should develop and maintain a set of plans to use in case of an emergency. While the CEO is responsible for business continuity planning for the whole enterprise, the IT manager is responsible for the overall plan for the IT department. The IT manager is responsible for defining the strategy and organization to use for business recovery (stages 1 to 2). The responsibility to develop, test, verify, and maintain plans and procedures for recovery of the individual services is often delegated to the team leaders. Meanwhile tactical stages 1 and 2 of the BCM process focus on proactive measures, to prevent the emergency from occurring, and the reactive measures. Operational stages 3 and 4 focus mainly on the reactive aspects. In stage 3, the product support teams are brought in to develop, document, and test emergency procedures. In stage 4, the procedures are tested with the users and maintained. Stage 4 must be repeated periodically to keep an awareness of what to do should anything happen. The plans must be maintained and updated whenever major changes to the infrastructure or services are implemented. Figure A-18 shows the typical content of business continuity plans. Each plan describes specific roles and responsibilities as well as activities to perform. It also contains supporting data, such as addresses and telephone numbers, for different phases of an emergency. These phases are best illustrated using an example of a fire in an office building of a small company as follows: 1. Emergency response and salvage: Call the fire brigade, and, if possible, prevent the fire from spreading and secure vital assets; evacuate the building. 2. Crisis management: While the fire is being handled, inform senior management, employees, families, customers, and suppliers, and maybe the media. Put stand-by accommodations and equipment on alert. 3. Stand-by invocation: After the fire is extinguished, Assess the damage and decide what action to take. Invoke standby arrangements if necessary. 4. Recover business processes: Re-establish the basic IT services and business processes in intermediate offices. Provide accommodations and transportation for employees if necessary. 5. Plan return to normal: Arrange for the normal office to be cleaned and redecorated, re-establish IT infrastructure. Make plans for move back to the normal office and normal business procedures. 6. Return to normal: Place move-back plans into effect.
493
Recover Business processes Roles and responsibilities Invoke stand-by arrangements. Decision to Roles and responsibilities invoke stand-by. Damage assessment. Roles and responsibilities Roles and responsibilities Crisis management Roles and responsibilities Emergency response and salvage Invocation & recovery phase
Re co ve ry
Plan contents
Roles and responsibilities Roles and responsibilities Action lists Reference data (including contract details and inventories)
Alert phase
Alert
emergency response salvage crisis Management damage Assessment decide whether to invoke stand-by arrangements
Before you establish the individual recovery plans for each business unit, you must develop and agree on a framework for the business recovery plans. This framework should include: A master plan to coordinate the overall recovery effort A series of other plans for activities that may need to be coordinated across the organization Plans for each key support function Plans for each critical business process Figure A-19 shows a template framework.
494
Master Plan
Overall co-ordination
Emergency Response Plan Damage Assessment Plan Crisis Management & Public Relations Plan
Salvage Plan
495
service and the levels of service that must be delivered and provide feedback to indicate what levels of service have been achieved. But what is high quality? Some users of a service may feel that they are receiving the best service ever while other users are dissatisfied with the same quality of service, even thought the IT department providing the service feels that the quality delivered is satisfactory. In most companies, the quality of service is an arbitrary issue. Therefore the judgement of the quality of service becomes a subjective matter based on personal (often short-term) criteria. This is why customers can be satisfied one week and demand the resignation of the entire IT department the next. Before going into SLM, lets look at service quality and customer satisfaction.
496
100
80
Level of Quality
60
40
20
0
Time
User perception of IT Performance IT Performance
497
Determining the right level to deliver is part of SLM. Working with intangibles, such as expectations, makes it a difficult task. From a service provider point of view, the challenge is to keep customer satisfaction as high as possible while keeping costs down. Usually, higher quality means higher costs. Since the service provider is paid only to deliver to expectations, the optimum level of service to be delivered is in the expected range. This gives the service provider a small level of flexibility to deliver a service of a slightly higher or lower quality than what is expected. This depends on such factors as customer loyalty, delivery cost, and available capacity. The service provider can choose to divert from this (typically, by providing higher quality than expected) to promote services or to cater for specific LOBs. Determining the right level to deliver is part of SLM. Again working with intangibles, such as expectations, makes it a difficult and tricky task.
498
According to this definition, the customer may use and pay for the service. In business organizations, it is not practical to negotiate service delivery on a person-by-person basis. Services are typically delivered to departments or LOB and paid for by the organization, and the one paying does not necessarily have to use the service. In this case, the one responsible for the cost is the customer, and those who are not financially responsible are called users Usually, during negotiations between the customer and the provider, service quality is adjusted to meet the needs of both parties. This adjustment often leads to degradations in both service quality and service price without a readjustment of users expectations. When the provider delivers the agreed-upon level of service, the users are disappointed because they receive a lower level of service than expected. However, customer satisfaction is as expected because the sponsor receives the expected level of service.
499
manage the technical assets and support business needs while keeping the technical aspects out of the customers sight. The customers are more concerned with what is being delivered rather than how it is delivered.
Services to provide
Internal Processes
500
Business Objectives
External Suppliers
Internal IT Departments
Quantifying IT services
A key to the success of SLM is correctly quantifying the services that are being provided. Unless there is an agreed-upon method of how services are to be measured, there is no way of knowing whether targets have been met. SLM is responsible for understanding and documenting customer requirements and translating them into a set of understandable measures. Figure A-24 illustrates the service design process, which consists of four steps: 1. Understanding and documenting customer requirements The basis for any service is to understand the customers demands and requirements. Through this process, SLM acquires detailed knowledge about the customer environment and requirements. This understanding is a prerequisite for defining the service, estimating the capacity needs, and defining the measurements needed to support service delivery. 2. Specifying external standards With a basic understanding of the customers requirements and demands, SLM can define the external standards. These specify the planned deliverables (both in terms of functionality and capacity) and the measurements that are used to quantify these to the customer, using customer terminology. Before completing the external standards, SLM must negotiate them with the customer. The external standards specify the functions and capacities that are delivered and the way in which they are measured. All of these must be accepted by the customer. The external standards, however, cannot be finalized without consent from all the teams in the IT department that are
501
going to deliver on the promise. This consent is obtained by SLM using the internal standards.
3. Translate to internal standards After the external standards are defined, or, rather, during the specification and negotiation processes, you must translate them into a set of standards to be used internally by the IT department. The internal standards specify, in IT terms, the functional and capacity-related requirements that the IT department must fulfill to support the delivery and the ways the delivery are measured and optionally charged. These specifications are negotiated between SLM and the other disciplines of service management. Each of the other disciplines is committed to providing the specified levels of service. The internal standards are produced by SLM and must be revised and renegotiated when the external standards change. 4. Produce contracts and agreement Finally, when both the internal and the external negotiations are finalized, the external and internal standards are used to create the final documents: contracts and agreements. SLM produces a set of contracts and agreements aimed at the customer. This set includes (for internal use):
502
Service level requirements External specifications Service level agreement Service catalog
There is another set of contracts and agreements produced to be used with external suppliers. In this set, the following items are found: Service quality plan Internal specifications Operational level agreement Underpinning contracts
503
End users/consumers
demands
Internal Documents Internal Specsheets Operational Level Agreement Service Quality Plan Underpinning Contracts
requirements
requirements IT Department
The use of specsheets is helpful to the SLA design process. The purpose of a specsheet is to specify, in detail, what the customer wants (external) and what consequences this has for the service provider (internal). Specsheets do not require signatures, but they are subject to document control. The SLA and the service catalog are built from specsheets. When a service level requirements document is changed, the specsheets must be updated. This in turn leads to rebuilding the SLA. Therefore, you can use the specsheets to keep internal quality targets in line with the external demands. Figure A-26 illustrates the use of external and internal specsheets.
504
Suppliers
External Documents Service Level Requirements External Specsheets Service Level Agreement Service Catalogue
External Specsheets
Internal Specsheets
Corporate Level
Agreements
Customer Level
Service Level
Seven types of documents are generated and maintained by the service specification process: External specsheet: The external specsheet contains information about customer demands, which are quantified as measurable targets. It also defines responsibilities for delivery and the assurance of the quality of service. Internal specsheet: The internal specsheet contains all the information related to the building, control, and monitoring of the components that make up the service. After completion of the specheets, the business demands should be successfully transformed into IT deliverables. It is now possible to draft the formal SLM documents: Service catalog: This document provides an overview of the services that are available to the customers of the IT organization. As a marketing tool, the service catalog presents a profile of the IT organization as a service provider and shows customers exactly what the IT organization can do. This also helps the IT organization manage the expectations of business more effectively. The design of the document should be consistent with its marketing purpose. This means that it should use information that is interesting to the customer,
505
and expressed in non-technical language. Also the layout should be professional and interesting. Service level agreement: The format of each SLA depends on several factors, including the physical, cultural, and business aspects of the organization. Where the organization consists of several fairly independent business units, these should be seen as independent customers. Often, SLAs are divided into parts: a part specific to the customer that specifies responsibilities, terms, and conditions; a general part that describes the service; and several optional appendixes specific to the actual agreement. Operational level agreement (OLA): The OLA is an internal document that is used only by the IT department. It serves as the internal SLA, specifying the service, responsibilities, terms, and conditions in IT terms rather than business terms. Underpinning contracts: Review all underpinning contracts regularly, both to accommodate changing service level requirements and as a routine measure. Underpinning contracts must be easily accessible for all participants in the SLM processes. Underpinning services supplied in-house are also vital to the service. It is important for you to review these and introduce OLAs (if they are not already in place) to safeguard the supporting services. Service Quality Plan: After the SLA is negotiated and signed, the difficult task of delivering on the promise begins. Even more difficult is the ongoing monitoring and review of the services delivered to the customer. This can only be accomplished with a full understanding of the total IT service delivery situation in terms of: The capabilities of the IT service Agreed-upon service levels The demands for internal and external suppliers This information is contained in a comprehensive Service Quality Plan, which aims to balance the customer requirements with the IT organization. The Service Quality Plan achieves this in the following ways: Specification of process parameters Specification of required management information Specification of key performance indicators The Service Quality Plan document is the written definition of the internal targets, responsibilities, and delivery times that are necessary to live up to the agreed upon service levels.
506
organisation
processes
tools
Organization
While the organization, roles, and responsibilities are covered in previous sections, it is important to emphasize that the ITIL model is only a suggestion. When organizing the service management organization, you may adjust the model to fit the specific needs and policies of a particular company. Chances are that, when transforming the current IT organization into a service management organization, many of the disciplines are already, at least partially, implemented. Use this as the starting point for the service management organization.
507
It is equally important not to implement all of the disciplines at one time. This can create too great a disturbance for the entire organization and, most probably, can lead to a chaotic situation that threatens the welfare of the entire company. Implementing service management is a gradual process of taking small steps and implementing the disciplines that provide the most benefit to the company first. In most situations, the two most obvious candidates are SLM and configuration management. Configuration management is one of the most difficult disciplines to implement. It requires a lot of hard work and discipline to combine many data repositories (often, with a lot of built-in redundancy) into one all-encompassing repository and to build the processes around it that ensure data consistency and integrity. Furthermore, the benefits are more long term. More immediate results are realized by implementing SLM. Doing this helps to shift the focus of the entire IT department to be much more business-oriented than if no SLM was in place. This shift in focus also helps to create an atmosphere in which the need for discipline and processes supporting the other service management disciplines is nurtured.
Processes
Processes are the bread and butter of service management. Where the organization defines roles and responsibilities (who does what), the processes define the achievements and procedures (inputs, outputs, and how to). Without processes, there can be no service management. In a highly-dynamic environment, such as the IT world, the organization, tools, and processes may change. The technology undergoes constant changes, and organizations are constantly aligned to the businesses. People move from one job to another and from company to company. Also companies are acquired and sold (almost at the speed of light) to the benefit of the overall business. In the middle of the chaotic structure that forms business today, the processes are the most stable of the three, despite having to be adjusted to support both the organizations and the underlying technology. In most cases, changes in technology or organization do not affect the nature (inputs and outputs) of the processes. Of course, processes need to be aligned to business requirements and company policies. They must also be constantly monitored for relevance and optimum efficiency. The success of service management relies more on processes than any other discipline. The execution of the processes ensures delivery of services according to the SLA. The processes ensure that incidents and problems are raised and
508
that solutions are identified and implemented when the service delivery is in jeopardy. Also processes ensure consistency of the data in the configuration repository. Processes are everything. Tools are merely there to assist.
Tools
Applying tools and technology alone will solve any of the challenges of service management. The basis of a successful service management operation is well-defined processes that ensure that everyone knows what their responsibilities are, what deliverables they are supposed to provide and in what quality, and why they are doing it. You must realize that tools are necessary to help the processes work, to automate processes where possible, and to handle the volumes. In some cases, monitoring system resources being the most obvious example, tools are a necessity to make the process work. The two most important parameters in deciding what tools are needed to support service management are integration and openness. How well the tools integrate and enable interdisciplinary processes and data usage is the key to a successful implementation. Using tools that are open (enabling integration into the current IT infrastructure and customization to support the specific organization and its processes) is a must. Failing to do so results in islands of management that are difficult, and even impossible, to bridge. This in turn, leads to a loss of business focus, autonomous sub-optimization for specific needs, and loss of control.
509
This ongoing improvement process can, for example, be achieved by periodically performing the following tasks: Monitoring and reporting on service achievements Incorporate details of performance against all SLA targets, together with details of any trends or specific actions being undertaken to improve service quality, into the periodic report. Holding service review meetings with customers Hold periodic review meetings on a regular basis with customers (or their representatives) to review the service achievement. Implement a formal Service Improvement Program The Service Improvement Program (SIP) is a project that the organization establishes to continuously identify improvements in customer satisfaction and service quality as delivered by IT. When the analysis of service levels and achievement reports identifies issues that impact, or may impact, service quality, SLM in conjunction with problem management and availability management can initiate a SIP to identify and implement actions to overcome the issues and restore service quality. Maintenance of SLAs Keep current all SLAs that are in place to ensure that the services covered and the targets for each are still relevant and represent the need of the customers. As shown in Figure A-28, all of the disciplines within service management encompass four distinct activities: Planning Delivery or deploying Measurement and act based on measurements Calibration and changes for improvement At the outset, the IT organization and its customers plan the nature of the service to be provided. Next, the IT organization delivers according to the plan. It takes calls, resolves problems, manages change, monitors inventory, opens the service desk to end users, and connects to the network and systems management platforms. The IT organization then measures its performance to determine whether it is delivering superior service based on the explicit needs of the LOB. Finally, the IT organization and the LOB continually reassess their agreements to ensure that those agreements meet changing business needs.
510
Planning
During the planning phase, IT and the LOB determine what services will be provided, at what levels, and for what ends. This effort leads to the establishment of SLAs, or contracts, that specify the who, what, when, and how of IT service. The most effective SLAs focus on key issues, such as: The needs of the LOB Business system availability Device and service quality Device usage and maintenance SLAs succeed when they are simple, clearly stated, and measurable. Clear and concise SLAs form an IT organizations SLM foundation, matching the LOBs need with IT service as well as cost. For example, consider an organization that has highly-skilled, relatively self-sufficient engineers who can deal with a four-hour response time during normal business hours. That organization should not have to pay the same for their IT service as a customer-billing organization with less experienced staff running real-time, important applications that require a one-hour response time 24 hours a day. SLAs, while conceptually simple, can quickly become complex. When specifying the term of the agreement, we recommend that you offer several basic levels of service rather than tailoring one for each organization. In this way, the total number of service options stays at a manageable level, and ITs ability to monitor them effectively is greatly enhanced.
511
Delivery
Comprehensively delivering service at a competitive cost as outlined and mutually agreed upon in the plan is a difficult task. As shown in the previous sections, delivery involves many separate disciplines that span the IT functional groups, such as network operations, application development, hardware procurement and deployment, software distribution and training, and that support all these elements. It also involves incident and problem resolution, configuration management, service request and change management, end-user empowerment, and the complete spectrum of network and systems management. Successful service delivery requires these functions to be integrated seamless.
Measurement
How can an IT organization determine whether it is meeting the service levels established with its customer? Much of the measurement step is built around monitoring those terms outlined in the SLAs. Therefore, an IT organization relies on technologies to actively monitor these service levels through the various delivery stages. These stages include the service delivery, monitoring of LOB assets, ensuring the health of LOB networks and systems, and managing changes to the LOB infrastructure. Two types of technologies support this measurement: real-time reporting tools and static historical reporting tools. For example, two calls may come to the service desk simultaneously. One call is covered under an agreement that entitles the caller to a one-hour resolution, while the second is entitled to a four-hour resolution. The service desk technology presents this information to the technician, who prioritizes the calls to ensure that both callers receive timely support. These technologies also include intelligent escalation utilities, operating in real time, to alert service desk management when agreements are in danger of being breached. Real-time reporting technologies enable management to initiate corrective action before service deteriorates. In addition to these real-time metrics, it is important for the service desk to monitor other key performance indicators including first-call resolution rates, SLA thresholds, high-priority open problems, problem time open, and call queue by analyst. Historical reporting is also vital to management for planning purposes. The data generated by these reporting tools substantiates the discussion that IT and LOBs have when they determine the appropriate level of service required. It also assesses the effectiveness of the service delivered.
512
Calibration
The process of planning, delivering, and measuring the delivery of customized IT support to its LOB is continuous because competitive pressures, technologies, capabilities, and needs change over time. Planning is the foundation of SLM. Calibrating the plan keeps IT responsive to the continually-changing conditions throughout the entire organization. To calibrate the service delivered, successful IT organizations employ a combination of historical reporting tools and a decision support framework. While the real-time monitoring tools described earlier assist IT in running the day-to-day operations, decision support tools provide a framework for exploring data more completely to make better-informed decisions. These tools, often built around multidimensional analysis techniques, enable IT management to see relationships in the volumes of data generated by one or more operational systems-relationships that are rarely apparent in real time or static reporting methodologies. For example, as an IT manager, you are tasked with managing your organization efficiently and effectively. This means that you need to use the best means to support the LOB in your company, and the best means are not always the same for each LOB. For instance, lets return to the earlier example of highly technical users, such as engineers, and less technical users, such as customer billing representatives. The engineers are relatively self-sufficient while the billing representatives relatively depend on your support. Given this, IT will likely support these two groups differently. By analyzing problem and usage data, service desk management determines how best to support each group or user, whether by telephone, e-mail, Web, voice mail, or a combination of these. The true power of decision support frameworks and static reporting technologies is to ensure that IT remains in sync with the LOB it supports. The calibration step of SLM is an explicit reminder for IT and LOBs to constantly evaluate the effectiveness and appropriateness of the service delivered.
513
repair technician is dispatched and determines that the asset needs to be replaced, the technologies generate the appropriate change order and initiate that process. When the change is approved and executed, the asset discovery tools confirm the work, close the change request, and report the new status in the asset management system. Finally, if the same end user initiates a second call, the service desk technician sees the updated inventory and a history of the change. In addition to the disciplines mentioned in the previous example, network and systems management integration encompass other enterprise IT technologies. These include technologies for software distribution, event management, systems management, applications management, remote control, and security. The seamless integration of these technologies can reduce the burden for many labor-intensive IT operations.
514
Appendix B.
515
516
Customer order: The action of setting up an SLA by associating customers with service offerings. Data collection: The process of obtaining performance and availability metric data from source applications for storage and later evaluation. Dependency: The relationship between SLAs in which the validation of one SLA depends upon the validation of another SLA. Typically used when one or more SLAs, which are internal to a service provider organization, are monitored for the purpose of guaranteeing an external customers SLA. End time: The end time of a defined period in the schedule that is associated with a particular state of peak, standard, or no-service hours. Evaluation: The examination of performance and availability data from one or more monitoring applications to determine if a violation or a trend toward a violation of an SLA has occurred. Frequency: Can have one of the following meanings: In business schedules: How often the associated period is active In metric evaluation: How often the evaluation is to be performed Measurement and metric: A standard of measurement or a measurable quantity, associated with guaranteed service levels to create SLOs. Metrics evaluate performance, availability, or utilization of resources, such as response time, CPU, and disk utilization. Measurement source: The source application from where a measurement originates. Performance and availability measurements are collected by the source application and written to a central data warehouse for processing later. A measurement source can provide measurement for one or more components. Examples of measurement sources are: IBM Tivoli Monitoring for Transaction Performance IBM Tivoli Business Systems Manager IBM Tivoli Enterprise Console IBM Tivoli Monitoring
No-service: The state of a period in a business schedule in which SLAs are not evaluated. This time is typically used for down time or maintenance hours that do not count against the SLOs established in SLAs. Offering: A service with guaranteed service levels. They are associated with business schedules and form the building blocks for customer orders and SLAs. They can be differentiated to provide service level choices to customers (such as Gold, Silver, and Bronze levels of service). An offering must be in the published state to be included in an SLA order. Offering component: Supplies the metrics for offerings and customer orders. At the time of an offering creation, one or more offering components are
517
selected. IBM Tivoli Service Level Advisor checks to determine the number of measurement sources for a component. Offering state: The state of a service offering. Valid values include: Draft: The offering is being created. It is not yet published but is available to be included in a customer order. Published: The offering has been defined and is made available for inclusion in customer orders. Withdrawn: A previously published offering has been removed from the list of available offerings and can no longer be included in customer orders. Order: The process by which an SLA is entered into the Tivoli Service Level Management solution. It includes customer information, a service offering, and the specific elements that make up the SLA. Order ID: The assigned identification number that distinguishes one customer order from another. Peak: The state of a period in a business schedule that defines hours in which levels of service are the most critical to the customer during peak business hours. Typically it defines a more severe level of service than that specified for standard hours. Period: A component of a business schedule that divides the timeline into named intervals, such as critical, peak, prime, standard, low impact, off hours, and no service. The general meaning of those intervals is defined by the customer during SLA negotiations. For example, you may define different SLOs (thresholds) for each period, depending on how critical that particular period is for the business. Published offering: An offering that is complete and made available to customers to be included in an SLA. Realm: A grouping of customers that is used to organize customer information and, in some cases, to control access to that information. Customers may be grouped by region, by company, by a division within a company, or by some other logical grouping. Customers can be assigned to one or more realms. Reports: Summarize the evaluated measurement data for an SLA. IBM Tivoli Service Level Advisor provides the following types of reports: Results reports show monitoring information for the peak or standard states of a specified metric in an order. Violations reports display the SLA violations during a specified period of time. Trends reports display trends toward the violation of breach values, that is, tendencies to violate SLAs.
518
Resource: A hardware, software, or data entity that is managed by Tivoli management software. In IBM Tivoli Service Level Advisor, the entity is monitored by performance and availability monitoring applications. Rollback: The capability of IBM Tivoli Service Level Advisor to return to the last valid state if there is a failure during customer order deployment or cancellation, enabling failed orders to be restarted or deleted. Service: Any task performed by one person or group for another person or group. Refer to the definition provided in Chapter 2, General approach for implementing service level management on page 23. Service element: A component that provides a piece of an overall service. Service elements are the building blocks used to construct service offerings and customer orders. Service level agreement (SLA): An agreement or contract between a service provider and a customer of that service, which sets expectations for the level of service with respect to availability, performance, and other measurable objectives. Service level objective (SLO): A specification of a metric that is associated with a guaranteed level of service that is defined in an SLA. The SLO is part of an offering and is associated with a business schedule so that different breach values can be set for each schedule period. Choices include peak, critical, standard, prime, off hours, and no service. Service level management (SLM): The disciplined, proactive methodology and procedures used to ensure that adequate levels of service are delivered to all IT users in accordance with business priorities and at acceptable cost. Effective SLM requires the IT organization to thoroughly understand each service it provides, including the relative priority and business importance of each. SLM is the continuous process of measuring, reporting, and improving the quality of service provided by the IT organization to the business. Service offering: A defined level of service that associates a business schedule, including specified peak, standard, and no-service periods, with particular metrics to be evaluated. Service provider: A person or organization that provides a service to a customer based on an SLA. SLA state: The state of an active SLA. It can assume one of the following values: Violation: One or more breach values have been exceeded, indicating that the agreed-upon level of service is not being met. Steady: All levels of service are currently being met, and there is no detected trend toward a violation of the SLA.
519
Trend: A trend toward a future violation of an SLA has been detected. None: The SLA is not fully processed yet. This is an initial state. Standard: The state of a period in a business schedule that defines hours in which levels of service are not as critical as during peak business hours. Start time: May have one of the following meanings: In defining business schedules, this is the start time of a defined period in the schedule that is associated with a particular state of peak, standard, or no-service hours. In defining the schedule for metric evaluation, this is the time that the evaluation will be initiated. Trend: A series of related measurements that indicates a defined direction or a predictable future result. Trend analysis: The examination of related measurements to determine whether a breach level for a level of service is being approached, so that corrective action can be taken to prevent a violation of an SLA. View: The display of the details of a business schedule, period, offering, customer, or realm. Violation: The state of an SLA when one or more SLOs are not met. SLA violations can be used to trigger a remediation policy for affected customers. Web report: SLA results made available through a series of Java servlets. Each report servlet can be integrated independently into the service providers existing Web content. Using Web server authentication, report data can be restricted by customer or realm. Displayed on a users Web browser showing the results of evaluation and trend analysis of SLA data to validate an SLA or to assist in identifying problem areas and taking corrective action. Withdrawn order: An order that is removed from the list of active orders that is being managed to guarantee levels of service. Note: Withdrawn orders are not deleted, but are no longer active. Withdrawn offering: An offering that was published, but which has since been withdrawn and is not available to customers for inclusion in an SLA.
520
Business systems
A business system is a representation of a group of diverse but interdependent enterprise resources that are used to deliver specific business functionality. These resources can include applications or other resources that are distributed over different networks and installed on different platforms. For example, a Web banking application that is distributed over mainframe database systems, application servers, firewall, intranet and Internet can be considered a business system. A business system is a hierarchical view that displays IT resources that relate to a business process. IBM Tivoli Business Systems Manager provides a flexible user interface that enables the viewing resources that are of interest to a user (such as a Manager of the Web Services group) or a group of users (such as the Web banking support team). It does this in ways that reflect the business process that is monitored, the so-called business system. A business system consists of: The system resources that provide the business function The appropriate prioritization of resources used to determine the health of the business system The relationship between system resources that may be shown A business system can be created from the console or automatically upon receiving events. Effective business systems consider only resources that are important to the target business systems. An important factor in defining business systems is who will actually use the business system. A help desk may need a business system based more on the physical organization of systems and applications. However, a CIO may want a business system that shows all the business processes in the enterprise, but not at the level of detail needed by the help desk. Business systems can be built according to the following aspects: An application or a set of applications (Web banking) A department (accounting department)
521
A vertical area of responsibility (International Technical Support Organization) A geographic region (Europe, Middle East, Africa (EMEA) region for IBM) Resources are represented as icons within the business system. To easily determine the root causes of a business system outage, IBM Tivoli Business Systems Manager provides several viewing perspectives. Tree view: Lists the hierarchy of all resources Hyperview: The best viewing option for displaying a large number of resources in one glance Table view: Shows resources in a table format and is equipped with column filtering and sorting capabilities Topology view: Shows the topology of the business system to the desired level of detail Web Console: Shows browser versions of the tree view and hyperview Executive dashboard: Shows a high level overview of the business system status In addition, you can invoke the following views from any resources in the business system: Business impact view: Shows resources that are affected and their relationship to the impact causing resource Event view: Displays the events that triggered the resource state change
Distributed Discovery
For distributed environments, an object type must be registered to IBM Tivoli Business Systems Manager. Then the object must then be discovered by the discovery process. This enables the Tivoli Business Systems Manager to identify and classify resources. Distributed resources can be discovered and monitored through the following interface: Agent listener IBM Tivoli Enterprise Console events can be forwarded through this interface. IBM Tivoli Enterprise Console rules can be developed to forward events to the IBM Tivoli Business Systems Manager database. The first event from a resource triggers the creation of the object as the discovery process.
522
Common listener The common listener transport provides bulk and delta transactions. The bulk transaction populates the IBM Tivoli Business Systems Manager database with snapshots of the instrumented environments. The delta transaction keeps the IBM Tivoli Business Systems Manager database updated as new resources are introduced or removed from the instrumented environments.
z/OS Discovery
IBM Tivoli Business Systems Manager installation requires you to install three started tasks and run them on each z/OS system that feed into IBM Tivoli Business Systems Manager. These started tasks perform a limited discovery of the objects running on the z/OS system. They feed the data to IBM Tivoli Business Systems Manager, where the objects are automatically discovered and placed. For more detailed discovery, IBM Tivoli Business Systems Manager uses NetView for the z/OS family of products. It uses REXX routines within NetView to discover IMS, DB2, and CICS resources. These resources are sent automatically to IBM Tivoli Business Systems Manager and correctly placed in the object hierarchy.
Event propagation
Event processing is the process of capturing business-critical events from IBM Tivoli Enterprise Console or common listener and routing them to IBM Tivoli Business Systems Manager. The events are then processed and stored in the IBM Tivoli Business Systems Manager database.
Events affect the status of a resource. State changes are propagated upward to affect the resources parents, to facilitate the determination of the status of business systems. Propagation is the process that allows events to escalate or propagate up the All Resources view or business systems. Propagation is implemented by generating a child event to the parent resources. In a distributed implementation, all events are of the type exception. Depending on their priority, exceptions can be processed to affect the object alert state. If the exception threshold for the object in a specific priority bucket is exceeded, the object alert state is changed and child events are generated. In enterprise implementations, events can be either exceptions or messages.
Messages are an object status event, and only one message can ever be posted
against an object at a time. Examples of typical message event statuses are Up, Down, and Abended.
523
Object types
In IBM Tivoli Business Systems Manager, an object type represents an IT component class, such as a machine, database or application. The object type can have multiple event sources mapped to that object type. Examples of object types can include Node, WindowsServer, OracleDatabase, CustomApp, Hub, and NetworkDevice. Each object type can have: An icon associated with it Events that can appear under it A set of tasks associated with it One or more Uniform Resource Locators (URLs) associated with it One or more local applications associated with it An object type can have multiple instances. Each actual IT component is an instance of that object type. For example, if you have an object type of NTServer and you have three NT servers called ServerA, ServerB, and ServerC, then you would have three instances of NTServer, which are NTServer on ServerA, NTServer on ServerB, and NTServer on ServerC. The Properties Page for each object instance lists the events that are received for that object instance. Object types can be as granular as desired. Consider these points: All instances of a given object type will have the same icon, tasks, and URLs. Each instance will display only the events that have come in for that instance, even though the object type must have all possible events types for that object type defined to it. An instance of any given object type can appear in any or all business systems. In an IBM Tivoli Business Systems Manager V3.1 distributed implementation, the only available object type is the generic object type.
524
send artificial events to IBM Tivoli Business Systems Manager if you want to populate it ahead of time with object instances.
525
526
Appendix C.
527
Example: C-1 TEC to TBSM forwarding TEC rule example rule: tec2tbsm_forward: ( description: 'invoke tec2tbsm.pl script to forward event to TBSM server.', event: _event of_class _class, reception_action: ( exec_program(_event, 'D:/tbsmd/bin/tec2tbsm.pl', '', [], 'NO') ) ). change_rule: tec2tbsm_forward_Change: ( description: 'invoke tec2tbsm.pl script to forward event to TBSM server.', event: _event of_class _class, attribute: status set_to _new_status within ['ACK', 'RESPONSE', 'CLOSED'], action: ( exec_program(_event, 'D:/tbsmd/bin/tec2tbsm.pl -n', '', [], 'NO') ) ).
528
After you install or configure the TEC rule and script on the IBM Tivoli Enterprise Console server, examine the sample TEC events listed in Example C-2.
Example: C-2 Sample TEC events processed by the tec2tbsm_forward rule ... 1~5911~1~1097778356(Oct 14 14:25:56 2004) ### EVENT ### TEC_ITS_NODE_STATUS;source=NV6K;nvhostname=9.42.171.89;category=2;msg='Node Down.';nodestatus=2;adapter_host=bc1srv5;hostname=klywy0a;origin=9.42.170.86;su b_source=NET;iflist=['9.42.171.133'];END ### END EVENT ### PROCESSED ... 1~5888~1~1097776526(Oct 14 13:55:26 2004) ### EVENT ### TMTP-PERF-VIOLATION-BELOW;fqhostname='bc1srv6.itso.ral.ibm.com';parentTransacti onId='null';rootTransactionId='5B976E80DBC0736F3B7F1C474C1AFACF0000006600000000 ';msg='Management Policy "TradeOnlineQuoteResponse", Transaction "TradeOnlineQuoteResponse.*" exceeded a lower performance threshold of 20 seconds. The transaction time is 13.088 seconds.';transactionName='TradeOnlineQuoteResponse.*';managementPolicyName='Tr adeOnlineQuoteResponse';userName='.*';hostname='bc1srv6.itso.ral.ibm.com';appli cationName='GenWin';startTime='1096922434000';violatedThresholdValue=20.0;sever ity=MINOR;hostName='bc1srv6.itso.ral.ibm.com';returnCode=0;transactionDuration= 13.088;transactionId='5B976E80DBC0736F3B7F1C474C1AFACF0000006600000000';date='O ct 4, 2004 4:40:47 PM EDT';thresholdId=113;END ### END EVENT ### PROCESSED ...
529
The tec2tbsm_forward rule invokes the tec2tbsm.pl script. It results in tec2tbsm.pl script issuing the ihstttec application programming interface (API) calls (as shown in Example C-3) to map the events to IBM Tivoli Business Systems Manager resource type. Then it sends the events to IBM Tivoli Business Systems Manager for discovery, status change, or both. The approach in Example C-3 (using an IBM Tivoli Enterprise Console rule and script) is one of the many ways to integrate IBM Tivoli Enterprise Console events into the IBM Tivoli Business Systems Manager distributed solution. Using this method to evaluate the event and then forward IBM Tivoli Enterprise Console events to IBM Tivoli Business Systems Manager via the ihstttec API call allows the most flexibility in mapping IBM Tivoli Enterprise Console events to IBM Tivoli Business Systems Manager resource types. It also allows any automation (IBM Tivoli Enterprise Console rules, etc.) that is in place to take effect before forwarding events to IBM Tivoli Business Systems Manager.
Example: C-3 Sample ihstttec API calls invoked by tec2tbsm.pl script ... D:/Tivoli/bin/w32-ix86/TME/TEC/../../TDS/EventService/ihstttec.exe -b 'WintelServer;1.0' -i 'klywy0a' -p 'NetView node status' -s 'CRITICAL' -d 'WintelServer' -o '22' -h 'klywy0a' -m 'Host klywy0a is DOWN; nvhostname=9.42.171.89; category=netmon; nv_generic=0x0; nv_specific=0x0; nodestatus=DOWN; iflist=[9.42.171.133]' ... D:/Tivoli/bin/w32-ix86/TME/TEC/../../TDS/EventService/ihstttec.exe -b 'UserTransaction;1.0' -i 'TradeOnlineQuoteResponse.*.bc1srv6' -p 'TMTP-PERF-VIOLATION-BELOW TradeOnlineQuoteResponse TradeOnlineQuoteResponse.* ' -s 'MINOR' -d 'TradeOnlineQuoteResponse.*' -o '22' -h 'bc1srv6' -m 'Management Policy "TradeOnlineQuoteResponse", Transaction "TradeOnlineQuoteResponse.*" exceeded a lower performance threshold of 20 seconds. The transaction time is 13.088 seconds. ; fqhostname=bc1srv6.itso.ral.ibm.com; returnCode=0x0; thresholdId=0x71; hostName=bc1srv6.itso.ral.ibm.com; startTime=1096922434000; transactionDuration=1.308800000000000e+001; rootTransactionId=5B976E80DBC0736F3B7F1C474C1AFACF0000006600000000; violatedThresholdValue=2.000000000000000e+001; parentTransactionId=null; transactionName=TradeOnlineQuoteResponse.*; managementPolicyName=TradeOnlineQuoteResponse; userName=.*; applicationName=GenWin; transactionId=5B976E80DBC0736F3B7F1C474C1AFACF0000006600000000'
530
SLA SLI SLM SLO SNMP SQL STI TBSM TCP/IP TDS TDW TEC TEDW TMR TMTP TSLA UDB URI URL
service level agreement service level indicator service level management service level objective Simple Network Management Protocol Structured Query Language Synthetic Transaction Investigator IBM Tivoli Business Systems Manager Transmission Control Protocol Internet Protocol Topology Display Services Tivoli Data Warehouse IBM Tivoli Enterprise Console Tivoli Enterprise Data Warehouse Tivoli Management Region IBM Tivoli Monitoring for Transaction Performance IBM Tivoli Service Level Advisor Universal Database Universal Resource Identifier Universal Resource Locator
531
532
Related publications
The publications listed in this section are considered particularly suitable for a more detailed discussion of the topics covered in this redbook.
IBM Redbooks
For information about ordering these publications, see How to get IBM Redbooks on page 536. Note that some of the documents referenced here may be available in softcopy only. IBM Tivoli Monitoring Version 5.1: Advanced Resource Monitoring, SG24-5519 Early Experiences with Tivoli Enterprise Console 3.7, SG24-6015 Tivoli NetView 6.01 and Friends, SG24-6019 End-to-End e-business Transaction Management Made Easy, SG24-6080 Introduction to Tivoli Data Warehouse, SG24-6607 Tivoli Business Systems Manager V2.1 End-to-end Business Impact Management, SG24-6610 Introducing IBM Tivoli Service Level Advisor, SG24-6611 IBM Tivoli Monitoring for Databases: Database Management Made Simple, SG24-6613 Introducing IBM Tivoli Monitoring for Web Infrastructure, SG24-6618 IBM Tivoli Monitoring for Business Integration, SG24-6625 Unveil Your e-business Transaction Performance with IBM TMTP 5.1, SG24-6912 Business Service Management Best Practices, SG24-7053 Implementing Tivoli Data Warehouse V 1.2, SG24-7100
533
Other publications
These publications are also relevant as further information sources: Installing and Configuring Tivoli Data Warehouse Version 1.2, GC32-0744-02 IBM Tivoli Monitoring for Transaction Performance Administrators Guide Version 5.3, GC32-9189 Release Notes for IBM Tivoli Service Level Advisor, SC09-7777-03 IBM Tivoli Monitoring for Web Infrastructure: WebSphere Application Server Warehouse Enable, SC09-7783 Command Reference for IBM Tivoli Service Level Advisor, SC32-0833-03 Getting Started with IBM Tivoli Service Level Advisor, SC32-0834-03 Administrators Guide for IBM Tivoli Service Level Advisor, SC32-0835-03 IBM Tivoli Enterprise Console Installation Guide Version 3.9, SC32-1233 IBM Tivoli Enterprise Console Rule Developers Guide Version 3.9, SC32-1234 IBM Tivoli Enterprise Console Users Guide 3.9, SC32-1235 IBM Tivoli Business Systems Manager Command Reference Guide, SC32-1243 Creating SLAs with IBM Tivoli Service Level Advisor 2.1, SC32-1247 IBM Tivoli Service Level Advisor SLM Reports, SC32-1248 Troubleshooting for IBM Tivoli Service Level Advisor, SC32-1249 Administrators Guide for IBM Tivoli Service Level Advisor , SC32-1250-01 IBM Tivoli Enterprise Console Rule Set Reference Version 3.9, SC32-1282 IBM Tivoli Resource Model Builder Version 1.1.3 Users Guide, SC32-1391-02 Tivoli Data Warehouse Release Notes Version 1.2, SC32-1399 IBM Tivoli Business Systems Manager Release Notes, SC32-9083 IBM Tivoli Business Systems Manager Diagnosis Guide, SC32-9084 IBM Tivoli Business Systems Manager Administrators Guide, SC32-9085 IBM Tivoli Business Systems Manager: Introducing the Consoles, SC32-9086 IBM Tivoli Business Systems Manager Messages Guide, SC32-9087 IBM Tivoli Business Systems Manager Getting Started Guide, SC32-9088 IBM Tivoli Business Systems Manager Installation and Configuration Guide, SC32-9089
534
IBM Tivoli Monitoring for Transaction Performance Warehouse Enablement Pack Implementation Guide, SC32-9109 IBM Tivoli Business Systems Manager Problem and Change Management Integration Guide, SC32-9130 IBM Tivoli Monitoring Users Guide Version 5.1.2, SH19-4569-03 IBM Tivoli Monitoring Version 5.1.2 Resource Model Reference Guide, SH19-4570-03 Jander, Mary; Morris, Wayne; Sturm, Rick. Foundations of Service Level Management. Sams, April 2000. ISBN 0672317435. Erickson-Harris, Lisa; St. Onge, David; Sturm, Rick. SLM Solutions: A Buyers Guide. Enterprise Management Assoc., July 2002. ISBN 097208360X.
IT Infrastructure Library. Service Delivery. Stationery Office, May 2001. ISBN, 0113300174.
Online resources
These Web sites and URLs are also relevant as further information sources: The Office of Government Commerce
http://www.ogc.gov.uk/
IT Infrastructure Library
http://www.itil.co.uk
Related publications
535
536
Index
Symbols
%age_Max 137 %age_Min 137 building SLAs 162 bulk discovery 61 bulk transaction 523 Business Continuity Management (BCM) 491 business decomposition 134 business goals 5556, 65, 72, 79, 87, 94, 206 business information 30 business knowledge base 19 business management 40 business owners 26 business process 18, 134 business process-based business system 122 business recovery 493 business representatives 28 business schedule 516 business service basing SLAs 58 functions 32 monitoring from this perspective 57 business service management (BSM) 17 business system 18, 59, 113, 115, 117, 525 best practices for building 120 business process based 122 concept 59 constructs 103 creation 119 Drag and Drop 119 folder 525 folder shortcut 525 hyperview 126 propagation rules 59 relationships 59 resource 525 resources 59 shortcut 525 technology-based IBM Tivoli Business Systems Manager 121 topology view 127 types 121 views 60 Web Console 129 Business System Shortcut (BSS) 120 business system tree 118 business system view 60, 521
A
ability to deliver 33 ABS (Automatic Business Systems) 116 adjudicate violations 170 adjudication 170 adjusting SLAs 116 administration tools 16 agent listener 102, 522 agent site 71 aggregated correlation 82 alert priority 118 alert propagation 118 alert state 118 AMR (Application Response Measurement) 80 analytical tools 16 API call, ihstttec 215 Application Response Measurement (ARM) 80 Application Response Monitoring 192 application sizing 478, 481 ARM API 81 ARM correlation 81 ARM engine 81 auto discovery 61 Automatic Business Systems (ABS) 116 automatic ticket request processor 108 availability 36, 484, 516 availability management 42, 450, 476, 484, 487
B
basing SLAs on business services 58 BCM (Business Continuity Management) 491 breach value 116, 516 BSM 17 solution 17, 21 tools 39 BSS (Business System Shortcut) 120 building business systems 119 building offerings 158
537
C
CAB (change advisory board) 466 CAB/EC (change advisory board/executive committee) 467 calibration 513 capacity management 33, 42, 450, 476, 483 subdisciplines 478 capacity management database 478479 capacity plan 482 capacity planning 478, 483 CCTA 448 central warehouse ETL 67 change 453, 516 assessment 469 initiation 469 prioritization 469 reception 469 urgent 470 change advisory board (CAB) 466 change advisory board/executive committee (CAB/EC) 467 change management 43, 107, 451, 454, 466 processes 466 change procedure normal 468 urgent 470 change request 454 change request processor 108 changing schedules 175 changing SLAs 169 changing SLOs 170 charging 487 child event 118 stopping from propagating 141 CI hardware and software 474 identification 455, 457 location 457 owner 457 state 457 client satisfaction 9 CMDB (configuration management database) 455, 458 common listener 102, 523 component 516
repair time 47 type 516 CompTyp_Cd column 159 computer services business center 489 configuration 456 configuration item (CI) 455456 attribute 456 configuration management 452, 454, 459 control 455 identification 455 status accounting 455 verification 455 configuration management database (CMDB) 455, 458 Configuration Repository 455 console consolidation 56 console server 64 constructing services and business systems 20 constructs 35 consumers 499 contingency planning 450 continuous improvement 48, 50, 205, 312 control center server 70 cost calculation 490 classification 490 estimation 489 monitoring 490 units 490 cost center 489 cost control 9 cost management 43, 450 system 489 cost of support 48 costing 487 creating offerings 158 crisis management 493 critical path management 57 Critical Watch List (CWL) 129 Crystal Enterprise Professional for Tivoli 104 Crystal Enterprise Server 71 customer 498, 516 order 517 requirements 501 satisfaction 497 segregation 76 transactions 79 CWL (Critical Watch List) 129 cycle 96
538
cycle time 96
D
dashboard roles 525 data collection 517 data mart 66, 68, 70 ETL 6768 database server 63 defining services in TBSM 187 Definitive Hardware Store (DHS) Definitive Software Library (DSL) delta transaction 523 demand management 478, 482 dependency 517 deployment review session 48 design specifications 16 desired quality 32 DHS (Definitive Hardware Store) discovering resources 61 discovery by event 61 discovery processing 522 Distributed Discovery 522 documentation 10 Drag and Drop business system DSL (Definitive Software Library) dynamic resource 164 Dynamic Resource List 407
473
event handler server 64 event management 42 event processing 523 event processing and propagation 62 event propagation 523 Event Viewer 125, 353 events propagation 118 exception 118119 executive awareness 57 executive dashboard 130, 525 executive management 40 executive sponsor 27 executive view service 525 executive view service resource 525 expected quality 26, 32 expected service 497 external metric 112 external specsheet 505 external standards 501 Extract-Transform-Load (ETL) 66
F
119 452, 472 fault management 104 financial management for IT services 477, 487, 491 folder 525 formula for PBT 137 frequency 517
E
effectiveness of SLM 50 efficiency of SLM 49 emergency response 493 end time 517 error control 463 escalating SLA events 186 escalation 459 ETL frequency 152 processes for Tivoli Service Level Advisor 152 runs 152 ETL (Extract-Transform-Load) 66 ETL1 66 ETL2 66 evaluation 158, 517 frequency 158, 162 of SLA 105, 157 event escalation 186 event group 89
G
gemEEConfig command 227, 230 gemgenprod command 226 generic object 190 generic object type 524 generic service 497 generic TBSM objects 190 generous service 497 GenWin playback 213 GTM schema 103
H
hard charging 488 health monitor server 64 heartbeat function 98 high availability managing using PBT and RLP 139 high priority 22 high-level design 26 historical monitoring 103
Index
539
historical reporting 46 history server 63 hole 97 host integration server 64 housekeeping 66 hyperview 60, 126
I
IBM Tivoli Business Systems Manager 56 functions 56 instrumentation 214 object types 524 overview 56 servers 63, 69 IBM Tivoli Monitoring architecture 98 benefits 95 business goals 94 concepts 96 functions 94 instrumentation 212 identification of CI 455, 457 ihstttec 524 ihstttec API call 215 impact of incident 462 improvement programs 15 improving SLM 117 incident 20, 459 impact 462 life cycle 460 management 43, 454, 461 priority 462 severity 462 instance, aggregated performance statistics 82 instrumentation 212 IBM Tivoli Business Systems Manager 214 IBM Tivoli Monitoring 212 IBM Tivoli Monitoring for Transaction Performance 213 IBM Tivoli Service Level Advisor 216 Tivoli Data Warehouse V1.2 216 integration with TBSM 186 integration, the power of 513 internal metric 112 internal specsheet 505 internal standards 502 IT domains 39 IT Infrastructure Library (ITIL) 5, 448
IT knowledge base 19 IT management 41 IT representatives 29 IT service 18 IT service continuity management 477, 491 IT_EXEC 525 ITIL 22 ITIL (IT Infrastructure Library) 56, 22, 448
J
J2EE components 312 J2EE instrumentation 83 J2EE monitoring 192 Java byte-code insertion 83 JVM memory 254
K
knowledge base business 18 IT 18 knowledge of the business function 10 known error 451, 462463
L
libarm library 81 life cycle of incident 460 of service 453 lines of business (LOB) 4, 449 live servlet sessions metric 171 LoadGEMIcons command 226 LOB (lines of business) 4, 449 location of CI 457 lower-level business system 231
M
maintainability 485 maintenance period 116, 175 maintenance schedule 175 managing expectations 9 mean time between failure (MTBF) 485 mean time between system incidents (MTBSI) 485 measurement 517 measurement layer 54 measurement metrics 34 measurement source 517 message 118, 523
540
metric 34, 517 external 112 internal 112 review 116 modeling 478, 481 monitor transactions 79 monitoring capabilities 34 enhancing 135 tools 16 MsmtRul table 159 MsmtTyp table 159 MsmtTyp_ID column 159 msrc_cd value 152 MTBF (mean time between failure) 485 MTBSI (mean time between system incidents) 485
P
parent performance initiated trace 82 parent-based aggregation 82 parentSLAEscalation 186 PBT (percentage-based thresholding) 136137, 312 PBT criteria 137 PBT formula 137 peak 517518 people 10 percentage-based thresholding (PBT) 136137, 312 perception of quality 26 perception of services 31 performance 36 performance issue 79 performance management 478479 activities 480 period 518 periodic reviews 49 physical domains 38 physical resource 525 physical tree 525 policy-based correlators 82 Populate Measurement ETL step 162 Populate Registration ETL step 162 predictive management 55 pricing 490 priority of incident 462 proactive improvement of SLM process 50 proactive integration tools and processes 51 proactive management of service levels 51 proactive response to business changes 50 problem 451 problem control 463 problem management 107, 451, 454, 463 tasks 465 problem request processor 107 problem tickets 107 process improvement model 25 processes 10 product mapping 54 production 466 profit center 489 project manager 27 propagation 118, 523 propagation of alerts 118 propagation rules 103 propagation server 64
N
negotiating OLAs 28 negotiating on SLAs 37 negotiating SLAs 28 no service 517 No Service period 175 notional charging 488
O
object discovery 61, 522 objects 117 occurrence 97 offering 75, 102, 517 offering component 517 offering evaluation 158 offering resource types 158 offering state 518 offerings 158, 375 Office of Government Commerce (OGC) 448 off-the-shelf 475 OGC 448 OLA (operational level agreement) 13, 506 OLA negotiation 28 one-of-a-kind 475 ongoing management 15 ongoing SLM process 44 operational level agreement (OLA) 13, 506 order 518 order ID 518 OS/390 adapter 102 owner of CI 457
Index
541
Q
QoS (Quality of Service) 87, 191 components 312 quality 496 Quality of Service (QoS) 87, 191, 312 quality of service level improvement 48 quality perception 26, 31 quality service 459, 496 quantifying IT services 501
RIM 92 RLP (resource level propagation) 136, 312 roles and responsibilities 26 rollback 519 root cause analysis 57, 79
S
satisfaction of customer 497 schedule 516 schedule changes 175 schedule replacement 178 scheduling maintenance 175 scmd command 261 scmd log handler 186 Secondary Impact Information (SII) 107 security 485 service 30, 519 availability 47 definitions 101 expected 497 generic 497 generous 497 life cycle 453 organization 507 processes 508 quality 496 quantifying 501 specification 503 tools 509 total 497 service catalog 13, 30, 505 service compositions 20 service context 20 service delivery 448450, 507 disciplines 453, 475 model 5 service desk 454, 459 service desk responsiveness 47 service element 519 service health 20 Service Improvement Program (SIP) 15, 510 service level agreement (SLA) 13, 506, 519 building 162 changes 169 evaluation 105, 157 management 449 negotiating 37 negotiation 28
R
Rational Robot 82, 192, 213 RDBMS Interface Module 92 realm 75, 518 real-time faults 47 real-time management 55 real-time monitoring 102 Redbooks Web site 536 Contact us xvi rediscovery 61 Registration ETL 153 release management 454, 472, 474 processes 473 tasks 474 reliability 485 replace schedule 178 replacing resources 170 reporting 79 function 40 IBM Tivoli Business Systems Manager 58 tools 16 reports 518 Request for Changes (RFC) 465, 469 resilience 485 resource 18, 519, 525 resource discovery 58, 61 resource level propagation (RLP) 136, 312 resource management 478, 482 resource models 96, 102 resource regulations 9 resource type 158 resources definitions 163 resources selection 163 restricted operator 133 review the metrics 116 RFC (Request for Changes) 465, 469
542
period 164 reporting, alerting 105 tiered 171 service level improvement 48 service level indicator (SLI) 14 service level management (SLM) 34, 198, 447, 449450, 495, 519 approach 24 benefits 7 challenges 7 components 10 convergence with business service management 18 definition 5 effectiveness 50 efficiency 49 external role 499 functions 12 goals 7 implementation 25, 35 integration 20 internal role 499 life cycle 74 life cycle with IBM Tivoli Service Level Advisor 73 management tools 38 measurement data mart 72, 78 monitoring 38 objectives 500 ownership 28 planning 26 pros and cons 6 responsibilities 499 Tivoli Service Level Advisor 72 service level manager 28 service level objective (SLO) 14, 519 changing 170 criteria 36 service levels reviews 49 service management 448, 500 service offering 519 service provider 519 service provision calibration 513 delivery 512 measurement 512 planning 511 service quality 496 Service Quality Plan 506
service support 5, 448449, 451, 454, 464, 475, 507 disciplines 453 serviceability 485 services definitions in TBSM 187 servlet sessions 254 metric 171 severity of incident 462 shortcut 525 sibling transaction ordering 82 SII (Secondary Impact Information) 107 simulate customer transactions 79 SIP (Service Improvement Program) 15, 510 SLA (service level agreement) 13, 506, 519 SLA state 519 SLI (service level indicator) 14 SLM (service level management) 34, 7, 10, 12, 7273, 198, 447, 449450, 495, 499500, 519 SLM administration server 77 SLM approach 110 SLM database 78 SLM improvement 117 SLM measurement data mart 72, 78 SLM reports 77 SLM server 77 SLO (service level objective) 14, 519 SNA protocol 102 SNMP managers 102 software control and distribution 452 solution 463 verification 464 source 525 source ETL 66 specsheet 504 external 505 internal 505 sponsor 499 standard 520 stand-by invocation 493 start time 520 state of CI 457 status accounting 455 steady 519 STI (Synthetic Transaction Investigator) 191 STI Recorder 82 Synthetic Transaction Investigator (STI) 82, 191
Index
543
T
table view 60 tapmagent 81 target ETL 66 technical information 30 technology-based IBM Tivoli Business Systems Manager business system 121 threshold 97 ticket request processor 108 tiered SLA 171, 278 Tivoli Business Systems Manager architecture 62 console 129 overview 117 roles 132 roles in SLM 132 services 187 system types 121 user roles 132 views in SLM 125 Tivoli Data Warehouse 64 architecture 68 overview 64 reporting 103 V1.2 instrumentation 216 Tivoli Enterprise Console adapter 93 architecture 90 benefits 88 business goals 87 concepts 89 functions 87 Tivoli Enterprise Data Warehouse data mart 68 Tivoli Monitoring for Transaction Performance architecture 83 benefits 80 business goals 79 concepts 80 instrumentation 213 main functions 79 Tivoli Service Level Advisor and SLM 164 architecture 76 benefits 74 business goals 72 concepts 75 databases 77 ETLs 152 evaluations 157
instrumentation 216 integration 103, 186 main functions 72 offerings 158 processes 152 schedules 157 SLM life cycle 73 TMTP object 190 tools 10 topology view 60, 127 total service 497 tree view 60, 125 trend 520 trend analysis 520 trends calculation 181 types of business system 121
U
UC (underpinning contract) 13, 165 underpinning contract (UC) 13, 165, 506 understanding services 29 urgency 462 urgent change procedure 470 urgent changes 470 usage information 32 user perception 26 user roles in TBSM 132 utility cost center 489
V
view 520 violation 519520 violations adjudications 170 visibility of SLA breaches 58 visibility of SLA trends 58 volume customization 475
W
warehouse agent 71 warehousing data 20 Web Console 129 application server 64 Web Health Console 95 Web report 520 withdrawn offering 520 withdrawn order 520 work space 61, 128
544
workload catalog 480 workload management 478, 480 objectives 480 workloads 480
X
XML BSV definition 121 XML Business System 116
Y
yellow event 144 yellow objects 140 yellow status of resources 131
Z
z/OS data 117
Index
545
546
Service Level Management Using IBM Tivoli Service Level Advisor and Tivoli Business Systems Manager
Back cover
Service Level Management Using IBM Tivoli Service Level Advisor and Tivoli Business Systems Manager
Integrate Tivoli Business Systems Manager and Tivoli Service Level Advisor Map business service management to service level management Achieve proactive service level management
Managing IT costs requires repeatable and measurable processes such as the best practices for service level management (SLM) documented in the IT Infrastructure Library (ITIL). Central to the ITIL best practices are the service management processes. These are subdivided into the core areas of service support and service delivery. This IBM Redbook takes a top-down approach that starts from the business requirement to improve service management. This includes the need to align IT services with the needs of the business, to improve the quality of the IT services delivered, and to reduce the long-term cost of service provision. It focuses on how clients accomplish this by implementing SLM processes supported by IBM Tivoli Service Level Advisor and IBM Tivoli Business Systems Manager. For IT managers and technical staff who are responsible for providing services to their customers, use this IBM Redbook as a practical guide to SLM with IBM Tivoli products. It takes you from a general outline of SLM to specific implementation examples of banking and trading that incorporate the Tivoli monitoring products.
BUILDING TECHNICAL INFORMATION BASED ON PRACTICAL EXPERIENCE IBM Redbooks are developed by the IBM International Technical Support Organization. Experts from IBM, Customers and Partners from around the world create timely technical information based on realistic scenarios. Specific recommendations are provided to help you implement IT solutions more effectively in your environment.