ES445 - Sun Cluster 3.2 Advanced Administration - SG

www.chinaitproject.
com IT QQ : 3264454
Sun Cluster 3.2 Advanced Administration ES-445

Student Guide With Instructor Notes
Sun Microsystems, Inc. UBRM05-104 500 Eldorado Blvd. Broomeld, CO 80021 U.S.A. Revision A
www.chinaitproject.com IT QQ : 3264454
Copyright 2007 Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Sun, Sun Microsystems, the Sun logo, iPlanet, Java, JumpStart, OpenBoot, Solaris, Solaris Jumpstart, Solstice DiskSuite, Sun BluePrints, Sun Enterprise, Sun Java, Sun Enterprise Netbackup, SunPlex, Sun StorEdge, and Sun Enterprise Server 250 Internal RAID Storage Option are trademarks or registered trademarks of Sun Microsystems, Inc., in the U.S. and other countries. Netscape and the Netscape logo are trademarks or registered trademarks of the Netscape Communications Corporation in the United States and in other countries. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. The OPEN LOOK and Sun Graphical User Interface was developed by Sun Microsystems, Inc., for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Suns licensees who implement OPEN LOOK GUIs and otherwise comply with Suns written license agreements. ORACLE is a registered trademark of Oracle Corporation. Federal Acquisitions: Commercial Software Government Users Subject to Standard License Terms and Conditions Export Laws. Products, Services, and technical data delivered by Sun may be subject to U.S. export controls or the trade laws of other countries. You will comply with all such laws and obtain all licenses to export, re-export, or import as may be required after delivery to You. You will not export or re-export to entities on the most current U.S. export exclusions lists or to any country subject to U.S. embargo or terrorist controls as specified in the U.S. export laws. You will not use or provide Products, Services, or technical data for nuclear, missile, or chemical biological weaponry end uses. DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS, AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. THIS MANUAL IS DESIGNED TO SUPPORT AN INSTRUCTOR-LED TRAINING (ILT) COURSE AND IS INTENDED TO BE USED FOR REFERENCE PURPOSES IN CONJUNCTION WITH THE ILT COURSE. THE MANUAL IS NOT A STANDALONE TRAINING TOOL. USE OF THE MANUAL FOR SELF-STUDY WITHOUT CLASS ATTENDANCE IS NOT RECOMMENDED. Export Commodity Classification Number (ECCN) assigned: 14 November 2005
Please Recycle
Copyright 2007 Sun Microsystems Inc. 4150 Network Circle, Santa Clara, California 95054, Etats-Unis. Tous droits rservs. Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution, et la dcompilation. Aucune partie de ce produit ou document ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit, sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a. Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci par des fournisseurs de Sun. Sun, Sun Microsystems, le logo Sun, iPlanet, Java, JumpStart, OpenBoot, Solaris, Solaris Jumpstart, Solstice DiskSuite, Sun BluePrints, Sun Enterprise, Sun Java, Sun Enterprise Netbackup, SunPlex, Sun StorEdge, et Sun Enterprise Server 250 Internal RAID Storage Option sont des marques de fabrique ou des marques dposes de Sun Microsystems, Inc., aux Etats-Unis et dans dautres pays. Netscape est une marque de Netscape Communications Corporation aux Etats-Unis et dans dautres pays. Toutes les marques SPARC sont utilises sous licence sont des marques de fabrique ou des marques dposes de SPARC International, Inc., aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun Microsystems, Inc. UNIX est une marques dpose aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company, Ltd. ORACLE est une marque dpose registre de Oracle Corporation. Linterfaces dutilisation graphique OPEN LOOK et Sun a t dveloppe par Sun Microsystems, Inc. pour ses utilisateurs et licencis. Sun reconnat les efforts de pionniers de Xerox pour larecherche et le dveloppement du concept des interfaces dutilisation visuelle ou graphique pour lindustrie de linformatique. Sun dtient une licence non exclusive de Xerox sur linterface dutilisation graphique Xerox, cette licence couvrant galement les licencis de Sun qui mettent en place linterface dutilisation graphique OPEN LOOK et qui en outre se conforment aux licences crites de Sun. Lgislation en matire dexportations. Les Produits, Services et donnes techniques livrs par Sun peuvent tre soumis aux contrles amricains sur les exportations, ou la lgislation commerciale dautres pays. Nous nous conformerons lensemble de ces textes et nous obtiendrons toutes licences dexportation, de r-exportation ou dimportation susceptibles dtre requises aprs livraison Vous. Vous nexporterez, ni ne r-exporterez en aucun cas des entits figurant sur les listes amricaines dinterdiction dexportation les plus courantes, ni vers un quelconque pays soumis embargo par les Etats-Unis, ou des contrles anti-terroristes, comme prvu par la lgislation amricaine en matire dexportations. Vous nutiliserez, ni ne fournirez les Produits, Services ou donnes techniques pour aucune utilisation finale lie aux armes nuclaires, chimiques ou biologiques ou aux missiles. LA DOCUMENTATION EST FOURNIE EN LETAT ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE UTILISATION PARTICULIERE OU A LABSENCE DE CONTREFAON. CE MANUEL DE RFRENCE DOIT TRE UTILIS DANS LE CADRE DUN COURS DE FORMATION DIRIG PAR UN INSTRUCTEUR (ILT). IL NE SAGIT PAS DUN OUTIL DE FORMATION INDPENDANT. NOUS VOUS DCONSEILLONS DE LUTILISER DANS LE CADRE DUNE AUTO-FORMATION.
Please Recycle
Table of Contents
About This Course ............................................................Preface-xvii Course Goals....................................................................... Preface-xvii Course Map........................................................................ Preface-xviii Topics Not Covered............................................................. Preface-xix How Prepared Are You?...................................................... Preface-xx Introductions ........................................................................ Preface-xxi How to Use Course Materials ...........................................Preface-xxii Conventions ........................................................................Preface-xxiii Icons ............................................................................Preface-xxiii Typographical Conventions ................................... Preface-xxiv Additional Conventions............................................Preface-xxv Before You Begin: Course Setup ......................................Preface-xxvi Preparation.................................................................Preface-xxvi Task 1 Defining the Hardware and Software Components of Your Clusters.............................Preface-xxvii Task 2 Verifying Installation and Configuration Information ............................................................Preface-xxvii Task 3 Running the Setup Script on Your Cluster...................................................................... Preface-xxviii Task 4 Reviewing Cluster Architecture (PIC FROM 345 of Hardware).............................. Preface-xxviii Upgrades in the Sun Cluster Environment ....................................1-1 Objectives ........................................................................................... 1-1 Relevance............................................................................................. 1-2 Additional Resources ........................................................................ 1-3 Introduction to Upgrades in the Sun Cluster Environment ........ 1-4 Sun Cluster Component Relationships........................................... 1-5 Upgrading the OS in the Sun Cluster Environment ............ 1-5 Procedure for Upgrading the OS (Non-Live Upgrade)....... 1-6 Upgrading the Volume Manager Software in the Sun Cluster Environment............................................... 1-6 Upgrading Applications in the Sun Cluster Environment .............................................................................. 1-7 The Upgrade Scenario for This Course.................................. 1-8
v
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454 Ordering the Upgrades of Cluster Components ................. 1-9 Introduction to Sun Cluster 3.2 Upgrade Strategies ................... 1-10 Traditional Upgrades (Without Dual-Partition or Live-Upgrade) ...................................................................... 1-10 Dual Partition Upgrades ........................................................ 1-11 Live Upgrade .......................................................................... 1-15 Comparison of Upgrade Strategies: Application Downtime and Total Time to Perform the Upgrade ........ 1-17 The Live Upgrade Process .............................................................. 1-18 Creating a Boot Environment............................................... 1-20 Upgrading a Boot Environment............................................ 1-23 Activating a Boot Environment ........................................... 1-26 Synchronizing Files................................................................. 1-27 Upgrading the VxVM Software ..................................................... 1-28 Removing the Previous VxVM From the Alternate Boot Environment ................................................................... 1-28 Using the pkgadd Utility to Install VxVM 5.0 Software .... 1-29 Upgrading Disk Groups......................................................... 1-30 Exercise: Upgrading the Solaris OS and VxVM Software.......... 1-31 Preparation............................................................................... 1-31 Task 1 Verifying That Your Cluster Is Operating Correctly ................................................................................ 1-31 Task 2 Installing the Solaris 10 Live Upgrade Software and Partitioning the Target Disk....................... 1-33 Task 3 Creating the New Boot Environment as a Clone of the Original Root Disk...................................... 1-34 Task 4 Remove VxVM 4.0 From the New Boot Environment ......................................................................... 1-35 Task 5 Upgrading to Solaris 10 OS in the New Boot Environment ................................................................ 1-36 Task 6 Adding VxVM 5.0 to the New Boot Environment ......................................................................... 1-36 Exercise Summary............................................................................ 1-38 Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades ............................................................................. 2-1 Objectives ........................................................................................... 2-1 Relevance............................................................................................. 2-2 Additional Resources ........................................................................ 2-3 Upgrading the Cluster Software (Non-Live Upgrades)............... 2-4 Upgrading the Shared Components (Non-Live Upgrades) 2-4 Upgrading the Sun Cluster Software Framework (Non-Live Upgrades)................................................................ 2-5 Upgrading Sun-Supported Data Services .......................... 2-10 Managing Dual-Partition Upgrades (Non Live-Upgrade) using scinstall .............................................................................. 2-14
vi
Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Applying Changes to the First Partition (Initiating the Flop-Over Menu Option 4) ............................................. 2-17 Upgrading and Applying Changes to the Second Partition .................................................................................... 2-17 Upgrading the Cluster Software (Live Upgrade)........................ 2-18 Upgrading Java ES Shared Components (Live Upgrade). 2-18 Upgrading Sun Cluster Framework Packages (Live Upgrade) .................................................................................. 2-19 Upgrading Sun Cluster Data Service Packages (Live Upgrade) .................................................................................. 2-19 Booting Into the New Cluster (Live Upgrade) ................... 2-19 Live Upgrade With Dual-Partition Rolling Reboot (Experimental) ..................................................................... 2-20 Reviewing Sun Cluster Software Upgrade Issues (All Methods)............................................................................................ 2-21 Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade) .......................................................................... 2-22 Identifying Resource Type Upgrade Criteria ..................... 2-22 Naming Resource Types ........................................................ 2-22 Performing the Resource Type Upgrade............................. 2-23 Viewing an Example Resource Type Upgrade................... 2-24 Examining Resource Type Upgrade Issues......................... 2-26 Exercise: Upgrading the Sun Cluster Software ........................... 2-27 Task 1 Provisioning the New Boot Environment........... 2-28 Task 2 Upgrading the Shared Components .................... 2-29 Task 3 Upgrading the Sun Cluster Software Framework ............................................................................ 2-29 Task 4 Upgrading Sun Cluster Software Data Services .................................................................................. 2-30 Task 5 Run the fixforzones Script.................................. 2-30 Task 6A Rebooting the Cluster Nodes (Simultaneously; the Official Procedure) ............................ 2-30 Task 6B Rebooting the Cluster Nodes (Dual-Partition Method; Experimental) ....................................................... 2-31 Task 7 Upgrading Type Versions ..................................... 2-33 Task 8 Upgrading Disk Groups (VxVM Only) ................. 2-33 Task 9 Verifying Your Cluster Operation......................... 2-34 Exercise Summary............................................................................ 2-35 Advanced Data Service Configuration ...........................................3-1 Objectives ........................................................................................... 3-1 Relevance............................................................................................. 3-2 Additional Resources ........................................................................ 3-3 Introducing Sun Cluster 3.2 Software Data Services .................... 3-4 Defining a Data Service............................................................ 3-4 Data Service Methods And Resources .................................. 3-5
vii
www.chinaitproject.com IT QQ : 3264454 Callback Methods ..................................................................... 3-6 Suspended Resource Groups ................................................. 3-8 Resource Type Registration File ............................................ 3-9 Writing Sun Cluster 3.x Software Data Services ......................... 3-14 Data Services Built Without DSDL....................................... 3-15 Data Services Built Using the DSDL .................................... 3-16 Process Monitoring Facility (PMF)....................................... 3-17 Behavior of PMF, Action Script, and Fault Monitor With DSDL ........................................................................... 3-19 Details of a DSDL Fault Monitor ......................................... 3-20 DSDL Fault Monitor Service Restarts .................................. 3-22 Fault Monitor Initiation of Group Failover ......................... 3-22 DSDL Resource Type Similarities and Variations.............. 3-23 The Generic Data Service ....................................................... 3-23 Using Builders to Build a Data Service Skeleton............... 3-25 Controlling RGM Behavior Through Properties ......................... 3-26 Controlling Behavior Through Resource Group Properties .............................................................................. 3-26 Advanced Control of the RGM Through Standard Resource Properties ............................................................ 3-28 Resource Dependencies ........................................................ 3-31 Cross-Group Dependencies................................................... 3-33 The Implicit_network_dependencies Group Property................................................................................. 3-33 Resource Group Dependencies............................................. 3-34 Advanced Resource Group Relationships ................................... 3-35 Weak Positive and Negative Affinities................................ 3-35 Strong Positive Affinities ....................................................... 3-36 Strong Positive Affinity With Failover Delegation ............ 3-37 Strong Negative Affinity....................................................... 3-38 Example of Complex Affinity Relationships ...................... 3-38 Multimaster and Scalable Applications........................................ 3-40 The Desired_primaries and Maximum_primaries Properties .............................................................................. 3-40 Controlling Load Balancing for Scalable Applications .... 3-41 Choosing Client Affinity........................................................ 3-41 Strong Client Affinity and Weak Client Affinity................ 3-42 Affinity Timeout for Strong Client Affinity ........................ 3-42 Exercise 1: Creating Sun Cluster Software Data Services .......... 3-44 Preparation............................................................................... 3-44 Task 1 Creating a Wrapper Script for Your Application ........................................................................... 3-44 Task 2 Creating a New Resource Type ............................ 3-46 Task 3 Installing the New Resource Type ....................... 3-48 Task 4 Registering the New Resource Type..................... 3-48
viii

www.chinaitproject.com IT QQ : 3264454 Task 5 Instantiating a Resource of the New Resource Type ...................................................................... 3-48 Task 6 Putting the Resource Group Containing the New Resource Online ................................................... 3-49 Task 7 Testing the Fault Monitor for the New Resource Type ...................................................................... 3-49 Exercise 2: Create a Data Service using GDS ............................... 3-50 Task 1 Making a Version of an Application Wrapper Script Suitable for GDS........................................................ 3-50 Task 2 Creating and Enabling a Resource Group ........... 3-51 Task 3 Verifying Restart and Failover Behavior of the New Resource ................................................................ 3-51 Exercise 3: Advanced Resource and Resource Group Control ............................................................................................ 3-53 Task 1 Investigating Cross Group Dependencies and Restart Dependencies .................................................. 3-53 Task 2 Investigating Resource Group Affinities.............. 3-54 Task 3 Modifying a Failover Service failover_mode Property................................................................................. 3-55 Exercise Summary............................................................................ 3-57 Performing Recovery and Maintenance Procedures ....................4-1 Objectives ........................................................................................... 4-1 Relevance............................................................................................. 4-2 Additional Resources ........................................................................ 4-3 Adding a Node to an Existing Cluster............................................ 4-4 Redefining a Cluster to Use Switches ................................... 4-5 Cabling the New Node............................................................. 4-6 Configuring Solaris OS on the New Node ............................ 4-6 Preparing the Existing Cluster to Accept the New Node ... 4-6 Adding a vxio Major Number................................................ 4-7 Installing Sun Cluster Packages on the New Node ............. 4-8 Creating Mount Points for Global File Systems ................... 4-8 Checking the did Major Number.......................................... 4-8 Running the scinstall Utility on the New Node.............. 4-9 Managing Quorum Devices .................................................... 4-9 Configuring Volume Management on the New Node..... 4-11 Adding a New Node to Existing Device Groups ............... 4-11 Configuring IPMP.................................................................. 4-13 Preparing the New Node to Run Existing Applications............................................................................. 4-13 Adding the New Node to Existing Resource Groups ...... 4-14 Removing a Node From an Existing Cluster ............................... 4-15 Switching Services Off the Node .......................................... 4-16 Removing a Node From the Resource Group Nodelist ................................................................................... 4-17
ix
www.chinaitproject.com IT QQ : 3264454 Removing a Node From the Device Group Nodelist ........ 4-17 Rebooting The Node to be Removed to Non-Cluster Mode ......................................................................................... 4-18 Removing Quorum Votes and Quorum Devices ............... 4-18 Completing Node Removal (Orderly Removal) ................ 4-19 Completing Node Removal (Dead Node)........................... 4-19 Adding Back Quorum Devices ............................................. 4-19 Replacing a Failed Node in a Cluster............................................ 4-20 Removing the Node Definition From the Cluster and Adding It Back .............................................................. 4-20 Using a Well-Managed Archive............................................ 4-21 Uninstalling Sun Cluster Software From a Node........................ 4-22 Reviewing Disk Replacement Procedures ................................... 4-23 Identifying Individual Disk Drive Failures......................... 4-23 Identifying Cable or Total Array Failures ........................... 4-24 Reviewing DID Consistency Issues...................................... 4-24 Updating the Physical Disk ID in Device Driver RAM (SCSI Disks).......................................................................... 4-26 Updating the Physical ID in Device Driver RAM (Fibre-Channel JBOD Disks)............................................... 4-27 Updating the DID Serial Number Information From the Device Driver RAM....................................................... 4-28 Examining Disk Replacement and Mirror Fixing ............. 4-30 Examining Failed Quorum Device Issues ........................... 4-32 Viewing an Example............................................................... 4-32 Backing Up and Restoring the CCR .............................................. 4-34 Exercise: Performing Maintenance and Recovery Procedures...................................................................................... 4-35 Lab Task Order........................................................................ 4-35 Preparation............................................................................... 4-35 Task 1 Removing a Cluster Node ...................................... 4-36 Task 2 Adding a Node to the Cluster............................... 4-39 Task 3 Replacing a Failed Fibre JBOD Drive ................... 4-46 Task 4 Replacing a Failed SCSI JBOD Drive .................... 4-48 Exercise Summary............................................................................ 4-52 Advanced Features (ZFS, QFS and Zones) ................................... 5-1 Objectives ........................................................................................... 5-1 Relevance............................................................................................. 5-2 Additional Resources ........................................................................ 5-3 ZFS as a Failover File System Only ................................................. 5-4 ZFS Includes a Volume Management Layer......................... 5-4 ZFS Removes the Need for /etc/vfstab Entries ............... 5-4 Example: Creating a Mirrored Pool and Some Filesystems ................................................................................. 5-4 ZFS Snapshots .......................................................................... 5-6

www.chinaitproject.com IT QQ : 3264454 HAStoragePlus and ZFS .......................................................... 5-6 Introducing the Features of the Sun StorEdge QFS File System.................................................................................................. 5-8 Features and Benefits of QFS................................................... 5-8 QFS Considerations for the Cluster...................................... 5-10 Configuring A Standard (Non-Shared) QFS File System........... 5-11 QFS File System and Component Device Types ................ 5-11 Creating Underlying Storage Devices.................................. 5-12 Creating the Master Configuration File............................... 5-12 Creating and Mounting the File System.............................. 5-13 Configuring Standard QFS as a Failover File System in the Cluster........................................................... 5-15 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)..................................................... 5-17 Shared QFS File Systems on Solaris Volume Manager Multiowner Diskset Devices .............................................. 5-18 Configuring a Shared QFS File System................................ 5-19 Creating a Shared QFS File System ...................................... 5-21 Creating a Failover Sun Cluster Resource to Control the Metadata Server................................................................ 5-21 Resource Group Manager Support for Non-Global Zones........ 5-22 Exercise 1: Running a Standard Failover Service on QFS (Optional) ....................................................................................... 5-31 Task 1 Installing the QFS Software on Your Cluster Nodes ....................................................................... 5-31 Task 2a Adding A Volume on Which to Build a Failover QFS File System (With VxVM) ........................ 5-32 Task 2b Adding A Volume on Which to Build a Failover QFS File System (With SVM) ........................... 5-33 Task 3 Preparing a QFS File System Configuration........ 5-33 Task 4 Creating, Mounting, and Switching the QFS File System ........................................................................... 5-35 Task 5 Migrating Your Oracle Application Data to the QFS File System......................................................... 5-35 Task 6 Rearranging Your Mount Points So That /oracle Is Mounted From the New QFS......................... 5-36 Task 7 Reconfiguring Your Cluster Resources to Use the New File System .................................................... 5-36 Exercise 2: Configuring a Shared QFS File System (Optional) ....................................................................................... 5-37 Task 1 Installing the QFS Software on Your Cluster Nodes (If Not Already Done in the QFS Failover Lab)............................................................................................ 5-38 Task 2 Installing RAC Framework Packages for Oracle RAC With SVM Multiowner Disksets.................. 5-38
xi
www.chinaitproject.com IT QQ : 3264454 Task 3 Installing the Oracle Distributed Lock Manager................................................................................. 5-38 Task 4 Creating and Enabling the RAC Framework Resource Group.................................................................... 5-39 Task 5 Adding Volumes on Which to Build a Shared QFS File System................................................................... 5-40 Task 6 Preparing a Shared QFS File System Configuration ....................................................................... 5-40 Task 7 Creating and Mounting the File System .............. 5-42 Task 8 Mounting the File System on Other Node(s)....... 5-42 Task 9 Configuring the Metadata Server as a Failover Resource, and Testing Failover........................ 5-42 Exercise 3: Running Oracle in Non-Global Zones (Optional) .......................................................................................... 5-43 Task 1 Configuring and Installing the Zones.................... 5-43 Task 2 Migrating Oracle to Run in the Zone................... 5-45 Exercise 4: Migrating Your Oracle Data to ZFS (Optional)........ 5-46 Exercise Summary............................................................................ 5-49 Best Practices .................................................................................. 6-1 Objectives ........................................................................................... 6-1 Relevance............................................................................................. 6-2 Additional Resources ........................................................................ 6-3 IPMP Best Practices............................................................................ 6-4 Using IPMP Hardware Redundancy .................................... 6-5 Using Test Addresses or Link State Testing in Solaris 10 OS ........................................................................................ 6-5 Placing the Test IP Address on a Virtual Interface .............. 6-5 Using the standby Keyword .................................................. 6-6 Enabling Failback for IPMP Interfaces................................... 6-7 Using the deprecated Flag on All Test Interfaces .............. 6-8 Controlling Test Targets.......................................................... 6-9 Shared Storage File System Best Practices.................................... 6-10 Using Failover or Global File Systems ................................. 6-10 Configuring the /etc/vfstab File (Traditional Non-ZFS Filesystems) ............................................................ 6-11 Using Affinity Switching ....................................................... 6-12 Using HAStoragePlus Resources With Scalable Services .................................................................................. 6-13 Volume Management Software Best Practices ............................ 6-15 Managing Boot Disk Mirroring With VxVM or Solaris VM ........................................................................ 6-15 Using VxVM Software to Mirror the Boot Disk ................. 6-15 Using Solaris VM Software to Mirror the Boot Disk ........ 6-19 Quorum Device Best Practices ....................................................... 6-21 Limiting Quorum Votes ......................................................... 6-21
xii

www.chinaitproject.com IT QQ : 3264454 Disk Path Monitoring ............................................................. 6-24 Deciding When to Use a Quorum Server Device.............. 6-27 Best Practices for Campus Clusters ............................................... 6-29 Defining Campus Cluster Topologies................................. 6-30 Reducing the Performance Impact of Campus Clusters .. 6-32 Exercise: Using Best Practices ........................................................ 6-33 Preparation............................................................................... 6-33 Task 1 Mirroring the Boot Disk Using Solaris VM.......... 6-34 Task 2 Encapsulating and Mirroring the Boot Disk Using VxVM Software......................................................... 6-36 Task 3 Verifying IPMP Best Practices ............................... 6-37 Task 4 Implementing Quorum Device Monitoring ........ 6-38 Exercise Summary............................................................................ 6-39 Best Practices for Cluster Security ................................................7-1 Objectives ........................................................................................... 7-1 Relevance............................................................................................. 7-2 Additional Resources ........................................................................ 7-3 Using a Security Policy as a Framework for Decision Making ... 7-4 Developing a Security Policy .................................................. 7-4 Implementing a Security Policy .............................................. 7-4 Identifying Security Vulnerabilities ................................................ 7-6 Minimizing Compared to Hardening the Solaris OS Software................................................................................... 7-6 Securing the Oracle RAC Software Installation.................... 7-6 Isolating Cluster Interconnects ............................................... 7-7 Disabling Internet Services ...................................................... 7-7 Identifying Sun Cluster 3.2 Software Services..................... 7-8 Securing Console Access.......................................................... 7-9 Securing Node Authentication During Installation............. 7-9 Using the Solaris Security Toolkit Software................................. 7-10 Introducing the Solaris Security Toolkit Software ............. 7-10 Structure of the Toolkit Software.......................................... 7-11 Executing the Toolkit Software............................................. 7-13 Undoing the Toolkit Software Security Modifications...... 7-13 Downloading and Installing Security Software .......................... 7-15 Downloading and Installing the Toolkit Software............. 7-15 Downloading Recommended and Security Patches......... 7-16 Downloading the FixModes Software (Solaris 9 OS Only)............................................................... 7-16 Downloading the MD5 Software (Solaris 9 OS Only) ....... 7-16 Implementing the Toolkit Software Modifications on a Cluster Node ......................................................................... 7-17 Providing Secure Clustered Services ............................................ 7-18 Using Secure NFS and Kerberized NFS............................... 7-18 Securing an LDAP Service..................................................... 7-19
xiii
www.chinaitproject.com IT QQ : 3264454 Using the Secure Apache Web Service ................................ 7-19 Using the Secure Sun Java System Web Server Software................................................................................. 7-20 Exercise: Hardening Security With the Toolkit Software .......... 7-21 Preparation............................................................................... 7-21 Task 1 Installing the Toolkit Software on the Selected Node................................................................. 7-21 Task 2 Running the suncluster3x-secure.driver Script ...................................................................................... 7-22 Task 3 Verifying That the Selected Node Is Hardened.................................................................................. 7-22 Task 4 Verifying That the Selected Node Operates Properly in the Cluster ........................................................ 7-23 Task 5 Hardening the Remaining Cluster Nodes (Optional) .............................................................................. 7-23 Task 6 Undoing the Security Modifications on Each Cluster Node.......................................................... 7-24 Exercise Summary............................................................................ 7-25 Examining Troubleshooting Tips ................................................... 8-1 Objectives ........................................................................................... 8-1 Relevance............................................................................................. 8-2 Additional Resources ........................................................................ 8-3 Defining How to Troubleshoot Clustered Services ...................... 8-4 Examining a Generic Troubleshooting Approach ............... 8-4 Triangulating the Causes of Failure ....................................... 8-5 Defining the Sun Cluster Software Stack .............................. 8-6 Identifying Dependencies Within Layers of the Stack ........ 8-8 Deciding Where to Begin ......................................................... 8-9 Example: Troubleshooting Failure in an HA-Oracle Application ............................................................................. 8-9 Identifying Log Files for Each Layer............................................. 8-11 Identifying Application Log Files......................................... 8-11 Identifying Cluster Framework Log Files .......................... 8-14 Troubleshooting the Software ........................................................ 9-1 Objectives ........................................................................................... 9-1 Relevance............................................................................................. 9-2 Additional Resources ........................................................................ 9-3 Introducing the Troubleshooting Exercises ................................... 9-4 Troubleshooting Self-Induced Problems ............................... 9-4 Troubleshooting Instructor-Induced Problems .................... 9-4 Implementing Disaster Recovery ........................................... 9-5 Exercise 1: Inducing Problems and Observing Reactions............ 9-6 Preparation................................................................................. 9-6 Task 1 Inducing Daemon Failures....................................... 9-6 Task 2 Inducing a Full Root File System............................ 9-9
xiv

www.chinaitproject.com IT QQ : 3264454 Task 3 Setting an Incorrect maxusers Value................... 9-10 Task 4 Inducing Operator Errors ....................................... 9-10 Exercise 2: Troubleshooting Instructor-Induced Problems ....... 9-12 Preparation............................................................................... 9-12 Task 1 Troubleshooting IPMP Errors............................... 9-13 Task 2 Troubleshooting an Unknown State .................... 9-14 Task 3 Troubleshooting a Resource STOP_FAILED State....................................................................................... 9-15 Task 4 Troubleshooting Oracle Software Resource Group Errors........................................................................ 9-16 Task 5 Troubleshooting an Unbootable Cluster Node .. 9-17 Task 6 Troubleshooting oracle_server Resource Fault Monitor Errors........................................................... 9-18 Task 7 Troubleshooting the Failure to Start a Web Server ........................................................................ 9-19 Task 8 Troubleshooting iws-res Resource Failures on One Node........................................................................ 9-20 Exercise 3: Implementing Disaster Recovery............................... 9-21 Hints for Installing New OS on the Failed (New) Node.......................................................................................... 9-21 Hints for Getting the Node Back into the Cluster .............. 9-21 Exercise Summary............................................................................ 9-22 Upgrading Oracle Software ............................................................ A-1 Exercise: Oracle Software Installation and Database Upgrade........................................................................................... A-2 Preparation................................................................................ A-2 Task 1 Installing the New Oracle Software ....................... A-3 Task 2 Upgrading the Database.......................................... A-6 Task 3 Configuring the New Network Components...... A-8 Task 4 Changing and Enabling the Resources.................. A-9 Task 5 Verifying That the Oracle Database Upgrade Is Successful ......................................................................... A-10 Exercise Summary........................................................................... A-11 Installing and Configuring Oracle 10gR2 RAC on Shared QFS .. B-1 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software...............................................................................................B-2 Preparation................................................................................ B-3 Task 1 Shutting down Failover Oracle Instances ..............B-5 Task 2 Provisioning the Shared QFS File System..............B-5 Task 3 Configuring Oracle Virtual IPs................................ B-6 Task 4 Configuring the oracle User Environment .......... B-7 Task 5 Disabling Access Control on X Server of the Admin Workstation ..................................................................B-7 Task 6 Installing Oracle CRS Software .............................. B-8 Task 7 Installing Oracle Database Software.................... B-13
xv
www.chinaitproject.com IT QQ : 3264454 Task 8 Create Sun Cluster Resources to Control Oracle RAC Through CRS..................................................................B-16 Task 9 Verifying That Oracle RAC Works Properly in a Cluster ...............................................................................B-19 Exercise Summary............................................................................B-22
xvi

Preface
About This Course

Course Goals
Upon completion of this course, you should be able to do the following:
Upgrade the Solaris Operating System (Solaris OS), VERITAS Volume Manager (VxVM), and ORACLE database software Upgrade Sun Cluster software Build Sun Cluster software data services and perform advanced resource and resource group management Perform recovery and maintenance procedures on Sun Cluster software Congure the ZFS, QFS, shared QFS, and zone advanced features of the cluster software Describe and use Sun Cluster best practices Describe and implement best practices for security in the Sun Cluster environment Use helpful tips to troubleshoot Sun Cluster software Perform Sun Cluster software troubleshooting exercises
Preface-xvii
Course Map
Course Map
The following course map enables you to see what you have accomplished and where you are going in reference to the course goals.
Upgrading Software
Upgrades in the Sun Cluster Environment Upgrading the Sun Cluster Software
Conguring Data Services

Advanced Data Service Conguration
Advanced Procedures and Features

Performing Recovery and Maintenance Procedures Advanced Features (ZFS, QFS and Zones)
Best Practices and Security

Best Practices
Best Practices for Cluster Security
Troubleshooting
Examining Troubleshooting Tips Troubleshooting the Software
Preface-xviii

www.chinaitproject.com IT QQ : 3264454 Topics Not Covered
Topics Not Covered

This course does not cover the following topics. Many of these topics are covered in other courses offered by Sun Educational Services:
VERITAS volume management Covered in ES-310: VERITAS Volume Manager Administration Sun Cluster 3.0 software administration Covered in ES-333: Sun Cluster 3.0 Administration Sun Cluster 3.1 software administration Covered in ES-338: Sun Cluster 3.1 Administration Sun Cluster 3.2 software administration Covered in ES-345: Sun Cluster 3.2 Administration
Refer to the Sun Educational Services catalog for specic information and registration.
About This Course

Preface-xix
How Prepared Are You?
How Prepared Are You?

To be sure you are prepared to take this course, can you answer yes to the following questions?
Can you perform routine system administration tasks in a Solaris OS environment? Can you perform routine volume management tasks in either a VxVM or Solaris Volume Manager (Solaris VM) software environment? Can you install and congure Sun Cluster 3.x software?
Preface-xx

www.chinaitproject.com IT QQ : 3264454 Introductions
Introductions
Now that you have been introduced to the course, introduce yourself to the other students and the instructor, addressing the following items:

Name Company afliation Title, function, and job responsibility Experience related to topics presented in this course Reasons for enrolling in this course Expectations for this course
About This Course

Preface-xxi
How to Use Course Materials
How to Use Course Materials

To enable you to succeed in this course, these course materials contain a learning module that is composed of the following components:
Goals You should be able to accomplish the goals after nishing this course and meeting all of its objectives. Objectives You should be able to accomplish the objectives after completing a portion of instructional content. Objectives support goals and can support other higher-level objectives. Lecture The instructor presents information specic to the objective of the module. This information helps you learn the knowledge and skills necessary to succeed with the activities. Activities The activities take various forms, such as an exercise, self-check, description, and demonstration. Activities help you facilitate the mastery of an objective. Visual aids The instructor might use several visual aids to convey a concept, such as a process, in a visual form. Visual aids commonly contain graphics, animation, and video.
Preface-xxii

www.chinaitproject.com IT QQ : 3264454 Conventions
Conventions
The following conventions are used in this course to represent various training elements and alternative learning resources.
Icons
Additional resources Indicates other references that provide additional information on the topics described in the module.
!
?
Discussion Indicates that a small-group or class discussion on the current topic is recommended at this time.
Note Indicates additional information that can help you but is not crucial to your understanding of the concept being described. You should be able to understand the concept or complete the task without this information. Examples of notational information include keyword shortcuts and minor system adjustments. Caution Indicates that there is a risk of personal injury from a non electrical hazard, or risk of irreversible damage to data, software, or the operating system. A caution indicates that the possibility of a hazard (as opposed to certainty) might happen, depending on the action of the user.
About This Course

Preface-xxiii
Conventions
Typographical Conventions
Courier is used for the names of commands, les, directories, programming code, and on-screen computer output; for example: Use ls -al to list all les. system% You have mail. Courier is also used to indicate programming constructs, such as class names, methods, and keywords; for example: The getServletInfo method is used to get author information. The java.awt.Dialog class contains Dialog constructor. Courier bold is used for characters and numbers that you type; for example: To list the les in this directory, type the following: # ls Courier bold is also used for each line of programming code that is referenced in a textual description; for example: 1 import java.io.*; 2 import javax.servlet.*; 3 import javax.servlet.http.*; Notice the javax.servlet interface is imported to allow access to its life cycle methods (Line 2). Courier italic is used for variables and command-line placeholders that are replaced with a real name or value; for example: To delete a le, use the rm filename command. Courier italic bold is used to represent variables whose values are to be entered by the student as part of an activity; for example: Type chmod a+rwx filename to grant read, write, and execute rights for filename to world, group, and users. Palatino italic is used for book titles, new words or terms, or words that you want to emphasize; for example: Read Chapter 6 in the Users Guide. These are called class options.
Preface-xxiv

www.chinaitproject.com IT QQ : 3264454 Conventions
Additional Conventions
Java programming language examples use the following additional conventions:
Method names are not followed with parentheses unless a formal or actual parameter list is shown; for example: The doIt method... refers to any method called doIt. The doIt() method... refers to a method called doIt that takes no arguments.
Line breaks occur only where there are separations (commas), conjunctions (operators), or white space in the code. Broken code is indented four spaces under the starting code. If a command used in the Solaris OS is different from a command used in the Microsoft Windows platform, both commands are shown; for example: If working in the Solaris OS: $CD SERVER_ROOT/BIN If working in Microsoft Windows: C:\>CD SERVER_ROOT\BIN
About This Course

Preface-xxv
Before You Begin: Course Setup
Before You Begin: Course Setup

In order to set up your clusters for the course, you perform the following tasks:
Task 1 Dene the hardware and software components of your clusters Task 2 Verify installation and conguration information Task 3 Run the setup script on your cluster Task 4 Review cluster architecture
Preparation
You will be running some scripts to set up your cluster. Each group will begin with a two-node cluster. The characteristics of the cluster are:
Running Solaris 9 9/04 (Update 7) OS and Sun Cluster 3.1 9/04 (Update 3) Running HA-Oracle 9i as a failover service with a failover le system Running Sun Java System Web server as a scalable (load-balanced) service Your groups choice of VERITAS Volume Manager (VxVM 4.0) or Solaris Volume Manager
If you have a third node you can also use the script to perform a Flash upgrade to Solaris 10, so that it is ready to be added to the cluster after the other nodes have already been upgraded in Modules 1 and 2. You can do this at your leisure.
Preface-xxvi

www.chinaitproject.com IT QQ : 3264454 Before You Begin: Course Setup
Task 1 Defining the Hardware and Software Components of Your Clusters

Use the following information in the table. First Transport Adapter Second Transport Adapter
Cluster Name Node 1 Node 2 (Same as above)
Node Name
ora-lh IP
iws-lh IP
(Same as above)
(Same as above)
Note the following:
Each group needs two separate logical host addresses (ora-lh for Oracle and iws-lh for the web server). They need not yet be entries in your hosts le on your nodes; the script will add entries. In the RLDC environment, existing entries on the vnchost may suggest which IP addresses to use. Consult your instructor.
Make sure you enter the transport adapter information in the correct order (rst one you enter on node one should be attached to the rst one you enter on node two), and that you enter the same logical host addresses on each node. The script will not check for this.
Make sure you enter the same choice of Volume manager on each node (the script will not check for this). For VxVM, the script will guide you through entering license information.
Task 2 Verifying Installation and Configuration Information

Perform the following steps on both cluster nodes: 1. To determine the version of the Solaris OS, type: # cat /etc/release # uname -a 2. To determine the amount of physical memory installed, type: # prtconf -v | grep -i memory
About This Course

Preface-xxvii
Before You Begin: Course Setup 3.
To verify that the root disk has been properly partitioned, type: # df -k # swap -l
Task 3 Running the Setup Script on Your Cluster

Perform the following steps on each node of each of your clusters. Do not wait for the script to nish on one node before launching on the next. Get them all running simultaneously. (The exact timing down to seconds, or the order in which you launch them, is not important.) 1. 2. Change to the directory that contains the installES445 script. You must run the script from this directory. Run the installES445 script and answer the questions correctly. After you choose the volume manager (and enter the license information for VxVM), the rest of the script runs. It might appear to be asking for input, but the input is automatically provided by the script. Near the end of the script you have validation errors from the HA-ORACLE resources. These come from the node where the failover le system is not mounted. (The HA-ORACLE resources cannot be validated there because the le system is properly mounted on only one node.) These errors are expected, and you can see, if you look closely, that they are ignored.
Task 4 Reviewing Cluster Architecture (PIC FROM 345 of Hardware)

Your instructor will lead you through a review of the cluster architecture and basic cluster administration.
It might be a good idea to have a cluster or two ready to go so that you can review the cluster setup and administration live with the students. If you got in early on the rst day of class and were able to kick off these scripts on a cluster or two, then it would be complete and ready for review once the introductions were nished. You COULD choose to do the scripts early on behalf of ALL students, if you knew in advance what the choice of volume manager would be. Instructor note for RLDC: On the vnchost machine (inside the RLDC), you will nd /etc/hosts, /etc/serialports, /etc/clusters precongured with plenty of comments and information so that you know which hosts go with which clusters and which terminal concentrator ports.
Preface-xxviii

www.chinaitproject.com IT QQ : 3264454 Before You Begin: Course Setup

Note in RLDC the local disk controller may be other than c0. Be careful and tell students to be careful that they pick another local controller disk for upgrade target.
Notes
The vnchost is intended to be your display host inside RLDC. Each student can have a separate VNC session, use for web browser to iws-lh, for graphics for the xclock material in the data services module. Note the display number will be different for different students for the xclock material (that is, vnchost:3).
About This Course

Preface-xxix
Module 1
Upgrades in the Sun Cluster Environment

Objectives
Upon completion of this module, you should be able to do the following:
Describe high availability issues when performing upgrades in the Sun Cluster environment Describe the required relationships for upgrading the Sun Cluster software Describe the different upgrade strategies Describe and perform an upgrade of the Solaris Operating System using the Solaris Live Upgrade software Upgrade the Veritas Volume Manager (VxVM) software into the Solaris Live Upgrade environment
Caution Scripts must be run to set up the initial state of your clusters. Your instructor may have done this on behalf of the entire class, or you may be launching the scripts for your own cluster. Please refer to Before You Begin: Course Setup on page xxvi of the About this Course section. These scripts must be launched before you begin this module so that your clusters are ready for the exercises at the end of the module.
1-1
Relevance
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
Discussion The following questions are relevant to understanding this module:
What are the relationships and interdependencies among OS upgrades, cluster release upgrades, and volume manager upgrades? Which methods are available for upgrading Solaris OS? Which methods give you the least downtime for your applications? Which are the safest? When must you upgrade the VxVM software?
1-2

www.chinaitproject.com IT QQ : 3264454 Additional Resources
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:

Man page for the live_upgrade(5) command. Sun Microsystems, Inc. Solaris 10 Installation Guide: Solaris Live Upgrade and Upgrade Planning, part number 817-5505. Sun Microsystems, Inc. Sun Cluster Software Installation Guide For Solaris OS, part number 819-2970. ORACLE Corporation, ORACLE Technology Network. Installing the Oracle9i Database. 2001. [Online] Available at http://otn.oracle.com. Symantec Software Corporation. VxVM software documentation, VERITAS Storage Foundation 5.0 Installation Guide

1-3
Introduction to Upgrades in the Sun Cluster Environment
Introduction to Upgrades in the Sun Cluster Environment

The focus of much of the Sun Cluster environment is to provide high availability for your applications by minimizing downtime. Minimizing downtime becomes an essential issue when performing upgrades of the Sun Cluster framework, the operating system, other Sun Cluster components, and applications themselves. Because you have already successfully deployed your applications in a highly trustworthy, high availability environment within Sun Cluster, you will want to minimize the downtime cost of any upgrades. You need to carefully understand and plan upgrades in the Sun Cluster environment. This module will introduce:
the complicated relationships that exist in the Sun Cluster environment between cluster software versions, operating system versions, volume manager versions, and application software versions. the variety of upgrade strategies that exist when upgrading specically to Sun Cluster 3.2. The following are strategies for minimizing application downtime, and can be used for upgrades to Sun Cluster 3.2 from any previous revision of Sun Cluster 3.0 or Sun Cluster 3.1:

Dual-partition Upgrades Live Upgrades
1-4

www.chinaitproject.com IT QQ : 3264454 Sun Cluster Component Relationships
Sun Cluster Component Relationships

Upgrades in the Sun Cluster environments present special complications. Sometimes you may be interested in upgrading the cluster version, but will therefore be required to upgrade the OS version as well, since your old OS may not be supported in the new cluster version. Sometimes the logic may be reversed: you may particularly want to upgrade your OS (to take advantage of Solaris 10 features, for example), but be required therefore to upgrade to a new cluster version that supports that OS. The following subsections outline these relationships in the Sun Cluster environment.
Upgrading the OS in the Sun Cluster Environment

There are many relationships between OS upgrades and Sun Cluster upgrades. You need to take the following relationships into consideration when deciding whether to upgrade and which mechanisms to use:
You may be driven by a desire to upgrade the OS, but nd that you must also upgrade your cluster version to support your OS. For example, you may be standardizing your enterprise on Solaris 10 OS, and want to upgrade your cluster from Solaris 9 to Solaris 10. But in order to run Solaris 10, you have to upgrade to Sun Cluster 3.1 Update 4 (8/05) or Sun Cluster 3.2 as no lower version supports Solaris 10 OS. You may be driven by a desire to upgrade the cluster versions, but nd that the new cluster version no longer supports your old OS. For example, you may want to upgrade to Sun Cluster 3.2. Sun Cluster 3.2 supports only Solaris 9 Update 8 and above, and Solaris 10 Update 3 and above. If you are running any other update, or any revision of Solaris 8, you will have to upgrade your OS in order to run Sun Cluster 3.2. You may want to do a major OS upgrade (Solaris 8 to 9, for example), and not have any intention of upgrading the cluster version. However, any major OS upgrade (not just an update revision upgrade) implies that you have to do a cluster framework upgrade, even if you are staying on the same update of the cluster. Sun Cluster 3.1 Update 3 for Solaris 8, for example, is different cluster framework software than Sun Cluster 3.1 Update 3 for Solaris 9 OS.

1-5
Considering the previous item, any major OS upgrade must be done before the corresponding Sun Cluster framework upgrade If you are upgrading to Sun Cluster 3.2, the dual-partition upgrade feature and live upgrade features let you perform your entire upgrade, including OS upgrades with very little downtime.
Procedure for Upgrading the OS (Non-Live Upgrade)

As in any traditional OS upgrade, you will be booted off the new OS media in order to upgrade to the new OS: There are a few small differences in the procedure used for upgrading the OS when in the cluster environment:
You need to comment out any global le systems from the vfstab le before the upgrade, except for the /global/.devices/node@# le system (you can leave that one). Since you are also upgrading the cluster after the OS upgrade, you will need to boot your new OS into non-clustered mode the very rst time. Therefore make sure you choose to not let the upgrade procedure reboot for you after the upgrade, so that you can boot the new OS manually into non-clustered mode.
Upgrading the Volume Manager Software in the Sun Cluster Environment

If you use Solaris Volume Manager as your only volume manager in the cluster, you never have to worry independently about volume manager upgrades and OS upgrades since they are one and the same. The OS upgrade procedure ensures that data that was under Solaris Volume Manager (SVM) control under a previous OS will be accessible under the new OS. If you use Veritas Volume Manager (VxVM), you have to start considering once again the relationships among VxVM versions, OS versions, and cluster versions:
You may need to upgrade your VxVM version because you want to upgrade your OS. For example, you may want to upgrade to Solaris 10 OS so you are required to upgrade to a minimum of VxVM 4.1.
1-6

You may need to upgrade VxVM because you want to upgrade the cluster revision. For example, you may be running Solaris 9 with VxVM 4.0 which is supported in Sun Cluster 3.1 But if you want to upgrade to Sun Cluster 3.2, this revision of VxVM is not supported. If you want to do a major OS upgrade (Solaris 9 to Solaris 10, for example), and do not intend to upgrade VxVM versions, you might still need to remove VxVM packages and add them back on the new OS, so that the correct drivers get loaded.
Upgrading Applications in the Sun Cluster Environment

When upgrading clustered applications, you need to take the following into consideration:
There is a strong relationship between OS version and versions of applications supported (for example, Solaris 10 is no longer supporting Oracle Server 8i). This will affect your decision about whether to upgrade the OS. If you are not upgrading your OS but only your cluster revision, your existing applications should still be supported in general by the new cluster agents, but you need to check to make sure. It is conceivable that the dual-partition feature could let you even accomplish application upgrade with very brief downtimes, but only if both of the following are true:
You installed separate application binaries locally on each node, so that you can upgrade one while the other is still running. The data itself can run equally well under the old and new software revisions (that is, you do not need to upgrade the data).
While the procedure to upgrade the actual application software may be the same inside and outside the cluster, you may have to upgrade properties of your cluster resources to get the new version running in the cluster. For example, many cluster resources have properties that point to the directory that contains application binaries or conguration les.

1-7
The Upgrade Scenario for This Course

This course has been staged so that you have to consider a full range of issues in upgrades:
You start the course in the following environment:

Solaris 9 OS 9/04 (Update 7) Sun Cluster 3.1 9/04 (Update 3) Optionally, VxVM 4.0 (or you could have chosen Solaris VM) Running Oracle server 9i as a failover service Running Sun Java System Web Server 6.1 as a scalable service Solaris 10 11/06 (Update 3) Sun Cluster 3.2 VxVM 5.0 (if that is your volume manager of choice)
You will be upgrading to the following environment:

You will have to balance the goals of minimizing application downtime and minimizing the total time it takes to do the upgrade. We will take advantage of the Live Upgrade softwares ability to upgrade all nodes simultaneously, while keeping our original cluster completely operational during the upgrade. If you want to try to experimentally combine dual-partition upgrade and Live Upgrade in this lab, you can also experience the minimum possible application downtime. You may choose to upgrade Oracle to Oracle 10gR2. The procedure for this is in an optional lab in Appendix A.
Note The initial setup has the Oracle binaries placed in the shared storage. Therefore you will not be able to minimize downtime related to the upgrade of Oracle software itself. If you choose to upgrade Oracle, you will have to take the whole application ofine for the duration of the Oracle upgrade.
1-8

Ordering the Upgrades of Cluster Components

In general, components will always be considered in the following order: 1. 2. 3. 4. 5. Removal of old VxVM, if any OS upgrade Add new VxVM, if any Multipathing software upgrade, if separate product from OS Cluster software upgrade including: a. b. 6. cluster framework data service packages
Application upgrade

1-9
Introduction to Sun Cluster 3.2 Upgrade Strategies

This section will introduce the strategies and methods for upgrading the components of your Sun Cluster environment. We will start with the worst-case scenario, which makes use of none of the advanced Sun Cluster 3.2 features for Sun Cluster upgrade and requires the most application downtime. We will then describe the dual-partition and liveupgrade features.
Traditional Upgrades (Without Dual-Partition or LiveUpgrade)

In the traditional upgrade environment, upgrades to all components of the Sun Cluster software stack are made on Sun Cluster nodes when they are not booted into the Sun Cluster environment.
OS upgrades are made in the traditional way (booted from the new OS medium). Cluster software upgrades are made afterwards, and you must boot the new OS into non-cluster mode to do them.
It would seem that you could leave some nodes up running the old cluster and OS version, as you perform OS and cluster upgrades on some other nodes. This assumption would be correct. The problem is that it would seem in the traditional upgrade that you could take down the rst set of nodes, do the upgrade, take down the second set of nodes, and immediately boot the rst set into the new upgraded cluster environment. This assumption would be incorrect, because you would still be bound by the regular cluster rules that prevent cluster amnesia. In other words, you cannot boot any nodes into the new cluster environment until they are all upgraded. Upgrading some nodes rst may still be more robust. If something goes wrong, you still have your application running on the old nodes. But it will not reduce the total downtime. The downtime required for the traditional upgrade is the time to completely upgrade the entire OS and cluster software.
1-10

www.chinaitproject.com IT QQ : 3264454 Introduction to Sun Cluster 3.2 Upgrade Strategies
Dual Partition Upgrades

Dual partition upgrades, which are new to Sun Cluster 3.2, allow you to divide your cluster into two partitions, or two sets of nodes. The rst partition of nodes is taken ofine and upgraded while your applications remain running on the second partition. So far this sounds just like the traditional upgrade, but the interesting part is that you can now reboot the nodes in the rst partition to form a cluster with their new software versions and they can run your clustered applications before you upgrade your second partition. As nodes in the rst partition reboot to join the cluster with their new set of software, they do not join the cluster with the nodes in the other partition running the old software. Instead, essentially, the two partitions exist side by side for a very short amount of time as two separate clusters, with the second partition (running the old software) still running the applications. Scripts are inserted in the second partition to automate shutdown of the second partition and transfer applications to the rst partition, running the new software versions.
Comparing Rolling Upgrades and Dual-Partition Upgrades

Rolling upgrades were supported in some circumstances, only when upgrading between different upgrade releases of Sun Cluster 3.1. They were never supported when upgrading from Sun Cluster 3.0, nor were they supported when you were also performing a major OS upgrade (from Solaris 8 or 9 to Solaris 10, for example). For upgrades to the initial release of Sun Cluster 3.2, the rolling upgrade mechanism has been withdrawn in favor of the dual-partition upgrade mechanism. The dual partition mechanism reduces downtime for your application to only slightly more than you can achieve in a rolling upgrade, and has far fewer restrictions:
You can use the dual-partition mechanism for all upgrades from any version of Sun Cluster 3.0 or 3.1 to Sun Cluster 3.2. You can use the dual-partition mechanism regardless of whether or not you are also upgrading your OS.

1-11
Differences Between Rolling Upgrade and Dual-Partition Upgrade

The following set of diagrams illustrates the differences between a rolling upgrade and a dual-partition upgrade. The diagrams show the simple case of a two-node cluster, where there would be one node in each partition. For larger clusters, you would have exibility designing which nodes go in which partition, as long as there are at least some nodes in each partition that are attached to all of your data storage devices and can master all of your application resource groups. Both a rolling upgrade and a dual-partition upgrade would start in a similar fashion, as illustrated in Figure 1-1. Nodes of the rst partition are taken off-line and the software is upgraded. Applications continue to run on the second partition:
Figure 1-1
Start of a Rolling Upgrade or Dual-Partition Upgrade
1-12

www.chinaitproject.com IT QQ : 3264454 Introduction to Sun Cluster 3.2 Upgrade Strategies The main difference between rolling and dual-partition upgrades lies in the transition of applications. In a rolling upgrade, as demonstrated in Figure 1-2, the nodes in the rst partition join the cluster with nodes of the second partition. Normal application switchovers can then occur, driven by commands issued manually:
Figure 1-2
Switchover From Old to New Versions in a Rolling Upgrade
In the dual-partition upgrade, the nodes of the rst partition boot the new software but never join the cluster with the second partition nodes, as illustrated in Figure 1-3:
Figure 1-3
Switchover in a Dual-Partition Upgrade

1-13
The rest of the upgrade is very similar in the two scenarios. The second partition is upgraded in non-cluster mode, and then the nodes can join the cluster to complete the cluster upgrade.
1-14

Live Upgrade
Live Upgrade software is a way to clone your entire boot environment and then apply upgrades to the clone only, while your original software versions continue to run on your original, non-upgraded boot disks. This is illustrated in Figure 1-4:
Figure 1-4
Live Upgrade Strategy
Beginning with upgrades to Sun Cluster 3.2, from any previous version of Sun Cluster 3.0 or 3.1, the live upgrade mechanism can support the upgrade of any or all of the following components directly onto the new boot disk, while the old software versions are still running:

Solaris OS Veritas Volume Manager Sun Cluster framework and data services
The Live Upgrade strategy has the following advantages over any other upgrade strategy:

1-15
It is the only upgrade mechanism that both minimizes application downtime and where all of the nodes can be upgraded simultaneously. You may, for example, have a restricted window for completing a cluster upgrade in its entirety. If you use the dual-partition strategy and do not take advantage of the live-upgrade strategy, you can minimize application downtime, but you can not minimize the amount of time it takes to upgrade the entire cluster.
It is the only upgrade mechanism that makes it completely trivial to back out of the upgrade and restore the original operations on the entire cluster. Since you never upgrade the original boot disk at all, you can always go back and just boot the original boot disk and start over again.
Caution If you have completed the entire cluster upgrade, and have upgraded VxVM disk group version numbers, then you may not be able to return to your original cluster running an older version of VxVM. For the original release of Sun Cluster 3.2, the Live Upgrade and dualpartition upgrade mechanisms are not supported together. You might think if you were doing live upgrade there would be no need for a dual-partition upgrade, since you can complete the actual upgrading on all nodes while still running the original software versions, and then reboot all nodes into the cluster. However, the amount of application downtime required to reboot all nodes, especially when upgrading to Solaris 10 OS, can be signicantly more than is required in the dual-partition upgrade strategy.
1-16

Comparison of Upgrade Strategies: Application Downtime and Total Time to Perform the Upgrade
The following table summarizes the course authors experience with the different upgrade strategies. The cluster being upgraded is a 2-node cluster running the exact same scenario as is presented in this course. App. Downtime 3 hrs
Upgrade Strategy Traditional (upgrade some nodes rst, leaving other nodes up in the cluster (safer)) Traditional (upgrade all nodes at same time (faster)) Dual Partition Live Upgrade (reboot all nodes at once into new cluster) Live Upgrade (Using dual partition to achieve a rolling reboot)
Total time 6 hrs
Comments
3 hrs 2 min 18 min
3 hrs 6 hrs 3 hrs
2 min
3 hrs
Not supported for initial release of cluster 3.2 (experimental)

1-17
The Live Upgrade Process

This section provides an overview of the Solaris Live Upgrade software, including a description of boot environments (BEs), a description of the software packages, and a command summary.
Boot Environments
The Solaris Live Upgrade software feature of the Solaris OS enables you to maintain multiple operating system images of a single system. An image, or boot environment, represents a set of operating system and application software packages. Different BEs might contain different operating system and application versions. A single system can have multiple BEs, but only one of them is the active BE; all other BEs are inactive.
Software Installation
You must install the Solaris Live Upgrade software from the OS that you will upgrade to in order to get full upgrade functionality via Live Upgrade. The Solaris 10 distribution contains an installer called liveupgrade20 under the Solaris_10/Tools/Installers directory. This installs the requisite SUNWluu and SUNWlur packages. Solaris 10 also has a SUNWluzone package used for upgrading zones upward from Solaris 10. It will do no harm to install this package, even if you do not use it.
Command Summary
Table 1-1 describes some important Solaris Live Upgrade software commands. A complete list is described in the live_upgrade(5) man page. Table 1-1 Solaris Live Upgrade Software Commands Command lu Purpose Access the Forms and Menu Language Interpreter (FMLI)-based interface for creating and administering BEs. Dene or display which BE to use at the next reboot. Cancel a job scheduled with the FMLI-based interface.
luactivate lucancel
1-18

www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process Table 1-1 Solaris Live Upgrade Software Commands (Continued) Command lucompare Purpose Compare les in two BEs, or compare les in a BE with a previously-taken compare database (list of les, sizes, and checksums). Create a BE. The command can either:
lucreate
Create a BE and populate it by cloning the active boot environment. This is what you would do if you intend to continue by upgrading the new BE with luupgrade. Create a BE that has only empty, fresh le systems. This is what you would do if you intend to lay down a pre-existing Flash image on the new BE with luupgrade.
lucurr ludelete lufslist lumake lurename lumount luumount lustatus luupgrade
Display the name of the active BE. Delete a BE. List the le systems within a BE. Recreate a BE based on the active BE. Rename a BE. Mount the le systems of a non-active BE. Unmount the le systems of a non-active BE. Report the status of all BEs present on a system. Do one of the following:

Upgrade the OS on a BE. Lay down a ash image on the BE. Whatever was on the BE previously is lost. This is not really an upgrade at all.

1-19
Creating a Boot Environment

You create BEs using either the lucreate command or the lu menu interface. Existing BEs are listed and dened in the /etc/lutab le. You must have one target slice free for each source slice, and the target slice must be large enough to contain the source slice contents, plus any les that might be added later. You can create a BE on a node while it is in cluster mode, and while the root disk is encapsulated by a volume manager, by using the lucreate command. The example in this section makes the following assumptions:

The node is booted in the cluster. The root disk has only /, swap, and /global/.devices/node@X partitions.
If your current root disk is VxVM-encapsulated, your new boot environment will not be encapsulated. When the new boot environment is activated and booted, you will still have access to your original root volumes from the original BE. You may then choose if you wish to delete these and encapsulate the new root disk.
Cluster-Specic Changes to /etc/vfstab are Required Before Creating the new BE

The Live Upgrade software treats the following types of /etc/vfstab entries incorrectly:
an entry for a failover (non-global) lesystem on a node not currently mounting the le system. The Live Upgrade software will gripe about this and refuse to create the new BE. The solution for this is to comment out any such entry before running the Live Upgrade software (only on nodes that are not currently mounting the failover le system).
the entry for /global/.devices/node@X
The Live Upgrade software will not copy its contents to the new boot disk because of the keyword global. The Live Upgrade software believes that all le systems with the global option are cluster data le systems, and will not copy their contents (this is the correct behavior for real data le systems). The Live Upgrade software is unhappy with DID device names.
1-20

www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process The solution for this is to edit the /etc/vfstab le before running the Live Upgrade software. You will remove the global option and replace the DID device names with traditional c#t#d#s# names. The following shows fragments of a vfstab le edited so that Live Upgrade will run correctly. The fragment is from a node not currently mounting the failover le system /oracle: These lines are from the original le:
/dev/did/dsk/d6s3 /dev/did/rdsk/d6s3 /global/.devices/node@1 ufs 2 no global /dev/md/orads/dsk/d100 /dev/md/orads/rdsk/d100 /oracle ufs 2 no -
These are the changes you need to make in order for Live Upgrade to run properly (save your original le for easy restoration):
/dev/dsk/c0t0d0s3 /dev/rdsk/c0t0d0s3 /global/.devices/node@1 ufs 2 no #/dev/md/orads/dsk/d100 /dev/md/orads/rdsk/d100 /oracle ufs 2 no -
Example BE Creation
Perform the following: 1. Determine the number and size of the le systems in the current BE: # df -k # cat /etc/vfstab 2. Verify that the correct number of needed target slices are created and properly sized. If your target disk is exactly the same geometry as your original root disk (which is easiest), you can follow one of the following strategies:
If the original root disk is VxVM-encapsulated, you can retrieve the /etc/vx/reconfig.d/disk.d/c#t#d#/vtoc le, which has the original Volume Table of Contents (VTOC) from your pre-encapsulated root disk. You can then apply this partitioning to the target disk. If the original root disk is not VxVM-encapsulated, you can just copy its partition table to the new disk.
3.
Create a new BE as a copy of the current BE:

1-21
Note In this example, c0t1d0 is the target disk. The command entry is identical whether or not the original root disk is VxVM-encapsulated. The example shows in bold some output that is specic to a VxVMencapsulated disk. The example shows using the -c option to name the current boot environment (this creates conguration entries for the original root conguration in the LU software). The -n option is used to name the new boot environment. # lucreate -c s9be \ -n s10be \ -m /:/dev/dsk/c0t1d0s0:ufs \ -m -:/dev/dsk/c0t1d0s1:swap \ -m /global/.devices/node@1:/dev/dsk/c0t1d0s3:ufs Discovering physical storage devices Discovering logical storage devices Cross referencing storage devices with boot environment configurations Determining types of file systems supported Validating file system requests Preparing logical storage devices Preparing physical storage devices Configuring physical storage devices Configuring logical storage devices Analyzing system configuration. No name for current boot environment. Current boot environment is named <s9be>. Creating initial configuration for primary boot environment <s9be>. WARNING: The device </dev/vx/dsk/bootdg/rootvol> for the root file system mount point </> is not a physical device. WARNING: The system boot prom identifies the physical device </dev/dsk/c0t0d0s0> as the system boot device. Is the physical device </dev/dsk/c0t0d0s0> the boot device for the logical device </dev/vx/dsk/bootdg/rootvol>? (yes or no) yes INFORMATION: Assuming the boot device </dev/dsk/c0t0d0s0> obtained from the system boot prom is the physical boot device for logical device </dev/vx/dsk/bootdg/rootvol>. The device </dev/dsk/c0t0d0s0> is not a root device for any boot environment. PBE configuration successful: PBE name <s9be> PBE Boot Device </dev/dsk/c0t0d0s0>. Comparing source boot environment <s9be> file systems with the file system(s) you specified for the new boot environment. Determining which file systems should be in the new boot environment. The file system is not mounted for the currently running BE
1-22

www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process The file system is not mounted for the currently running BE Updating boot environment description database on all BEs. Searching /dev for possible boot environment filesystem devices Updating system configuration files. The device </dev/dsk/c0t1d0s0> is not a root device for any boot environment. Creating configuration for boot environment <s10be>. Source boot environment is <s9be>. The file system is not mounted for the currently running BE The file system is not mounted for the currently running BE Creating boot environment <s10be>. Creating file systems on boot environment <s10be>. Creating <ufs> file system for </> on </dev/dsk/c0t1d0s0>. Creating <ufs> file system for </global/.devices/node@1> on </dev/dsk/c0t1d0s3>. Mounting file systems for boot environment <s10be>. Calculating required sizes of file systems for boot environment <s10be>. Populating file systems on boot environment <s10be>. Checking selection integrity. Integrity check OK. Populating contents of mount point </>. Populating contents of mount point </global/.devices/node@1>. Copying . .
Upgrading a Boot Environment

After copying an image of the current BE to the inactive BE, you can upgrade this inactive BE to a higher version of the Solaris OS by using the luupgrade command. This inactive, upgraded version does not impact the active version in any way. However, you can use the active version as a source of synchronization for the inactive version. Changes might occur on a live system both during and after the upgrade procedure that might render the as-yet-inactive BE stale. This module addresses that topic later. The target OS version must match the SUNWlur and SUNWluu Solaris Live Upgrade software packages in order to upgrade to that version. To upgrade a BE, perform the following: 1. Identify the BE to upgrade and make sure it is complete:

1-23
# lustatus BE_name Complete Active ActiveOnReboot CopyStatus ------------------------------------------------s9be yes yes yes s10be yes no no 2. Make sure that the BE is not currently mounted, and unmount it if necessary. If the alternate BE is mounted, the mount point for its root le system will be /.alt.be-name, for example /.alt.s10be. One reason it might be mounted is that there was some manipulation that you had to perform by hand, such as removing an older version of VxVM (which is discussed later in this module). # df -k # luumount s10be 3. Run the luupgrade utility to upgrade an inactive BE. The amount of time varies greatly according to the horsepower of your system. On the course developers system (2 x 1.5 Gigahertz (GHz) CPUs, 6 Gigabytes (GB) RAM, it takes about two hours. The course developer has seen slower systems where it took over nine hours.
Note The Solaris OS image identied by the -s option must be the directory containing the .cdtoc, .install_config, .slicemapfile, and .volume.inf les. This directory then contains the Solaris_n directory. # luupgrade -u -n s10be -s /net/server/sol10u3sparc Validating the contents of the media </net/clustergw/sol10u3sparc>. The media is a standard Solaris media. The media contains an operating system upgrade image. The media contains <Solaris> version <10>. Constructing upgrade profile to use. Locating the operating system upgrade program. Checking for existence of previously scheduled Live Upgrade requests. Creating upgrade profile for BE <s10be>. Determining packages to install or upgrade for BE <s10be>. Performing the operating system upgrade of the BE <s10be>. CAUTION: Interrupting this process may leave the boot environment unstable or unbootable. Upgrading Solaris: 100% completed Installation of the packages from this the media is complete. Updating package information on boot environment <s10be>. Package information successfully updated on boot environment <s10be>. Adding operating system patches to the BE <s10be>. The operating system patch installation is complete.
1-24

www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process INFORMATION: The file </var/sadm/system/logs/upgrade_log> on boot environment <s10be> contains a log of the upgrade operation. INFORMATION: The file </var/sadm/system/data/upgrade_cleanup> on boot environment <s10be> contains a log of cleanup operations required. WARNING: <3> packages failed to install properly on boot environment <s10be>. INFORMATION: The file </var/sadm/system/data/upgrade_failed_pkgadds> on boot environment <s10be> contains a list of packages that failed to upgrade or install properly. INFORMATION: Review the files listed above. Remember that all of the files are located on boot environment <s10be>. Before you activate boot environment <s10be>, determine if any additional system maintenance is required or if additional media of the software distribution must be installed. The Solaris upgrade of the boot environment <s10be> is partially complete.
Note This upgrade displays what seem to be pkgadd errors. The reason for this is that some of the Sun Cluster auxiliary components that were installed on top of the original Solaris 9 are now part of the base OS in Solaris 10. However, the versions that are already in the Solaris 9 cluster are newer than the ones Solaris 10 was trying to install. Therefore, there is no problem. You could see this detail in the /var/sadm/system/data/upgrade_failed_pkgadds le of the upgraded boot environment (not on the original root disk). 4. Check the status of the upgrade: # lustatus s10be

1-25
Activating a Boot Environment

Use the luactivate command to dene the upgraded boot environment as the active boot environment to be used at the next system boot. You must use a shutdown or reboot command that runs the kill scripts in the /etc/rc0.d directory to make it the active Solaris OS image. Perform the following: 1. Make sure the newly upgraded boot environment is not mounted, and unmount it if necessary: # df -k # luumount s10be 2. Activate the BE (Note you see the same warning mentioned in the previous section concerning the failed package installations):
# luactivate s10be WARNING: <3> packages failed to install properly on boot environment <s10be>. INFORMATION: </var/sadm/system/data/upgrade_failed_pkgadds> on boot environment <s10be> contains a list of packages that failed to upgrade or install properly. Review the file before you reboot the system to determine if any additional system maintenance is required.
********************************************************************** The target boot environment has been activated. It will be used when you reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You MUST USE either the init or the shutdown command when you reboot. If you do not use either init or shutdown, the system will not boot using the target BE. ********************************************************************** In case of a failure while booting to the target BE, the following process needs to be followed to fallback to the currently working boot environment: 1. Enter the PROM monitor (ok prompt). 2. Change the boot device back to the original boot environment by typing: setenv boot-device disk:a
1-26

www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process
3. Boot to the original boot environment by typing: boot ********************************************************************** Activation of boot environment <s10be> successful.
3.
Verify that the desired BE is the one active for next reboot: # luactivate s10be
4. 5.
Bring the system down: # init 0 The boot-device will automatically be changed so that the next time you boot, it will boot off of the new boot environment.
The luactivate command makes it clear that you must use either the init or /usr/sbin/shutdown command to reboot the system using the newly activated BE. This is to ensure that the /etc/rc0.d/K62lu shutdown script will run. If booted in the cluster, you can use the /usr/cluster/bin/scshutdown command because it calls the /sbin/rc0 script.
Synchronizing Files
The rst time you boot a newly created boot environment, the Solaris Live Upgrade software synchronizes the les dened in the /etc/lu/synclist le, using the corresponding les from the boot environment that was last active as the source of these les. You can add and remove entries in the synclist le to ensure that they are synchronized to the new BE when rst booting the new BE. Note At the time of writing of this course, this feature does not work using the OVERWRITE keyword to synchronize custom les and directories. It is supposed to be able to synchronize entire directory trees.

1-27
Upgrading the VxVM Software

VxVM 5.0 supports upgrading the software (from VxVM 4.0 or VxVM 4.1) directly inside an inactive Live Upgrade boot environment. The process is: 1. 2. 3. 4. Remove the previous VxVM version from the cloned (alternate) boot environment before you upgrade its OS with luupgrade. Perform the OS upgrade of the alternate boot environment. Add VxVM 5.0 directly into the newly upgraded boot environment. Wait until after you reboot the new boot environment to upgrade the disk groups, if necessary.
The rst three steps can proceed while your original boot environment is up and running the original OS, original VxVM version, the original cluster framework, and the clustered applications. When upgrading to VxVM 5.0 you do not require a new license to perform the upgrade, unless your VxVM 4.x license has expired. VERITAS does not support direct upgrades to VxVM 5.0 from any release of VxVM earlier than 4.0. If you have an older VxVM release, you must rst upgrade to version 4.0, then to 5.0.
Removing the Previous VxVM From the Alternate Boot Environment

You remove VxVM from the alternate boot environment after it has been created and cloned with lucreate but before its OS has been upgraded. The example assumes that only three basic VxVM packages are installed. If you have any of the GUI packages, remove those as well. # lumount s10be # pkgrm -R /.alt.s10be VRTSvlic VRTSvxvm VRTSvmman
1-28

www.chinaitproject.com IT QQ : 3264454 Upgrading the VxVM Software
Using the pkgadd Utility to Install VxVM 5.0 Software

VERITAS supports using the pkgadd command to install VxVM 5.0 software directly into the alternate boot environment, after the OS upgrade. You will be asked if you want to use your saved conguration (answer yes). Later, when you reboot your new OS on the new boot environment, VxVM will be congured and will run properly. The only difculty to this is that the packages on the VERITAS Storage Foundation 5.0 distribution are in a tar.gz format. So you have to spool them off and expand them manually before doing the pkgadd. 1. Spool and expand packages: # cd veritas_50_sw_dir/volume_manager/pkgs # cp VRTSvlic.tar.gz VRTSvxvm.tar.gz \ VRTSvmman.tar.gz /var/tmp # cd /var/tmp # gzcat VRTSvlic.tar.gz | tar xvf # gzcat VRTSvxvm.tar.gz | tar xvf # gzcat VRTSvmman.tar.gz | tar xvf 2. Add the new packages directly into the upgraded inactive boot environment: # lumount s10be # pkgadd -d /var/tmp -R /.alt.s10be \ VRTSvlic VRTSvxvm VRTSvmman 3. Add any VxVM patches directly into the alternate boot environment.
As the VxVM package is added, you will be asked if you want to restore the previous conguration, which was automatically preserved for you. You should answer yes to restore the conguration, as in the following example: At the following prompt: - Enter "y" if you are upgrading VxVM and want to use the existing VxVM configuration. Do not run vxinstall after installation. - Enter "n" if this is a new installation. All disks will still have old configuration. You will need to run vxinstall and then vxdiskadm after installation to initialize and configure VxVM. Restore and reuse the old VxVM configuration [y,n,q,?] (default: y): y

1-29
Upgrading Disk Groups

After you boot an upgraded VxVM environment, you can upgrade disk group versions. Disk group version numbers indicate the set of features available for a particular disk group. For example, disk groups created with VxVM 4.0 have a disk group version number of 110. After a new VxVM version is booted, old disk group version numbers are not automatically upgraded. The reason for this is that once they are upgraded they are no longer backward-compatible with an older version of VxVM. When you choose to upgrade a VxVM disk group version number (the new version number would be 140 for VxVM 5.0), you will not be able to use that disk group any more under older VxVM versions. Any new VxVM features will not be available for that disk group until that disk group has been upgraded. The following example shows the command to list the existing version of a disk group and then the command to upgrade it to match the version number associated with the new version of VxVM: # vxdg list iwsdg |grep version version: 110 # vxdg upgrade iwsdg # vxdg list iwsdg |grep version version: 140
1-30

www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Solaris OS and VxVM Software
Exercise: Upgrading the Solaris OS and VxVM Software

In this exercise, you perform the following tasks to upgrade from the Solaris 9 OS to Solaris 10 OS, and from VxVM 4.0 to VxVM 5.0 (if you are using VxVM).

Task 1 Verify that your cluster is operating correctly Task 2 Install the Solaris 10 Live Upgrade software and partition the target disk Task 3 Create a new boot environment as a clone of the original root disk Task 4 Remove VxVM 4.0 from the new boot environment (if you are using VxVM) Task 5 Upgrade the boot environment to Solaris 10 OS Task 6 Add VxVM 5.0 into the new boot environment (if you are using VxVM)
Preparation
You must identify the local disk where the new Live Upgrade boot environment will be created. This drive must be local, must not be a VxVM disk, and should have the identical size and geometry of the current boot disk. Use the format command to identify your local disks and use the vxdisk -o alldgs list command to verify that they are not VxVM disks (if you are using VxVM). You also need to know the node id of each node during this exercise. Run the following command on both cluster nodes to determine each node id: # clinfo -n
Task 1 Verifying That Your Cluster Is Operating Correctly

1. On either node of the two-node cluster, verify that the cluster is running: # # # # # scstat -q scstat -W scstat -D uname -a cat /etc/cluster/release

1-31
2.
Verify that a failover resource group for Oracle, a scalable resource group for Sun Java System Web Server (iws), and a failover resource group for the scalable service load balancer (lb-rg) are congured. # scstat -g Verify that you can access the Oracle database. A small /oracli client environment has been installed on each node. You should be able to serve as an Oracle client from either node, regardless of the node on which the failover service is running. # ksh # cd /oracli # ls clienv oraInventory/ product/ # . ./clienv # which sqlplus # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> insert into mytable values ('yourname', age); SQL> commit; SQL> select * from mytable; SQL> quit
3.
4.
On your administrative workstation or display station, edit /etc/hosts and add your IP address that is known on the nodes as iws-lh.
Note You can call this iws-lh on the admin workstation if you are not in a shared admin station environment. In an RLDC environment with a shared display server, the logical names should already be entered. Consult your instructor. 5. Invoke your web browser on your display station. Check your proxy settings; if you are using a proxy to get to the Internet, set a proxy exception for the name you entered in the previous step. Navigate to http://iws-lh-name/cgi-bin/test-iws.cgi. Click the reload or refresh button several times to verify that you are receiving responses from both nodes.
6. 7.
1-32

www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Solaris OS and VxVM Software
Task 2 Installing the Solaris 10 Live Upgrade Software and Partitioning the Target Disk
Perform the following step on all of your cluster nodes: 1. Use the installer provided as part of the Solaris 10 distribution to install the new Live Upgrade packages. # cd Sol10_distr_dir/Solaris_10/Tools/Installers # ./liveupgrade20 -nodisplay 2. 3. Accept the license and use the Typical option to install all the packages. This is a very quick install. If the original root disk is VxVM-encapsulated, partition the target disk as follows: If you are not using VxVM or your root disk is not VxVM encapsulated, skip this step and do step 4 instead. Perform the following on all the nodes in the cluster: a. b. Determine the device being used as the current boot disk: # ls /etc/vx/reconfig.d/disk.d Copy the vtoc le that contains the partition table for the current boot disk before it was encapsulated: # cp /etc/vx/reconfig.d/disk.d/boot-disk/vtoc /tmp c. Edit the vtoc le: # vi /tmp/vtoc 1. 2. 3. Delete the rst two comment lines. Remove all instances of the string 0x from the second column. Remove all instances of the string 0x2 from the third column (once again, you will be removing two characters from the second column and three characters from the third column). Save the le.
4. d.
Create partitions on the target disk according to the vtoc le. Be careful that you format the target disk, not the current boot disk: # fmthard -s /tmp/vtoc /dev/rdsk/c#t#d#s2

1-33
Exercise: Upgrading the Solaris OS and VxVM Software 4.
If the original root disk is not VxVM-encapsulated, partition the target disk as follows: Do this step instead of step 3 if your root disk is not VxVMencapsulated. Copy the partitioning from the original root disk to the new disk. In this example cAtAdAs0 is the original disk and cBtBdBs0 is the new disk. # prtvtoc /dev/rdsk/cAtAdAs2| fmthard -s - \ /dev/rdsk/cBtBdBs2
Task 3 Creating the New Boot Environment as a Clone of the Original Root Disk
Perform the following steps on all cluster nodes. Do not wait for the lucreate command to complete on one node before starting the next. But be careful and heed the warning that the commands will not be identical on each node. 1. Verify that the target disk was partitioned correctly: # format c#t#d# The tasks in this exercise assume that the boot disk has a slice 0 for the root le system, a slice 1 for swap, a slice 3 for global devices, and a slice 7 for Solaris Volume Manager software replicas. 2. If you are using Solaris Volume Manager (only), add metadevice database replicas on slice 7 of the new disk. Verify that you have three copies on each disk: # metadb -a -c 3 c#t#d#s7 # metadb -i 3. Save a copy of the original vfstab on each node: # cp /etc/vfstab /etc/vfstab.preLU 4. Edit the /etc/vfstab le: a. If you are using SVM: On the line for the /global/.devices/node@# le system, change /dev/did/dsk/d#s3 and /dev/did/rdsk/d#s3 to /dev/dsk/c#t#d#s3 and /dev/rdsk/c#t#d#s3, using the same c#t#d# as the root disk.
1-34

www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Solaris OS and VxVM Software b. Regardless of your volume manager, replace the word global in the 7th eld of the line for /global/.devices/node@# with a minus sign (-). Only on the node not mounting /oracle (check carefully), comment out the line for /oracle.
c. 5.
Create a clone of the current boot environment on the target disk using the node id as the value of the variable X as follows:
Warning This command is not identical on each node of the cluster. The node@X will differ. You might also have different target disks. Be careful: c#t#d# identies the target disk. If you are using VxVM, conrm when the command asks about the identity of your underlying root drive. # lucreate -c s9be -n s10be \ -m /:/dev/dsk/c#t#d#s0:ufs \ -m -:/dev/dsk/c#t#d#s1:swap \ -m /global/.devices/node@X:/dev/dsk/c#t#d#s3:ufs
Note The lucreate command can take approximately 10-20 minutes to complete.
Task 4 Remove VxVM 4.0 From the New Boot Environment

Skip this task completely if you are not using VxVM. Perform the following steps on both cluster nodes: 1. 2. 3. Mount the new boot environment. # lumount s10be Remove the VxVM packages. # pkgrm -R /.alt.s10be VRTSvmman VRTSvxvm VRTSvlic Unmount the new boot environment. # luumount s10be

1-35
Task 5 Upgrading to Solaris 10 OS in the New Boot Environment

Perform the following steps on all cluster nodes. 1. Identify the BE to upgrade and make sure it is complete: # lustatus BE_name Complete Active ActiveOnReboot CopyStatus ------------------------------------------------s9be yes yes yes s10be yes no no 2. Upgrade the s10be boot environment. Make sure you get this launched on all nodes (do not wait for one to nish before launching the other). The location required is the top-level directory of the distribution (the directory that contains the Solaris_10 subdirectory). # luupgrade -u -n s10be -s solaris-10-software-location
Note The upgrade to Solaris 10 OS can take two to seven hours to complete, depending on the speed of your hardware. Your lecture will probably be continuing at this point, and you will be continuing the lab later. Near the end of the upgrade you will see information and warnings about some failed pkgadds, as discussed in the body of the module. The cluster installation had some versions of packages that were newer than those bundled in the base release of Solaris 10. You can ignore these warnings.
Task 6 Adding VxVM 5.0 to the New Boot Environment

Skip these steps completely if you are not using VxVM. Perform these steps on all cluster nodes. 1. 2. Mount the new boot environment. # lumount s10be Spool and expand the new VxVM 5.0 packages. # cd veritas_50_sw_dir/volume_manager/pkgs
1-36

www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Solaris OS and VxVM Software # cp VRTSvlic.tar.gz VRTSvxvm.tar.gz \ VRTSvmman.tar.gz /var/tmp # # # # 3. cd /var/tmp gzcat VRTSvlic.tar.gz | tar xf gzcat VRTSvxvm.tar.gz | tar xf gzcat VRTSvmman.tar.gz | tar xf -
Add the new VxVM packages. Answer yes when you are asked about using the saved conguration. # pkgadd -d /var/tmp -R /.alt.s10be \ VRTSvlic VRTSvxvm VRTSvmman
4.
Add any VxVM 5.0 patches directly into the new boot environment: # cd veritas_50_patch_dir # patchadd -R /.alt.s10be xxxxx-yy
Note At the time of writing this course, there are no VxVM patches required by the course itself. 5. Unmount the new boot environment. # luumount s10be

1-37
Exercise Summary
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
1-38

Module 2
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Objectives

Upgrade Sun Cluster software when not using Live Upgrade Use the scinstall options that control the dual-partitioned upgrade method, when not using Live Upgrade Use Live Upgrade to upgrade the Sun Cluster software Upgrade resource types and resources
2-1
Relevance
Relevance
!
?
Can functionality already present in commands like pkgadd and patchadd install software directly into a new boot environment that is created with the Live Upgrade Software? Is the time required for a reboot the minimum that you could possibly expect your applications to be down during an upgrade? Or could you achieve even less downtime than that in the cluster environment? What would happen if you never upgraded from the Sun Cluster 3.1 resource type versions after an upgrade to Sun Cluster 3.2? Should existing resources still work? Always? What might break in the future?
2-2

Sun Microsystems, Inc. Sun Cluster Software Installation Guide For Solaris OS, part number 819-2970. Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971.
2-3
Upgrading the Cluster Software (Non-Live Upgrades)

When you are not using the live-upgrade method, you upgrade cluster software when booted into non-cluster mode. Recall that if you are also upgrading the OS, this would be after the OS upgrade, and you would manually boot into non-cluster mode rather than letting the upgrade procedure reboot for you. You have to upgrade the cluster software in three steps: 1. 2. 3. Upgrade the Sun Java Enterprise System shared components using the Java ES installer. Upgrade the Sun Cluster software framework packages using scinstall from the new cluster installation medium. Upgrade the Sun Cluster data service packages using scinstall (from either the cluster installation medium or /usr/cluster/bin).
Note You use all of these same procedures whether or not you are taking advantage of the dual-partition upgrade feature. If you are using the dual-partition upgrade feature, you will be using the new scinstall dual-partition management options, discussed in the next major section of this module to manage the transitions between the partitions. When nodes of each partition are booted in non-cluster mode, you then use the normal OS upgrade procedures to upgrade the OS and these procedures to upgrade the cluster components.
Upgrading the Shared Components (Non-Live Upgrades)

The Sun Cluster framework software depends on a set of software included on the Java ES distribution known as the shared components. This software includes Java distributions and libraries and the Sun Java Web Console software required to run the Sun Cluster Manager. When you do an initial (non-upgrade) installation of Sun Cluster software, these components are automatically selected and installed.
2-4

www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades) When you do upgrades of the Sun Cluster framework, you must use the Java ES installer to upgrade the shared components rst. For example, if you use the graphical installer, you can just choose the shared components. It will indicate that the Sun Cluster software itself can not be upgraded by the Java ES installer (you will have to do that afterwords.) Note The Java DB software is another required component that is listed separately from the other shared components. Make sure you select the Java DB software when upgrading the shared components.
Upgrading the Sun Cluster Software Framework (NonLive Upgrades)

When you upgrade the Sun Cluster software by using the scinstall utility, you remove old software packages and add new ones, while also preserving the conguration. Upgrades to Sun Cluster 3.2 also preserve your choice of language packages, adding back the same languages that you had in the old version. Figure 2-1 provides a block diagram of this process.
user
scinstall -u update
save config
pkgrm
pkgadd
restore config
Figure 2-1
Upgrade Procedure Block Diagram
Identifying a Test IP Address (Upgrades from SC3.0 Only)

In Solaris 8 and Solaris 9 OS, IPMP requires a test IP address for each interface it controls (in Solaris 10 this is optional). This IP address must be valid on the subnet in which the interface is congured.
2-5
Sun Cluster 3.0 software uses Public Network Management (PNM) network adapter failover (NAFO) groups for public network fault monitoring. You must convert this PNM conguration to an IPMP conguration for the Sun Cluster 3.1 or Sun Cluster 3.2 software. The scinstall utility performs this conversion when you upgrade the framework. If you are upgrading from Sun Cluster 3.1, it is assumed that IPMP is already congured.
Framework Upgrade Procedure

Make sure you run scinstall from the correct media (including OS version) corresponding to the new version. The example shows an upgrade to Sun Cluster 3.2 for Solaris 10 OS, which is packaged only in the Sun Java Enterprise System format. Change directories to the appropriate Tools directory on the Sun Cluster medium: # cd SC32-loc/Solaris_sparc/Product/sun_cluster/Solaris_10/Tools To choose the upgrade off the menus, run scinstall without arguments: # ./scinstall To run the upgrade non-interactively, use the following arguments: # ./scinstall -u update
2-6

www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades)
Example Framework Upgrade

The following is an example framework upgrade using the menu interface. *** Main Menu *** Please select from one of the following (*) options: 1) 2) * 3) * 4) * 5) Create a new cluster or add a cluster node Configure a cluster to be JumpStarted from this install server Manage a dual-partition upgrade Upgrade this cluster node Print release information for this cluster node
* ?) Help with menu options * q) Quit Option: 4
*** Upgrade Menu *** Please select from any one of the following options: 1) Upgrade Sun Cluster framework on this node 2) Upgrade Sun Cluster data service agents on this node 3) Upgrade Sun Cluster Support for Oracle RAC on this node ?) Help with menu options q) Return to the Main Menu Option: 1
*** Upgrading the Sun Cluster Framework on this Node ***
The node must be booted in noncluster mode in order to upgrade the framework. Press Control-d at any time to return to the Main Menu.
Do you want to continue (yes/no) [yes]?
yes
2-7
scinstall -u update Starting upgrade of Sun Cluster framework software Saving current Sun Cluster configuration Do not boot this node into cluster mode until upgrade is complete. Renamed "/etc/cluster/ccr" to "/etc/cluster/ccr.upgrade". ** Removing Sun Cluster framework packages ** ** Removing Sun Cluster framework packages ** Removing SUNWscspmr..done Removing SUNWscspmu..done Removing SUNWscspm...done Removing SUNWscva....done Removing SUNWscmasa..done Removing SUNWmdm.....done Removing SUNWscvm....done Removing SUNWscsam...done Removing SUNWscsal...done Removing SUNWscman...done Removing SUNWscgds...done Removing SUNWscdev...done Removing SUNWscnm....done Removing SUNWscsck...done Removing SUNWscu.....done Removing SUNWscr.....done ** Installing SunCluster 3.2 framework ** SUNWscu.....done SUNWsccomu..done SUNWsczr....done SUNWsccomzu..done SUNWsczu....done SUNWscsckr..done SUNWscscku..done SUNWscr.....done SUNWscrtlh..done SUNWscnmr...done SUNWscnmu...done SUNWscdev...done SUNWscgds...done SUNWscsmf...done
2-8

www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades) SUNWscman...done SUNWscsal...done SUNWscsam...done SUNWscvm....done SUNWmdmr....done SUNWmdmu....done SUNWscmasa..done SUNWscmasar..done SUNWscmasasen..done SUNWscmasau..done SUNWscmautil..done SUNWscmautilr..done SUNWjfreechart..done SUNWscspmr..done SUNWscspmu..done SUNWscderby..done SUNWsctelemetry..done Dec 19 12:09:10 rico java[9874]: pkcs11_softtoken: Keystore version failure. Ensure that the EEPROM parameter "local-mac-address?" is set to "true" ... done Restored /etc/cluster/ccr.upgrade to /etc/cluster/ccr
Completed Sun Cluster framework upgrade Updating nsswitch.conf ... done Press Enter to continue:
2-9
Upgrading Sun-Supported Data Services

The scinstall command has a separate step to upgrade data service agents. It can automatically determine which data service agents are installed on your previous cluster version, and then upgrade those data services by removing the packages and adding the new ones from the new distribution. In the Sun Cluster 3.2 distribution all data service packages are located under the sun_cluster_agents directory in the Java ES structure on the distribution medium. When you specify the location with the noninteractive command-line, you specify the location all the way down to this directory.
Data Services Upgrade Procedure

Use the scinstall menu item or command line to update all data services that are installed on your own cluster. You can also choose to do some subset of these services. The non-interactive command line version looks like the following: # /usr/cluster/bin/scinstall -u update -s [ all | service[,service...] ]\ -d JES-distribution/Solaris_[sparc,x86]/Product/sun_cluster_agents .
2-10

www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades)
Example Data Services Upgrade

This is an example of upgrading the Sun Cluster software data services using the menu interface: *** Upgrade Menu *** Please select from any one of the following options: 1) Upgrade Sun Cluster framework on this node 2) Upgrade Sun Cluster data service agents on this node 3) Upgrade Sun Cluster Support for Oracle RAC on this node ?) Help with menu options q) Return to the Main Menu Option: 2
*** Upgrading Sun Cluster Agents on this Node ***
This option is used to upgrade Sun Cluster data service agents on this node. Press Control-d at any time to return to the Main Menu.
Do you want to continue (yes/no) [yes]?
You must specify the location of the Java Enterprise System (JES) distribution that contains the Sun Cluster data service agents. The name that you give must be the full path to the directory that contains the "Solaris_sparc" subdirectory. Where is it located? /net/srvr/sc32 Select the data service agents you want to upgrade: Identifier 1) iws 2) oracle 3) All Description Sun Cluster HA Sun Java System Web Server Sun Cluster HA for Oracle All data services in this menu
2-11
q) Done Option(s): Selected: 3 Option(s): q
This is the complete list of data services you selected: oracle iws Is it correct (yes/no) [yes]? Is it okay to upgrade these data services now (yes/no) [yes]?
scinstall -u update -d /net/srvr/sc32/Solaris_sparc/Product/sun_cluster_agents -s oracle,iws
Starting upgrade of Sun Cluster data services agents
List of upgradable data services agents: (*) indicates selected for upgrade. * iws * oracle
Upgrading Sun Cluster data services agents software
Do not boot this node into cluster mode until upgrade is complete. ** Removing HA oracle Data Service on Sun Cluster ** Removing SUNWscor....done ** Installing Sun Cluster HA for Oracle ** SUNWscor....done ** Removing HA Sun Java System Web Server **
2-12

www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades) Removing SUNWschtt...done ** Installing Sun Cluster HA Sun Java System Web Server ** SUNWschtt...done
Completed upgrade of Sun Cluster data services agents Press Enter to continue:
Note the following issues in this example:
The packages for the previous version are removed but the resource types remain registered. The packages for the new data services are added but any new resource type versions are not registered. The scinstall command does not upgrade the already instantiated resources for any types, such as oracle_server, and oracle_listener, that require type version upgrades. This must be done after booting the cluster, and is discussed later in the module.
2-13
Managing Dual-Partition Upgrades (Non Live-Upgrade) using scinstall

Sun Cluster 3.2 has added scinstall options for managing dualpartition upgrades. Pay careful attention, some of the options are intended to be called from the scinstall utility specically located on the Sun Cluster 3.2 installation media, and it is expected that at the current time you are running a version of Sun Cluster previous to Sun Cluster 3.2. Note These options use either ssh (tried rst) or rsh between various nodes on the public network. You need to enable root ssh or rsh for password free access between every node and every other node (including every node to itself!). All of the options can be used both non-interactively from the command line or from the following scinstall menu: *** Manage a Dual-Partition Upgrade Menu *** Select from any one of the following (*) options: * 1) Display and select possible partitioning schemes * 2) Begin the dual-partition upgrade process * 3) Upgrade Sun Cluster software on this cluster node 4) Apply dual-partition upgrade changes * 5) Display status of the upgrade * ?) Help with menu options * q) Return to the Main Menu Option:
The following sections outline the usage.
Planning the Upgrade (Menu Option 1)

This option merely displays possible partitioning schemes. You can call it through the menu or you can call: # ./scinstall -u plan
2-14

www.chinaitproject.com IT QQ : 3264454 Managing Dual-Partition Upgrades (Non Live-Upgrade) using scinstall This option is completely harmless and can be called at any time, or not at all if you know what your partitioning scheme will be. For example, this was called on a three-node cluster where all nodes were attached to the shared data storage: rico:/# scinstall -u plan Option 1 First partition rico midnight Second partition noodle Option 2 First partition rico noodle Second partition midnight Option 3 First partition rico Second partition midnight noodle
Log file /var/cluster/logs/install/scinstall.upgrade.log.4095
2-15
Beginning the Dual-Partition Upgrade (Menu Option 2)

This option is run to begin the entire dual-partition upgrade. The characteristics of running this option are:
You run it from any one node before any upgrades at all have been performed. You run it from the Sun Cluster 3.2 installation media, regardless of which version of Sun Cluster 3.2 you are running currently (previous to the upgrade). You specify which nodes will be in the partition that is shut down rst. Quorum votes are manipulated in the remaining nodes. Once the nodes you specify are shut down, the only quorum votes remaining will be ones belonging to the remaining nodes (with quorum device votes set as if only the remaining nodes were attached). The nodes that you specify (the rst-partition nodes) are automatically halted (through ssh or rsh). This may or may not include the node from which you are running the option. The intention is that you do the complete upgrade of nodes that have been halted, not booting them into the cluster until upgrades of all layers have been completed.
Note The utility will automatically insert scripts into the rst-partition nodes to prevent you from accidentally trying to boot them into the cluster until you are ready to use the next option (apply) to perform the opover. If you use the interactive menus, the dialogue will have you choose the rst partition-nodes. If you use the non-interactive command line-version, you specify the rst-partition nodes using the -h option as in the following example: rico:/# scinstall -u begin -h rico Broadcast Message from root (???) on rico Tue Dec 19 11:30:45... THE SYSTEM rico IS BEING SHUT DOWN NOW ! ! ! Log off now or risk your files being damaged
Log file - /var/cluster/logs/install/scinstall.upgrade.log.1691
2-16

www.chinaitproject.com IT QQ : 3264454 Managing Dual-Partition Upgrades (Non Live-Upgrade) using scinstall
Applying Changes to the First Partition (Initiating the Flop-Over Menu Option 4)
After performing all your upgrades to the nodes in the rst partition, you use this menu option on any one of the rst partition nodes. The noninteractive command-line version is: # /usr/cluster/bin/scinstall -u apply This operation performs all of the following operations for you automatically: 1. Nodes of the rst partition are rebooted back into cluster mode. They do not communicate with the non-upgraded, second partition nodes. Rather, for a short amount of time, you have two separate clusters running side by side. A node in the rst partition will (through an automatically provisioned boot-service): a. b. c. d. ssh or rsh to the nodes of the second partition Halt the clustered applications there Halt those nodes Initialize application takeover on the rst partition
2.
This is the whole beauty-part of the dual-partition upgrade strategy. Your applications are down only for the length of time it takes the rst partition nodes to halt the second partition nodes and to take over the applications.
Upgrading and Applying Changes to the Second Partition

At this point you can perform your upgrades on the second partition. Once again, if you accidentally try to boot any of these nodes into the second partition before using the scinstall -u apply operation, you will be prevented from doing so. Once the second partition upgrades are complete, you run the scinstall -u apply on any one of these nodes, and these nodes will then be rebooted automatically and rejoin the cluster.
2-17
Upgrading the Cluster Software (Live Upgrade)

Starting with Sun Cluster 3.2, you can upgrade all three sub-components of the sun cluster software upgrade directly into the alternate boot environment:

Java ES Shared Components Sun Cluster software framework packages Sun Cluster data service packages
You can perform all these upgrades directly into your new boot environment on all cluster nodes simultaneously, while your entire cluster is still booted and available using your original root disks.
Upgrading Java ES Shared Components (Live Upgrade)

Upgrading the shared components into a Live Upgrade alternate boot environment is slightly unusual in that you have to do a two-step process: 1. Run the interactive Java ES installer in a special non-install mode. You drive the Java ES installer exactly as if you were going to upgrade the shared components. Rather than install the shared components, this step will produce an upgrade state le that lets you perform the second step. The following command line demonstrates how you launch the Java ES installer to perform this step: # cd JES_media_directory/Solaris_[sparc,x86] # ./installer -no -saveState name_of_new_statefile 2. Use the upgrade state le produced in step 1 to perform a silent Java ES installation of the shared components directly into the alternate boot environment. The alternate boot environment must be mounted rst. The following command line demonstrates how you launch the Java ES installer in silent mode and direct its installation into the new boot environment: # cd JES_media_directory/Solaris_[sparc,x86] # ./installer -noconsole -nodisplay \ -altroot /.alt.s10be \ -state name_of_new_statefile
2-18

www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Live Upgrade)
Upgrading Sun Cluster Framework Packages (Live Upgrade)

Upgrading the Sun Cluster framework packages directly into the Live Upgrade alternate boot environment is simple. The only caveat is that you have to use the non-interactive command-line, not the menus: # cd JES_media/Solaris_[sparc,x86]/Product/sun_cluster # cd Solaris_[9,10]/Tools # ./scinstall -R /.alt.s10be -u update The output looks exactly the same as the non Live-Upgrade framework package upgrade.
Upgrading Sun Cluster Data Service Packages (Live Upgrade)

The upgrade of the data service packages is a separate step. It too must be done using the non-interactive command-line, not the menus: # ./scinstall -R /.alt.s10be -u update -s all \ -d JES_media/Solaris_[sparc,x86]/Product/sun_cluster_agents The output looks exactly the same as the non Live-Upgrade data service package upgrade.
Booting Into the New Cluster (Live Upgrade)

In the initial Sun Cluster 3.2 release, there is no integration of the dualpartition and live upgrade frameworks. What this implies for live-upgrade is that there is no support for rolling reboot. As you use live upgrade to do all the upgrades, your original cluster on your original boot environment is still up and running. However, you will ofcially not be able to leave some nodes up in the original cluster while rebooting other ones into the new cluster versions. You just have to reboot all the nodes at the same time. It seems like you could still keep your application downtime to several minutes. If you have just upgraded your new boot environment to Solaris 10, however, you will have to wait as the nodes load all of their SMF services during the reboot, which can take some time.
2-19
Live Upgrade With Dual-Partition Rolling Reboot (Experimental)

Functionality is in place to use the dual-partition logic after performing live upgrade on all the cluster nodes to achieve a rolling reboot. Some of the nodes (what you call the rst partition) reboot into the new cluster environment while others (the second partition) are still running the old cluster. In order to perform this procedure: 1. 2. 3. 4. Perform the entire live upgrade procedure on all nodes of the cluster. Leave the new boot environment mounted under the same mount point name (for example, /.alt.s10be) on every node of the cluster. Enable the dual-partition functionality by enabling root rsh or ssh access between all nodes of the cluster. From the Sun Cluster 3.2 media directory, for example (.../Solaris_sparc/Product/Sun_Cluster/Solaris_10/Tools) Run the following, listing the nodes that should be rebooted rst. # ./scinstall -R /.alt.s10be -u begin -h \ nodename[,nodename] All of the following will be automated by the above command: a. b. c. On the node(s) listed (rst partition), the new boot environment is activated. The listed nodes are rebooted using init 6 into the new boot environment. Other nodes stay up in the old boot environment. As the new nodes boot, (but after SMF loading on Solaris 10), a script runs to: 1. 2. 3. Contact (rsh or ssh) the second partition nodes Activate their new boot environment Reboot them with init 6
Note The current bug in this procedure, and part of the reason it is not supported, is a timing issue. The rst partition nodes may cause a disk reservation conict after calling init 6 but still not giving sufcient time for them to shut down. This may cause a kernel panic. In the lab exercises, if you want to try this experimental procedure, you will modify the run_reserve script to put in a delay to work around this problem.
2-20

www.chinaitproject.com IT QQ : 3264454 Reviewing Sun Cluster Software Upgrade Issues (All Methods)
Reviewing Sun Cluster Software Upgrade Issues (All Methods)

Consider the following when upgrading Sun Cluster software:
The cluster framework upgrade procedure (scinstall -u update, or submenu item 1 of the upgrade menu) command is idempotent. This means that you can resume the command after it is interrupted. The estimated time to upgrade a single node can vary greatly, depending on the node horsepower. Output from the upgrade process is logged in the /var/cluster/upgrade directory (conguration and state information) and /var/cluster/logs/install directory (les created during upgrade). If you are using Live Upgrade, these will be in the new boot environment, not the old one.
Once a particular node is upgraded, that nodes upgrade (on that particular boot environment) is irreversible.
2-21
Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade)

This section covers the resource type upgrade and describes how to successfully perform the upgrade. This issue is not addressed until all cluster nodes are fully upgraded and booted back into the cluster.
Identifying Resource Type Upgrade Criteria

Consider the following two criteria when upgrading a resource type:
Multiple versions of the same resource type must be able to coexist within the same cluster. The contents of the RTR les for these versions fully describe each resource type version. You can upgrade resources from an old type-version to a new typeversion without having to delete and recreate them.
Naming Resource Types

Starting with Sun Cluster 3.1 software, a resource type name is fully dened using the VENDOR_ID, RESOURCE_TYPE, and RT_VERSION resource type properties concatenated together in the following manner: VENDOR_ID.RESOURCE_TYPE:RT_VERSION For example, an RTR le for the oracle_server resource type has the following properties: Resource_type = "oracle_server"; Vendor_id = SUNW; RT_Version = "6"; The derived name for this resource type is SUNW.oracle_server:6. If no conicts exist, then the following resource type names are equivalent and can be used with the clrt and clrs commands:

SUNW.oracle_server:6 SUNW.oracle_server oracle_server:6 oracle_server
2-22

www.chinaitproject.com IT QQ : 3264454 Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade)
Performing the Resource Type Upgrade

To upgrade a resource to a new resource type version, the resource must be in one of the following states. The data service developer denes these states through the upgrade_from directive in the RTR le:
ANYTIME You can upgrade the resource type version whether the resource is online or ofine. WHEN_DISABLED You must disable the resource in order to upgrade its type version. WHEN_OFFLINE The resource must be ofine in order to upgrade (it could still be enabled, if its group were completely ofine). WHEN_UNMONITORED You can upgrade the resource type version when the resource is ofine or online, but not monitored. Likely, the new resource type version has new monitoring code. WHEN_UNMANAGED You can upgrade the resource type version only when the resource is in a group that is unmanaged. AT_CREATION This is a more polite way of saying never. You need to delete the resource and add it back as a new version.
For example, the oracle_server RTR le for resource type version 6 contains directives: #$upgrade_from #$upgrade_from #$upgrade_from #$upgrade_from "1" "3.1" "4" "5" anytime anytime anytime anytime
Assuming that the package containing the methods and description for the new version of the resource type is installed (for example, by the procedure that you used to upgrade the data service packages), you use the following steps to register new versions of resource types and upgrade resource instances to these new resource type versions: 1. Determine whether you need to change the state of the resource to upgrade its type. You look for upgrade_from directives in the RTR le (for example, in /opt/cluster/lib/rgm/rtreg/SUNW.oracle_server). These les are discussed in more detail in the next module. Put the resource in the appropriate state. Here are some example commands: # //do nothing, if it can be upgraded anytime # clrs disable res
2.
2-23
Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade) # clrs unmonitor res # clrg offline rg-containing-res # clrg unmanage rg-containing-res 3. 4. Register the new resource type version: # clrt register res-type
For each resource of the old version, change the Type_version property to the new version: Restore the state of the resource, the resource monitor, or the resource group, if necessary. Type one of the following commands: # # # # //nothing, if it is already online clrs enable res clrs monitor res clrg online -M rg-containing-res
# clrs set -p Type_version=new-res-type-version res 5.
6.
(Optional) Unregister the old resource types. Although you removed the resource type packages using the scinstall upgrade procedure, they are still registered and can cause confusion for subsequent resource creation. # clrt unregister old-resource-type
Viewing an Example Resource Type Upgrade

Suppose you created a resource called ora-server-res in the Sun Cluster 3.1 Update 3 environment and upgraded the Sun Cluster software framework and data services to the Sun Cluster 3.2 software. To upgrade the resource ora-server-res to the new resource type version, perform the following steps: 1. List the resource type of the registered oracle_server data service: # clrt show oracle_server |grep Resource Registered Resource Types === Resource Type: SUNW.oracle_server:4 RT_description: Resource type for Oracle Server 2. Register the Sun Cluster 3.2 oracle_server data service: # clrt register SUNW.oracle_server:6 3. List the resource type version of the ora-server-res resource:
# clrs show ora-server-res Resources ===
2-24

www.chinaitproject.com IT QQ : 3264454 Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade)
Resource: Type: Type_version: Group: R_description: Resource_project_name: Enabled{rico}: Enabled{midnight}: Monitored{rico}: Monitored{midnight}: 4.
ora-server-res SUNW.oracle_server:4 4 ora-rg default True True True True
Determine when you can perform the upgrade:
# grep upgrade /opt/cluster/lib/rgm/rtreg/SUNW.oracle_server #$upgrade #$upgrade_from "1" anytime #$upgrade_from "3.1" anytime #$upgrade_from "4" anytime #$upgrade_from "5" anytime 5. 6. Upgrade the ora-server-res resource to the new version type: # clrs set -p Type_version=6 ora-server-res Verify that the upgrade succeeded: # clrs show ora-server-res Resources === Resource: Type: Type_version: Group: R_description: Resource_project_name: Enabled{rico}: Enabled{midnight}: Monitored{rico}: Monitored{midnight}: 7. ora-server-res SUNW.oracle_server:6 6 ora-rg default True True True True
Unregister the old oracle_server resource type: # clrt unregister SUNW.oracle_server:4
2-25
Examining Resource Type Upgrade Issues

Consider the following issues when upgrading resource types:
Version 3.1 and 3.2 RTR les contain the #$upgrade directive. Version 3.0 software RTR les do not. All Sun Cluster 3.0 resources are considered to be of Type_version 1. Sun Cluster 3.0 did not have any functionality to manipulate resource type versions, but Sun Cluster 3.1 and 3.2 type upgrades all consider resource type version 1 a valid type version from which to upgrade. Sun Cluster 3.1 and 3.2 RTR les must dene the RT_VERSION resource type property in addition to the START, STOP, and RESOURCE_NAME properties. The Sun Cluster 3.1 and 3.2 software stores 3.1 and 3.2 resource types in the Cluster Conguration Repository (CCR) under a concatenated name. If you are upgrading from Sun Cluster 3.0, the software continues to store previous software version resource types in the CCR under a non-concatenated name. Unless there is a compelling reason not to do so, unregister all old Sun Cluster software data services after you upgrade all resource instances to the new version data services. Even though old version data services can coexist with new version data services in a cluster, it is usually confusing to have both types.
2-26

www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Sun Cluster Software
Exercise: Upgrading the Sun Cluster Software

In this exercise, you perform the following tasks:

Task 1 Patching and restoring vfstab les Task 2 Verify dependency software Task 3 Upgrade the Sun Cluster software framework Task 4 Upgrade the Sun Cluster software data services Task 5 Run the fixforzones script Task 6A Reboot the cluster nodes Task 6B Reboot the cluster nodes (experimental rolling reboot) Task 7 Upgrade resource instances whose types have new versions Task 9 Upgrade disk groups (VxVM only) Task 10 Verify your cluster operation
2-27
Task 1 Provisioning the New Boot Environment

Perform the following steps on both nodes in the cluster: 1. Restore your original /etc/vfstab: # mv /etc/vfstab.preLU /etc/vfstab 2. Edit the new /etc/vfstab on each node: # lumount s10be # vi /.alt.s10be/etc/vfstab a. On the real uncommented line for /global/.devices/node@#, put in the global option. You can leave in the c#t#d#s# -- the appropriate DID will be substituted when you later boot into the new cluster environment. Remove the comment for the /oracle le system, if it is commented.
b. 3.
Patch your upgraded OS. At the time of writing this course, there are two required IDR patches. # cd patch_location # patchadd -R /.alt.s10be -M . patches
4.
Check and x the nsswitch.conf le in the new boot environment. Make sure the line for ipnodes references only the files keyword, regardless of any other name service you are using: # vi /.alt.s10be/etc/nsswitch.conf ipnodes: files
5.
Create an empty state le so that you are not asked about NFSV4 domains the rst time you boot your new OS. # touch /.alt.s10be/etc/.NFS4inst_state.domain
Note The remaining live upgrade procedures rely on the new boot environment still being mounted. Do not perform the luumount command here.
2-28

Task 2 Upgrading the Shared Components

Perform the following steps on all nodes of the cluster. You can save a lot of time by doing all nodes simultaneously. 1. Run the interactive Java ES installer in non-install mode: # cd sc32_directory/Solaris_sparc # ./installer -no -nodisplay \ -saveState /var/tmp/jesupgr.state a. b. c. d. e. f. g. 2. Accept the license Choose only the shared components and Java DB (Enter 6,8) Continue through the install, accepting all the JavaDB components Select the option to automatically upgrade J2SDK Conrm upgrade of all the shared components Select Congure Later Conrm that you want to install (no software will actually get installed)
Use the upgrade state le produced in step 1 to perform a silent Java ES installation of the shared components directly into the alternate boot environment. # cd sc32_directory/Solaris_sparc # ./installer -noconsole -nodisplay \ -altroot /.alt.s10be \ -state /var/tmp/jesupgr.state
Task 3 Upgrading the Sun Cluster Software Framework

On both cluster nodes (at the same time), upgrade the cluster framework software: # cd sc32_directory/Solaris_sparc/Product/sun_cluster # cd Solaris_10/Tools # ./scinstall -R /.alt.s10be -u update
2-29
Task 4 Upgrading Sun Cluster Software Data Services

On both cluster nodes (at the same time), upgrade the data services # ./scinstall -R /.alt.s10be -u update -s all \ -d sc32_directory/Solaris_sparc/Product/sun_cluster_agents
Task 5 Run the fixforzones Script

There is currently a bug associated with the ./scinstall -R of the previous two tasks; since you are running still in the Solaris 9 boot environment, they do not properly spool the cluster packages to let them get propagated into future Solaris 10 non-global zones after the upgrade. Run the following script: it will spool the packages that were installed in the previous two tasks to the pspool arena for each package so that future non-global zones will run properly using the new zones feature. # cd class_software_directory # ./fixforzones
Task 6A Rebooting the Cluster Nodes (Simultaneously; the Official Procedure)

If you prefer to do the experimental procedure combining dual partition rolling reboot with live upgrade, skip this task and do task 5B instead. Perform the following step on both cluster nodes, at the same time: 1. Change directories out of the new boot environment and unmount it: # cd / # luumount s10be 2. 3. Activate the new boot environment: # luactivate s10be Type the init 6 command.
2-30

www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Sun Cluster Software You will need to wait for both nodes to load their SMF services, but eventually your new cluster should be active and your cluster services should run automatically. Try to time how long your applications were not available.
Task 6B Rebooting the Cluster Nodes (Dual-Partition Method; Experimental)

To perform the experimental use of dual-partition to achieve rolling reboot after the live upgrade: 1. 2. On both nodes, enable rsh access across the whole cluster: # echo + >/.rhosts Choose to drive from the node not running Oracle. This will save you one application switchover. This node will be your rst partition node. On the chosen node, copy and edit the following le as indicated. You will be inserting the line that is highlighted: # cd /.alt.s10be/usr/cluster/lib/sc # cp run_reserve run_reserve.save # vi run_reserve . . ########### # node_join ########### elif [ $command = node_join ] then sleep 90 echo `gettext "obtaining access to all attached disks"` . . 4. On all nodes, make sure that the /.alt.s10be le system is mounted but not busy: # df -k # cd / # fuser -cu /.alt.s10be If fuser indicates that processes are using that le system, gure out what those processes are and kill them (or cd out of the directory, if it is a shell).
3.
2-31
Exercise: Upgrading the Sun Cluster Software 5.
On the chosen node, initiate the dual-partition rolling reboot. Ignore any error messages that you see right after the command, unless you mistyped it.
# cd sc32_directory/Solaris_sparc/Product/sun_cluster # cd Solaris_10/Tools # ./scinstall -u begin -R /.alt.s10be \ -h name_of_node_you_are_typing_on 6. Observe the reboot sequence and time the amount of time your applications are down. This will take more total time than the supported post-live-upgrade reboot, but your applications will be down for a shorter duration. Restore the run_reserve le, on the node you had driven from:
7.
Warning Make sure you do not do this until both nodes reboot successfully in the rolling fashion into the new OS. # cd /usr/cluster/lib/sc # mv run_reserve.save run_reserve
2-32

Task 7 Upgrading Type Versions

Perform the following steps on one node in the cluster: 1. Register the new Oracle resource types: # # # # clrt clrt clrt clrt register register register register SUNW.oracle_server:6 SUNW.oracle_listener:5 SUNW.iws:5 SUNW.HAStoragePlus:4
2.
Upgrade the Oracle resources. Ignore validation error messages that occur, as usual, on the node not mounting the failover le system. # clrs set -p Type_version=6 ora-server-res # clrs set -p Type_version=5 ora-listener-res # clrs set -p Type_version=4 ora-stor
3.
Upgrade the iws resources. # clrs set -p Type_version=5 iws-res # clrs set -p Type_version=4 iws-stor
4.
Unregister the old types: # # # # clrt clrt clrt clrt unregister unregister unregister unregister SUNW.oracle_server:4 SUNW.oracle_listener:4 SUNW.iws:4 SUNW.HAStoragePlus:2
Task 8 Upgrading Disk Groups (VxVM Only)

If you are using VxVM, perform the following steps: 1. Determine which node is the current primary for the iwsdg disk group. # cldg status 2. From the node which is the current primary, upgrade the group: # vxdg list iwsdg |grep version version: 110 # vxdg upgrade iwsdg # vxdg list iwsdg |grep version version: 140 3. Repeat steps 1 and 2 for the oradg disk group.
2-33
Task 9 Verifying Your Cluster Operation

Verify that your cluster is operational and that its behavior, from a client point of view, is indistinguishable from before the upgrade: 1. Verify that you can access the Oracle database. You should be able to serve as an Oracle client from either node, regardless of the node on which the failover service is running. # ksh # cd /oracli # ls clienv oraInventory/ product/ # . ./clienv # which sqlplus # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit 2. 3. 4. Invoke your web browser on your display station. Navigate to http://iws-lh-name/cgi-bin/test-iws.cgi. Click the reload or refresh button several times to verify the behavior of the scalable application.
2-34

www.chinaitproject.com IT QQ : 3264454 Exercise Summary
Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications
2-35
Module 3

Objectives

Understand Sun Cluster data services Write Sun Cluster 3.x software data services Control RGM behavior through resource group properties and resource properties Use advanced resource group relationships Tune multimaster and scalable applications
3-1
Relevance
Relevance
!
?
What are the benets of being able to congure instances of a custom resource type using extension properties? Does Sun Cluster require that all application fault monitors use pretty much the same logic, or is it just a convention? What is the benet of the convention? If GDS had been invented with the initial release of Sun Cluster 3.0, would there be so many specic resource types?
3-2

Additional resources The following references provide additional details on the topics described in this module:
Man pages for the following commands:

hatimerun(1M) pmfadm(1M) property_attributes(5) r_properties(5) rg_properties(5) rpc.pmfd(1M) rt_callbacks(1HA) rt_properties(5) rt_reg(4) scdsbuilder(1HA) scdsconfig(1HA) scdscreate(1HA) scha_calls(3HA) SUNW.gds(5)
Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971. Sun Microsystems, Inc. Sun Cluster Data Service Developers Guide For Solaris OS, part number 819-2972.

3-3
Introducing Sun Cluster 3.2 Software Data Services

This module describes how to tune and congure Sun Cluster 3.2 software data services. The commands used to modify various properties in resource groups, resources, and resource types are simple. Understanding when to perform such modications and how they impact the behavior of the cluster is more complex. Before tuning a Sun Cluster software data service, you must understand what a data service is and how it is written. This section provides an overview of the components that make up a Sun Cluster 3.x software data service.
Defining a Data Service

The term data service describes a set of les that enable an application, such as the Oracle software or the Sun Java System Web Server, to run in a cluster rather than on a standalone server. There are at least two ways to distinguish applications for which a data service might exist:

Failover or scalable Network-aware (client-server) or network-unaware (client-less)
A data service consists of the application itself, which is installed separately from the cluster framework software, and the following:
Methods (also known as callback methods) that the cluster calls in response to automatic or manual requests Fault probes that monitor the health of the application Resource type registration (RTR) le that denes:

Methods Properties Other directives, such as upgrade directives (as seen earlier in the course)
3-4

www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services
Data Service Methods And Resources

Data service methods are always associated with specic resources. A resource is always a member of a particular resource group; there is no such thing as a disembodied resource without a group. Figure 3-1 summarizes the resource and resource group transitions. Review this diagram when referring to the methods to see when each method gets applied to each resource. Note the following:
The state of the disabled/enabled ag for a resource is preserved during state transitions, unless you explicitly add a -e option to the clrg switch or clrg online commands. A managed but Ofine group is still subject to going Online automatically upon cluster reconguration (that is, node failure or node joining), unless it is suspended.
Group online resources running if enabled
res1: disabled/enabled res2: disabled/enabled clrs disable/enable res clrg offline rg (enabled/disabled state preserved)
clrg switch -n node rg
clrg switch -n node rg clrg online rg (brings it on preferred node) (enabled/disabled state of resource preserved) Group: Offline No resources running
res1: disabled/enabled res2: disabled/enabled
clrg unmanage rg (must disable each resource first)
Resources may be enabled or disabled clrs disable/enable res (affects whether they will run when group switched on) clrg manage rg
clrg online -M rg clrg switch -M -n node rg
res1: disabled/enabled res2: disabled/enabled clrs disable/enable res
Group: Unmanaged
Figure 3-1
Resource Group and Resource Transitions

3-5
Callback Methods
There are several methods that you can create, but only those that start and stop the application are required. Table 3-1 includes the full list of methods and their descriptions. Refer to the man page for the rt_callbacks(1HA) command for more information on callback methods. Table 3-1 Data Service Callback Methods Method Name START STOP MONITOR_START MONITOR_STOP MONITOR_CHECK Method Description This method starts the application, usually under PMF control. It is called when bringing a resource online. This method stops the application. It is called when bringing a resource ofine. This method starts the fault monitors, usually under PMF control. It is called when bringing a resource online. This method stops the fault monitors. It is called when bringing a resource ofine. This method performs a sanity check before a failover to validate that a given data service can run on the proposed destination node. This method is used in addition to or instead of the START method. It is called before the network resources are congured when bringing a resource online. This method is used in conjunction with or instead of the STOP method to stop the application. It is called after the network resources are uncongured when bringing a resource ofine. This method is called when a resource is instantiated or a resource property is modied to validate that the requested change is okay. If it is not okay, this method vetoes the change (exits as non-zero). This method is called: (A) When a resource is added to a managed resource group. (B) For all resources in a group when a resource group moves from the unmanaged to managed state.
PRENET_START
POSTNET_STOP
VALIDATE
INIT
3-6

www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services Table 3-1 Data Service Callback Methods (Continued) Method Name BOOT FINI Method Description This method is called when a node joins the cluster, if the resource group containing the resource is already managed. This method is the opposite of INIT. It is called: (A) When a resource is removed from a managed resource group. (B) For all resources in a group when a resource group moves from the managed to the unmanaged. This method is called when a resource is instantiated or a resource property is modied. If the VALIDATE method fails, then the UPDATE method is not called.
UPDATE

3-7
Suspended Resource Groups

When you suspend a resource group, you disable any automatic recovery, failover, or restarting of the resource group or the resources within. That is:
You can still transition the resource group any way you like using the commands presented on the previous pages. You can still enable/disable any individual resources, using the commands presented on the previous pages. If the resource group is online, the resources will go on and off accordingly. The fault monitors for resources will still be started. Resources will not automatically be restarted by fault monitors, nor will entire groups automatically fail over, even if an entire node fails.
The reason you might want to suspend an online resource group is to perform maintenance on itthat is, start and stop some applications manually, but while preserving the online status of the group and other components, so that dependencies can still be honored correctly. The reason you might suspend an ofine resource group is so that it does not go online automatically when you did not intend it to do so. For example, when you put a resource group ofine (but it is still managed and not suspended), a node failure still causes the group to go online. To suspend a group, type: # clrg suspend grpname To remove the suspension of a group, type: # clrg resume grpname To see whether a group is currently suspended, use the clrg status command.
3-8

www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services
Resource Type Registration File

You install the RTR le into the /opt/cluster/lib/rgm/rtreg/ resource-type-name le using the Java ES installer or the pkgadd command with the data service package. For Sun-supported data services, this le is usually a symbolic link to a le in the base directory where the resource type is installedusually the /opt/resource-package directory. Refer to the man page for the rt_reg(4) command for a description of the le. For example: # ls -l /opt/cluster/lib/rgm/rtreg/SUNW.iws lrwxrwxrwx 1 root other 27 Nov 13 15:13 SUNW.iws -> /../../../../SUNWschtt/etc/SUNW.iws When the resource type is registered to the cluster using the clrt register command, a le is created in the CCR with the name /etc/cluster/ccr/rgm_rt_resource-type-name. For example, the Sun Java System Web Server resource type registration le in the CCR is /etc/cluster/ccr/rgm_rt_SUNW.iws:5. Recall, in Sun Cluster 3.1 and 3.2 software, the resource type version is appended to the resource type name and vendor identier (ID) to provide a fully-qualied, unambiguous resource type name. The RTR le contains the following information:
Resource type properties, such as the Resource_type, RT_version, RT_basedir, START, STOP, and upgrade directive properties Standard properties that specify specic minima, maxima, defaults, and requirements for this type Extension properties applicable to this resource type only
Starting in SC3.2, by convention application-oriented RTR les live in /opt/cluster/lib/rgm/rtreg, and RTR les for built-in types like LogicalHostname and SharedAddress still live in /usr/cluster/lib/rgm/rtreg.

3-9
Contents of an RTR File

The contents of the RTR le are self explanatory. The following is the RTR le for the iws type. Some of the properties are omitted, and explanations inserted.
RESOURCE_TYPE = "iws"; VENDOR_ID = SUNW; RT_DESCRIPTION = "HA Sun Java System Web Server"; RT_VERSION ="5"; API_VERSION = 2; INIT_NODES = RG_PRIMARIES; V ALIDATE, INIT, and BOOT, and FINI for resources of this type are called on all nodes on the nodelist for that resource group, which is logical. The alternate value is RT_INSTALLED which would call the methods for a resource on all nodes where the type is installed, even ones not in the nodelist for that RG. RT_BASEDIR=/opt/SUNWschtt/bin; FAILOVER = FALSE; [In other words, could be scalable as well] START STOP VALIDATE UPDATE MONITOR_START MONITOR_STOP MONITOR_CHECK PKGLIST = SUNWschtt; # # Upgrade directives # #$upgrade #$upgrade_from "1.0" anytime #$upgrade_from "3.1" anytime #$upgrade_from "4" anytime = = = = = = = iws_svc_start; iws_svc_stop; iws_validate; iws_update; iws_monitor_start; iws_monitor_stop; iws_monitor_check;
3-10

www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services # The paramtable is a list of bracketed resource property declarations # that come after the resource-type declarations # The property-name declaration must be the first attribute # after the open curly of a paramtable entry Dening standard property types is done typically to set a default for THIS resource type -whichever ones are omitted can still be used but you would just get the "default default" from man r_properties. There is an exception for standard properties relating to load-balancing, such as Network_resource_used and Scalable. These are standard properties but they must be mentioned in the RTR le in order to be used at all for this resource type. { PROPERTY = Start_timeout; MIN=60; DEFAULT=300; } { PROPERTY = Stop_timeout; MIN=60; DEFAULT=300; } . . [ Lists of similar ones are omitted to save paper.] . . { PROPERTY = FailOver_Mode; DEFAULT = SOFT; TUNABLE = ANYTIME; } { PROPERTY = Network_resources_used; TUNABLE = AT_CREATION; DEFAULT = ""; } { PROPERTY = Scalable; DEFAULT = FALSE; TUNABLE = AT_CREATION; } { PROPERTY = Load_balancing_policy; DEFAULT = LB_WEIGHTED; TUNABLE = AT_CREATION; } { PROPERTY = Load_balancing_weights; DEFAULT = "";

3-11
Introducing Sun Cluster 3.2 Software Data Services TUNABLE = ANYTIME; } { PROPERTY = Port_list; DEFAULT = "80/tcp"; TUNABLE = AT_CREATION; } # # Extension Properties #
# Not to be edited by end user { PROPERTY = Paramtable_version; EXTENSION; STRING; DEFAULT = "1.0"; DESCRIPTION = "The Paramtable Version for this Resource"; } # Must specify installation path of iPlanet (on PXFS) # Can be a SET of these for sticky mode scalable iPlanet # Web servers (These need to be under the same resource). { PROPERTY = Confdir_list; EXTENSION; STRINGARRAY; TUNABLE = AT_CREATION; DESCRIPTION = "The Configuration Directory Path(s)"; } # These two control the restarting of the fault monitor itself # (not the server daemon) by PMF. { PROPERTY = Monitor_retry_count; EXTENSION; INT; MIN=-1; DEFAULT = 4; TUNABLE = ANYTIME; DESCRIPTION = "Number of PMF restarts allowed for the fault monitor"; } {
3-12

www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services PROPERTY = Monitor_retry_interval; EXTENSION; INT; MIN=-1; DEFAULT = 2; TUNABLE = ANYTIME; DESCRIPTION = "Time window (minutes) for fault monitor restarts"; } # This is an optional property, which determines whether to failover when # retry_count is exceeded during retry_interval. { PROPERTY = Failover_enabled; EXTENSION; BOOLEAN; DEFAULT = TRUE; TUNABLE = WHEN_DISABLED; DESCRIPTION = "Determines whether to failover when retry_count is exceeded during retry_interval"; } # Time out value for the probe { PROPERTY = Probe_timeout; EXTENSION; INT; MIN=15; DEFAULT = 90; TUNABLE = ANYTIME; DESCRIPTION = "Time out value for the probe (seconds)"; } # # # # # # {
List of URIs to be probed. The iws agent probe will send HTTP/1.1 GET requests to each of the listed URIs. The probe looks at the http response code and regards 500 (Internal Server Error) as a failure. PROPERTY = Monitor_Uri_List; EXTENSION; STRINGARRAY; DEFAULT = ""; TUNABLE = ANYTIME; DESCRIPTION = "URI(s) that will be monitored by the agent probe";

3-13
Writing Sun Cluster 3.x Software Data Services

This section describes the various techniques available for developing Sun Cluster 3.x software data services. Sun Cluster 3.x software enables applications to run and be administered as highly available and scalable resources. The clusters Resource Group Manager (RGM) provides the mechanism for high availability and scalability. The elements that form the programming interface to this facility include the following:

Callback methods Resource Management application programming interface (RMAPI) Process Monitor Facility (PMF) Data Service Development Library (DSDL) The hatimerun command
The interfaces between these libraries are summarized in Figure 3-2.
Resource Types
Callback Methods
libsdev (DSDL)
libscha (RMAPI)
PMF
hatimerun (1M)
RGM
Figure 3-2 Resource Types
3-14

www.chinaitproject.com IT QQ : 3264454 Writing Sun Cluster 3.x Software Data Services
Data Services Built Without DSDL

A data service built without using the DSDL makes use of the following facilities:
RMAPI A set of low-level routines that enable a data service to:

Access property values for resources and resource groups Request restart of a resource or failover of a whole group Get or set resource status
PMF The process monitoring facility, which provides a means of monitoring processes and their descendants, and restarting them if they should stop.
Note PMF is not required. For example, the Oracle data service launches Oracle processes by issuing the database startup command and the Oracle application itself is not monitored by PMF.
The hatimerun command A facility for running programs under a timeout (likely an application probe).
Access to all of these facilities is via either C library routines or commandline utilities that can be called from a script. Therefore, the methods for data services built without DSDL can be in any programming language. Figure 3-3 illustrates the interfaces for a data service built without DSDL.
Resource Types
Callback Methods
libscha (RMAPI)
PMF
hatimerun (1M)
RGM
Figure 3-3
Data Service Built Without DSDL

3-15
Data Services Built Using the DSDL

Built on top of the RMAPI is the DSDL, which provides a high-level integrated framework while retaining the underlying method-callback model of the RGM. The DSDL brings together various facilities for data service development, including the following:
The libscha.so library The C language implementation of the RMAPI The PMF service The process monitoring facility, which provides a means of monitoring processes and their descendants, and restarting them if they should stop The hatimerun command A facility for running programs under a timeout
The libsdev.so library contains the DSDL functions. This is accessible from C and C++ programs only. Figure 3-4 illustrates the interfaces for a data service built with DSDL.
Resource Types
Callback Methods
libsdev (DSDL)
libscha (RMAPI)
PMF
hatimerun (1M)
RGM
Figure 3-4
Data Service Built With DSDL
3-16

Process Monitoring Facility (PMF)

The PMF is a facility available for monitoring daemons in a Sun Cluster software environment. It is an optional facility in that there is nothing that prevents you from writing a data service that ignores it completely. The DSDL library does use PMF for both application daemons and fault monitor daemons. If a daemon abnormally exits, the PMF detects it and can do one of the following:

Restart the daemon every time it dies Restart the daemon a certain number of times within a certain number of minutes Invoke an action script A combination of the previous two (invoke an action script only if the daemon dies more times than the threshold)
Displaying Processes Monitored by PMF

To determine the processes that are being monitored by PMF, type the following: # pmfadm -L tags: iws-rg,iws-res,0.mon iws-rg,iws-res,0.svc The space-delimited output displays tags that represent the processes monitored by the PMF. In this output, there are two PMF tags that have the iws-res resource in their names: one tag is for the fault monitor daemon (*.mon) and the other tag is for the application daemon (*.svc).

3-17
Displaying the State and Conguration of a Monitored Process

To display all state and conguration information for a particular PMF tag, type the following: # pmfadm -l iws-rg,iws-rs,0.svc pmfadm -c iws-rg,iws-res,0.svc -a /usr/cluster/lib/sc/scds_pmf_action_script /bin/sh -c '/bin/sh -c "/global/web/iws/https-iwslh.sunedu.com/start 2>/dev/null"' retries: 0 owner: root monitor children: all pids: 2730 2731 This output shows that PMF does not try to restart a failed daemon for this particular tag. If there were restarts, you would see a -n num_restarts option, likely with a -t. Instead, PMF calls the /usr/cluster/lib/sc/scds_pmf_action_script script upon abnormal exit of the 2730 and 2731 PIDs. This action script contacts the fault monitor to consider either restarting the service or performing a failover of the resource group. The fault monitor takes action based on the congured values of the retry_count and retry_interval properties, and the number of recorded failures.
3-18

Behavior of PMF, Action Script, and Fault Monitor With DSDL

The behavior of PMF with the fault monitor seen in the previous example with the iws agent is the same behavior implemented by every single data service agent that is implemented with DSDL. The behavior summary is:
PMF monitors but does not try to restart the application daemon itself. PMF uses the scds_pmf_action_script to tell the fault monitor about the application daemon death (every single data service written with DSDL uses the same action script). The fault monitor decides whether to have RGM restart the service or fail over the whole resource group.
The behavior is illustrated in Figure 3-5.
PMF
Action Script
Monitor and Restart Update Status
Start and Monitor (but not restart)
Fault Monitor Select Action
Application Daemons
Restart or Failover
RGM
Figure 3-5
DSDL: PMF and Fault Monitor Block Diagram

3-19
Details of a DSDL Fault Monitor

A DSDL fault monitor:
Lets PMF inform it immediately of application daemon death, through the action script. Periodically probes the application. For example:
To verify that the NFS service is healthy, the fault probe sends null RPC requests to the NFS service daemons. To verify that Sun Java Web Server or Apache Web Server are healthy (the probes are identical), the fault probe contacts the web server port and retrieves the head information from a congurable list of URLs (it defaults to just the root URL).
If these requests return to the probe within probe_timeout seconds, then the probe concludes that the service is healthy. Otherwise, it increments the failure history and either restarts the service or fails it over, depending on the values of the retry_count and retry_interval properties and the number of recorded failures. DSDL fault monitors dene a partial failure that contributes a fractional quantity to the failure history. Actual values for these fractional quantities are determined by the resource type developer, but they must be scaled to a number between 0 and 1 before being added to the failure history. To illustrate this concept, suppose there is an instance of an Apache web server running in a given resource group. If the probe successfully establishes a TCP connection to the web server but fails to read all the requested data before the probe times out, the failure is considered a partial (50 percent) failure. Two such partial failures that accumulate within the same retry interval are considered a complete failure.
3-20

www.chinaitproject.com IT QQ : 3264454 Writing Sun Cluster 3.x Software Data Services Figure 3-6 depicts how a DSDL fault monitor checks the health of a service.
Data service fault monitor Set status Service is degraded Increment failure count by fractional amount
Sleep (thorough_probe_interval)
Health check (probe_timeout)
Yes
Partial failure? Suggest Restart Update failure history
No
Success?
Yes
Set status Service online
No
Increment failure count by 1
Reset history Call
No
Request successful?
Failure count less than
scha_control
Yes
retry_count?
No
To suggest failover
Yes
Set status Service has failed
End
TE:
Figure 3-6
Sun-Supported Fault Monitor Block Diagram

3-21
DSDL Fault Monitor Service Restarts

There are several ways that an application can get restarted:

Let PMF do it directly. Let the fault monitor do it directly, without the involvement of the RGM. Let the fault monitor restart the resource, but inform the RGM that the resource is being restarted. This is implemented with the low-level call: scha_control -O RESOURCE_IS_RESTARTED -G RG -R RES
Let the fault monitor tell RGM to restart the resource. This is implemented with the low-level call: scha_control -O RESOURCE_RESTART
DSDL fault monitors always choose the last of these (it is embedded in the DSDL restart function). There are several reasons that you want the RGM to perform resource restarts rather than restarting it outside of the scope of the RGM. This is discussed further in the advanced resource control section of this module.
Fault Monitor Initiation of Group Failover

The low-level command: scha_control -o GIVEOVER -G RG -R RES informs the RGM of a request to fail over an entire resource group specically because of a failure of a particular resource. This same call may appear somewhere in any data service fault monitor. (In DSDL it is embedded in a higher-level DSDL function call.) While it is always the entire resource group that fails over, the RGM can veto the failover based on several criteria, some of which are related to the particular resource.
3-22

DSDL Resource Type Similarities and Variations

The DSDL conventions result in approximately 95 percent of the code to be identical among methods of all resource types written using DSDL. The only variations in different DSDL resource types are:
They may have different extension properties that are used to customize different instances of that same resource type. The START method of each resource type will have code particular to that type that gets values for the extension properties and uses them to launch the application daemon with the correct parameters. The part of the fault monitor that actually probes the application will be particular for that application. The STOP method may have some customized code to stop an application (like apache calling apachectl stop).
The Generic Data Service

The Generic Data Service (GDS) addresses DSDL data service redundancy. For example, if you used GDS for each of 40 different new applications in the cluster, it would eliminate the need for 40 new resource types (with 95 percent repeated code). GDS allows the agent developer to use a single resource type, the Generic Data Service type (SUNW.gds), and to specify the following as properties of each individual instance of GDS:

The application to be launched (the Start_command) The application probe to be called by the standard fault monitor framework (the Probe_command) The command to stop the application (the Stop_command)
Customizing Different Instances of Your Application With GDS

The GDS extension properties let you dene which application to invoke, but you are not allowed to add any extension properties that would allow you to customize different instances of the application (which are what extension properties are used for in a standard, non-GDS, dedicated resource type). To customize an instance with GDS:

3-23
Invent some conguration le that you will put on all nodes (or in a global le system), usually with some VARIABLE=VALUE lines. Have the application specied by the Start_command be a wrapper around your real application. The wrapper reads in the conguration values from the le in order to correctly start the customized instance of your application. Use the conguration values in a similar way with the Probe_command and Stop_command.
Advantages of Using GDS for a New Application

The advantages of using GDS include:

Reducing code bloat. Ensuring that the 95 percent redundant part in every data service is implemented correctly with DSDL. If you program DSDL yourself, you may make programming errors in the calls or the convention. Simplifying conguration by using a custom conguration le rather than application-specic extension properties.
Disadvantages of Using GDS

The disadvantages of GDS are:
GDS does not allow you to congure different application instances using extension properties. Using GDS, you cannot use the standard clrs show command to see extension properties that customize particular instances. With GDS, conguration of your custom parameters cannot be validated with a VALIDATE method, which is valuable in a non-GDS agent. They must be validated by your wrappers.
3-24

Using Builders to Build a Data Service Skeleton

Sun Cluster provides building tools, or data service code generators, that help you to write new data services. In essence, the builders provide the 95 percent of the code that is similar to all new resource types, and leave areas for you to ll in the missing parts (typically, the command to launch the application, the probe, and the command to stop the application). There are two variations of the builder:

Graphical builder (scdsbuilder) Command-line builder (scdscreate and scdsconfig) commands
Both builder varieties can create the following types of agents:
Real new resource types whose methods are written in C using DSDL (you must have a C compiler in your PATH to be offered this option) Real new resource type whose methods are written in ksh (no DSDL) New applications to be congured as an instance of SUNW.gds
Note There is less value to using a builder for a new application congured as an instance of SUNW.gds, since 95 percent of the code is already encapsulated in the GDS. However, the builder does provide scripts for you to ease the burden of calling the correct cluster commands to build your application as a GDS. It does not help you create a framework for having application wrappers read a conguration le (to replace the absence of customized extension properties). The agent builders have a code generation phase (scdscreate on the command line, or the rst action of the GUI) that lays down the skeleton code for you. For real new types, a full RTR le and all the methods and probes are created. You then customize the code and move to the packaging phase (scdsconfig on the command line), which creates a Solaris package for your agent.

3-25
Controlling RGM Behavior Through Properties

There are certain properties which affect how the resource group manager manages resources and groups, regardless of the specic implementation of the resources.
Controlling Behavior Through Resource Group Properties

The following resource group properties allow you to manipulate how the RGM treats specic resource groups.
The Auto_start_on_new_cluster Property

The boolean Auto_start_on_new_cluster property denes whether the RGM automatically brings the resource group online at cluster start. It is tunable anytime.
The Failback Property

The boolean Failback property determines whether the RGM considers relocating a resource group on membership changes. It is tunable anytime.
The Pingpong_interval Property

This property controls RGM behavior in a couple of different ways:
If a resource group fails to start twice on the same particular node (failure of START methods of same or different resources in the group) within the pingpong interval (expressed in seconds), then the RGM will not consider that node as a candidate for the group failover. If one particular resource successfully does a scha_control -O GIVEOVER to take a group off of a particular node, and then the same resource tries to do another scha_control -O GIVEOVER on a different node to bring the group back to the original node, RGM will reject it within the pingpong interval.
3-26

www.chinaitproject.com IT QQ : 3264454 Controlling RGM Behavior Through Properties
Note The Pingpong_interval property is meant to prohibit faulty start scripts or properties and faulty fault monitors, or problem applications, from causing endless pingponging between nodes.
The RG_system Property

If you set RG_system to true, the RGM prohibits you from adding resources to a group, disabling or deleting resources, or modifying group properties (other than RG_system itself). It is a mechanism for locking a resource group (although you can remove the lock by setting RG_system back to false).

3-27
Advanced Control of the RGM Through Standard Resource Properties

Certain standard resource properties have control characteristics over how the RGM treats those individual resources and the groups that contain them.
The Failover_mode Property

The Failover_mode property:
Describes what should happen to a resource group if this resource fails to start up (a START method fails). Should the group move to another node, or should it live without this resource? Describes what happens to a resource group if this resource fails to stop (a stop method fails). Should the group be frozen pending manual clearing by the administrator of a STOP_FAILED ag, or should you reboot the node where the method failed? Puts restrictions on whether the RGM will allow the fault monitor for this resource to cause group failover through scha_control -O GIVEOVER. By setting this property, you can have RGM categorically deny give over requests from this resources fault monitor, although fault monitors for other resources in the same group might still cause the whole group to fail over. Puts restrictions on whether this resources fault monitor can cause RGM to restart this resource. This is one reason why having the fault monitor delegate resource restarts to the RGM is preferred, rather than the fault monitor (or PMF) restarting resources directly. The restart restriction cannot be enforced if the restart occurs outside of RGM control.
3-28

www.chinaitproject.com IT QQ : 3264454 Controlling RGM Behavior Through Properties Table 3-2 describes how values of the Failover_mode property work. Table 3-2 The Failover_mode Value Operation Can Fault Monitor Cause RGM to Fail RG Over? Yes Can Fault Monitor Cause RGM to Restart the Resource? Yes
Value of the Failover_mode Property
Failure to Start
Failure to Stop
NONE
Other resources in the same resource group can still start (if non-dependent). The whole resource group is switched to another node. The whole resource group is switched to another node. Other resources in the same resource group can still start (if non-dependent). Other resources in the same resource group can still start (if non-dependent).
The STOP_FAILED ag is set on the resource. The STOP_FAILED ag is set on the resource. The node reboots. The STOP_FAILED ag is set on the resource. The STOP_FAILED ag is set on the resource.
SOFT
Yes
Yes
HARD
Yes
Yes
RESTART_ONLY
No
Yes
LOG_ONLY
No
No
RESTART_ONLY and LOG_ONLY are new with SC31U3 (9/04). Note they are the same as NONE concerning START and STOP failures -- the difference is that they put restrictions on what the RGM will do on behalf of the fault monitor (with either value set -- the resource CANNOT be the cause of RG failover).
If the STOP_FAILED ag is set, it must manually be cleared using the clrs clear command before the service can start again. # clrs clear -f STOP_FAILED -n nodename resname

3-29
The Thorough_probe_interval Property

This property is used by convention in fault monitor code to control how long the fault monitor sleeps between probes. Note that this is just a convention; it is used by all DSDL fault monitors, including GDS.
The Retry_count and Retry_interval Properties

These properties allow you to attempt restarting the resource Retry_count number of times within an interval of Retry_interval seconds. These properties are used both by the fault monitor implementation and by the RGM itself. Normally the fault monitor itself will only suggest the RGM do a restart Retry_count number of times within the interval, and then will suggest failover thereafter. However, he RGM, keeps its own count of restarts and also enforces the policy. The RGMs count of restarts includes restarts performed by the RGM itself and restarts performed externally where the RGM was informed with scha_control -O RESOURCE_IS_RESTARTED.
3-30

Resource Dependencies
Resource dependency properties form a special subset of the standard resource properties. The RGM enforces four kinds of resource dependencies: regular, weak, restart and ofine-restart.
Regular Resource Dependencies

If Resource A depends on Resource B, then:
The dependency is indicated by defining: Resource_dependencies=ResBname as a property of Resource A. Resource B must be added rst. Resource B must be started rst. Without the dependency, RGM may have been able to start them in parallel. Resource A must be stopped rst. Without the dependency, RGM may have been able to stop them in parallel. Resource A must be deleted rst. The rgmd daemon will not try to start resource A if resource B fails to go online, and will not try to stop resource B if resource A fails to be stopped.
Note Starting in Sun Cluster 3.2 you are allowed to explicitly disable resource B (the dependee) even when resource A is still online.
Weak Dependencies (Weaker Than Regular Dependencies)

If resource A has a weak dependency on resource B, all of the previous bulleted items apply except for the last one. So the weak dependency causes an ordering of starts and stops but is not a strict dependency if the dependee (resource B) is failing to start or the dependent (resource A) fails to stop. Weak dependencies are indicated by dening: Resource_dependencies_weak=ResBname as a property of resource A.

3-31
Restart Dependencies (Stronger Than Regular Dependencies)

Restart dependencies have all the attributes of regular dependencies, with the additional attribute that a restart of resource B will also cause a restart of resource A. Restart dependencies are indicated by dening: Resource_dependencies_restart=ResBname as a property of resource A. Note This is a second reason for the preference that fault monitors either delegate restarts to the RGM (with scha_control -o RESOURCE_RESTART) or at least inform the RGM of the restart (with scha_control -O RESOURCE_IS_RESTARTED). If a fault monitor does neither of these and just restarts a resource (or lets PMF restart an application), then the restart dependencies cannot be enforced by the RGM. Restart dependencies can daisy chain with a resource C having a restart dependency on resource A and a chain of resource restarts being driven by a single restart request.
Ofine-Restart Dependencies
These dependencies are new to Sun Cluster 3.2 and are similar to restart dependencies. In the new ofine-restart dependencies, when resource B is detected to be ofine, or is put ofine explicitly, resource A is restarted. But the actual start part of the restart for resource A will block until resource B actually goes online again. This can lead to more accurate semantics for the relationship between A and B. Regular restart dependencies do not do anything with the dependent (A) even if the RGM is aware that the dependee (B) is ofine. However if you know that A cannot operate properly without B, the ofine-restart behavior may be more correct.
3-32

Cross-Group Dependencies
Starting in Sun Cluster 3.1 9/04 (Update 3), any of the types of dependencies can be between resources in the same resource group or in different resource groups. If the resources are in different groups, the dependency does not imply any preference about whether the groups run on the same or different nodes (that is controlled by RG_affinities, presented later in this module). There are additional side effects of cross-group dependencies:
You will not be allowed to put a group containing the dependee (resource B) ofine (with clrg offline) if the dependent (resource A) is online in its group. You are allowed to switch the group containing the dependee to a different node (clrg switch) while the dependent stays online. But if it is a restart dependency or ofine-restart dependency the RGM will also restart the dependent while leaving the rest of its group alone. This could daisy-chain. Assuming all resources are enabled, if you start the dependents group before that of the dependee, the dependents group will be in a Pending Online state with the dependent resource not started until the dependency can be satised.
The Implicit_network_dependencies Group Property

The Implicit_network_dependencies can be set individually on each group. The default value is TRUE. In a failover group, this means that the RGM behaves as if every non-network resource has a regular strong dependency on every network resource. You might set the Implicit_network_dependencies property to FALSE if you have a service which does not depend on logical IP addresses. In this case you could still set explicit dependencies using the regular Resource_dependencies property if you had some non-network resources that you wanted to depend on some network resources.

3-33
Resource Group Dependencies

Resource group dependencies imply an ordering relationship between two groups. If resource group G has a group dependency on resource group H, group G cannot be brought online unless group H is online. Group H cannot be brought ofine unless group G is ofine. This type of dependency considers the state of the group only rather than the resources inside it. Note Resource group dependencies are somewhat limited. They are enforced between groups running on the same or different nodes when you manually try to start or stop resource groups in the wrong dependency order. However, when failovers occur, resource group dependencies are only strictly enforced between groups that are stopping and starting on the same node. Resource group dependencies have been almost completely obviated by cross-group resource dependencies. If you think you need a resource group dependency, a cross-group resource dependency may work better. A group dependency will still be satised even if all of the resources within the dependee group are disabled. You may not have intended this when setting the dependency.
3-34

www.chinaitproject.com IT QQ : 3264454 Advanced Resource Group Relationships
Advanced Resource Group Relationships

Starting in Sun Cluster 3.1 9/04 (Update 3), Sun Cluster offers a series of advanced resource group relationships called resource group afnities. Resource group afnities provide a mechanism for specifying either a preference (weak afnities) or a requirement (strong afnities) that certain resource groups either run on the same node (positive afnities) or do not run on the same node (negative afnities). In this discussion, the words source and target are used to refer to the resource groups with afnity relationships. The source is the group for which the value of RG_affinities is set, and the target is the group referred to by the value of the property. So, in the following example: # clrg set -p RG_affinities=++rg1 rg2 rg2 is referred to as the source and rg1 as the target.
Weak Positive and Negative Affinities

These two kinds of afnities place only a preference, not a requirement, on the location of the source group. Setting a weak positive afnity says that the source group will prefer to run on the node already running the target group, if any. If the target group is not running at all, that is ne. You can freely switch either group online and ofine, and explicitly place either on any node that you want. Setting a weak negative afnity means the source group will prefer any other node besides the target. Once again, it is just a preference, so you can explicitly put either group on whichever node you want. Weak positive and weak negative group afnities are denoted by a single plus or single minus sign, respectively. In the following example, group rg2 is given a weak positive afnity for rg1. Note that this does not affect the current location of either group. pecan:/# clrg set -p RG_affinities=+rg1 rg2 WARNING: resource group rg2 declares a weak positive affinity for resource group rg1; but this affinity is not satisfied by the current state of the two resource groups

3-35
Advanced Resource Group Relationships The following will be affected by a weak positive afnity:
Failover of the source group If the target is online, when the source group needs to fail over it will fail over to the node running the target group, even if that node is not a preferred node on the source groups node list.
Putting the resource group online onto a non-specied node: # clrg online source-grp Similarly, when a source group goes online and you do not specify a specic node, it will go onto the same node as the target, even if that node is not a preferred node on the source groups node list.
Weak negative afnities affect the exact same scenarios, with the source group preferring to fail over or go online on a node not currently running the target group. However, weak afnities are not enforced when you manually bring or switch a group onto a specic node. The following command will succeed, even if the source group has a weak afnity for a target running on a different node. # clrg switch -n specific-node src-grp There can be multiple resource groups as the value of the property. In other words, a source can have more than one target. In addition, a source can have both weak positive and weak negative afnities. In these cases, the source prefers to choose a node satisfying the greatest possible number of weak afnities. For example, it will select a node that satises two weak positive afnities and two weak negative afnities rather than a node that satises three weak positive afnities and zero weak negative afnities.
Strong Positive Affinities

Strong positive afnities (indicated with a ++ before the value of the target) place a requirement that the source run on the same node as the target. This example sets a strong positive afnity: # clrg set -p RG_affinities=++rg1 rg2 and represents the following aspects of strong positive afnities:
The only node or nodes on which the source can be online are nodes on which the target is online.
3-36

www.chinaitproject.com IT QQ : 3264454 Advanced Resource Group Relationships
If the source and target are currently running on one node, and you switch the target to another node, it will drag the source with it. If you ofine or online the target group, it will ofine or online the source as well. An attempt to switch the source to a node where the target is not running will fail. If a resource in the source group fails, the source group still cannot fail over to a node where the target is not running. (The solution to this is discussed in the next section.)
The source and target are closely tied together. If you have two failover resource groups with a strong positive afnity relationship, it might make sense to make them one group. So why does strong positive afnity exist?
The relationship can be between a failover group (source) and a scalable group (target). That is, you are saying the failover group must run on some node already running the scalable group. You might want to be able to ofine the source group but leave the target group running, which strong positive afnity allows. Some resources may reject being put in the same group, but you still want them all running on the same node or nodes.
Strong Positive Affinity With Failover Delegation

A slight variation on strong positive afnity is set with the +++ syntax: # clrg set -p RG_affinities=+++rg1 rg2
The only difference between the +++ and the ++ is that with +++, if a resource in the source group fails and its fault monitor suggests a failover, the failover can succeed. The RGM will move the target group over to where the source wants to fail over to, and then the source gets dragged correctly.

3-37
Advanced Resource Group Relationships
Strong Negative Affinity

A strong negative afnity is set using the following syntax: # clrg set -p RG_affinities=--rg1 rg2 In this instance, the source group can not run on the same node as the target. It will refuse to switch to any node where the target is running. If you switch the target to the node where the source is running, it displaces the source and the source switches to a different node, if any. If there are no more nodes, the source switches ofine. For example, if you have a two-node cluster with both the source and target groups online on different nodes, and one node crashes (whichever it is), only the target group can remain running, since the source group refuses to run on the same node. The source will come back online when another node boots. Use strong negative afnities with care, since the high availability of the source group can be compromised.
Example of Complex Affinity Relationships

There is a real Sun-supported data service that requires you to congure a pair of cooperating daemons into two separate resource groups with the following relationships:

RG1: has RG_affinities=+RG2 RG2: has RG_affinites=--RG1
Will these two groups run on the same node or different nodes? RG1 has a weak positive afnity for RG2. Normally RG1 will be running on a different node from RG2. If RG1 needs to fail over, it will fail over to the node where RG2 is running. Because RG2 has a strong negative afnity for RG1, it will have to move either to a third node, or to the node where RG1 was running previously (if the node is still alive). If there are no more nodes remaining, RG2 would have to go ofine completely.
3-38

www.chinaitproject.com IT QQ : 3264454 Advanced Resource Group Relationships The relationship is illustrated in Figure 3-7:
Node 1 RG1 RG1 Failover
Node 2 RG2
RG2 pushed to another node (maybe Node 1) or completely offline.
Figure 3-7
Resource Groups With Unusual Afnity Relationships
What kind of application needs this type of afnity? This afnity is used in the SAP Web Application data service. In the example, RG1 contains a master application. RG2 contains a memory state replica application, intended to run only on a different node from RG1s node, and intended to replicate memory state on behalf of the master application. The memory state replica application provides memory state so that if RG1 fails over, it will fail to the node RG2 was on and get its memory state preserved. At that point, RG2 no longer needs to run on that same node, it needs to move to another node. If there is only one node left, there is no need for RG2 (the memory replicator), since its purpose is to run on a different node.

3-39
Multimaster and Scalable Applications

Resource groups have always been referred to as scalable if they can run on more than one node simultaneously. The newest agreed-upon terminology makes a distinction between two different types of applications running in such groups:
Multimaster applications run on more than one node at a time but do not make use of the internal load balancing features provided by the Sun Cluster software. Scalable applications specically make use of the internal load balancing features provided by the SharedAddress resource. This is the type represented by a resource that has the value of the Scalable resource property set to TRUE.
The Desired_primaries and Maximum_primaries Properties

These are group properties that indicate how many nodes a resource group should run on simultaneously. The RGM will try to put the group online on Desired_primaries nodes, but you can manually switch it on using clrg switch, up to Maximum_primaries. If the group is already running on more than Desired_primaries number of nodes and some nodes fail but the group is still online on at least Desired_primaries number of nodes, then the RGM will not try to start the group on any other nodes. Note that if these values are greater than 1, the Mode=Scalable group property is automatically set. Note The group may still contain either a multimaster application (with the resource property Scalable=False) or a scalable application (with resource property Scalable=True).
3-40

www.chinaitproject.com IT QQ : 3264454 Multimaster and Scalable Applications
Controlling Load Balancing for Scalable Applications

The load balancing for a scalable application is performed by whichever node is mastering the failover resource group containing an online SharedAddress resource. However, properties that manipulate the loadbalancing for particular scalable applications are applied to the application resource itself, not to the SharedAddress resource. The scalable applications register with the load balancer and delegate the information provided by the load-balancing properties. Note A single SharedAddress resource could be performing load balancing on behalf of more than one application resource, and each application resource could have different load-balancing properties.
Choosing Client Affinity

Certain types of applications require that load balancing in a scalable service be on a per client basis, rather than a per connection basis. That is, they require that the same client IP always have its requests forwarded to the same node. The prototypical example of an application requiring such client afnity is a shopping cart application, where the state of the clients shopping cart is recorded only in the memory of the particular node where the cart was created. Client afnity is chosen for a particular application by setting the value of the Load_balancing_policy property on the application resource to one of the following:
LB_STICKY Connections from the same client IP all go to the same node. Load balancing is only for different clients. This is only for ports listed in the Port_list property. LB_STICKY_WILD Connections from the same client to any server port go to the same node. This is good when port numbers are generated dynamically and not known in advance.
The default is no client afnity (the default value of the Load_balancing_policy is set to LB_WEIGHTED).

3-41
Strong Client Affinity and Weak Client Affinity

In the Sun Cluster 3.0 implementation of client afnity, the load balancer does not have to maintain any tables mapping clients to server nodes, even when client afnity is chosen. The load balancer performs a hash function on the incoming client IP address and dynamically calculates to which node the client should be connected. The same incoming client should always hash to the same server node. The problem with this mechanism is that if the cluster recongures and the scalable service is stopped or started on any node) the hashing must be recalculated. Every client, even those who had afnity to one of the nodes not leaving or joining a cluster, will lose its afnity if any node joins or leaves. Sun Cluster 3.1 changed the default so that the load balancer does maintain tables mapping clients to specic server nodes. Afnity to a surviving node will survive a different node leaving the cluster. However, this can require a lot of memory, and involves replication of the memory from the node doing the load balancing to at least one secondary node. In Sun Cluster 3.1 and 3.2 if you want to choose the old, less reliable, but more memory-optimal implementation of afnity, you can choose to set the Weak_Affinity property on a scalable application to True. This only has an effect if you have chosen the client afnity policy in the rst place.
Affinity Timeout for Strong Client Affinity

If you have chosen the client afnity and used the default strong afnity (default value of Weak_affinity being False), a mechanism called afnity timeout can reduce the memory impact of the load balancer having to maintain all the client mapping tables. Afnity timeout states that a clients mapping (afnity) to a certain server node will be erased a certain amount of seconds after the last connection from the client closes. The number of seconds is dened by the Affinity_timeout property.
3-42

www.chinaitproject.com IT QQ : 3264454 Multimaster and Scalable Applications For example, if you set Affinity_timeout to 900 (15 minutes, a reasonable value for a shopping cart application), a client would lose the afnity 15 minutes after the last connection closes. This includes time that the connection is in a TIME_WAIT state. Setting afnity timeout can prevent wasting away all your memory if you really had millions and millions of possible clients. You can set Affinity_timeout to -1, or innite, so that afnity is never lost, unless the node to which a client is mapped goes down or its application becomes disabled. The default value for Affinity_timeout is 0. Does this mean that afnity still exists? As long as the client makes a continuous series of connections and never reaches a state of all connections closed, it would still have afnity. If some connections were still in the TIME_WAIT state (this would give you an extra 60 seconds on later Solaris 9 OS versions and on Solaris 10), you would keep your afnity, but once the last connection closed, afnity would be gone. The default value of 0 might be appropriate for some applications. Consider something like a scalable FTP application, where a new data connection must be connected to the same node as an existing control connection.

3-43
Exercise 1: Creating Sun Cluster Software Data Services


Task 1 Create a wrapper script for your application Task 2 Create a new resource type Task 3 Install the newly created resource type Task 4 Register the new resource type Task 5 Instantiate a resource of the new resource type Task 6 Put the resource group containing the new resource online Task 7 Test the fault monitor for the new resource type
Preparation
There is no special preparation required for the following tasks.
Task 1 Creating a Wrapper Script for Your Application

To perform the tasks in this exercise, allow your administrative workstation environment to accept X Windows client connections and to export the DISPLAY variable of your server nodes to the administrative workstation. You also need to create a start script for the xclock binary that the cluster can use to start the TEST.xclock resource. Perform the following: 1. Add the cluster nodes to the list of clients permitted to connect to the X Windows server on the administrative workstation or display station: # /usr/openwin/bin/xhost +node1 # /usr/openwin/bin/xhost +node2
3-44

www.chinaitproject.com IT QQ : 3264454 Exercise 1: Creating Sun Cluster Software Data Services 2. Create a wrapper script for the xclock binary. This is the application for which you create a new resource type. Be careful to use the proper quote characters. On all cluster nodes, type the following (or use a model le provided by the instructor):
# vi /usr/openwin/bin/myxclock #!/bin/ksh # this script will get called as follows: myxclock $RESOURCE_NAME # because you will subsequently modify the START method to do so DISP_HOST=$(/usr/cluster/bin/scha_resource_get \ -O EXTENSION -R $1 Display_host | sed -n '$p') CLOCKTYPE=$(/usr/cluster/bin/scha_resource_get \ -O EXTENSION -R $1 Clock_type | sed -n '$p') /usr/openwin/bin/xclock -$CLOCKTYPE -display $DISP_HOST \ -title "xclock on $(uname -n)" 3. Make the wrapper script executable: # chmod a+x /usr/openwin/bin/myxclock

3-45
Task 2 Creating a New Resource Type

On any one node in the cluster, perform the following steps: 1. Launch the scdsbuilder executable: # DISPLAY=display_name-or_IP:display# # export DISPLAY # scdsbuilder & 2. Complete the form as follows:

Vendor Name TEST Application Name xclock RT Version 1.0 Working Directory /var/tmp Scalable/Failover (Radio Button) Failover Network Aware check box Deselect Type of generated source ksh
3. 4. 5. 6.
Click Create. Click OK in the Success! dialog box. Click Next. Fill out the form as follows:
Start Command: /usr/openwin/bin/myxclock $RESOURCE_NAME Stop Command Leave this blank Validate Command Leave this blank Probe Command /usr/cluster/bin/pmfadm -q $PMF_TAG
7. 8. 9.
Click Congure. Click OK in the Success! dialog box. Do not exit the builder application; go to a command line window on the same node on which you are running the builder.
3-46

www.chinaitproject.com IT QQ : 3264454 Exercise 1: Creating Sun Cluster Software Data Services 10. Modify the RTR le to add an extension property. The scdsbuilder executable left a section for you at the bottom of the RTR le. If you use the instructors model le, make sure to insert its lines in the appropriate place into the TEST.xclock le. # cd /var/tmp/TESTxclock/etc # vi TEST.xclock # User added code -- BEGIN vvvvvvvvvvvvvvv { PROPERTY = Display_host; EXTENSION; STRING; DEFAULT = ""; TUNABLE = WHEN_DISABLED; DESCRIPTION = "Display (host:#) on which to display xclock GUI"; } { PROPERTY = Clock_type; EXTENSION; Enum {digital, analog}; Default = analog; TUNABLE = WHEN_DISABLED; DESCRIPTION = "Type of clock (analog or digital)"; } # User added code -- END ^^^^^^^^^^^^^^^ 11. Back in the builder window, again click Congure. 12. Acknowledge the Success! dialog box, and exit the builder application.

3-47
Task 3 Installing the New Resource Type

Perform the following steps: 1. Copy the entire /var/tmp/TESTxclock directory from the node on which it was created to the other cluster node. # cd /var/tmp # rcp -r TESTclock other_node:/var/tmp 2. 3. Install the package on all cluster nodes: On all nodes edit the le /opt/TESTxclock/bin/xclock_probe.ksh Change the line that says: PMF_TAG=$RESOURCEGROUP_NAME,$RESOURCE_NAME,0.mon to: PMF_TAG=$RESOURCEGROUP_NAME,$RESOURCE_NAME,0.svc (you are just changing mon to svc) 4. On all nodes edit the le /opt/TESTxclock/bin/xclock_svc_start.ksh Find the lines that read: pmfadm -c $PMF_TAG -a "$action -R $RESOURCE_NAME -G $RESOURCEGROUP_NAME \ -T $RESOURCETYPE_NAME" $start_cmd_args and change the whole thing to just: pmfadm -c $PMF_TAG $start_cmd_args # pkgadd -d /var/tmp/TESTxclock/pkg TESTxclock
Task 4 Registering the New Resource Type

Run the following command from one cluster node: # clrt register TEST.xclock If you accidentally have a syntax error in the RTR le you will see a misleading error message about the le not being found.
Task 5 Instantiating a Resource of the New Resource Type

Perform the following steps from one cluster node:
3-48

www.chinaitproject.com IT QQ : 3264454 Exercise 1: Creating Sun Cluster Software Data Services 1. 2. Create an empty resource group: # clrg create -n node1,node2 xclock-rg Create an instance of the TEST.xclock resource type: # clrs create -t TEST.xclock \ -g xclock-rg -p Clock_type=analog \ -p Thorough_probe_interval=10 \ -p Display_host=display_station:display# xclock-res
Task 6 Putting the Resource Group Containing the New Resource Online
From one cluster node, run the following commands: # clrg online -M xclock-rg # clrg status # clrs status
Task 7 Testing the Fault Monitor for the New Resource Type
Perform the following steps: 1. 2. Close the xclock window, and verify that it is restarted. What are the values that control how many restarts can happen on the same node within a certain interval before a failover occurs? # clrs show -p Retry_count -p Retry_interval xclock-res 3. What happens if you close the window three times within the 370 second interval? Do it (the three times includes the time in step 1, if it is still within the interval). Now that your application has failed to the next node, close the application three consecutive times on that node. Why will your application not fail back to the original node? What kind of messages are you seeing on the node consoles or at the bottom of the /var/adm/messages le? On either node, print out the current value of the Pingpong_interval, and change it to a lower value: # clrg show -p Pingpong_interval xclock-rg # clrg set -p Pingpong_interval=60 xclock-rg 7. Verify that the application can now fail back to its rst node.
4. 5.
6.

3-49
Exercise 2: Create a Data Service using GDS
Exercise 2: Create a Data Service using GDS

Task 1 Make a version of an application wrapper script suitable for GDS Task 2 Create a variation of the data service using GDS Task 3 Verify restart and failover behavior of the new resource
Task 1 Making a Version of an Application Wrapper Script Suitable for GDS

1. Create a wrapper script for the xclock binary. This is the application for which you created a new resource type. Be careful to use the proper quote characters. The instructor has a model le you could choose to use as well. On all cluster nodes, type the following: # vi /usr/openwin/bin/myxclock.gds #!/bin/ksh if [[ ! -f /opt/TESTxclock/etc/xclock.config ]] then echo "Can't find configuration file: can't start clock" else . /opt/TESTxclock/etc/xclock.config /usr/openwin/bin/xclock -$CLOCKTYPE -display $HOST \ -title "xclGDS on $(uname -n)" fi 2. 3. Make the wrapper script executable: # chmod a+x /usr/openwin/bin/myxclock.gds On all nodes, create the conguration le required by the wrapper script: # vi /opt/TESTxclock/etc/xclock.config CLOCKTYPE=analog HOST=display_name_or_IP:display#
Note Every real-world application that is based on GDS has to be congured by some custom-built le such as this rather than by extension properties.
3-50

www.chinaitproject.com IT QQ : 3264454 Exercise 2: Create a Data Service using GDS
Task 2 Creating and Enabling a Resource Group

On any one node in the cluster, perform the following steps: 1. Register the GDS resource type: # clrt register SUNW.gds 2. Create a resource and resource group: # clrg create -n node1,node2 xclockgds-rg # clrs create -t SUNW.gds \ -g xclockgds-rg \ -p Start_command=/usr/openwin/bin/myxclock.gds \ -p Probe_command=/bin/true \ -p Network_aware=false xclockgds-res 3. Enable the group: # clrg online -M xclockgds-rg
Task 3 Verifying Restart and Failover Behavior of the New Resource

1. Verify that your new GDS resource is being managed by PMF in the standard DSDL manner (one failure will immediately contact the fault monitor via an action script): # pmfadm -L # pmfadm -l xclockgds-rg,xclockgds-res,0.svc 2. While you see that PMF does not normally restart the application at all, see if you can predict how many times the fault monitor will do the restart, and what the interval is: Close the GDS xclock and verify that it is restarted. The resource may be in a Starting state for a short time. Run the following until you verify that the resource has returned to the Online state: # clrs status xclockgds-res 5. 6. Repeat steps 3 and 4 one more time. Once again repeat the clrs status command until you verify that the resource is Online. Close the GDS xclock again (should be the third time now within the ve-minute interval).
# clrs show -p Retry_interval -p Retry_count xclockgds-res 3. 4.

3-51
Exercise 2: Create a Data Service using GDS 7. 8.
Verify in the console message that the fault probe is unable to restart the resource, and that it fails over to the other node. Repeat steps 3 and 4 TWICE on the new node, rerunning the clrg status command each time until you verify that the resource is Online. Close the GDS a third time on the new node. Verify that the application cannot fail over because of the Pingpong_interval.
9.
10. Note the slightly different behavior of this resource (which conforms more to the DSDL standards) than the one from the previous exercise. The resource should be restarted. 11. On either node, print out the previous value of the Pingpong_interval, and change it to a lower value: # clrg show -p Pingpong_interval xclockgds-rg # clrg set -p Pingpong_interval=60 xclockgds-rg 12. Did you prefer the GDS version of the application to the other version? Is it less work to create? Is it fully functional? There are no correct answers to these questions.
3-52

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Advanced Resource and Resource Group Control
Exercise 3: Advanced Resource and Resource Group Control

Task 1 Investigate cross-group dependencies and restart dependencies Task 2 Investigate resource group afnities Task 3 Modify a failover service FAILOVER_MODE property
Task 1 Investigating Cross Group Dependencies and Restart Dependencies

1. Create a restart dependency so that the GDS xclock depends on the original (non-GDS xclock). Note this is a cross-group dependency: Close the non-GDS xclock (the dependee) and see if the dependent gets restarted (it should, since the request to restart is made through the RGM even in the data service built with ksh). Are you allowed to switch over the dependee? This also constitutes an RGM-aware restart, and you should see the dependent (GDSxclock) restart without switching nodes, as the non-GDS switches over. # clrg switch -n other_node xclock-rg 4. Try to disable the dependee resource, and try to ofine its group. Are you allowed to disable the dependee? Curiously enough, you can (with a warning) disable the dependee, but you can not ofine its group. Do you agree that this should be so? # clrs disable xclock-res # clrs enable xclock-res # clrg offline xclock-rg 5. Change the dependency to an ofine-restart dependency: # clrs set -p Resource_dependencies_restart="" \ -p Resource_dependencies_offline_restart=xclock-res \ xclockgds-res
# clrs set -p Resource_dependencies_restart=xclock-res xclockgds-res 2.
3.

3-53
Exercise 3: Advanced Resource and Resource Group Control 6.
Now try to disable the dependee (you will see a restart issued for the dependent GDS resource, but its start is blocked): # clrs disable xclock-res # clrg status xclock-rg xclockgds-rg
7.
Restart the dependee and verify that the dependent is now unblocked as well: # clrs enable xclock-res # clrg status xclock-rg xclockgds-rg # clrs status xclock-res xclockgds-res
Task 2 Investigating Resource Group Affinities

Perform the following steps on the nodes indicated: 1. Take both of your new resource groups ofine and remove the resource relationship:
# clrg offline xclock-rg # clrg offline xclockgds-rg # clrs set -p Resource_dependencies_offline_restart="" xclockgds-res 2. Place a weak negative afnity for each group on the other:
# clrg set -p RG_affinities=-xclockgds-rg xclock-rg # clrg set -p RG_affinities=-xclock-rg xclockgds-rg 3. Bring the groups online without specifying a particular node. Do they end up on the same or different nodes? # clrg online xclockgds-rg # clrg online xclock-rg 4. Switch the groups so that they are on the same node. Why is this allowed? # clrg switch -n othernode xclockgds-rg 5. 6. Try to set a strong negative afnity. What message do you get? Switch one of your groups and set the strong negative afnity. Note that you are making the non-GDS resource group the target of the afnity relationship. That group can switch, but the source group will always move out of its way. # clrg set -p RG_affinities=--xclock-rg xclockgds-rg
# clrg switch -n othernode xclockgds-rg # clrg set -p RG_affinities=--xclock-rg xclockgds-rg
3-54

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Advanced Resource and Resource Group Control 7. 8. 9. # # # # clrg clrg clrg clrg Try to switch the source group. Can you do it? # clrg switch -n othernode xclockgds-rg What happens when you switch the target group? # clrg switch -n othernode xclock-rg Take both groups ofine and set a strong positive afnity relationship, and bring them online:
offline xclock-rg offline xclockgds-rg set -p RG_affinities= xclock-rg set -p RG_affinities=++xclock-rg xclockgds-rg 10. Bring them online. Try to bring the source group online rst and see what happens. Then bring the target group online and it should drag the source online with it. # clrg online xclockgds-rg # clrg online xclock-rg 11. Try to switch the source and then the target. How do they behave? # clrg switch -n oterhnode xclockgds-rg # clrg switch -n oterhnode xclock-rg 12. Try to make the source fail over (by closing the GDS xclock three times). Wait each time until clrs status xclockgds-res reports that the resource is Online again. Why does it not fail over? 13. Modify the afnity so that it can perform failover delegation.
# clrg set -p RG_affinities=+++xclock-rg xclockgds-rg 14. Repeat step 12 and observe the results. You may see results while killing it fewer than three times because the RGM is still counting failovers within the interval. 15. Can you set a strong positive afnity between a scalable group and a failover group? Try it. # clrg set -p RG_affinities=++xclock-rg iws-rg
Task 3 Modifying a Failover Service failover_mode Property

Perform the following steps: 1. Identify the primary of the ora-rg resource group: # clrg status ora-rg

3-55
Exercise 3: Advanced Resource and Resource Group Control 2.
Replace the STOP method for the ora-listener-res resource on the node determined in Step 1:
# mv /opt/SUNWscor/oracle_listener/bin/oracle_listener_stop /var/tmp # vi /opt/SUNWscor/oracle_listener/bin/oracle_listener_stop #!/bin/ksh exit 1 # chmod a+x /opt/SUNWscor/oracle_listener/bin/oracle_listener_stop 3. Modify the failover_mode property from one node (ignore validation errors on the node not mounting the failover le system, as usual). # clrs set -p Failover_mode=HARD ora-listner-res 4. Put the ora-rg resource group ofine from one node: # clrg offline ora-rg
Note The host running the ora-rg resource group reboots as a result of the failure of the STOP method and the value of the failover_mode property. 5. After the node reboots, restore the STOP method for the oracle_listener property on that node.
3-56

Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications

3-57
Module 4
Performing Recovery and Maintenance Procedures

Objectives

Add a new node to a running cluster Remove a node from a cluster Replace a failed node in a cluster Uninstall the Sun Cluster 3.2 software from a node Replace failed disks Back up and restore the Cluster Conguration Repository (CCR)
Note If you have a third node to add to the cluster as part of the exercises for this module, you will need to make sure it is upgraded to Solaris 10 by running the ./installES445 script and choosing option 3, as noted in the preface.
4-1
Relevance
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics presented in this module. While they are not expected to know the answers to these questions, the answers should be of interest to them and inspire them to learn the material presented in this module.
!
?
Discussion The following questions are relevant to understanding the content of this module:
What is required to get existing cluster services to run on a new node in the cluster? When performing a fresh installation on a failed node, how do you get the existing nodes to accept it back into the cluster? What is the difference between replacing hardware Redundant Array of Independent Disks (RAID) disks and Just a Bunch Of Disks (JBOD) disks?
4-2

Sun Microsystems, Inc. Sun Cluster Software Installation Guide For Solaris OS, part number 819-2970. Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971.

4-3
Adding a Node to an Existing Cluster

Adding a new node into a cluster conguration is a simple task if your hardware connections are set up correctly. It is more difcult to ensure that pre-existing failover and scalable applications in existing resource groups run properly on the new node. Usually, when you install a new data service in the cluster, you manually test it on each potential node to make sure that it runs properly. In a running cluster, with running services, you often do not have this opportunity. This section describes the steps involved in adding a new node into a cluster: 1. 2. 3. 4. 5. 6. Change a two-node cluster dened without switches to one dened with switches. Cable the new node properly. Congure the Solaris OS on the new node. Prepare the existing cluster to accept the new node. Add a vxio major number, if required, so that the new node can boot into the cluster. Install the Sun Cluster framework and auxiliary packages (previous to Sun Cluster 3.1 8/05 you could just let scinstall install cluster framework packages). Starting in Sun Cluster 3.1 8/05 you must do the package installs through the Sun Java Enterprise System (Java ES) installer. Create mount points for global le systems on the new node. Check the did device driver major number on the new node. Run the scinstall command on the new node.
7. 8. 9.
10. Congure quorum devices properly to take the new node into consideration. 11. Congure volume management on the new node, if needed. 12. Add the new node to existing volume manager device groups. 13. Verify or congure IP multipathing (IPMP) on the new node. 14. Prepare the new node to run existing applications. 15. Add the new node to existing resource groups.
4-4

www.chinaitproject.com IT QQ : 3264454 Adding a Node to an Existing Cluster
Redefining a Cluster to Use Switches

When you install a two-node cluster, you have the choice of using or not using switches. If a two-node cluster is dened without switches, and you want to add a third node, then you must change it to dene the cluster with switches. If you have a two-node cluster and you might add a third node to the cluster in the future, you should dene it with switches even if these switches do not physically exist. This conguration works ne, and allows you to simply re-cable the transports, one at a time, without reconguring any other transport software to add a third node. The general strategy to dene the transport is to redene and recable one transport path at a time, and make sure the clintr status command reports the transport up and running before you dene the other transport path. You can either use the clintr command or the clsetup command to remove the cable denition between the two nodes, and use additional clintr commands to add a switch denition and the cables from each node to the switch. You can physically recable at any time: before performing the whole procedure, during the middle of the procedurebetween the removal and the additionor at the end of the procedure. Just do one at a time. The following example demonstrates using the clintr command to replace a crossover link with a switch. 1. Disable and remove the crossover transport cable from the conguration:
# clintr disable vincent:hme0,theo:hme0 # clintr remove vincent:hme0,theo:hme0 2. # clintr add sw1 3. 4. Add a transport cable connecting node 1 to the transport switch: Add a transport cable connecting node 2 to the transport switch: # clintr add vincent:hme0,sw1 # clintr add theo:hme0,sw1 Modify the other transports in a similar way. Add a transport switch to the conguration:

4-5
Cabling the New Node

Before the cluster installation procedure is run on the new node, make sure the node is correctly cabled to the following items:

All cluster transport switches Public network hubs or switches Data storage devices or switches, as appropriate
Configuring Solaris OS on the New Node

The Solaris OS conguration steps include the following:
Run the devfsadm utility, or reboot the new node with the boot -r command to make sure it recognizes any new storage connections. Set up IPMP on the public network connections. If you do not set up IPMP on your public network, the following occurs:
Sun Cluster 3.1 8/05 (Update 4) and later When you use the scinstall command to congure the node, it builds a an IPMP group sc_ipmpN for every adapter for which there is a nonIPMP /etc/hostname.xxx le. Sun Cluster 3.1 Updates 1-3 A singleton IPMP group sc_ipmpN will be added if you were to create a brand new LogicalHostname or SharedAddress resource. But in order to modify an existing resource, you would need to congure IPMP by hand.
Preparing the Existing Cluster to Accept the New Node

Entries in the CCR list the names of new nodes that are allowed into the cluster, and the authentication mechanism used. For example: # claccess show Host Access Control === Cluster name: Allowed hosts: Authentication Protocol: orangecat None sys
4-6

You must modify this list to include your new node. Authentication is almost always congured with the value sys. A reverse IP lookup is performed on a node trying to add the cluster, and the name is matched with one on the list. Use the claccess utility on an existing node in the cluster to allow the new node to join as follows: # claccess allow -h noodle # claccess show Host Access Control === Cluster name: Allowed hosts: Authentication Protocol: orangecat noodle sys
Adding a vxio Major Number

If you intend to install the VxVM software on the new node, manually add a vxio major number that matches the existing nodes before you install the cluster software. For example: # vi /etc/name_to_major vxio 274
Note If the vxio major number used by the existing nodes already conicts with a different major number on that new node, then change the conicting entry on the new node to a higher, unused number, and do a reconguration reboot (or wait until the one that occurs automatically at the end of scinstall). If you are using Solaris VM software in your cluster, then the major number for the md driver is already in the /etc/name_to_major le as a result of the Solaris OS installation. This is a reserved number, so that it will always match on old and new nodes.

4-7
Installing Sun Cluster Packages on the New Node

In order to install Sun Cluster software packages on the new node:
Sun Cluster 3.2 and Sun Cluster 3.1 8/05 (Update 4) You are required to use the Java ES installer in order to install the packages (or install a Solaris Flash archive from a node where the Java ES had been used to install the packages but has not yet been congured with scinstall). Sun Cluster 3.1 9/04 (Update 3) You may use the installer or a Flash image (from an uncongured system) to install the cluster packages, or you may wait and let scinstall install them for you. If you let scinstall install the packages, you must rst install the Java Web Console, if it is not already installed (it is already part of the base OS in Solaris 10 OS).
All earlier releases You can use an installer to install the packages, or just wait and let scinstall install them. These releases do not use or require the Sun Java Web Console.
Creating Mount Points for Global File Systems

Create mount point directories for existing global le systems on a new node before conguring it into the cluster. Then, as it reboots after the cluster conguration, it will be ready to access global le systems already mounted by other nodes in the cluster.
Checking the did Major Number

The did major number needs to match between the new and existing nodes in order to successfully allow the new node to add itself to the cluster: # vi /etc/name_to_major did 300
Note If the did major number used by the existing nodes conicts with a different major number on the new node, then change the conicting entry on the new node to a higher, unused number, and make the did major number match the other nodes. scinstall will conclude with the desired reconguration reboot.
4-8

Running the scinstall Utility on the New Node

The next step to run the scinstall utility on the new node to congure it into the cluster. Use the following guidelines to make the scinstall utility work correctly for this operation:
Select the option to add this machine as a node in an established cluster. You can use the Typical install if you have the standard placeholder /globaldevices le system. Any existing node can be the sponsor node. Use autodiscovery for the cluster transport if you use Ethernet. Autodiscovery reliably probes and discovers Ethernet transport adapters.
Managing Quorum Devices

Examine any existing quorum devices. If the new node is physically attached to these quorum devices, then you must delete the quorum device from the cluster and then add it back to the cluster. There is no other way to properly recongure a quorum device with a new attached node. When you add the quorum device back into the cluster, all the physical paths to the quorum device, including the new one, are detected. In addition, if the quorum device was attached to exactly two nodes, then the cluster used SCSI-2 Persistent Group Reservation Emulation (PGRE) reservations. If it is now attached to three nodes, make sure it works correctly as a SCSI-3 quorum device when it is added back into the cluster.
Deleting an Existing Quorum Device

1. 2. List the existing quorum devices: # clq show Using the did number associated with the quorum device, determine if a new node is connected. # cldev list -v did_number 3. 4. Remove any such quorum devices: # clq remove did_number If this was a SCSI-2 quorum device that will then be added back as a SCSI-3 quorum device, you need to scrub the old reservations:

4-9
Adding a Node to an Existing Cluster # /usr/cluster/lib/sc/pgre -c pgre_scrub -d \ /dev/did/rdsk/d#s2
Adding a Quorum Device to the Cluster

You might need to add a quorum device back into the cluster that you previously removed. New quorum devices are automatically SCSI-3 quorum devices if they are physically connected to more than two nodes. If you are adding a node to a one-node cluster to form a two-node cluster, probably as part of a replacement procedure, you must add a quorum device. If the new node is not attached to the storage array containing the existing quorum device, you might also choose to add a new quorum device. This is not required, but it is highly recommended so that the quorum vote mathematics continue to make sense. To add a quorum device to the cluster, use the following command: # clq add did_number To reset the installmode ag (adding a node to a one-node cluster), use: # clq reset You can also use the clsetup utility to add a quorum device. This will automatically reset the installmode ag, if appropriate.
4-10

Configuring Volume Management on the New Node

This section shows how to congure volume management on the new node.
Installing VxVM Software on the New Node

For VxVM 5.0, you could install the packages and initialize VxVM before or after installing the cluster framework. Note The product initialization fails when the vxio entry is already in the /etc/name_to_major le. However, you were required to add this entry so that the new node could boot into the cluster. Therefore, you must delete this entry before installing and conguring the actual VxVM software. Then conrm and restore if necessary the old vxio number, or use clvxvm initialize to do so.
Choosing Not to Install VxVM Software on a Non-Storage Node

You are not required to install the VxVM software on a non-storage node A node that is not attached to shared storage will never be the primary server for a shared storage device group. If this is the case, you must still run the cldg sync command from an existing node for each existing VxVM device group so that the global device les and links are created correctly on the non-storage node. Type the following command to create the les and links: # cldg sync dgname
Adding a New Node to Existing Device Groups

Add the new node to the Nodelist property for any existing device groups to which the new node is physically attached.

4-11
Adding the New Node

For the Solaris VM software, the metaset command automatically calls the correct cluster reconguration steps as follows: # metaset -s disksetname -a -h new-nodename If your diskset spans two arrays, you might also add the new host as a mediator. # metaset -s disksetname -a -m new-nodename
Note Solaris VM allows only two hosts to be mediators even if three or more hosts are attached to the diskset, so if you are adding a third attached node you will not be adding it as a mediator. For the VxVM software, use the cldg command to add the node to a device group. You can use a wildcard (+) here if it is appropriate. #cldg add-node -n new-nodename dgname You can also use the clsetup utility to perform this procedure.
4-12

Configuring IPMP
IPMP must be congured on the public network interfaces on the new node before you can add the node to any LogicalHostname or SharedAddress resources. IPMP can be set up before the new node is added to the cluster, either manually or as part of a Solaris JumpStart software installation. Verify the network conguration with the following command: # ifconfig -a If the public network interfaces are not in IPMP groups, set up /etc/hostname.xxx les on the new node and reboot it.
Preparing the New Node to Run Existing Applications

If you plan to use your new node to run applications that already exist as data services in resource groups, you might have to prepare the node before you attempt to modify your resource groups. Check for the following requirements:
Install any application software on the new node that you installed on the local disks of other nodes. Installing applications on local disks instead of global le systems can allow rolling maintenance of the software, although it is more time consuming to do so. Create any necessary local les and directories, even for software installed in the global le systems. For scalable services, log les must be on the local storage. Modify the /etc/passwd, /etc/shadow, /etc/group, /etc/system, and /etc/project les with any application-specic changes that you already made on the other nodes. Add any lines to the /etc/vfstab le that reference global le systems. Add any lines to the /etc/vfstab le for failover le systems that the application accesses. These require that the node be physically connected to the storage and appear in the Nodelist property for the device group. Create any logical host (or shared address) IP address entries in /etc/hosts (or /etc/inet/ipnodes for IPV6). Install any data service agents that are needed on the new node.

4-13
Adding the New Node to Existing Resource Groups

If your resource group contains a LogicalHostname or SharedAddress resource, you must modify the NetIfList property to include the IPMP group name for the new node. You cannot add the new node to the Nodelist property for the resource group until you do this. In the following example, sc_ipmp0 is the IPMP group name on each node, including the new one. It is easiest to manage the cluster if names of IPMP groups on the same subnet match across nodes, although it is irrelevant to the Sun Cluster and Solaris software. The following examples use the names of the groups that would have been automatically assigned by the cluster. # clrs show -v ora-lh|grep NetIfList NetIfList: # clrs show -v iws-lh|grep NetIfList NetIfList: sc_ipmp0@1 sc_ipmp0@2 sc_ipmp0@1 sc_ipmp0@2
# clrs set -p netiflist=sc_ipmp0@1,sc_ipmp0@2,sc_ipmp0@3 ora-lh # clrs set -p netiflist=sc_ipmp0@1,sc_ipmp0@2,sc_ipmp0@3 iws-lh Now you can modify the Nodelist property for the existing resource groups to include the new node. In the new CLI, clrg add-node adds a new node to the node list while keeping the existing nodes intact. For scalable applications you must modify the Nodelist property for the failover group containing the SharedAddress resource before modifying the Nodelist property for the scalable application group. You could do them with the same command, or with the wildcard. # clrg add-node -n noodle ora-rg # clrg add-node -n noodle lb-rg # clrg add-node -n noodle iws-rg If you have a scalable resource group, change the Desired_primaries and Maximum_primaries properties of the resource group, assuming you want the group to run on all nodes at the same time. # clrg set -p Desired_primaries=3 -p Maximum_primaries=3 iws-rg
Now you can use clrg switch or clrg online to run your applications on the new node.
4-14

www.chinaitproject.com IT QQ : 3264454 Removing a Node From an Existing Cluster
Removing a Node From an Existing Cluster

To remove a node from a cluster conguration, you must perform a number of non-automated cluster conguration tasks. There are two scenarios which are very similar, but differ only slightly near the end:
Orderly removal of a node (includes deconguration of the cluster framework on that node) Removal of a dead node from the cluster conguration
The only interruption to service would be the switching of services off a live node that you are removing in an orderly fashion. These procedures have been greatly simplied in Sun Cluster 3.2. The following steps are required to remove a node. 1. 2. 3. 4. 5. 6. Switch any services and device groups off the node, if it is still alive. Remove the node from the Nodelist property of resource groups. Remove the node from the Nodelist property of the VxVM and Solaris VM device groups. Halt the node to be removed (orderly scenario) and boot it to noncluster mode. Disable node quorum votes and remove attached quorum devices. Remove or clear the node from the cluster, depending on the state of the node being removed. (A) Orderly Removal Scenario: a. b. Allow removal access for the node to be removed. On the node to be removed, comment out any global le systems from /etc/vfstab (you can leave /global/.devices/node@#). Run clnode remove on the node being removed. This automates all other steps of the node removal, including deconguration of the cluster framework on that node. Run clnode clear on a remaining cluster node
c.
(B) Dead Node Removal Scenario: a. 7. Add back any quorum devices, as appropriate.

4-15
Switching Services Off the Node

If the node you are planning on removing is still part of the cluster, then switch services off the node you are removing. After you complete this action, there will be no cluster service interruption caused by the node removal. Type the following command from any node: # clnode evacuate noodle
Note This will also evacuate any services from non-global zones running on the specied node.
4-16

Removing a Node From the Resource Group Nodelist

Remove the node from each resource group with which it is listed. You can do this while the resource group is online. To determine with which resource groups the node is listed, you can use the clrg show command. However, with the new CLI wildcards, you can remove a node from the nodelist of all resource groups to which it belongs without even knowing the resource group names. # clrg remove-node -n noodle +
Note If any resource group were congured in a non-global zone, its Nodelist property would not be modied by the above command. You would have to explicitly remove the non-global zone from the Nodelist. For example: clrg remove-node -n noodle:myzone myrg
Note The Desired_primaries and Maximum_primaries properties are automatically reduced for scalable groups.
Removing a Node From the Device Group Nodelist

For VxVM device groups, remove the node from the Nodelist property by using the cldg command or the clsetup utility.: # cldg remove-node -n noodle ora-rg For Solaris VM device groups, if the host to be removed is congured as a mediator, you must remove the node from the mediator conguration rst: # metaset -s orads -d -f -m noodle Then the following metaset command takes care of both the volume management conguration of the diskset and the cluster conguration: # metaset -s orads -d -f -h noodle

4-17
Note The -f is required if the node being removed is already dead. At the time of writing there is an outstanding bug that these take an excruciatingly long time (more than 5 minutes) if the node being removed is already dead. The BugID is 6507093.
Rebooting The Node to be Removed to Non-Cluster Mode

When doing an orderly node removal, you wait until now to reboot the node to non-cluster mode.
Removing Quorum Votes and Quorum Devices

Since you will be manipulating quorum votes and devices, it is safest to disable the quorum vote of the node being removed. It is required in the case of a node-removal from a 2-node cluster: # clq disable noodle You must remove any quorum device to which the node being removed is attached. If you are removing a node from a two-node cluster, you will have to enable the installmode property of the cluster: The following is an example of removing a quorum device for noderemoval from a cluster of more than two nodes. If it is a SCSI-3 quorum device that you know you will add back later as a SCSI-2 quorum device, scrub the reservations: # clq remove d1 # /usr/cluster/lib/sc/scsi -c scrub -d \ /dev/did/rdsk/d#s2 The following is an example of removing a quorum device for noderemoval from a two-node cluster: # cluster -p installmode=enabled # clq remove d1
4-18

Completing Node Removal (Orderly Removal)

In this scenario the rest of the procedure is driven from the node being removed. You must comment out or remove from the vfstab le any global le systems containing data, but you can leave the /global/.devices/node@#. You then run the remove sub-option, pointing the node to a remaining node in the cluster: # clnode remove -F -h node_remaining_in_cluster
Note The -F option overrides any objection that node may have to the fact that a local tape drive is still listed in its own local copy of the CCR as a device group.
Completing Node Removal (Dead Node)

In the alternate scenario, you clear the remaining node information from a node still in the cluster: # clnode clear -F noodle
Adding Back Quorum Devices

If you had to remove any attached quorum devices but are still left with two or more nodes after the node removal, you want to add quorum devices back. Warning If you had started with a three-node cluster with all nodes attached to the shared storage, at this point you would be left with a twonode cluster with no quorum device. Make sure you add back a quorum device at this point.

4-19
Replacing a Failed Node in a Cluster
Replacing a Failed Node in a Cluster

Sometimes you need to replace a completely failed node, due to:

Complete failure of the node hardware itself Failure or accidental corruption of the root disk or root le system
It makes no difference whether you need to install a new OS on a new replacement node, or on the same node that you used to have. Both are considered complete node replacements. The following subsections discuss two possible replacement procedures. Note that the rst is more general (the replacement node could even have different disk controllers and network adapters).
Removing the Node Definition From the Cluster and Adding It Back
The most reliable solution to replacing a failed node in a cluster is to completely remove the nodes denition from the cluster, and then add it back as a brand new node. This allows the new node to have different transport adapters, storage attachments, and even a different name (or any or all of these could be the same as before). Although this solution involves going through all the procedures mentioned earlier in the chapter, to remove a node denition and then adding it back, this is likely to be the fastest way to restore a cluster node. The alternative, restoring archives, takes much longer. In order to make Solaris and Sun Cluster package installation faster, you can install a Flash archive that was created on a node that had the Sun Cluster packages installed but had not been congured with scinstall. Note You might think you could create a Flash archive from a node already congured into the cluster, and use it to restore a node without having to remove its denition from the cluster and add it back. The problem is that standard Flash installation post-processing is inconsistent with the ability to boot right back into the cluster. While with some clever installation post-processing it was possible to make this work in Solaris 8 OS and Solaris 9 OS, it has never been supported and this author has not been able to make it work in Solaris 10 OS.
4-20

www.chinaitproject.com IT QQ : 3264454 Replacing a Failed Node in a Cluster
Using a Well-Managed Archive

If you keep full archives (such as ufsdump archives) for the boot disk, you can boot a failed and repaired node off the network or CD. You can then restore the archives on the failed node and return it to the cluster without having to remove the node from the conguration and add it back again. This procedure assumes that the replacement node has the identical hardware conguration, such as transport adapters, as the old one. You do not need to back up the /global/.devices/node@# le system because it is rebuilt automatically when doing a reconguration reboot. After you replace the failed boot disk, you must recongure the volume management software you are using to manage your boot disk.

4-21
Uninstalling Sun Cluster Software From a Node
Uninstalling Sun Cluster Software From a Node

Starting in Sun Cluster 3.2 the Java ES installer has the ability to remove the Sun Cluster software from a node. Note Recall in this scenario that the cluster framework would have been automatically uncongured by the clnode remove utility. The following example shows the command to run the Java ES uninstaller. # cd /var/sadm/prod/entsys5 # ./uninstall
4-22

www.chinaitproject.com IT QQ : 3264454 Reviewing Disk Replacement Procedures
Reviewing Disk Replacement Procedures

This section describes procedures and issues for disk replacement. Individual disks that are not in a hardware RAID controller are referred to as JBOD (Just a Bunch of Disks). Among the issues explored are the following:
Individual disk drive failures in hardware RAID rather than software RAID Entire array (or cable to array) failures DID consistency issues Physical disk IDs for SCSI-JBOD and Fibre-JBOD disks Updating physical disk IDs in the DID database (in the CCR) VxVM software procedures for xing broken mirrors Solaris VM software procedures for xing broken mirrors Special issues for replacing a failed drive that is used as a quorum device
Identifying Individual Disk Drive Failures

In almost all hardware RAID installations, individual disk drive failures are normally not visible at the Solaris OS level or the Sun Cluster 3.x software level. In many cases, physically replacing a drive automatically triggers reconstruction of the hardware RAID device without manual intervention. Remember that for small, non-multipathed hardware RAID devices, data must still be mirrored across controllers with the VxVM or Solaris VM software. Normally a hardware RAID drive failure does not cause even mirror degradation, because the disks from the Solaris OS level are the hardware RAID LUNs. If individual drives are not protected by hardware RAID, then loss of one disk requires you to recongure the Solaris OS and volume management software. This section assumes that availability to the VxVM or Solaris VM software device is never lost because you always mirror properly.

4-23
Identifying Cable or Total Array Failures

The failure of an entire array for a hardware RAID often looks very much like individual disk failure in a software RAID. In the Solaris OS, if one or more LUNs break, you must repair the DIDs and volume management mirrors after you x these LUNS. Cable failures are similar to total array failures, but often without the difcult issues. You must still x volume management mirrors after replacing cable, but you are unlikely to have to recongure individual device les or DIDs. This is because the same drives or LUNs become active after the cable is replaced.
Reviewing DID Consistency Issues

If you have a JBOD drive or total array failure, it might be important that the replacement drive receive the same DID number as the disk that it replaces when you reinsert it. Remember individual drives in a hardware RAID do not have a DID number. It is the LUN that has a DID number, so the DIDs are not an issue for an individual drive failure.
Reviewing DID Information in the CCR

The Sun Cluster software associates the following information within the CCR with each DID:
The c#t#d# value from each node If this changes, then you get a different DID number for the replacement. A disk serial number or worldwide name (WWN) The fact that the DID database contains a specic physical serial number or WWN from a disk is a tricky problem. In the cluster environment, this information exists and must be kept consistent across three different places:

The physical disk itself The disk device driver (in each individual nodes RAM) The DID database portion of the CCR
4-24

www.chinaitproject.com IT QQ : 3264454 Reviewing Disk Replacement Procedures The illustration in Figure 4-1 shows the relationship among the different places that store a physical Disk ID:
Node 1 Physical ID in Device Driver (RAM) cldev repair
Physical ID on Disk
Boot cfgadm (SCSI JBOD Disks) luxadm/devfsadm/ devfsadmd (Fibre JBOD Disks) Node 2 Physical ID in Device Driver (RAM)
Physical ID in DID Database (CCR)
cldev repair
Figure 4-1
Physical Disk ID on Disk, Device Driver, and CCR
The device driver in RAM on any particular node is automatically updated from the information that is physically present on the disk during a boot or reboot.

4-25
Updating the Physical Disk ID in Device Driver RAM (SCSI Disks)

If you need to replace a SCSI JBOD disk, it is possible to propagate the new disks ID to each nodess device driver in RAM without a reboot by using the following procedure: 1. 2. Start with one of the nodes connected to the disk in question. If the disk being replaced is a quorum device, reassign the quorum to a different disk. This is discussed in Examining Failed Quorum Device Issues on page 4-32. If you are using Solaris Volume Manager, physically switch the device group containing the disk to a different node using cldg switch
3.
Note If the device group contains any failover le systems, you must switch the associated resource group, that is, the application, and let it drag over the device group. If there are no failover le systems, you can just switch the device group. 4. If you are using VxVM, temporarily remove the disk from VxVM control: # vxdisk offline c#t#d# # vxdisk rm c#t#d# # vxdmpadm -f disable path=c#t#d#s2 5. 6. Use cfgadm to uncongure the disk: # cfgadm -c unconfigure c#::dsk/c#t#d# Physically replace the disk with a new disk (you can do this step anytime up until now, as long as the new disk is in place before the next step). If you are repeating these steps (as required) on a second node attached to the disk, the new disk will already be in place and you do not need to touch the hardware. Use cfgadm to read the new physical disk information and congure it into device driver RAM: # cfgadm -c configure c#::dsk/c#t#d# 8. If you are using VxVM, the DMP device driver is automatically updated about the new disk. Type the following so that VxVM recognizes the new disk: # vxdisk scandisks
7.
4-26

Note The vxdisk scandisk option exists in VxVM version 4.0 and above. Previous to VxVM 4.0, you had to run vxdctl enable, whose documented meaning was to enable the conguration daemon but which had the side-effect of scanning all the disks.
9.
Repeat steps 3-8 on remaining nodes also connected to the disk. If you are using Solaris VM, in step 3, you must switch the device group or resource group off of that node (back to the rst node, if there are only two attached nodes).
Updating the Physical ID in Device Driver RAM (Fibre-Channel JBOD Disks)

In the case of JBOD Fibre disks, updating the physical disk ID in device driver RAM is a simple process. The physical disk ID for a Fibre disk is also called the World Wide Name (WWN). The documented procedure uses the luxadm command on one node, and then devfsadm on the other nodes. This inserts the proper physical ID in device driver RAM, as well as lays down the correct physical device les. Fibre devices have the disk WWN as part of the physical device le under the subdirectory /devices/... To physically replace a Sun StorEdge A5x00 array drive, type the following commands: # luxadm remove_device -F enclosure_name,position # luxadm insert_device enclosure_name,position You might need to use the -F force option to remove the device if volume manager device drivers have the device open. Make sure that you remove the correct drive. For example, if you want to replace the third drive from the left on the front of the array named krusty, type the following commands (remember the positions start counting at 0): # luxadm remove_device -F krusty,f2 # luxadm insert_device krusty,f2

4-27
Note While this is the documented procedure for Fibre JBOD disk replacement in the cluster, in Solaris 9 OS and Solaris 10 OS the devfsadmd daemon on each node will automatically detect a new Fibre JBOD disk insertion and perform the equivalent of the above commands automatically for you.
Updating the DID Serial Number Information From the Device Driver RAM
The following procedure assumes that you physically replaced a JBOD drive. Because it has the same c#t#d# value as the drive it is replacing, the DID device number is the same. However, you still have to update the physical disk information stored in the DID database (in the CCR) for each DID. Proceed as follows: 1. Make sure you know which DID number you are working with. Use the cldev utility to map an individual c#t#d# value to a DID number, or the DID number to an individual c#t#d# value: # cldev list -v d13 DID Device Full Device Path ------------------------d13 theo:/dev/rdsk/c2t1d0 d13 vincent:/dev/rdsk/c2t1d0 2. From any node physically connected to the disk, print out the previous physical disk information stored in the CCR (for comparison purposes, so you can make sure your update succeeds):
# cldev show -v d13 DID Device Instances === DID Device Name: /dev/did/rdsk/d13 Full Device Path: theo:/dev/rdsk/c2t1d0 Full Device Path: vincent:/dev/rdsk/c2t1d0 Replication: none default_fencing: global Disk ID 46554a49545355204d4146333336344c2053554e33364720303036373731303320202020 Ascii Disk ID: FUJITSU MAF3364L SUN36G 00677103
4-28

www.chinaitproject.com IT QQ : 3264454 Reviewing Disk Replacement Procedures 3. From any node physically connected to the disk, use the cldev repair utility to synchronize the DID database with the new physical disk information from device driver RAM.
# cldev repair d13 # cldev show -v d13 DID Device Instances === DID Device Name: /dev/did/rdsk/d13 Full Device Path: vincent:/dev/rdsk/c2t1d0 Full Device Path: theo:/dev/rdsk/c2t1d0 Replication: none default_fencing: global Disk ID: 46554a49545355204d4146333336344c2053554e33364720303036383532303920202020 Ascii Disk ID: FUJITSU MAF3364L SUN36G 00685209

4-29
Examining Disk Replacement and Mirror Fixing

Disk replacement and mirror xing is different for the VxVM and Solaris VM software.
Examining VxVM Software Procedures

Restoring a repaired disk fully into a VxVM software mirror conguration is a complex process that is made easy by the vxdiskadm utility menus. Use the vxdiskadm utility menus for most of this process. Assume that a JBOD disk, array, or cable failure made a mirror failure visible in the VxVM software, and that you completed steps to physically replace the drive and DID, if appropriate. Although DIDs are not used to build volume manager volumes, they still need to be consistent for all disks so that failure fencing reservations work properly. Perform the following steps: 1. Run the vxdisk scandisks command on the primary node for the disk group to make the Volume Manager rescan paths to the disks and certify that the disks are healthy. This is the only step that you must perform manually. Run the vxdiskadm utility and choose the option to Replace a failed or remove disk. a. You can have it list all of the possible disks you can choose as the replacement (including the replaced disk with the same name as the one that broke). Do not use the FMR (fast mirror resynchronization) option. This is only useful if a disk was temporarily unavailable and you are restoring the same physical disk as before (in this case VxVM can do a fast resynchronization of only the regions that have changed). The vxdiskadm utility will automatically x any mirrors that had been broken when the old disk became unavailable.
2.
b.
c.
A caution to this procedure is that mirrors involving broken disks might already be repaired by the Volume Manager softwares vxrelocd process. If this has occurred, the Volume Manager does not consider these volumes broken anymore, even if they are now mirrored inside the same controller. You might need to manually examine every mirror carefully, deleting xed plexes that are in the same storage array and remirroring onto your xed disk either manually or using the command vxunreloc.
4-30

Examining Solaris VM Procedures

The Solaris VM software requires considerably more manual work to repair mirrors than the VxVM software. In the cluster environment, you can run any of the commands from any node in the diskset. The nodes automatically send your command to the primary node. Perform the following: 1. Use the format or fmthard command to repartition the new disk. The disk is already dened in a diskset, so you must partition it manually with a small s7 at the beginning of the disk. One possible best practice is to partition all Solaris VM disks identically with a small s7 at the beginning, the rest of the disk as s0, and use soft partitions to make smaller chunks. If this is true for this specic situation, you can use an existing diskset disk as a model to repartition the new disk. 2. Fix any broken metastate database denitions from the disk. The easiest way to x denitions is to delete the broken one, and then use the metaset command to rebalance the metastate databases: # metadb -s orads -i # metadb -s orads -d /dev/did/rdsk/d#s7 # metaset -s orads -b
Note When metadb replicas are added automatically to new diskset disks in Sun Cluster 3.2, they get the name /dev/did/rdsk/d# (it is a reverse lookup problem, it really is s7, but for non-EFI disks d# is same major/minor number as d#s7). You need to delete using the same name as shows up in metadb -s setname -i. If you upgraded from SC3.1, the name will still include the s7.
3.
If you use soft partitions on top of Solaris OS partitions, rewrite soft partition information to the new drive: # metarecover -s orads /dev/did/rdsk/d#s0 -p It is often easiest to mirror regular partitions rst, and then make soft partitions on top of the mirrors. In this case no recovery is required for your soft partitions because, while the mirror is degraded, it remains usable.
4.
Run the metastat -s dsname command to identify any broken mirrors. This command indicates where to run the metareplace command to x mirrors. If mirrors are already xed by hot spares,

4-31
you can still run the same metareplace command to put the xed disk back into work and the spare back into the spare pool. Use the following command for each disk: metareplace -s nfsds -e d#_for_mirror fixed_component In this command, the fixed_component argument is the soft partition device if you use soft partitions, or the /dev/did/rdsk/d#s# value if you do not use soft partitions.
Examining Failed Quorum Device Issues

The repair procedure for a broken quorum device is to add a different quorum device, and then remove the broken one. You can use the following procedure to replace a failed quorum device. 1. Identify the DID instance number associated with the failed quorum device: # clq status 2. Identify the DID instance number to be used as a new quorum device: # cldev list -v 3. 4. Add the new quorum device to the cluster conguration: # clq add new-d# Delete the old quorum device from the cluster conguration: # clq remove old-d#
Viewing an Example
This example shows how to x a Sun StorEdge A5200 array disk failure with the VxVM software: 1. Identify the failed drive as follows. This is often easiest to do through volume management software: # vxdisk list # vxprint -g dgname 2. On the node hosting the device group to which the disk belongs, replace the failed disk. Let the luxadm utility guide you through removing the failed disk and inserting a new disk: # luxadm remove enclosure,position
4-32

www.chinaitproject.com IT QQ : 3264454 Reviewing Disk Replacement Procedures # luxadm insert enclosure,position 3. On the other node, run the devfsadm command.
Note In Solaris 9 OS and Solaris 10 OS the devfsadmd utility detects physical disk insertion and automatically performs steps 2 and 3 for you. 4. On either node, recongure the DID information as follows: # cldev list -v c#t#d# # cldev repair d# 5. 6. On all nodes attached to the device group, type the following: # vxdisk scandisks On the node hosting the device group, type the following: # vxdiskadm a. You may need to use the option 22, Change/Display the default disk layouts. Starting in VxVM 4.0, the built-in default is CDS disks. If you are repairing a disk that is in a non-CDS group, you need to change the default layout used by vxdiskadm. Use option 5, Replace a failed or removed disk. Let this utility guide you through the process.
b. 7.
Verify in the Volume Manager software that the failed mirrors are resynched, or that they were previously reconstructed by the vxrelocd process: # vxprint -g grpname # vxtask list
8. 9.
Move all the hot-relocated subdisks back to the repaired disk: # vxunreloc -g grpname repaireddiskname If the failed drive was a quorum device, then create a new quorum device and remove the failed quorum device from the cluster conguration: # clq add new-quorum-did # clq remove old-quorum-did

4-33
Backing Up and Restoring the CCR
Backing Up and Restoring the CCR

Back up the CCR as frequently as it changes. You have the choice of rolling the CCR backup into the already-scheduled root le system backup, or scheduling CCR backups separately. After you install the cluster, the CCR changes when you run cluster utilities to make conguration changes. The /etc/cluster directory is identical on all nodes, except for the contents of the nodeid les. Therefore, to recover from disaster, you can use any existing, up-to-date /etc/cluster directory as the backup source for all other nodes in the cluster as long as you modify the aforementioned le appropriately. The amount of space in /etc/cluster/ccr is small, so it does not take much space or time to make a backup. Making a backup of the CCR while booted in cluster mode is possible as long as you are not simultaneously making modications to the CCR. In this manner, the CCR can be considered quiesced and ready for a backup. Use the tar command to back up the CCR from a cluster node to tape drives: # cd /etc/cluster # tar cvf /dev/rmt/1 ccr You can restore the CCR to a node from a tape drive as follows: 1. 2. Boot into single-user noncluster mode: ok boot -sx Extract the ccr tar le from the tape drive: # cd /etc/cluster # tar xvf /dev/rmt/1 ccr 3. Reboot the node into the cluster: # init 6
4-34

www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures
Exercise: Performing Maintenance and Recovery Procedures


Task 1 Remove a cluster node Task 2 Add a node to the cluster Task 3 Replace a failed Fibre JBOD drive Task 4 Replace a failed SCSI JBOD drive
Lab Task Order

If you have a two-node cluster (and no third node to add to the cluster), perform the tasks in the order written (remove a node from the cluster, and then add it back). Note This is the procedure for recovery if you have to completely rebuild a node in a two-node cluster. If you have a third node to attach to the cluster, perform Task 2 (add your third node) before Task 1 (removing a node). If you have time and then want to do Task 2 again (add your third node back), feel free to do so.
Preparation
Perform the following steps before beginning any tasks: 1. Some of the following tasks refer to Node 1 and Node 2. You need to resolve these node IDs to host names. On both nodes, type the following: # clinfo -n 2. 3. Some of the following tasks distinguish between two-node clusters and three-node clusters. Some of the following tasks distinguish between clusters using Solaris VM software and clusters using VxVM software.

4-35
Task 1 Removing a Cluster Node

In this task, you remove a node from a cluster in an organized fashion. Note You may be going from a three-node to a two-node cluster (if you have performed Task 2 before this task) or from a two-node cluster to a one-node cluster. Perform the following: 1. 2. Keep the node to be removed booted in the cluster until you remove it from all resource groups and device groups. Evacuate all device groups and resource groups. From any node, type the following: # clnode evacuate node_to_be_removed 3. Set the Nodelist property of each resource group so that it no longer includes the node to be removed. # clrg remove-node -n node_to_be_removed + 4. Set the NetIflist property of any network resources used so that they only include the IPMP group names of remaining nodes. The examples here assume the IPMP group names that are automatically assigned by the system.
Note The idea here is to leave in IPMP information only for those nodes that will remain. # clrs set -p netiflist=sc_ipmp0@1[,sc_ipmp0@2] ora-lh # clrs set -p netiflist=sc_ipmp0@1[,sc_ipmp0@2] iws-lh
4-36

www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures 5. Perform this step only if the node to be removed is physically attached to the shared storage (that is, go directly to step 6 if you are removing a third, non-storage node). For VxVM: Remove the node from the nodelist property of any VxVM device groups. # cldg remove-node -n + For Solaris Volume Manager: Remove the node as a mediator rst, and then remove it from the disksets by using the metaset command. On any node, type the following: # metaset -s orads -d -f -m node_to_be_removed # metaset -s iwsds -d -f -m node_to_be_removed
Note If you are going from three storage nodes down to two storage nodes, it is possible that the node you are removing was never a mediator in the rst place
# metaset -s orads -d -f -h node_to_be_removed # metaset -s iwsds -d -f -h node_to_be_removed 6. 7. Reboot the node to be removed into non-cluster mode: # reboot -- -x Put the node into maintenance state. On a remaining node, type the following: # clq disable node_to_be_removed 8. If you are going from a two-node cluster to a one-node cluster (only), put the cluster into install mode: # cluster set -p installmode=enabled 9. Remove one quorum device. If the node in question is physically attached to the quorum device, this will be your one and only quorum device. If you are removing a non-storage (third node), you should still remove one of the quorum devices so that you are left with the correct quorum votes. # clq status # clq remove d#

4-37
Exercise: Performing Maintenance and Recovery Procedures 10. Enable remove/add access for the node to be removed # claccess allow -h node_to_be_removed
11. On the node to be removed (booted in non-cluster mode): a. b. Comment out the vfstab entry for /global/web Run the following to decongure the node and remove it from the cluster conguration:
# clnode remove -F -n name_of_any_remaining_node 12. At this point, you could remove the cluster software if this were a real-world server that you no longer wanted in the cluster but whose OS you wanted to preserve. For lab purposes, you may be adding the node back into the cluster, so you might as well just leave the cluster packages installed.
4-38

Task 2 Adding a Node to the Cluster

Perform this task to either add the node that you removed from the cluster in the previous task, or to add a brand new (third) node to the cluster. In any case, at least one current cluster node must be running in order to perform the steps in this task. The steps distinguish between twonode clusters and three-node clusters, where necessary. Perform the following: 1. If you are going to be adding a third node, verify that the cluster is dened to use switches: # clintr show If it is dened to use switches, only one end of each cable should be dened as a host adapter, and the other should be a switch port, such as switch1@1. Redene the cluster to use switches if it was originally a two-node cluster dened without them. Do the following for one private network at a time. Use the Cluster Interconnect submenu of the clsetup utility: a. b. c. d. e. f. g. 2. Disable the crossover cable Remove the denition of the crossover cable Add a switch Add a cable from Node 1 to the switch Add a cable from Node 2 to the switch Physically recable, if necessary. Repeat all steps for the second interconnect
Use the claccess command on any node already in the cluster to allow the new node to join the cluster: # claccess allow -h newnodename Create the mount points /global/web and /oracle on the new node, if they do not already exist: # mkdir /oracle # mkdir -p /global/web
3.

4-39
Exercise: Performing Maintenance and Recovery Procedures 4.
If you are using VxVM software in the cluster, verify that the vxio major number is the same on the new node as on existing nodes. Create the entry on the new node if it does not yet exist. # grep vxio /etc/name_to_major vxio same_number_as_other_node[s]
Note You have to use the same number as the other node(s). If there is some other device driver entry in the le that already has that number, reassign the number for that other driver to a higher number (higher than any existing entry in the le).
5.
If this is a brand new (third) node which never had any cluster framework packages installed, use the Java ES installer to install the cluster framework packages. You can run the installer in graphical mode (if your DISPLAY is set correctly) or command-line mode: # pkginfo -l | grep SUNWsc If no cluster packages exist (they may exist even on a new node if you installed a Flash archive that includes the cluster packages), then do the following: # cd sc32_software_location/Solaris_sparc # ./installer or # ./installer -nodisplay Choose the Sun Cluster Core packages (not the agents). Choose the Configure Later option.
6.
Verify that the did major number is the same on the new node as on the current nodes: # grep did /etc/name_to_major did same_number_as_other_node[s]
Note You have to use the same number as the other nodes. If there is some other device driver entry in the le that already has that number, reassign the number for that other driver to a higher number (higher than any existing entry in the le) 7. Run the scinstall utility on the new node to add it to the cluster: # /usr/cluster/bin/scinstall
4-40

www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures a. b. c. d. e. Choose Option 1 from the Main Menu. Choose Option 3 from the Install Menu, Add this machine as a node in an existing cluster. From the Type of Installation Menu, choose Option 1, Typical. Provide the name of any node already in the cluster as a sponsoring node. Provide the name of the cluster that you want to join. Type cluster show -t global on an existing node if you have forgotten the name of the cluster. f. g. h. i. 8. Answer no to avoid sccheck (to save time) Use auto-discovery for the transport adapters. Reply yes to the automatic reboot question. Examine and approve the scinstall command-line options.
After the new cluster node is rebooted into the cluster, delete any existing quorum devices which are also physically connected to the new node. You may have no quorum devices at all, if you are going from a one-node to a two-node cluster. a. b. List the existing quorum devices: # clq status Determine if any of the DIDs listed are physically connected to the new node: # cldev list -v did_number_listed_above c. Remove any such quorum device. Be careful to remove only existing quorum devices that are attached to your new node. Do not remove the existing quorum device if you are adding a third, non-storage node. # clq remove did_number # /usr/cluster/lib/sc/pgre -c pgre_scrub \ -d /dev/did/rdsk/d#s2 # /usr/cluster/lib/sc/pgre -c pgre_inkeys \ -d /dev/did/rdsk/d#s2
9.
Add any additional required or desired quorum devices, and call clq reset in case you had previously enabled the installmode ag. If you are adding a third storage node, this could be the quorum device you just deleted in the previous step.

4-41
If you are adding a third node which is a non-storage node, you want to keep the original quorum device (listed in the previous step), and add a second quorum device. # cldev list -v # clq add d#
Note At the time of writing there is a bug being investigated and one of your original two nodes may still panic (specically, when adding the same quorum device as before where a new storage node is attached). After that node reboots, the cluster is congured correctly and you can continue.
# clq reset // only really required if you are // adding back a second, node, in order // to reset the installmode flag 10. If you are using VxVM and the new node is physically attached to the storage, install VxVM if it is not already installed: a. b. Remove the vxio major number from /etc/name_to_major. Install the VxVM 5.0 packages:
# cd veritas50dir/volume_manager/pkgs # cp VRTSvlic.tar.gz VRTSvxvm.tar.gz \ VRTSvmman.tar.gz /var/tmp # cd /var/tmp # gzcat VRTSvlic.tar.gz | tar xf # gzcat VRTSvxvm.tar.gz | tar xf # gzcat VRTSvmman.tar.gz | tar xf # pkgadd -d /var/tmp VRTSvlic VRTSvxvm VRTSvmman # vxinstall // add a license and accept the default // for all questions except the last one // (say no when asked about default disk group) # clvxvm initialize # reboot 11. If you are using Solaris VM, create local metadevice database replicas if they do not already exist: # metadb -i # metadb -a -f -c 3 c#t#d#s7 12. Add the new node to any existing VxVM or Solaris VM device groups to which the new node is physically attached (if you are adding a non-storage node, ignore this):
4-42

www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures If you are using VxVM type the following: # cldg add-node -n newnodename oradg # cldg add-node -n newnodename iwsdg If you are using the Solaris VM software, type the following: # # # # metaset metaset metaset metaset -s -s -s -s orads iwsds orads iwsds -a -a -a -a -h -h -m -m newnodename newnodename newnodename newnodename
13. For VxVM only, synchronize your device groups. This will create the proper devices if you have a non-storage third node. # cldg sync oradg iwsdg 14. Make sure that IPMP is congured on the new node, and note the name of the IPMP groups. # clnode status -m 15. Make system changes on the new node to support the scalable web server application, including installing the agent:
Option A: Node that you removed in Task 1 but left software intact.
# vi /etc/vfstab (Make sure the entry for /global/web is uncommented) Option B: Brand new node:
# cd SC32_loc/Solaris_sparc/Product # cd sun_cluster_agents/Solaris_10/Packages # pkgadd -d . SUNWschtt # # # # cd same_directory_with_install_script_from_day1 gzcat iws-common.cpio.gz | cpio -ivmud mkdir -p /var/iws/logs chown webservd:webservd /var/iws/logs
# vi /etc/vfstab (Add a line for /global/web. You can paste it from another node, but be careful about the paste procedure adding an unwanted newline character.) # vi /etc/hosts (Add entry for iws-lh)

4-43
16. Make system changes on the new node to support running Oracle, including installing the agent. Note if you want the failover Oracle application to run on a non-storage node, you need to change the failover le system to a global le system.
Option A: Node that you removed in Task 1 but left software intact: (Everything should be left over here. There is no need to do anything.)
Option B: Brand new node attached to shared storage:
# cd SC32_loc/Solaris_sparc/Product # cd sun_cluster_agents/Solaris_10/Packages # pkgadd -d . SUNWscor # cd same_directory_with_install_script_from_day1 # gzcat oracli.cpio.gz | cpio -ivmud # groupadd -g 8888 dba # useradd -u 8888 -g dba -s /bin/ksh \ -c "Oracle User" -d /oracle oracle # vi /etc/vfstab (Add a line for /oracle; it can remain a failover le system. You can paste from another node, but be careful about the paste procedure adding an unwanted new-line character.) # vi /etc/hosts (Add an entry for ora-lh.)
Option C: New non-storage node (change to global le system): 1. Make basic administrative changes to the new node:
# cd SC32_loc/Solaris_sparc/Product # cd sun_cluster_agents/Solaris_10/Packages # pkgadd -d . SUNWscor # cd same_directory_with_install_script_from_day1 # gzcat oracli.cpio.gz | cpio -ivmud # groupadd -g 8888 dba # useradd -u 8888 -g dba -s /bin/ksh \ -c "Oracle User" -d /oracle oracle # vi /etc/hosts (Add an entry for ora-lh.) 2. From any one node, bring the application ofine:
# clrg offline ora-rg
4-44

www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures 3. On all nodes, modify the /etc/vfstab le so that the /oracle is a global le system. On your new node, this will be a new line: For the existing line for /oracle, modify the last two elds so that they are yes global rather than no -. For the new node, add or paste the line as a global le system. You can paste it from another node, but be careful about the paste procedure adding an unwanted new-line character. 4. From any one node, change the AffinityOn property and resume the application again. You need to disable the dependents of the ora-stor in order to disable the orastor itself to change the property: clrs disable ora-server-res clrs disable ora-listener-res clrs disable ora-stor clrs set -p AffinityOn=false ora-stor mount /oracle clrg online -e ora-rg
# vi /etc/vfstab
# # # # # #
17. Modify the NetIfList property of any LogicalHostName or SharedAddress resources in those resource groups to include the IPMP group of the new node: # clrs show -v iws-lh ora-lh|grep NetIfList # clrs set -p netiflist=existing-list,ipmp-group@new-nodeid ora-lh iws-lh 18. Add the new node to resource group node lists. For the scalable resource group, modify the Desired_primaries and Maximum_primaries to include the new number of nodes (two or three). # clrg add-node -n newnodename ora-rg # clrg add-node -n newnodename lb-rg # clrg add-node -n newnodename iws-rg # clrg set -p Desired_primaries=new_number_of_nodes \ -p Maximum_primaries=new_number_of_nodes iws-rg

4-45
19. Verify that you can run existing resource groups on the new node. # clrg switch -n newnodename ora-rg # clrg switch -n newnodename lb-rg # clrg online -n newnodename iws-rg
Note If you wanted to make the xclock-rg and/or xclockgds-rg run on a brand new node as well, you would have to copy over the custom wrapper software and conguration les before being able to add the new node to the resource groups. As this is not a formal requirement of this lab, it is left as an exercise to the reader. If you are adding back a second node that you had just deleted from the cluster without having rebuilt the whole OS, then you could just add the node to the resource groups as in step 18.
20. Deny all future nodes from adding themselves to the cluster: # claccess deny-all
Task 3 Replacing a Failed Fibre JBOD Drive

To replace a failed Sun StorEdge A5x00 JBOD drive, perform the following: 1. Identify a disk drive that has consistent trafc, such as one of the drives used for your Oracle or web server software: For VxVM: # vxdisk -o alldgs list # vxprint -g dgname For Solaris VM: # metaset # metastat -s dsname 2. If you are using Solaris VM software, save a copy of the VTOC of the disk on which you simulate failure: # prtvtoc /dev/did/rdsk/d#s2 > /old-vtoc.txt 3. Simulate a disk failure by pulling the disk out of the drive bay. If you are not physically present, use the format command to change the relevant Solaris OS partition sizes to zero, which simulates a failure. Identify the quorum:
4.
4-46

www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures # cldev list -v # clq status 5. If the disk in question is the quorum device, assign a different quorum device, and remove the original one: # clq add dnew_number # clq remove dold_number 6. Replace the failed drive. If you are simulating a failure by zeroing out Solaris partitions, make the drive look like a new disk (no partitions except s2, which covers the whole disk). On one connected node, type the following: # luxadm remove_dev -F enclosure,position # luxadm insert_dev enclosure,position On the other node, type the following: # devfsadm 7. Repair the appropriate DID device: # cldev show -v did_number # cldev repair did_number # cldev show -v did_number 8. If you are running VxVM software, rescan the conguration on all connected nodes: # vxdisk scandisks 9. If you are running VxVM software, x the mirrors on the node which owns the device group: # vxdiskadm 10. If you are running Solaris VM software, perform the following steps on the node which owns the device group. a. Reformat the drive according to the way it was before the failure: # fmthard -s /old-vtoc.txt /dev/did/rdsk/d#s2 b. Replace failed metastate database replicas: # metadb -s disksetname # metadb -s disksetname -d /dev/did/rdsk/d#s7
Note This may be just /dev/did/rdsk/d#, as per the note earlier in the module. c. Rebalance metastate database replicas:

4-47
Exercise: Performing Maintenance and Recovery Procedures # metaset -s disksetname -b d. Re-enable the failed submirror component:
# metastat -s disksetname | grep metareplace Note The command that this procedure tells you to invoke is not exactly correct (you cannot abbreviate a DID device name, for example). However, it tells you which device is the mirror and which DID was broken. # metareplace -s disksetname -e mirror /dev/did/rdsk/d#s0 e. Verify the status: # metastat -s disksetname
Task 4 Replacing a Failed SCSI JBOD Drive

To replace a failed SCSI JBOD drive, perform the following: 1. Identify a disk drive that has consistent trafc, such as one of the drives used for your Oracle or web server software: For VxVM: # vxdisk -o alldgs list # vxprint -g dgname For Solaris VM: # metaset # metastat -s dsname 2. If you are using Solaris VM software, save a copy of the VTOC of the disk on which you simulate failure: # prtvtoc /dev/did/rdsk/d#s2 > /old-vtoc.txt 3. Simulate a disk failure by pulling the disk out of the drive bay. If you are not physically present, use the format command to change the relevant Solaris OS partition sizes to zero, which simulates a failure. Generate some write trafc to the volume in which the broken disk is dened, in order to make sure that the Volume Manager recognizes that your disk is broken: # touch /mount-point/blah For VxVM: # vxprint -g dgname
4.
4-48

www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures For Solaris VM: # metastat -s dsname # metadb -s dsname -i 5. Replace the failed drive. If you are simulating a failure by zeroing out Solaris partitions, make the drive look like a new disk (no partitions except s2, which covers the whole disk). Identify the quorum: # cldev list -v # clq status 7. If the disk in question is the quorum device, assign a different quorum device, and remove the original one: # clq add dnew_number # clq remove dold_number # clq status 8. If you are using Solaris Volume Manager, delete bad metadevice database replicas (so that you are only left with good ones on surviving disks): # metadb -s disksetname -i # metadb -s disksetname -d /dev/did/rdsk/d#s7
6.
Note That might be /dev/did/rdsk/d#s7 or /dev/did/rdsk/d#, as per the note earlier in the module. # metadb -s disksetname -i 9. On each node physically connected to the disk (one at a time): a. For Solaris VM only (do this step and then skip to c.): 1. Switch the device group to which the disk belongs to a different node (other than the one on which you are operating): # cldg switch -n othernode devgrpname 2. If the device group has a failover le system (like /oracle) and the above command fails, then switch the associated resource group: # clrg switch -n othernode ora-rg b. For VxVM only: Temporarily remove the disk from VxVM control: # vxdisk offline c#t#d# # vxdisk rm c#t#d#

4-49
Exercise: Performing Maintenance and Recovery Procedures # vxdmpadm -f disable path=c#t#d#s2 c. d.
Use cfgadm -c unconfigure to uncongure the disk: # cfgadm -c unconfigure c#::dsk/c#t#d# Use cfgadm -c configure to congure the disk. Ignore any notice you receive such as cannot instrument return of fd_intr. # cfgadm -c configure c#::dsk/c#t#d# If you are using VxVM, the DMP device driver is automatically informed about the new disk. Type the following so that VxVM completely recognizes the new disk: # vxdisk scandisks Repeat steps a-e for the other connected nodes. For SVM, as you operate on each node, you need to switch the group off that node.
e.
f.
10. Repair the appropriate DID device (from any node connected to the disk). You will see the physical disk ID changed if you really were able to change the drive. # cldev show -v did_number # cldev repair did_number # cldev show -v did_number 11. If you are running VxVM software, x the mirrors in the VxVM software on the node which owns the device group: # vxdiskadm If you are xing one of the disk groups built by the class scripts, it is a CDS disk group and you should not need to change the layout default. If you happen for some reason to have a non-CDS disk group, you need to use option 22 to change the layout default to sliced. Then you can choose to replace the drive. 12. If you are running Solaris VM software, perform the following steps on the node which owns the device group. a. b. c. Restore the partition table of the drive: # fmthard -s /old-vtoc.txt /dev/did/rdsk/d#s2 Rebalance metastate database replicas: # metaset -s disksetname -b Re-enable the failed submirror component: # metastat -s disksetname | grep metareplace
4-50

Note The command that this procedure tells you to invoke is not exactly correct (you cannot abbreviate a DID device name, for example). However, it tells you which device is the mirror and which DID was broken. # metareplace -s disksetname -e mirror /dev/did/rdsk/d#s0 d. Verify the status: # metastat -s disksetname

4-51
Exercise Summary
Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications
4-52

Module 5
Advanced Features (ZFS, QFS and Zones)

Objectives

Congure data for any failover application in an HA-ZFS le system Understand the design and features of the QFS le system Congure a standard QFS le system Congure a shared QFS le system in the cluster using Solaris Volume Manager multiowner diskset devices Congure Sun Cluster resource groups in non-global zones
5-1
Relevance
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics presented in this module. While they are not expected to know the answers to these questions, the answers should be of interest to them and inspire them to learn the material presented in this module.
!
?
When was the basic technology of the UFS le system invented? Does it have any weaknesses? If we already have a global le system, why do we need another le system technology that supports simultaneous le access by multiple nodes? What is the advantage of the administrative sandbox provided by Solaris 10 zones?
5-2

Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971. Sun Microsystems, Inc. Sun StorEdge QFS Installation and Upgrade Guide, part number 819-4334. Sun Microsystems, Inc. Sun StorEdge QFS Conguration and Administration Guide, part number 819-4332.

5-3
ZFS as a Failover File System Only

In the initial release of Sun Cluster 3.2 ZFS is available as a failover le system only. You will be able to store data only for failover applications (not scalable applications), as such, you will be able to run the applications only on nodes physically connected to the storage.
ZFS Includes a Volume Management Layer

When you use ZFS, you will generally not need any other software volume manager (neither VxVM nor Solaris Volume Manager), nor will you need to perform any cluster device group management. ZFS automatically manages its storage in units called zpools. The zpool layer provides optional data protection in the form of mirroring or raidz (a ZFS-specic variation of RAID 5). Within a zpool you can build any number of le systems. All the le systems in a pool share all of the storage in the pool. With ZFS, you never need to worry about the amount of space available in a particular le system. Just grow the pool at any time, and the space will automatically be available to all the lesystems within.
ZFS Removes the Need for /etc/vfstab Entries

The conguration database that ZFS automatically maintains within the pool contains all the mount information for le systems inside the pool. You never need to create vfstab entries.
Example: Creating a Mirrored Pool and Some Filesystems

In the cluster, you use traditional disk paths (not DID devices) as components of pools. The pools can still fail over even if the paths have different names on different nodes (the model is precisely analogous to how VxVM has always automatically discovered disk groups): vincent:/# zpool create marcpool mirror c1t0d0 c2t0d0 vincent:/# zpool status pool: marcpool
5-4

www.chinaitproject.com IT QQ : 3264454 ZFS as a Failover File System Only state: ONLINE scrub: none requested config: NAME marcpool mirror c1t0d0 c2t0d0 STATE ONLINE ONLINE ONLINE ONLINE READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0
errors: No known data errors Now we can create lesystems that occupy the pool. ZFS automatically creates mount points and mounts the le systems. You never need to make /etc/vfstab entries. vincent:/# zfs create marcpool/myfs1 vincent:/# zfs create marcpool/myfs2
vincent:/# zfs list NAME marcpool marcpool/myfs1 marcpool/myfs2 vincent:/# df -k . . marcpool marcpool/myfs1 marcpool/myfs2
USED 136K 24.5K 24.5K
AVAIL 33.2G 33.2G 33.2G
REFER 27.5K 24.5K 24.5K
MOUNTPOINT /marcpool /marcpool/myfs1 /marcpool/myfs2
34836480 34836480 34836480
27 34836343 24 34836343 24 34836343
1% 1% 1%
/marcpool /marcpool/myfs1 /marcpool/myfs2
The mount points default to /poolname/fsname, but you can change them to whatever you want: vincent:/# zfs set mountpoint=/oracle marcpool/myfs1 vincent:/# zfs set mountpoint=/shmoracle marcpool/myfs2 vincent:/# df -k |grep pool marcpool 34836480 26 34836332 1% marcpool/myfs1 34836480 24 34836332 1% marcpool/myfs2 34836480 24 34836332 1%
/marcpool /oracle /shmoracle

5-5
ZFS Snapshots
ZFS has an instantaneous point-in-time snapshot feature. Initially, snapshots do not consume any room in the zpool. As the original (parent) copy of the le system changes, snapshots start to take up room in the zpool to record the old values of changed data, at the time of the snapshot: theo:/# zfs list NAME orapool orapool/oracle USED 2.70G 2.70G AVAIL 30.5G 30.5G REFER 24.5K 2.70G MOUNTPOINT //orapool //oracle
theo:/# zfs snapshot orapool/oracle@thursday_1feb07 theo:/# zfs list NAME USED AVAIL REFER MOUNTPOINT orapool 2.70G 30.5G 24.5K //orapool orapool/oracle 2.70G 30.5G 2.70G //oracle orapool/oracle@thursday_1feb07 299K - 2.70G -
The parent le system can be rolled back. If you want to roll back to a snapshot that is not the most recent one, you can specify a -r option to the rollback subcommand which will automatically destroy snapshots taken after the one you are rolling back to. theo:/# zfs rollback orapool/oracle@thursday_1feb07
HAStoragePlus and ZFS

The same HAStoragePlus resource type that provides failover of traditional lesystems can also provide ZFS failover. For ZFS, an HAStoragePlus instance represents one or more zpools and all the le systems and snapshots within. To manage failover of ZFS lesystems within the cluster, all you have to do is congure an HAStoragePlus instance with the value of the Zpools property indicating one or more pools that should fail over. You do not need to congure any /etc/vfstab entries whatsoever. All of the ZFS mount information is self-contained in the zpool conguration database.
5-6

www.chinaitproject.com IT QQ : 3264454 ZFS as a Failover File System Only
Note A single HAStoragePlus instance can refer to multiple traditional (non-ZFS) lesystems, or multiple ZFS zpools, but not both. When you use the Zpools property the values of the properties for traditional lesystems (FilesystemMountPoints and AffinityOn) are ignored.

5-7
Introducing the Features of the Sun StorEdge QFS File System

The Sun StorEdge QFS le system is a modern, advanced le system implementation that can be used as an alternative to the standard Solaris UNIX le system (UFS) or to the VERITAS le system (VxFS). Sun StorEdge QFS provides a completely standard le system interface to the OS, and any application can use it transparently. Note There are some exceptions to this for Shared QFS, which is discussed later in this module. Outside of the Sun Cluster environment, QFS is often combined with the Sun StorEdge Storage Archiving Manager (SAMFS). SAMFS can dynamically migrate le system data between disk and tape storage, among other advanced archiving features. Its combination with QFS is often called SAM-QFS. Inside the Sun Cluster environment, the archiving manager portion is not currently supported.
Features and Benefits of QFS

The following subsections outline many of the advanced features of QFS that can make it a preferred choice of le system.
Built-in RAID 0 Volume Management

QFS can manage striping data for a single le system across multiple underlying devices without the use of any volume manager. Striping parameters are dynamically congurable. QFS does not internally manage mirroring; if you wanted RAID 1+0 behavior, for example, you could let QFS stripe across underlying mirror devices that you build with VxVM or Solaris VM.
5-8

www.chinaitproject.com IT QQ : 3264454 Introducing the Features of the Sun StorEdge QFS File System
File Striping Option (Round-Robin Allocation)

In addition to standard block striping, QFS supports le striping. The rst le created is located completely on the rst underlying device, the second on the second underlying device, and so on. This can be a performance optimization when there are multiple applications producing multiple le streams, each with their own degree of le locality.
High Capacity / Inodes Allocated Dynamically

QFS supports both individual le sizes and le system sizes up to 263 bytes, which at the time of writing of this course is considered innite. Inodes are created dynamically, so there is no intrinsic limit on the number of les, except for the amount of disk space devoted to the le system (or to the metadata device).
Separation of File Data and Metadata (Optional)

QFS lets you choose, if you wish, to specify separate underlying devices for le data and metadata (super blocks, inodes, and directories). This can reduce head movement signicantly when large numbers of small les are being updated. QFS can stripe both the metadata and the le data.
Choice of Dual Disk Allocation Unit (DAU) Components and Single-Size DAU Components
The Disk Allocation Unit (DAU) is the minimum amount of le data guaranteed to be contiguous on the underlying device. This is a similar concept to the le system block size in UFS. QFS supports the following:
Dual DAU components This supports les where the rst eight DAUs of the le will each be 4 kilobytes (Kbytes), and the remaining DAUs of the le are larger (defaults to 16 Kbytes for a non-shared le system, and can be set to 32 Kbytes or 64 Kbytes at le system creation time). This will optimize space when you have a lot of small les. Single-size DAU components This uses a single size for the DAU, which defaults to 64 Kbytes. At le system creation time, you are allowed to specify a much larger DAU (up to 64 MB).

5-9
Shared File System Support

QFS supports a shared le system type, where multiple connected nodes can simultaneously mount a shared le system and simultaneously read and write the le data directly to the storage medium. This is in contrast to Sun Clusters standard global le system, where only one node at a time (the storage primary) actually reads and writes data directly to the storage medium. Shared QFS design and conguration is discussed in greater detail later in the module. At the time of writing of this course, the Shared QFS le system is supported only for use with the Oracle RAC application. While it is possible it could work with other multimaster and scalable applications, its use has not been certied with them.
QFS Considerations for the Cluster

The following considerations should be taken into account in deciding whether to use QFS in the Sun Cluster environment:
Standard (non-shared) QFS can be used only as a failover le system. The Sun cluster standard global le system (PxFS) is not supported at the time of writing of this course with an underlying QFS le system. One result of this caveat is that you could not use QFS to store le data for a failover application that needs to fail over to a non-storage node. If you have such a conguration, then UFS and VxFS are still the only supported le system types.
QFS requires separate licensing. Starting with QFS 4.3 this is a paperonly license. Like UFS, QFS does not support shrinking a le system.
5-10

www.chinaitproject.com IT QQ : 3264454 Configuring A Standard (Non-Shared) QFS File System
Conguring A Standard (Non-Shared) QFS File System

These are the steps to designing and conguring a standard QFS le system, suitable for use as a failover le system in the cluster:

Choose QFS le system and component device types Create underlying storage devices (partitioning and/or volume management) Create the Master Conguration File (mcf) Create and mount the le system Congure an instance of HAStoragePlus to support cluster failover
QFS File System and Component Device Types

QFS supports an ms le system type where both metadata and le data are written to the same device(s). It also supports an ma le system type where metadata and le data are written onto separate devices. The ms le system type is simpler to design, but the ma le system type provides more exibility and possible performance optimizations.
The ms File System Type

The ms le system type supports only one type of component, the md component, which species an underlying device which will hold both le data and metadata. All you need to decide is how many md components you need. QFS will stripe the combination of metadata and le data across however many components you list. The md component type is the dual DAU component, as described earlier in the module.
The ma File System Type

The ma le system type requires:
One or more component devices of the type mm. This is a dual DAU type that supports metadata only. One or more component devices to hold le data only.
You can choose dual allocation types (md) or single allocation types (mr, gXXX).

5-11
Configuring A Standard (Non-Shared) QFS File System
You cannot mix and match dual allocation types and single allocation types in the same le system. The gXXX type allows you to group single allocation subcomponents. Later, storage for specic directory trees in the le system can be assigned to specic groups.
Creating Underlying Storage Devices

For a standard QFS le system, you can choose whether or not to use VxVM or Solaris VM for the underlying storage. You might do so to provide mirroring for the underlying storage devices, and then let QFS provide striping so that the nal effect is that of a RAID 1+0. If you do not use an underlying volume manager, you will use Disk ID (DID) devices (/dev/global/dsk/d#s#) when you specify the components in a cluster environment. This guarantees consistent naming of the components across the cluster when the le system needs to fail over. You do any partitioning manually using format.
Creating the Master Configuration File

The master conguration le, /etc/opt/SUNWsamfs/mcf, is congured with one line describing a le system and the following lines describing its components. The le system name itself is a one-word identier which will later be used as a device name when you create and mount the le system.
Example of a Simple mcf File

The following is the simplest possible conguration. It contains a single le system of type ms. Recall that the components for this le system type are one or more underlying devices of type md: #id ordinal type family state #---------------------------------------------------qfsora 100 ms qfsora on /dev/md/oraqfsds/dsk/d100 101 md qfsora on This example species a single underlying device that will be used for the combination of metadata and le data. The eld denitions are as follows:
5-12

id File system or underlying device name. Note that in this example the underlying disk component is a Solaris VM device (likely a mirror or soft partition of a mirror). ordinal Record number that must be unique in the le but otherwise has no specic meaning. type File system or component type. family Name used to associate the le system with components, typically the same as the le system name. state The state can be on or off. It is meaningful for initialization of tape devices in SAMFS-QFS, and would always be on or - for QFS disk components.
Example With Several Striped Components

The following example is more complicated in that it species several devices. QFS will automatically stripe across all the devices: qfsora 100 ms qfsora on /dev/md/oraqfsds/dsk/d100 101 md qfsora on /dev/md/oraqfsds/dsk/d200 102 md qfsora on /dev/md/oraqfsds/dsk/d300 103 md qfsora on
Example With Separation of File Data and Metadata

In the following example there is a single subcomponent designated to hold metadata (component type mm) and a single component for the data (component type md): qfs1 10 ma qfs1 on /dev/md/blahds/dsk/d200 11 mm qfs1 on /dev/md/blahds/dsk/d100 12 md qfs1 on
Creating and Mounting the File System

The steps for creating and mounting the le system are as follows: 1. 2. 3. 4. Verify the conguration with sam-fsd. Congure the QFS software with samd config. Create the le system with sammkfs. Add an entry in the /etc/vfstab le.

5-13
Configuring A Standard (Non-Shared) QFS File System 5. Mount the le system.
Verifying the Conguration and Conguring the Software

The sam-fsd is a daemon which is automatically launched from /etc/inittab (including on Solaris 10 OS). When run as a standalone invocation, it is useful for listing out errors in your mcf conguration. The rst example below demonstrates an mcf le that contains errors: # sam-fsd Device /dev/md/oraqfsds/dsk/d700 not found. No family set device for eq 101, set 'qfshmora' 2 errors in '/etc/opt/SUNWsamfs/mcf' sam-fsd: Read mcf /etc/opt/SUNWsamfs/mcf failed. The second example shows the output of sam-fsd when the mcf le is error-free. # sam-fsd Would reconfigure qfsora Trace file controls: sam-amld /var/opt/SUNWsamfs/trace/sam-amld cust err fatal misc proc date size 10M age 0 sam-archiverd /var/opt/SUNWsamfs/trace/sam-archiverd cust err fatal misc proc date size 10M age 0 . . fsmgmt /var/opt/SUNWsamfs/trace/fsmgmt cust err fatal misc proc date size 10M age 0 Once the conguration is veried, you can congure QFS: # samd config Configuring SAM-FS
5-14

Creating the File System With sammkfs

The following is an example of creating the le system. There are many options not shown here for setting the DAU size, stripe widths, and the like: # sammkfs qfsora Building 'qfsora' will destroy the contents of devices: /dev/md/oraqfsds/dsk/d100 Do you wish to continue? [y/N]y total data kilobytes = 6291456 total data kilobytes free = 6290992
Creating an Entry in /etc/vfstab and Mounting the File System

An entry in /etc/vfstab would look like the following: qfsora /oracle samfs no sync_meta=1
Note The le system type is always samfs even when using QFS without the archiving manager packages. The example here uses no in the mount-at-boot column, which will be required for a failover le system in the cluster. Outside of the cluster, you would likely use yes. You mount the le system just like any other that is listed in the vfstab: # mount /oracle
Configuring Standard QFS as a Failover File System in the Cluster

All you need to do to make QFS work as a failover le system in the cluster is:
Make sure you have no in the mount-at-boot eld in the /etc/vfstab entry. Copy the mcf le to all other connected nodes and run samd config on each connected node. Add an HAStoragePlus instance to control QFS failover:
# clrs create -g ora-rg -t HAStoragePlus \ -p FilesystemMountpoints=/oracle \ -p FilesystemCheckCommand=/bin/true qfsora-stor

5-15
Configuring A Standard (Non-Shared) QFS File System
The only difference between this resource and an HAStoragePlus instance controlling a standard UFS or VxFS failover le system is the FilesystemCheckCommand property value.
5-16

www.chinaitproject.com IT QQ : 3264454 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)
Conguring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)
The Shared QFS le system allows multiple servers to read and write le system le data directly to the storage medium. Shared QFS does not support the following:

Block and character device les and named pipes Mandatory le locking Advisory le locking is supported; this is much like NFS.
One server at a time functions as the metadata server; that is, while le data can be simultaneously accessed from multiple servers, metadata must be accessed from only one server at a time for each particular le system. The data ow for a Shared QFS in the Sun Cluster environment is illustrated by the diagram in Figure 5-1:
Node 1 Metadata Server Node 2
File Data Metadata Shared QFS
File Data
Figure 5-1
Shared QFS File Data and Metadata
Note that the shared QFS le system is an alternative to the standard Sun Cluster global le system. Its advantage is the simultaneous le data access. Shared QFS can be used both inside and outside of the Sun Cluster environment. Inside the Sun cluster, the following apply:

5-17
The Shared QFS le system is accessed only by nodes in a single cluster physically connected to the data. It cannot be simultaneously accessed from other servers outside the cluster, even if they are physically attached to the same storage, because of data fencing. A resource of type SUNW.qfs (provided with the QFS software) must be congured to drive failover of the metadata server. The only choice for underlying component volume management (if you need software mirroring for the components) is the Solaris Volume Manager multiowner disksets. There is no support for any VxVM volume management. The only supported application for Shared QFS is Oracle RAC.
Note You could get Shared QFS running in the cluster without any reference to Oracle RAC, if you were not using any volume manager. At this time, it is possible that other multimaster or scalable applications besides Oracle RAC could run on Shared QFS (given the limitations previously listed in this section). However, only Oracle RAC has been veried.
Shared QFS File Systems on Solaris Volume Manager Multiowner Diskset Devices
Beginning with QFS 4.4, shared QFS le systems in the Sun Cluster environment can use Solaris VM multiowner diskset devices as underlying components. Prior to QFS 4.4, no volume manager was supported underneath shared QFS, and it was therefore only suitable in the cluster with hardware RAID devices. Solaris Volume Manager multiowner disksets require the following minimum software:
Solaris 9 OS 9/04 (update 7) or later (including all versions of Solaris 10 OS) Sun Cluster 3.1 9/04 (update 3) or later
At this time, the Solaris Volume Manager multiowner diskset feature does have a dependency on the Oracle RAC framework (the underlying RACspecic cluster membership monitor). Thus, you do have to install and congure the RAC framework (which includes the ORCLudlm package from Oracle) in order to use Solaris Volume Manager with Shared QFS.
5-18

Installing the RAC Framework

Starting in Sun Cluster 3.1 9/04 (Update 3) the RAC framework packages have moved back to the data service agents media distribution. In earlier versions of Sun Cluster 3.1, they were packaged with the cluster framework. In order to use Solaris VM multiowner disksets, you need to install the three core packages of the RAC framework (SUNWscucm, SUNWudlm, and SUNWudlmr) as well as the Solaris VM-specic package (SUNWscmd).
Managing the RAC Framework

You create a rac-framework-rg whose resources launch the RAC framework components:
On Solaris 9 OS, this is optional. If you do not create it, the RAC framework daemons will automatically still be launched by boot scripts. On Solaris 10 OS, this is required. Only these resources will launch the RAC framework daemons.
The rac-framework-rg resources do all the launching of the daemons in the BOOT and INIT methods, as long as the group is managed. Trying to stop and start the resources will do nothing. This is a protective mechanism to guarantee that the RAC framework is enabled on behalf of its dependents Solaris VM multiowner disksets and Oracle RAC itself.
Creating Solaris VM Multiowner Disksets and Volumes

The only difference in managing disksets and volumes with the multiowner diskset feature is that you use the -M option when creating the diskset: # metaset -s newmultids -M -a -h node1 node2 ... Every other Solaris VM operation is unaffected.
Configuring a Shared QFS File System

Shared QFS uses the same master conguration le (mcf) as regular nonshared QFS.

5-19
www.chinaitproject.com IT QQ : 3264454 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only) The following is an example of a master conguration le that species two shared QFS le systems suitable for Oracle RAC (one for the binaries and one for the data): qfs1 10 ma qfs1 on shared /dev/md/orashareds/dsk/d100 11 mm qfs1 on /dev/md/orashareds/dsk/d200 12 mr qfs1 on qfs2 20 ma qfs2 on shared /dev/md/orashareds/dsk/d300 21 mm qfs2 on /dev/md/oradhareds/dsk/d400 22 mr qfs2 on In this example, all of our underlying devices are mirrored volumes in the same Solaris Volume Manager multiowner diskset.
Additional Conguration File hosts.fsname

For shared QFS le systems, an additional conguration le /etc/opt/SUNWsamfs/hosts.fsname contains entries that have:

A server name (as returned by hostname) An IP address or resolvable name used to reach that server In the cluster, use the cluster private hostname such that metadata trafc goes over the cluster transport.
A server priority (non-negative number) An unused eld The word server to identify the initial metadata server
The following is an example of a hosts.fsname le suitable for a shared QFS conguration in the cluster: # cat hosts.qfs1 vincent clusternode1-priv 1 - server theo clusternode2-priv 2 Note that only one node (it should not matter which one) is congured as the initial metadata server. The other node is still a potential metadata server, and failover in the cluster will be controlled by a Sun Cluster resource that will be discussed later in the module.
5-20

Creating a Shared QFS File System

The tasks for creating, adding /etc/vfstab entries, and mounting a shared QFS le system are similar to those for an ordinary non-shared QFS le system:
The master conguration le and hosts.fsname le need to be manually replicated to each node that will be mounting the le system. The sammkfs -S fsname is called from any one node. The /etc/vfstab entry on each node always has no in the mount-atboot column and shared in the options column, like the following: qfs1 qfs2 - /oracle - /oradata samfs samfs no no shared shared
On the initial mount, the le system needs to be mounted rst from the initial metadata server (the one identied with the server ag), and then the mount command needs to be issued from the other nodes as well. A boot script automatically mounts shared QFS le systems thereafter.
Creating a Failover Sun Cluster Resource to Control the Metadata Server

The SUNW.qfs resource type description le is part of the QFS software installation itself. An instance of this resource type controls the failover of the metadata server. This is the only resource in the group; there is no related failover IP address or storage. Run the following commands to register the resource type and add the failover resource. The only required property of the SUNW.qfs resource type is the QFSFileSystem property, which points to the mount point of the le system in question # clrt register SUNW.qfs # clrg create -n node1,node2 qfsmeta-rg # clrs create -g qfsmeta-rg -t SUNW.qfs \ -p QFSFileSystem=/orashared qfsmeta-res # clrg online -M qfsmeta-rg

5-21
Resource Group Manager Support for Non-Global Zones

The new zones feature for Solaris 10 provides the ability to run both failover and scalable clustered applications in Solaris 10 non-global zones, using the exact same data service agents that control the applications when they run in the global zone. The previous version of the cluster, Sun Cluster 3.1 8/05 update 4, provided the Agent for Solaris Containers which treated the entire zone as a cluster resource. This agent would control booting and failing over of entire zones for you. The zones themselves, however, were clusterunaware: you could not install the standard data service agents in them. The RGM in the last release was still completely unaware of zones; it just called methods of the agent, whose logic happened to boot and halt zones. This new feature is completely different; here, the RGM is completely aware of non-global zones. While the new feature is very impressive and compelling it is in fact remarkably easy to understand: the RGM treats non-global zones as virtual nodes. Whenever a node is specied to the RGM (as part of a resource groups node list, and then later as the target of switch or other control command), you could also specify a non-global zone. This simple, elegant, and general design has some interesting implications. This feature is clearly intended to support failover of applications between zones running on different nodes, or running scalable, load-balanced applications in nodes of different zones. However, because of the way zones can be specied exactly in the place of nodes in an RGM nodelist, this feature would actually enable you to fail over an application from a physical node (global zone) to a non-global zone running on a different node, or even to a non-global zone running on the same node, or even between different non-global zones on the same node. These congurations are obviously somewhat unorthodox and probably still most useful for testing purposes. The orthodox conguration will be having all the members of a resource groupss node list be either physical nodes or non-global zones congured on different physical nodes.
General Principles for the New Zones Feature

The highlights of conguring and running applications using the new zones feature are as follows:
Zones are built separately on local storage on different nodes.
5-22

www.chinaitproject.com IT QQ : 3264454 Resource Group Manager Support for Non-Global Zones

You can give zones on different nodes the same or different names. The booting of these zones is not under control of the cluster. You are likely to want to set the autoboot property of the zone to true. You need to install the data service agents that you want in the nonglobal zones:
You could install it in global zones, using pkgadd without the -G option, and just let the packages get inherited into current and future non-global zones. Alternatively, you could install the agents only in the nonglobal zones in which they are needed.
Note You must type the clrt register command in the global zone, and by default, it looks for the resource-type registration (RTR) les only in the global zone. If the agent is installed only in a non-global zone, you can use the -f option of clrt register to refer to an RTR le in the nonglobal zones root path.
There are two ways to specify a zone name in the place of a node name:
The old CLI supports only a syntax of -h nodename:zonename to refer to a zone (in a node list, or as a switch target, or whatever). The new CLI supports both the -n nodename:zonename syntax as well as the alternate syntax -n nodename,[nodename..] -z zonename When this latter syntax is used it implies that you want to specify the same zone name for each node name listed with the -n option.
LogicalHostname and SharedAddress resources work within a resource group that is mastered on a non-global zone, as expected. However, while they will appear to only ever belong inside the nonglobal zone, the underlying implementation of the methods will actually congure them in the global zone and then move them to the appropriate non-global zone. The only implication for their conguration is that the IP addresses in question be resolvable in the global zone, rather than in the nonglobal zone in which the addresses will actually end up. Obviously, in order to have your application happy, you are likely to also need

5-23
to make the IP address resolvable in the non-global zone. But the dependency for proper operation of the cluster IP resources themselves is just for address resolution in the global zone. For failover applications, it is optional to congure dedicated zone IPs using zonecfg. For scalable applications, it is required that each zone in question have their own dedicated public network IP addresses congured using zonecfg.
Traditional (non-ZFS) HAStoragePlus instances that represent global or failover lesystems are slightly strange. The le system needs to be mounted on the appropriate physical node (or nodes, in the case of global), and then a loopback le system mount is made in the nonglobal zone. The non-global zone mount point does not need to be the same as the physical node mount point. The HAStoragePlus supports a new syntax of global-zone-mt-pt:non-global-zonemt-pt as the value of the FilesystemMountPoints property if you want to specify a different mount point in the non-global zone. Without the new syntax, the methods of HAStoragePlus will assume that the global and non-global zone mount points are the same. ZFS is a little different in that the le system software is zoneaware and can be made available exclusively in a non-global zone when required. This is discussed further later in this document.
Some cluster commands can be typed directly in the non-global zone:

Viewing status or conguration of almost anything Changing state or switching a resource group or individual resource, assuming the groups node list contains that specic non-global zone. Creating an application resource (but not a LogicalHostname, SharedAddress, or HAstoragePlus) in a group whose node list contains that specic non-global zone. LogicalHostname, SharedAddress, and HAStoragePlus resources must still be created by commands typed in the global zone.
All other operations must be done from the global zone.
Example
The following example shows the creation of a full scalable Apache conguration running only in a pair of non-global zones on different nodes.
5-24

www.chinaitproject.com IT QQ : 3264454 Resource Group Manager Support for Non-Global Zones The following steps have been taken prior to this example:
Separate non-global zones have been created and booted on the two cluster nodes. In this example, both zones have the name frozone. A standard global lesystem /global/web has been created on the global zone. The HAStoragePlus resource that we will demonstrate in this example will automatically perform the loopback mounts in the non-global zones. The IP entry for food-web is resolvable both on the physical nodes, such that you can create the IP resource, and in the non-global zones, so that the application can run correctly. Each non-global zone has its own dedicated non-failover IP address on the public network. This is a requirement for scalable applications in zones.
Creating a Failover Resource Group and SharedAddress

The following commands are typed in a global zone. The example creates the same resource group twice in order to demonstrate the two different ways of placing a non-global zone in the node list for a resource group. # clrg create -n pecan:frozone,grape:frozone apache-sa-rg # clrg status Cluster Resource Groups === Group Name ---------apache-sa-rg Node Name --------pecan:frozone grape:frozone Suspended --------No No Status -----Unmanaged Unmanaged
# clrg delete apache-sa-rg # clrg create -n pecan,grape -z frozone # clrg status Cluster Resource Groups === Group Name ---------apache-sa-rg Node Name --------pecan:frozone grape:frozone
apache-sa-rg
Suspended --------No No
Status -----Unmanaged Unmanaged
Create a shared address resource (this must be typed in a global zone). # clrssa create -g apache-sa-rg food-web

5-25
Manage and online the resource group to demonstrate that the IP address actually goes online in the non-global zone (in this case, on pecan:frozone, as it is the rst node in the groups node list). This command could be run either from a global or non-global zone. In the example we are still running from the global zone so we can see with ifconfig how the IP address is placed in the non-global zone: # clrg online -M apache-sa-rg # clrs status Cluster Resources === Resource Name ------------food-web online. Node Name --------pecan:frozone grape:frozone State ----Online Offline Status Message -------------Online - SharedAddress Offline
# ifconfig -a . . . qfe2:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 zone frozone inet 192.168.1.52 netmask ffffff00 broadcast 192.168.1.255 . . .
Creating a Scalable Group and an HAStoragePlus Resource

In the following example assume the following prerequisites have already been met:
A global lesystem mount point /global/web has already been created and provisioned in the global zones only. The only thing already provisioned in the non-global zones is the mount point. In this example we will use the same mount points for the loop-back in the non-global zones. You do not put vfstab entries in the non-global zones.
5-26

www.chinaitproject.com IT QQ : 3264454 Resource Group Manager Support for Non-Global Zones The following commands are entered in a global zone: # clrt register HAStoragePlus # clrg create -p Desired_primaries=2 -p Maximum_primaries=2 \ -n pecan,grape -z frozone apache-rg # clrg status Cluster Resource Groups === Group Name ---------apache-sa-rg Node Name --------pecan:frozone grape:frozone pecan:frozone grape:frozone Suspended --------No No No No Status -----Online Offline Unmanaged Unmanaged
apache-rg
# clrs create -g apache-rg -t HAStoragePlus \ -p FilesystemMountpoints=/global/web \ -p AffinityOn=false \ web-stor # clrg online -M apache-rg
The loopback le system mount will automatically be performed in the non-global zones as the HAStoragePlus resource goes online.
Provisioning Applications and Adding Application Resources

Applications can now be provisioned directly inside the non-global zone, which now has its storage available, exactly as you would do in the global zone. All application conguration les, and all values of cluster application resources which reference les, will be relative to the nonglobal zone. The following example assumes the scalable apache instance has now been correctly congured inside /global/web in the non-global zone. The clrt register must be called from the global zone, but the clrs create could be called either from the global zone or non-global zone.

5-27
Resource Group Manager Support for Non-Global Zones # clrt register apache # clrs create -g apache-rg -t apache \ -p Bin_dir=/global/web/bin \ -p SCALABLE=true \ -p Resource_dependencies=food-web,web-stor \ apache-res
The apache resource will go online on both zones immediately, since new resources are now created in an enabled state and the scalable resource group is already online.
Private Hostnames and clprivnet0 Addresses for Zones

There is a completely optional method for automatically provisioning IP addresses for the exclusive use of zones on the private network. These IP addresses, if provisioned, will show up as virtual IP addresses on the clprivnet0 adapter, congured specically in the zones in which you provision them. The only reason that you would want to congure such private addresses is if there were some cluster-aware software that you wanted to run in the zones where the software instances needed to intercommunicate across the private network. Note that at release time of Sun Cluster 3.2, Oracle RAC, which would be a perfect example of such an application, will not be supported in non-global zones. When you provision per-zone private IP addresses you:
Make up a name for each particular zones private IP. This name is not congured in any external name service including the hosts le; rather it is stored and resolved through the CCR just like the node private host names. Allow the cluster to automatically select an IP address corresponding to your choice of name. The cluster will automatically choose an appropriate IP address in the correct private network range for the per-zone private network IP address.
In other words, all you have to do is make up a name.
5-28

www.chinaitproject.com IT QQ : 3264454 Resource Group Manager Support for Non-Global Zones
Example of Provisioning Private Net Addresses for Zones

Conguring the per-zone private net addresses is accomplished in the new CLI through the clnode set command: # clnode set -p zprivatehostname=priv-frozone-g grape:frozone # clnode set -p zprivatehostname=priv-frozone-p pecan:frozone
Each command will automatically congure a new clprivnet virtual IP on the respective node. For example: # zlogin frozone [Connected to zone 'frozone' pts/6] Last login: Mon Jun 19 11:34:14 on console Sun Microsystems Inc. SunOS 5.10 Generic January 2005 frozone-p:/# ifconfig -a lo0:1: flags=20010008c9<UP,LOOPBACK,RUNNING,NOARP,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 qfe1:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 192.168.1.231 netmask ffffff00 broadcast 192.168.1.255 qfe2:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 inet 192.168.1.52 netmask ffffff00 broadcast 192.168.1.255 clprivnet0:1: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 6 inet 172.16.4.66 netmask fffffe00 broadcast 172.16.5.255 Current values of the zone private host name can be displayed using clnode show. In the example we display information for one node, which would include zones on that node. Omitting a node name (or using the default wildcard +) would show all nodes. # clnode show pecan Cluster Nodes === Node Name: Node ID: Enabled: privatehostname: reboot_on_path_failure: globalzoneshares: defaultpsetmin: pecan 2 yes clusternode1-priv disabled 1 1

5-29
Resource Group Manager Support for Non-Global Zones quorum_vote: quorum_defaultvote: quorum_resv_key: Transport Adapter List: Node Zones: --- Zones on node pecan --Zone Name: zprivatehostname:
www.chinaitproject.com IT QQ : 3264454 1 1 0x448E2AE600000002 hme0, qfe0 pecan:frozone
pecan:frozone priv-frozone-p
Once you set the zprivatehostname property for a zone, changing the name by repeating the command with a different name just changes the name and leaves the same IP address. There is currently no mechanism for conguring more than one private IP address in a zone. If you want to uncongure the zones private IP address you can just null out the value of the zprivatehostname: # clnode set -p zprivatehostname='' grape:frozone # clnode set -p zprivatehostname='' pecan:frozone
5-30

www.chinaitproject.com IT QQ : 3264454 Exercise 1: Running a Standard Failover Service on QFS (Optional)
Exercise 1: Running a Standard Failover Service on QFS (Optional)

In this exercise you create a standard (non-shared) QFS le system to hold application data. You will add storage for the QFS le system in the same device group as your existing Oracle data, migrate your Oracle data to the le system, and arrange to use the QFS le system as a failover le system within the cluster. While one of the main benets of the QFS le system is the ability to separate metadata from le data, for simplicity in this exercise, you will create an ms type of le system that combines metadata and le data on the same device. You will use (from the QFS point of view) a single device, which is a VxVM or Solaris VM mirrored volume. If you prefer, you could create two separate mirrored volumes, one for metadata and one for le data (to have them on separate spindles), and instead create an ma le system that separates metadata and le data. Let the material in the module serve as your guideline. In this exercise, you perform the following tasks:

Task 1 Install the QFS software on your cluster nodes Task 2 Add volumes on which to build a failover QFS le system (VxVM or SVM) Task 3 Prepare a QFS le system conguration Task 4 Create, mount, and switch the QFS le system Task 5 Migrate your Oracle application data to the QFS le system Task 6 Rearrange your mount points so that /oracle is mounted from the new QFS Task 7 Recongure your cluster resources to use the new le system
Task 1 Installing the QFS Software on Your Cluster Nodes

Perform the following steps on all nodes that are physically attached to your data storage. If you have a third node that is a non-storage node, it cannot host any QFS le systems.

5-31
Exercise 1: Running a Standard Failover Service on QFS (Optional) 1. Install the QFS packages: # cd qfs_4.5_software_location/2.10 # pkgadd -d . -G SUNWqfsr SUNWqfsu Answer yes to all the questions asked by pkgadd.
Task 2a Adding A Volume on Which to Build a Failover QFS File System (With VxVM)
Perform the following steps only on the node that is currently the owner of the oradg disk group. You can determine which node that is by running cldg status on any node. 1. Select two disks from shared storage (one from one array and one from the other array) for a new mirrored volume. Make sure you do not use any disks already in use in existing device groups. Note the logical device name (referred to as cAtAdA and cBtBdB in step 2). The following example checks against all volume managers that you could possibly be using, including zfs. # # # # 2. vxdisk -o alldgs list metaset zpool status //run this one on all nodes cldev list -v
Add the disks to your oradg disk group: # /etc/vx/bin/vxdisksetup -i cAtAdA # /etc/vx/bin/vxdisksetup -i cBtBdB # vxdg -g oradg adddisk qfsd1=cAtAdA qfsd2=cBtBdB
3.
Create a mirrored volume to hold the failover QFS le system. You mirror in the background so you can proceed without having to wait. # vxassist -g oradg make qfsvol 6g qfsd1 # vxassist -g oradg mirror qfsvol qfsd2 &
4.
Synchronize the device group so that correct cluster global devices get created for the new volume. # cldg sync oradg
5-32

Task 2b Adding A Volume on Which to Build a Failover QFS File System (With SVM)
Perform the following steps only on the node that is currently the owner of the orads disk group. You can determine which node that is by running cldg status on any node. 1. Select two disks from shared storage (one from one array and one from the other array) for a new mirrored volume. Make sure you do not use any disks already in use in existing device groups. Note the DID device names (referred to as dA and dB in step 2). The following example checks against all volume managers that you could possibly be using, including zfs. # # # # 2. vxdisk -o alldgs list metaset zpool status //run this one on all nodes cldev list -v
Add the disks to your orads diskset: # metaset -s orads -a /dev/did/rdsk/dA # metaset -s orads -a /dev/did/rdsk/dB
3.
Create a volume (soft partition on top of mirrored disks) to hold the failover QFS le system. # # # # # metainit -s orads d21 1 1 /dev/did/rdsk/dAs0 metainit -s orads d22 1 1 /dev/did/rdsk/dBs0 metainit -s orads d20 -m d21 metattach -s orads d20 d22 metainit -s orads d200 -p d20 6g
Task 3 Preparing a QFS File System Configuration

Perform the following steps on all nodes physically connected to the storage: 1. Create an entry in the QFS conguration le to congure the failover le system: # cd /etc/opt/SUNWsamfs # vi mcf For VxVM: qfsora 100 ms qfsora on /dev/vx/dsk/oradg/qfsvol 101 md qfsora on

5-33
Exercise 1: Running a Standard Failover Service on QFS (Optional) For SVM: qfsora 100 ms qfsora on /dev/md/orads/dsk/d200 101 md qfsora on
Note In this example, just for ease of doing the lab, you are creating an ms type of le system which has the metadata and le data on the same device. 2. Verify the conguration that you have just entered: # /opt/SUNWsamfs/sbin/sam-fsd You should see a bunch of trace output if everything looks OK. There will be conguration error messages if something is wrong with your conguration. 3. 4. 5. Notify the QFS daemon of your new conguration: # /opt/SUNWsamfs/sbin/samd config Make a mount point for your new le system: # mkdir /oranew Add an entry in /etc/vfstab for your new le system: # vi /etc/vfstab qfsora /oranew samfs - no sync_meta=1
5-34

Task 4 Creating, Mounting, and Switching the QFS File System

Perform the steps on the nodes indicated in each step: 1. On the node that owns the oradg or orads device group, create and mount the new le system: # # # # 2. /opt/SUNWsamfs/sbin/sammkfs qfsora mount /oranew df -k umount /oranew
On a different storage node, relocate the device group (it will be dragged across by the ora-rg resource group) and verify that you can mount and unmount the new le system: # # # # clrg switch -n new-node ora-rg mount /oranew df -k umount /oranew
Task 5 Migrating Your Oracle Application Data to the QFS File System
Perform the following steps on the node that owns the oradg or orads device group: 1. If you have any non-storage nodes (three-node cluster in a pair+1 conguration), make sure the Nodelist property for the ora-rg resource group contains only the storage nodes: # clrg remove-node -n any_non_storage_node ora-rg 2. # # # # # # # Halt your Oracle resources and migrate the data: clrs disable ora-server-res ora-listener-res mount /oranew chown oracle:dba /oranew cd /oracle find . -print|cpio -pdmu /oranew cd / umount /oranew 3. Disable the old storage resource: # clrs disable ora-stor

5-35
Exercise 1: Running a Standard Failover Service on QFS (Optional)
Task 6 Rearranging Your Mount Points So That /oracle Is Mounted From the New QFS
Perform the following on all the storage nodes. Edit /etc/vfstab:

Change the old /oracle mount point to /ora-old. Change the /oranew mount point to /oracle.
Task 7 Reconfiguring Your Cluster Resources to Use the New File System
Perform the following steps on any one node in the cluster: 1. Make a new cluster resource for the QFS le system and set the application dependencies (ignore validation errors, as usual, from the node where the new QFS failover lesystem is not mounted).
# clrs create -g ora-rg -t HAStoragePlus \ -p FilesystemMountpoints=/oracle \ -p FilesystemCheckCommand=/bin/true \ qfsora-stor
# clrs set -p Resource_dependencies=qfsora-stor ora-server-res # clrs set -p Resource_dependencies=qfsora-stor ora-listener-res 2. 3. 4. Remove the old storage resource: # clrs delete ora-stor Enable all of the resources: # clrs enable -g ora-rg + Verify that Oracle is running properly with its data on the new QFS le system, and that you can switch over and fail over the ora-rg resource group. If you see error messages about busy Solaris VM volumes, it may be because any mirror resynchronization in progress has to be restarted when you do a switchover. This can be ignored.
5-36

www.chinaitproject.com IT QQ : 3264454 Exercise 2: Configuring a Shared QFS File System (Optional)
Exercise 2: Conguring a Shared QFS File System (Optional)

In this exercise you create a shared QFS le system for use in the cluster. Regardless of whether you use underlying Solaris VM multiowner diskset devices or regular DID devices, the shared QFS le system is supported in the cluster only for the Oracle RAC application. Since this lab uses the Solaris VM multiowner diskset devices, you must install the RAC framework resource group. (It is the Solaris VM multiowner feature that depends on this, not the shared QFS). You can create the Solaris VM multiowner diskset devices even if you are using VxVM for your other (failover) data storage. Shared QFS requires the ma type of le system, where metadata and les are on separate devices. You will need to create two separate Solaris VM mirrored devices for this purpose. Installing the Oracle RAC application itself on shared QFS is a separate, optional lab. In this exercise, you perform the following tasks:

Task 1 Install the QFS software on your cluster nodes if needed Task 2 Install the RAC framework packages in order to support SVM multiowner disksets Task 3 Install the Oracle distributed lock manager Task 4 Create and enable the RAC framework resource group Task 5 Add volumes on which to build a shared QFS le system Task 6 Prepare a shared QFS le system conguration Task 7 Create and mount the le system Task 8 Mount the le system on other node(s) Task 9 Congure the metadata server as a failover resource, and test failover

5-37
Exercise 2: Configuring a Shared QFS File System (Optional)
Task 1 Installing the QFS Software on Your Cluster Nodes (If Not Already Done in the QFS Failover Lab)
Perform the following steps on all nodes that are physically attached to your data storage. If you have a third node that is a non-storage node, it cannot host any QFS le systems. 1. Install the QFS packages: # cd qfs_4.5_software_location/2.10 # pkgadd -d . -G SUNWqfsr SUNWqfsu Answer yes for the questions asked by pkgadd.
Task 2 Installing RAC Framework Packages for Oracle RAC With SVM Multiowner Disksets
Perform the following steps on all cluster nodes connected to storage: 1. Install the appropriate packages from the data service agents CD: # cd sc32_location/Solaris_sparc/Product/sun_cluster_agents # cd Solaris_10/Packages # pkgadd -d . SUNWscucm SUNWudlm SUNWudlmr SUNWscmd 2. 3. List out local metadbs: # metadb Add metadbs on the root drive if they do not yet exist: # metadb -a -f -c 3 c#t#d#s7
Task 3 Installing the Oracle Distributed Lock Manager

Perform the following steps on all selected cluster nodes as user root. 1. Install the ORCLudlm package by typing: # # # # # cd Oracle_CRS_location/racpatch cp ORCLudlm.tar.Z /var/tmp cd /var/tmp zcat ORCLudlm.tar.Z|tar xvf pkgadd -d . ORCLudlm
5-38

www.chinaitproject.com IT QQ : 3264454 Exercise 2: Configuring a Shared QFS File System (Optional) The pkgadd command will prompt you for the group that is to be the DBA for the database. Respond by typing: Please enter the group which should be able to act as the DBA of the database (dba): [?] dba
Task 4 Creating and Enabling the RAC Framework Resource Group

On any one of the selected nodes, perform the following steps to create and enable the RAC framework resource group: # clrt register rac_framework # clrt register rac_udlm # clrt register rac_svm # clrg create -n node1,node2 -p Desired_primaries=2 \ -p Maximum_primaries=2 rac-framework-rg # clrs create -g rac-framework-rg -t rac_framework rac-framework-res # clrs create -g rac-framework-rg -t rac_udlm \ -p Resource_dependencies=rac-framework-res rac-udlm-res # clrs create -g rac-framework-rg -t rac_svm \ -p Resource_dependencies=rac-framework-res rac-svm-res # clrg online -M rac-framework-rg
If all goes well, you will see a message on both consoles: Unix DLM version(2) and SUN Unix DLM Library Version (1):compatible

5-39
Exercise 2: Configuring a Shared QFS File System (Optional)
Task 5 Adding Volumes on Which to Build a Shared QFS File System

Perform the following steps on one of your storage nodes. 1. Select two disks from shared storage for a new multi-owner diskset. Make sure you do not use any disks already in use in existing device groups. Note the DID device names (referred to as dA and dB in step 2). The following example checks against all volume managers that you could possibly be using, including zfs. # # # # 2. vxdisk -o alldgs list metaset zpool status //run this one on all nodes cldev list -v
Create a new multiowner diskset and add the disks: # metaset -s orashareds -M -a -h node1 node2 # metaset -s orashareds -a /dev/did/rdsk/dA # metaset -s orashareds -a /dev/did/rdsk/dB
3.
Create a volume (soft partition on top of mirrored volume) to hold the shared QFS le system metadata: # metainit -s orashareds d11 1 1 /dev/did/rdsk/dAs0 # metainit -s orashareds d10 -m d11 # metainit -s orashareds d100 -p d10 1g
4.
Create a volume (soft partition on top of mirrored volume) to hold the shared QFS le system le data: # metainit -s orashareds d21 1 1 /dev/did/rdsk/dBs0 # metainit -s orashareds d20 -m d21 # metainit -s orashareds d200 -p d20 6g
Note The metadata and le data are on separate spindles. Each of them is currently a soft partition of a mirrored volume with only one submirror. At the end of lab, you could optionally choose additional disks and complete the mirroring of your data.
Task 6 Preparing a Shared QFS File System Configuration

Perform the following steps on all nodes physically connected to the storage:
5-40

www.chinaitproject.com IT QQ : 3264454 Exercise 2: Configuring a Shared QFS File System (Optional) 1. Create an entry in the QFS conguration le to congure the failover le system (add lines to your le if you have done the failover QFS exercise): # cd /etc/opt/SUNWsamfs # vi mcf qfsorashared 10 ma qfsorashared on shared /dev/md/orashareds/dsk/d100 11 mm qfsorashared on /dev/md/orashareds/dsk/d200 12 mr qfsorashared on 2. Create a parameter le for the shared QFS le system that is suitable for Oracle RAC: # vi /etc/opt/SUNWsamfs/samfs.cmd fs = qfsorashared stripe = 1 sync_meta = 1 mh_write qwrite nstreams = 1024 rdlease = 600 3. Verify the conguration that you have just entered: # /opt/SUNWsamfs/sbin/sam-fsd You should see a series of trace output if everything looks OK. You will see conguration error messages if something is wrong with your conguration. 4. 5. Notify the QFS daemon of your new conguration: # /opt/SUNWsamfs/sbin/samd config Print out the association between node names and private network host names: # clnode show -p privatehostname 6. Use the above output to create the shared QFS le associating shared QFS hosts with the private network hostname. This le should be identical on all nodes. List the rst node as the only server (this is the metadata server that will fail over when you add its failover agent): # vi /etc/opt/SUNWsamfs/hosts.qfsorashared nodename1 clusternode1-priv 1 - server nodename2 clusternode2-priv 2 7. Make a mount point for your new le system: # mkdir /orashared

5-41
Exercise 2: Configuring a Shared QFS File System (Optional) 8. Add an entry to /etc/vfstab for your new le system: # vi /etc/vfstab qfsorashared -
/orashared samfs - no shared
Task 7 Creating and Mounting the File System

From the single node which is identied as the server in the hosts.qfsorashared le created in the previous task, create and mount the le system: # /opt/SUNWsamfs/sbin/sammkfs -S qfsorashared # mount /orashared
Task 8 Mounting the File System on Other Node(s)

From the other node(s): 1. 2. Mount the le system: # mount /orashared Verify that you can put les in the shared QFS le system from various nodes and that the les are visible on all the nodes.
Task 9 Configuring the Metadata Server as a Failover Resource, and Testing Failover
1. From any one cluster node, create and enable a resource group containing a single resource of type SUNW.qfs for metadata failover: # clrt register SUNW.qfs # clrg create -n node1,node2 qfsmeta-rg # clrs create -g qfsmeta-rg -t SUNW.qfs \ -p QFSFileSystem=/orashared qfsmeta-res # clrg online -M qfsmeta-rg 2. Verify that you can manually switch the metadata server. Note the messages on the node consoles: # clrg switch -n othernode qfsmeta-rg 3. Halt the node that is the current metadata server. When it reboots, verify that the shared le system is automatically mounted.
5-42

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle in Non-Global Zones (Optional)
Exercise 3: Running Oracle in Non-Global Zones (Optional)

In this exercise you create non-global zones and migrate your Oracle resource group to run in the non-global zones, using the standard HAOracle agent. Warning The zones feature will not work after a Live Upgrade to Solaris 10 unless you have run the fixforzones script specied in the exercises for Module 2. Make sure that you have completed that step. In this exercise, you perform the following tasks:

Task 1 Conguring and Installing the Zones Task 2 Migrating Oracle to Run in the Zone
Task 1 Configuring and Installing the Zones

You will perform the entire exercise on all nodes currently congured to support ora-rg. This could include a non-storage node if you have not yet done the failover QFS or ZFS exercises. Perform the following steps on all nodes simultaneously. It could take a while to install and boot the zones, so make sure you do not do them one at a time. Note you will be building zones with the zonepath in the local storage for each node. 1. Congure the zone:
# zonecfg -z orazone orazone: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:orazone> create zonecfg:orazone> set zonepath=/orazone zonecfg:orazone set autoboot=true zonecfg:orazone> commit zonecfg:orazone> exit 2. Install the zone:
# zoneadm -z orazone install Preparing to install zone <orazone>. Creating list of files to copy from the global zone. .
5-43
Exercise 3: Running Oracle in Non-Global Zones (Optional) . .
The file </orazone/root/var/sadm/system/logs/install_log> contains a log of the zone installation. 3. 4. Boot the zone: # zoneadm -z orazone boot Connect to the zone console and congure the zone. It will look similar to a standard Solaris OS that is booting after a sysunconfig: # zlogin -C orazone [Connected to zone 'myzone' console] Wait until the SMF services are all loaded, and navigate through the conguration screens. Get your terminal type correct, or you may have trouble with the rest of the conguration screens. The choice for CDE Terminal Emulator seems to work best for ctelnet and cconsole windows, even in the Java Desktop environment. When you have nished system conguration of the zone, it will reboot automatically. You can stay connected to the zone console. 5. Log in and perform other zone post-installation steps: myzone console login: root Password: *** # vi /etc/default/login Comment out the CONSOLE=/dev/console line. 6. Add an oracle user to the zone: # groupadd -g 8888 dba # useradd -u 8888 -g dba -s /bin/ksh -d /oracle \ -c "Oracle User" oracle 7. Add in an entry for ora-lh to the /etc/hosts le of the zone. Use the same IP address as your ora-lh in the nodes hosts le in the global zone. Make an /oracle directory in the zone: # mkdir /oracle 9. Disconnect from the zone console using ~.
8.
5-44

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle in Non-Global Zones (Optional)
Task 2 Migrating Oracle to Run in the Zone

Perform the following tasks from any one node in the cluster: 1. 2. 3. 4. Determine on which node Oracle is running right now: # clrg status Delete the other node from the node list of the group: # clrg remove-node -n other_node ora-rg Add the non-global zone on the other node to the node list: # clrg add-node -n other_node:orazone ora-rg Switch the resource group so that it is running in the zone of the other node: # clrg switch -n other_node:orazone ora-rg 5. Repeat steps 2 and 4 the other way around (deleting physical node from node list, adding the non-global zone.)

5-45
Exercise 4: Migrating Your Oracle Data to ZFS (Optional)
Exercise 4: Migrating Your Oracle Data to ZFS (Optional)

Perform the following steps only on the single node that is currently the primary for the Oracle resource group (or that physical node, if it is running in a non-global zone). 1. If you Oracle resource group is congured on a third, non-storage node (or non-global zone on a non-storage node), remove that node from the resource groups node list (ZFS is a failover le system only): # clrg switch -n any_storage_node ora-rg //use -n node:zone if in a zone # clrg remove-node -n nonstorage_node ora-rg // use -n node:zone if in a zone 2. Identify two unused disks (one in each storage array) that you can put into your ZFS pool. If you seem to already be out of disks, please consult your instructor about cleaning up. From either of your storage nodes, create a zpool that provides mirroring for your two disks: # zpool create orapool mirror c#t#d# c#t#d# # zpool status 4. Create a ZFS le system within the pool: # zfs create orapool/oracle # df -k 5. Disable your service and migrate the data to your ZFS le system: # # # # 6. clrs disable ora-server-res ora-listener-res cd /oracle find . -print|cpio -pdmu /orapool/oracle cd /
3.
If your old /oracle is a global le system, because the application had been congured on a non-storage node, unmount it and delete it from the vfstab le: # umount /oracle # vi /etc/vfstab (comment out line for /oracle)
7.
Disable your old (non-ZFS) storage resource, and then set the point point of the ZFS le system to /oracle. # clrs disable ora-stor // this will be qfsora-stor if you already // did the failover QFS exercise # zfs set mountpoint=/oracle orapool/oracle # df -k
5-46

www.chinaitproject.com IT QQ : 3264454 Exercise 4: Migrating Your Oracle Data to ZFS (Optional)
8.
Reset your resources using the ZFS storage. If your resource group is already running in a non-global zone, you will now see the ZFS le system only in the non-global zone. # clrs create -g ora-rg -t HAStoragePlus \ -p Zpools=orapool ora-zfs-stor # clrs set -p Resource_depdendencies=ora-zfs-stor \ ora-server-res # clrs set -p Resource_depdendencies=ora-zfs-stor \ ora-listener-res # clrs delete ora-stor // this will be qfsora-stor if you already // did the failover QFS exercise # clrs enable -g ora-rg + # clrs status
9.
Observe switchover and failover behavior of the oracle application, which will now include the zpool containing your ZFS le system.
10. On the node or non-global zone where Oracle is running, take a snapshot of the data: # zfs snapshot orapool/oracle@thedatathisminute 11. Make some modications to the oracle data: # ksh # cd /oracli # . ./clienv # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> insert into mytable values SQL> insert into mytable values SQL> insert into mytable values SQL> insert into mytable values SQL> commit; SQL> select * from mytable; SQL> quit
('somename', ('somename', ('somename', ('somename',
age); age); age); age);
12. Switch your oracle application to the other node (or zone), just to prove that the snapshots fail over along with everything else. 13. Verify your new data, and then restore your snapshot on the node where Oracle is running. Do you get your old data back? # ksh # cd /oracli # . ./clienv

5-47
Exercise 4: Migrating Your Oracle Data to ZFS (Optional) # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit
# clrs disable ora-server-res ora-listener-res # zfs rollback orapool/oracle@thedatathisminute # clrs enable ora-server-res ora-listener-res
# ksh # cd /oracli # . ./clienv # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit
5-48

Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications

5-49
Module 6
Best Practices
Objectives
Dene and implement best practices for Internet Protocol multipathing (IPMP) Dene and implement best practices for shared storage le systems Dene and implement best practices for boot disk encapsulation and mirroring Dene and implement best practices for quorum devices Dene and implement best practices for campus clusters
6-1
Relevance
Relevance
!
?
Why do you need the HAStoragePlus resource for global data if global le systems are mounted at boot time? What is a failover le system? What are the best ifconfig command options to use for IPMP? Can you unencapsulate a boot mirror?
6-2

Gene Trantham and Ben Howard. Towards a Reference Conguration for VxVM Managed Boot Disks. Sun Blueprints Online. August 2000 Sun Microsystems, Inc. Sun Cluster System Administration Guide For Solaris OS, part number 819-0580. Sun Microsystems, Inc. System Administration Guide: IP Services (from Solaris 10 Collection), part number 816-4554.
Best Practices
6-3
IPMP Best Practices
IPMP Best Practices

The Sun Cluster 3.2 and 3.1 software includes the use of the Solaris OS native IPMP to manage public network redundancy on each node. Sun Cluster 3.0 used a cluster-specic solution called network adapter failover (NAFO). Among the many benets of IPMP over NAFO are the following:
IPMP supports active IP addresses on all interfaces in the group, while a NAFO group can have only one live interface at a time. IPMP lets you choose between an active-standby group or an active-active group. In an active-active group, the Sun Cluster software automatically load balances cluster service IP addresses across the interfaces of a group. The default failover time for IPMP is much faster than for NAFO. It starts at 10 seconds for IPMP and adjusts automatically, while it takes about 45 seconds for NAFO. IPMP supports automatic repair detection of interfaces and failback of IP addresses, while NAFO does not. IPMP is part of the base Solaris OS installation, so the Sun Cluster software does not need to repeat development to support network adapter failover. IPMP supports IPv6. The Sun Cluster software supports IPv6 on the public network and IPv6 logical hostnames starting at Sun Cluster 3.1 9/04 (Update 3).
This section proposes some best practices for conguring and using IPMP, including the following:

Achieving the best hardware redundancy Using test addresses or link state testing in Solaris 10 OS Placing test addresses on the virtual interfaces, if possible Avoiding standby interfaces to achieve better load balancing Ensuring you use the failback=true parameter with load balancing Using deprecated ag on all test interfaces Controlling test targets
6-4

www.chinaitproject.com IT QQ : 3264454 IPMP Best Practices
Using IPMP Hardware Redundancy

IPMP does not require redundant public network interfaces on each node, but this redundancy is a recommended best practice. If possible, do not place redundant adapters on each node on the same quad card or the same I/O board. If you have a choice between redundancy on the same quad card and no redundancy at all, choose redundancy on the quad card. The best conguration is to connect redundant members of the same IPMP group to different switches or hubs. Complete loss of an entire switch could cause failure of one of the public network interfaces on each node, while leaving the other interfaces intact.
Using Test Addresses or Link State Testing in Solaris 10 OS

Starting in Solaris 10, you can congure IPMP without any test addresses. If you have an IPMP group whose member adapters have no interfaces with the -failover option, simple link-state testing is used to determine adapter failure rather than the pings used with the normal test addresses. The advantages are simplicity of conguration and relief from congestion caused by test pings. However, without test addresses, IPMP cannot perform end-to-end testing of an adapter. Either the send or receive capacity of an adapter could be broken, but without the test addresses, IPMP does not perform any failovers as long as the link state is OK.
Placing the Test IP Address on a Virtual Interface

IPMP works with the test IP address on either the physical interface or the virtual interface. At this time, some reported bugs relate to running the test IP on the physical interface. Refer to bug #4710499 for more information.
Best Practices
6-5
IPMP Best Practices
Note The bug concerns failure of certain RPC applications when the deprecated ag is on the physical interface. Since using the deprecated ag with test addresses is recommended as a best practice, you want to make sure the test addresses are not on the physical interface. proto192# cat /etc/hostname.qfe1 proto192 group therapy netmask + broadcast + up addif proto192-qfe1-test -failover deprecated netmask + broadcast + up The second interface in a group does not normally have an additional physical node IP associated with it. Unfortunately, there is no way to make a virtual interface without a physical interface. As an alternative, it is possible to have yet another placeholder IP for the second member of the group. An example is: proto192# cat /etc/hostname.qfe2 proto192-phys-placeholder group therapy netmask + broadcast + up addif proto192-qfe2-test -failover deprecated netmask + broadcast + up This conguration requires that you allocate an additional subnet IP per node. You already need two extra subnet IPs for the test IPs. You can use the test IP alone and not use the placeholder in the hope that you do not run into bug #4710499. For example: proto192# cat /etc/hostname.qfe2 proto192-qfe2-test group therapy -failover deprecated \ netmask + broadcast + up
Using the standby Keyword

It is possible to place the standby keyword on an interface that only has a test address. With this keyword, the interface supports additional virtual IP addresses only for failure of another member of the group. Use of the standby keyword implements the active-standby form of IPMP. For example: proto192# cat /etc/hostname.qfe2 proto192-qfe2-test group therapy -failover deprecated \ netmask + broadcast + standby up
6-6

www.chinaitproject.com IT QQ : 3264454 IPMP Best Practices Without the standby keyword on either interface of the group, the Sun Cluster software automatically load balances the LogicalHostname and SharedAddress resource IP addresses across the members of the IPMP group. This achieves a measure of inbound load balancing, as shown in the following example.
proto192# ifconfig -a . . . qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 172.20.4.192 netmask ffffff00 broadcast 172.20.4.255 groupname therapy ether 8:0:20:f1:2b:d qfe1:1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2 inet 172.20.4.194 netmask ffffff00 broadcast 172.20.4.255 qfe1:2: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 2 inet 172.20.4.182 netmask ffffff00 broadcast 172.20.4.255 qfe2: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3 inet 172.20.4.195 netmask ffffff00 broadcast 172.20.4.255 groupname therapy ether 8:0:20:f1:2b:e qfe2:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 inet 172.20.4.183 netmask ffffff00 broadcast 172.20.4.255 qfe2:2: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 inet 172.20.4.184 netmask ffffff00 broadcast 172.20.4.255
It is a best practice to not use the standby keyword on either interface of a two-member IPMP group.
Enabling Failback for IPMP Interfaces

The IPMP /etc/default/mpathd le Failback=yes option enables failback to a repaired interface. The interruption caused by this failback is barely noticeable. If you use the load balancing implemented by the cluster across IPMP members in the absence of the standby keyword, keep failback enabled. If failback is disabled, after an interface fails it essentially turns into a standby interface. There is a way to move all non-test IP addresses manually from one interface to another using the if_mpadm -d interface_name command, but there is no way to move some of the interfaces manually. If failback is enabled, load balancing of non-test IP addresses rebalances itself. A repaired interface takes back exactly those IP addresses you placed on the interface originally with the ifconfig addif command.
Best Practices
6-7
IPMP Best Practices
Using the deprecated Flag on All Test Interfaces

The deprecated ag on any IPMP-member interface prevents client TCP connections from using that IP address as its source address. The Sun Cluster software automatically uses this keyword for all LogicalHostName and SharedAddress interfaces. No outbound connections ever come from these IPs. It is a best practice to always use this keyword for the test interfaces, so that no outbound connections appear to come from these IP addresses. The result is that all outbound connections from a Sun Cluster software node come from the physical node IP. This maintains a level of predictability and ease of administration for external servers that a Sun Cluster software node might contact as a client. Note Many data services, such as web servers and application servers, are often considered middleware that relies on a back-end database, which might be external to the cluster or in a different cluster.
6-8

www.chinaitproject.com IT QQ : 3264454 IPMP Best Practices
Controlling Test Targets

The IPMP in.mpathd daemon chooses routers as ping targets for determining the health of a local interface. One way to think of it is this: If any lines of output of the netstat -r command have a G in the ags column, then these will be the only targets chosen by IPMP. If no routers exist, then IPMP chooses ve targets based on the fastest respondents to a ping to the all host multicast address (224.0.0.1). If there is exactly one router in your conguration, it is the only ping target. If it fails, IPMP marks all interfaces on all the nodes as down. If the router is down, the cluster interfaces are essentially dead, as no trafc can pass through. However, if you run some trafc inside the router, such as Sun Enterprise NetBackup services, this trafc is affected by a single router failing because of the way IPMP works. It is a best practice to have some IPMP test target high availability; that is, to have at least several test targets. You can accomplish this by either:
having at least two routers. Some vendors provide true HA clusterlike solutions for routing as well. Manually adding routes to hosts, solely for the purpose of more routers being added to the routing table that can then be chosen as targets.
For example, if you wanted to use 192.168.1.39 and 192.168.1.5 as the targets, you could run these commands: # route add -host 192.168.1.39 192.168.1.39 -static # route add -host 192.168.1.5 192.168.1.5 -static
You can put these commands in a boot script. The Solaris Administration Guide: IP Services from docs.sun.com listed in the resources section at the beginning of this module suggests making a boot script /etc/rc2.d/S70ipmp.targets. This works on both Solaris 9 and Solaris 10.
Best Practices
6-9
Shared Storage File System Best Practices

This section denes some best practices for using shared storage le systems and the HAStoragePlus resource type:
When to use a global le system instead of a non-global failover le system How to set up the /etc/vfstab le for le systems When to use afnity switching How to use the HAStoragePlus resource type with scalable services
Using Failover or Global File Systems

Using a failover rather than a global le system can have performance benets for write-intensive failover services. Failover le systems avoid the overhead of state replication and cache coherency protocols for the global le system. The disadvantage is that you can mount a failover le system, which is mounted on one node at a time, only on the primary nodes for the underlying storage. The node must be physically connected to the storage which contains the failover le system. The failover le system is therefore not usable for scalable services at all, or for failover services that might fail over to a node not physically connected to the storage.
When to Use a Failover File System

Use a failover le system if all of the following are true:

The le system is for failover service only. The Nodelist property for the resource group contains only the nodes physically connected to the storageit should be the same as the node list for the device group. Only services in a single resource group use the le system.
If these conditions are true, you generally receive a performance benet from using a failover le system, especially for le system-intensive services.
6-10

www.chinaitproject.com IT QQ : 3264454 Shared Storage File System Best Practices
When to Use a Global File System

Use a global le system if any of the following are true:

The le system is for a scalable service. The le system is for a failover service that must fail over to a node not physically connected to the storage. The le system contains data for different failover services in different resource groups. You have an administrative reason to need access to the data from a node not running the service.
ZFS as Failover File System Only

At the time of writing this course, Sun Cluster 3.2 is supporting ZFS as a failover le system only. Migration of existing data into ZFS requires some care and planning, and it would be hard to urge immediate migration as a best practice. However, ZFS is clearly the le system of the future; it is very likely to quickly obviate any traditional le systems built on top Solaris Volume Manager, and, more slowly, those using Veritas Volume Manager. For brand new failover data, you may want to consider why you are not just going to ZFS. Note At the present time, if you have failover data which could need to be globally accessible at a later time, that would be a good reason not to use ZFS.
Configuring the /etc/vfstab File (Traditional NonZFS Filesystems)

The HAStoragePlus resource type does not have an extension property to distinguish between failover and global le systems. It distinguishes these le systems from entries in the vfstab le. Global le systems must have the value yes in the Mount at boot column and the word global in the options column of the vfstab le. The Sun Cluster 3.x software requires logging le systems. Note the following for different revisions of the Solaris OS:
Below Solaris 9 4/04 (Update 6)
Best Practices
6-11
The vfstab le global option automatically enables logging, but you must explicitly include the logging option for a failover le system. It is a good practice to keep the logging keyword for record keeping, and in case you convert a global le system to a failover le system.
Solaris 9 4/04 (Update 6) and above (including Solaris 10) The logging option is the default for both global and failover le systems.
Failover le systems must have the value no in the Mount at boot column and must not have the word global in the options column of the vfstab le, as in the following example. There is no harm in including the logging option, even if it is the default, as in Solaris 10 OS: /dev/vx/dsk/nfsdg/burns /dev/vx/rdsk/nfsdg/burns /localnfs 2 no logging The VxFS le system has always logged by default. The cluster software requires that the vfstab le entries be present and identical on all nodes in the Nodelist property for the resource group in which you put the HAStoragePlus resource. This includes nodes not connected to the storage, for global le systems only. The VALIDATE method for the HAStoragePlus resource type enforces this and does not distinguish between nodes physically connected and nodes not connected to the storage. ufs
Using Affinity Switching

For a failover le system, you must set the AffinityOn parameter to true on the HAStoragePlus resource. Only the node that is the primary for the storage can mount the le system. For a global le system used in a failover resource group, set the AffinityOn parameter to true as a performance optimization if both of the following are true:
The node list for the resource group includes only nodes physically connected to the storage. No other services outside of that resource group use the storage.
6-12

www.chinaitproject.com IT QQ : 3264454 Shared Storage File System Best Practices Global le systems in scalable services ignore the AffinityOn parameter setting. Set this parameter to false to indicate that you understand its function, although it has no real meaning in this context.
Using HAStoragePlus Resources With Scalable Services

The HAStoragePlus resources associated with le systems for scalable services must use a global le system. Multiple nodes have no other way to access le-oriented data simultaneously. The HAStoragePlus resource belongs in the scalable resource group, along with the scalable service, for two reasons:
The purpose of the HAStoragePlus resource START method is to ensure access to the storage. You want to ensure access from every node that is to run the data service. You want to place a dependency between the scalable data service and the global storage. This properly prevents starting the data service on any node where the storage is not accessible.
Best Practices
6-13
The proper relationship between the resources and resource groups associated with a typical scalable service are shown in Figure 6-1.
Scalable Resource Group Failover Resource Group
SUNW.HAStoragePlus Resource
SUNW.SharedAddress Resource
Resource_dependencies
SUNW.apache Resource
Figure 6-1
Resource and Resource Group Relationships
6-14

www.chinaitproject.com IT QQ : 3264454 Volume Management Software Best Practices
Volume Management Software Best Practices

This section discusses best practices in using volume management software to mirror the boot disk, including VxVM software and Solaris VM software.
Managing Boot Disk Mirroring With VxVM or Solaris VM

If you use VxVM to manage your data, you have a choice about which product to use to manage your boot disk mirroring. Either product has its advantages:

Solaris VM mirroring is generally easier in recovery scenarios. VxVM root mirroring, if you are already using VxVM for your data, may be considered simpler in that you are only using one volume management product.
Using VxVM Software to Mirror the Boot Disk

Assuming you do choose to use VxVM for boot disk management, this section of the course describes some best practice recommendations for the following topics:

Partitioning your boot disk prior to encapsulation Encapsulating the boot disk with clvxvm encapsulate Making sure you still have a logging root le system Properly mirroring the boot disk Unencapsulating the boot disk on either the original encapsulated disk or a mirrored copy
Partitioning the Boot Disk Prior to Encapsulation

You must have two free partitions in order to encapsulate the boot disk. The encapsulation process uses partition numbers for both a completely non-overlapping private region on the disk and an overlapping public region.
Best Practices
6-15
To acquire space for the non-overlapping private region, the boot disk encapsulation process removes a cylinder from the beginning of your swap slice. You lose little, and it is an acceptable practice to let the VxVM software boot encapsulation do this. It is a good practice to separate the /var directory as a separate le system so that runaway logging does not ll the entire boot disk. Make sure that you give plenty of room for normally large-sized log les. On an 18-Gbyte or 36-Gbyte boot disk, 4 Gbytes or 6 Gbytes are average sizes for the /var le system. The VxVM software can encapsulate (and later mirror) any le system on the boot disk, as long as you stay within the partition limit of ve. You need a partition for the /global/.devices/node@# le system. This partition must be on a local disk. While theoretically it could be on a separate local disk than the root disk, it is not recommended to increase the number of local disks required for boot. You should put the global devices le system on the boot disk. The following highlights the points to remember:
Never put more than ve partitions, including swap, on the boot disk. Do not separate out the root, /usr, and /opt le systems. Make a separate /var partition if you are concerned about runaway logging. Put the /global/.devices/node@# le system on the boot disk.
Encapsulating the Boot Disk With the clvxvm encapsulate Utility

Using the clvxvm encapsulate utility to encapsulate a root disk in the clustered environment automatically solves the following cluster-specic issues, all related to the fact that the /global/.devices/node@# le system is one of those being encapsulated:
The volume name for that le system is a different name on each node. The minor number for that le system is a different minor number on each node.
6-16

The /etc/vfstab le is properly edited prior to the reboot so that the le system is recognized by the VxVM scripts as part of the boot disk, and the correct volume is inserted by these scripts.
Making Sure to Have logging For the root File System

VxVM used to have bugs when you had a logging le system for a root partition that was under control of the volume manager. This has been xed (in patches for VxVM4.1, and integrated into 5.0) Unfortunately, Veritass encapsulation utilities still put nologging in the vfstab le for your root partition. Make sure you remove this so that you do have a logging le system, and do one more reboot to make it active.
Mirroring the Boot Disk

Most problems with unencapsulation are the result of incorrect mirroring of the encapsulated boot disk. When mirroring the boot disk, the mirror disk is an initialized volume manager disk. However, if mirrored properly, the VxVM software recognizes that you are mirroring the boot disk and puts restricted mirrors on the second disk. Restricted mirrors are carved out of subdisks which fall only on cylinder boundaries. Starting in VxVM 4.0, VxVM automatically lays down underlying Solaris OS partitions on the root mirror for every partition being mirrored, thus, you can access the data from either the original disk or the root mirror if you are in an emergency recovery environment. Starting in VxVM 4.0, you have a choice for your data disks and disk groups about whether they use the traditional sliced implementation or the new cross-platform data sharing (CDS) implementation. For the boot disk and its mirror, however, you must use the traditional sliced implementation. Use the following commands to mirror volumes: # /etc/vx/bin/vxdisksetup -i c0t1d0 format=sliced # vxdg -g rootdg adddisk rootmir=c0t1d0 # /etc/vx/bin/vxmirror rootdg_1 rootmir
Best Practices
6-17
The vxmirror command calls the vxrootmir command on the boot disk to mirror the boot disk volumes.
Performing Unencapsulation
If you properly mirror your boot disks, you can unencapsulate onto the original encapsulated disk or onto an initialized mirror copy. All versions of VxVM software supported by Sun Cluster 3.2 make the original boot disk and its mirror copy identical, and you can unencapsulate all partitions leaving either copy as the remaining copy. To unencapsulate, you must unmirror each boot disk volume individually. This example shows the removal of the mirrored copy (mirror halves living on the disk rootmir). You could encapsulate just as easy by removing the halves on the original root disk. # # # # # vxprint -g rootdg vxassist -g rootdg remove mirror rootvol !rootmir vxassist -g rootdg remove mirror swapvol !rootmir vxassist -g rootdg remove mirror rootdisk_13vol !rootmir /etc/vx/bin/vxunroot
The vxunroot command detects any problems before the attempt to unencapsulate begins.
6-18

Using Solaris VM Software to Mirror the Boot Disk

Sun Cluster 3.x supports the use of Solaris Volume Manager in two ways: rst, to manage the boot disk only, and second, to manage the boot disk and the shared storage volumes. This section discusses best practices for managing the boot disk using Solaris VM in a cluster environment.
Partitioning Your Boot Disk

There is only one special requirement for partitioning a boot disk managed by Solaris VM. You must account for some number of metastate database replicas on each of your local drives. The default size of the replica is 8192 blocks (4 Mbytes). Recall the replica quorum rule:
The system will continue running if at least half of the state database replicas are physically accessible. The system will panic if fewer than half of the state database replicas are physically accessible. The system will not reboot into multiuser mode unless more than half of the state database replicas are physically accessible and consistent.
You can force Solaris VM to run, even if a majority of the database replicas are not available, by setting the tunable mirrored_root_flag to 1 in the /etc/system le. The default value of this tunable is disabled, which requires that a majority of all replicas are physically accessible before Solaris VM software will start. To enable this tunable, type the following: # echo "set md:mirrored_root_flag=1" >> /etc/system Consider using at least three local disks when using Solaris VM to manage your boot disks. If for no other purpose, this third local disk can be a container for a replica in the event that one of the mirrored disks fails. You then have full availability, even after a single failure.
Mirroring Your Boot Disk

The procedure to mirror the boot disk is well documented, and is not discussed in this section. You can nd the entire procedure in the lab at the end of the module. A few points of the procedure to remember:
Best Practices
6-19
Create one-way mirrors, and reboot using your new metadevices before attaching the second submirror. If you attach the second mirror before rebooting, it can be corrupted by the time you reboot (since only the original partition is still mounted before the reboot).
The metadevice number must be different on each node for the /global/.devices/node@# le system, since each of these is mounted as a global le system. Update /etc/vfstab to use the new metadevices:
The metaroot command updates only the entry for the root le system (and also adds the correct line to /etc/system). All other entries (swap, /global/.devices/node@#) must be edited manually before reboot.
The command lockfs is used just before reboot to ush all transactions out of any logs and write the transactions to the le system. You must use this just once before the reboot because there are le systems that use the logging option.
Removing Your Boot Disk From Solaris VM Control

Customers use both Solaris VM and VxVM software to simplify the extrication of the volume management software from the boot device. Solaris VM is easier to remove than VxVM because Solaris VM does not make any modications to the partition table on the boot disk or the other submirror when you mirror it. The following procedure deletes the mirrors on the boot device: 1. 2. Detach the submirrors for each boot disk metadevice using the metadetach command. Change the entries in the /etc/vfstab and /etc/system les to mount standard Solaris partitions by restoring the backup copy using the metaroot command or editing the le by hand. Reboot the system. Delete the boot disk mirror structures using the metaclear command.
3. 4.
6-20

www.chinaitproject.com IT QQ : 3264454 Quorum Device Best Practices
Quorum Device Best Practices

This section describes some best practices for creating and maintaining quorum devices:
Adding quorum devices so that the number of quorum votes is one less than the number of node votes Writing a script to periodically check for quorum device failure Deciding when to use a quorum server
Limiting Quorum Votes

Do not have more quorum device votes than node votes. You want to make sure that the cluster can boot with every participating node, regardless of the state of any storage. You do not want to have all your nodes available, or have your data available and be prohibited from forming a cluster. The one less than rule ensures that the cluster can handle the greatest simultaneous number of node failures while remaining able to access some data on some nodes. The simplest example of this is a Pair+N cluster, shown in Figure 6-2.
Junction
Junction
Node 1 (1)
Node 2 (1)
Node 3 (1)
Node 4 (1)
QD(1)
QD(1) QD(1)
Figure 6-2
Pair+N Cluster
Best Practices
6-21
You want to be able to boot the cluster with only Node 1 or only Node 2 and still have full access to the data. With only one or two quorum devices, you cannot do that because you need at least two nodes to form a quorum of three out of ve votes, or four out of six votes. The optimal number of quorum votes from devices allows you to form a one-node cluster of four votes, as long as it is a node connected to the storage. A more complicated example might be a six-node cluster with three pairs of two nodes, shown in Figure 6-3.
N1 N1
N2 N2
N3 N3
N4 N4
N5 N5
N6 N6
Q(1) Q(1)
Q(1) Q(1)
Q(1) Q(1)
Figure 6-3
Six-Node Cluster With Three Pairs of Two Nodes
With only the three quorum devices shown above, you cannot lose one pair and half of each of the other pairs of nodes. You might have twothirds of your data storage serviceable, but you cannot form a cluster with only four out of the possible nine votes. By adding two more quora as shown in Figure 6-4 on page 6-23, you can form a cluster.
6-22

www.chinaitproject.com IT QQ : 3264454 Quorum Device Best Practices You can survive more possible outages with the conguration shown in Figure 6-4. For example, Nodes 2 and 4 give you a total of six quorum votes, as do Nodes 2 and 5 or 3 and 5.
N1
N2
N3
N4
N5
N6
Q(1)
Q(1)
Q(1)
Q(1)
Q(1)
Figure 6-4
Six-Node Cluster With Inter-Pair Quorum Devices
Quorum Vote Counts

Remember that the vote count of a quorum device is automatically set to one fewer than the number of node votes physically connected to it. By default, if you have a four-node cluster in an N*N topology, a single quorum device is assigned three votes, and satises the best practice described in this section. It is a common error to accidentally congure multiple quorum devices in this conguration. This gives you too many quorum votes, in which case you might fail to boot even if all nodes are available.
Quorum Device Use

Remember that you are not forced to have a quorum device with more than two nodes. If you do not use quorum devices, then you always need a simple majority of nodes to form the cluster.
Best Practices
6-23
Disk Path Monitoring

Sun Cluster provides a piece of the framework that performs disk-path monitoring. It is implemented as a user-level daemon called scdpmd, which runs on each cluster node. The daemon communicates through the cluster transport so that the status of all disk paths in the cluster is visible from any node. By default, all disk paths are monitored, but it is possible to disable monitoring of certain disk paths. This could be an optimization if you had a large number of disks currently not in use. The disk-path monitoring provides status information of monitored disks using the scdpm command or Sun Cluster Manager. At this time, disk path monitoring does not automatically recongure the cluster when disk-path outages occur. It does not present any extra information regarding the quorum devices or data disks, but it makes it easy to write your own scripts to specically report on exactly the disk paths that interest you. A disk path is declared failed when it is inaccessible for 10 consecutive minutes.
Using the cldev status Command

The utility scdpm can be used to print out the status of disk paths in any state, or just faulty disk paths. SDisks are specied by using either the DID number or c#t#d#. The following command shows faulty disk paths for the whole cluster: # cldev status Cluster DID Devices === Device Instance --------------/dev/did/rdsk/d1 /dev/did/rdsk/d10 Node ---vincent theo vincent theo vincent theo vincent Status -----Ok Ok Ok Ok Ok Ok Ok
/dev/did/rdsk/d11
/dev/did/rdsk/d12
6-24

www.chinaitproject.com IT QQ : 3264454 Quorum Device Best Practices /dev/did/rdsk/d13 theo vincent theo vincent theo Ok Ok Ok Ok Ok
/dev/did/rdsk/d14
/dev/did/rdsk/d15
Sample Script to Monitor Quorum Devices

A general problem with quorum devices is that they do not report as failed until the cluster attempts to reserve the device. Using quorum devices as submirrors in the VxVM or Solaris VM software congurations can function as an advantage for failure detection, because you are more likely to see that a disk failed if it is in use. Running the Solaris VM software in the cluster provides the added functionality of the mdmonitord daemon, which automatically generates periodic trafc to your Solaris VM software metadevices for earlier failure detection. While this does not relate specically to the quorum device, it can be an advantage for quorum failure detection if you use a quorum device as a building block in the Solaris VM software. As a good practice, write a script to periodically determine the status of your quorum devices and notify you if one fails. You can integrate the scdpm command into a script that you run using the cron facility to periodically monitor the status of all your quorum devices. The following sample script monitors quorum devices and reports their status in a le called /var/cluster/qcheck. If the script detects a failure of a quorum device, it also sends the information to the syslogd daemon.
Best Practices
6-25
Quorum Device Best Practices #!/bin/ksh PATH=$PATH:/usr/cluster/bin MYNAME=$(uname -n) if [[ -x /usr/sbin/clinfo ]] && /usr/sbin/clinfo then for QDEV in $(clq list -t scsi) do if cldev list -v $QDEV |grep " $MYNAME:" >/dev/null 2>&1 then STATUS_OUTPUT=$(cldev status -n $MYNAME $QDEV| nawk '$2=="'$MYNAME'" {print $3}')
if [[ $STATUS_OUTPUT != Ok ]] then logger -p daemon.crit "Quorum device $QDEV is faulted" fi echo "$(date): Quorum $QDEV $STATUS_OUTPUT">>/var/cluster/qcheck fi done fi
6-26

www.chinaitproject.com IT QQ : 3264454 Quorum Device Best Practices
Deciding When to Use a Quorum Server Device

The new Sun Cluster 3.2 quorum server feature allows an external server daemon to be dened as the quorum device within the cluster, taking the place of any other traditional quorum device. This is illustrated in Figure 6-5:
Switch
Switch
Node 1 (1)
Node 2 (1)
Node 3 (1)
Node 4 (1)
Network (not the cluster transport)
scqsd daemon (3 votes for this cluster) [ can be quorum for other clusters too ] External machine running Quorum Server Software
Figure 6-5
Quorum Server Quorum Device
Note that the quorum server daemon automatically is assigned a number of votes one fewer than the number of nodes. You would therefore never assign both a quorum server quorum device and another traditional quorum device; you would end up with too many quorum device votes. It might seem that using quorum server quorum device could be a good practice for any cluster, but this is not necessarily true. Consider a Pair +1 cluster three-node cluster. A quorum server quorum device will automatically two votes. Therefore any node, including the non-storage node, can form the cluster by itself, and any node can be the last node remaining in the cluster.
Best Practices
6-27
Is it a good idea for a non-storage node to be left in the cluster itself? It might seem harmless at rst: you might think that normal HAStoragePlus dependencies would prevent applications from running anyway on the third node. However, there is a problem:
HAStoragePlus dependencies can prevent applications from starting, but can not prevent an application from continuing to run if the storage disappears. This is because there is no fault monitor associated with HAStoragePlus. A non-storage node unable to access a global le system can still cause the global le system mount point to be busy, thus preventing the proper restoration of a global le system even when one of the storage nodes returns.
The conclusion is that the quorum server is not an ideal solution for all topologies. In particular, it is a best practice to use the quorum server quorum device only when you understand and desire the semantics of an all-connected quorum device.
6-28

www.chinaitproject.com IT QQ : 3264454 Best Practices for Campus Clusters
Best Practices for Campus Clusters

A local cluster is one in which all of the nodes and storage arrays are located at the same site. These clusters enhance business continuity by providing redundant hardware and automated failover. While this conguration offers good protection against events such as hardware and software failures, it does not protect against events that could destroy or damage the entire site. Fibre Channel (FC) technology enables data replication over much greater distances than previously available using SCSI technology. Using FC, you can place cluster nodes and storage in different buildings or even at different sites, thus building extended clusters. A campus cluster is one type of extended cluster. Extended clusters offer signicant protection against disasters, but they are not a complete disaster recovery solution. A cluster that has only one logical copy of data is still vulnerable against inconsistencies introduced by faulty software or hardware, even if that data is mirrored. Common user errors, such as erroneously deleting database tables, can cause a major disaster. Even cluster software can fail to protect against certain disasters. For example, a campus cluster in which all the nodes are located within a few kilometers might be subject to an earthquake that levels the entire campus. To protect against these possibilities, most enterprises deploy a multifaceted solution to ensure continuous service availability. Note In August 2005 Sun introduced the Sun Cluster Geographic Edition, which is a cluster of clusters two clusters (which may be separated by very long distances) that perform data replication and provide a more complete disaster recovery solution. You can learn about Sun Cluster Geographic Edition in the ES-361 course.
Best Practices
6-29
Defining Campus Cluster Topologies

There are only two supported campus cluster topologies using Sun Cluster software:

Two-site campus cluster Three-site campus cluster
The difference between these two topologies is the location in which the quorum device (always required for two-node clusters) is placed.
Conguring a Two-Site Campus Cluster

The two-site campus cluster is depicted in Figure 6-6.
Site A Site B
Node 1
Node 2
Storage and Quorum
Storage
Figure 6-6
Two-Site Campus Cluster
In the two-site conguration, the quorum device is located at the site of one of the nodes. If this entire site goes down, you will have complete loss of availability.
6-30

www.chinaitproject.com IT QQ : 3264454 Best Practices for Campus Clusters
Conguring a Three-Site Campus Cluster

The three-site campus cluster is depicted in Figure 6-7.
Site A Site B
Node 1
Node 2
Storage
Storage
Quorum
Figure 6-7
Three-Site Campus Cluster
The three-site topology behaves much like a standard, non-campus, two-node cluster during split-brain scenarios. This is the preferred campus cluster, but it requires more hardware and a third site.
Choosing A Quorum Device, a Quorum Node, or a Quorum Server for the Third Site
In the three-site campus cluster, the third site could contain:

A traditional disk quorum device A quorum node (a third node congured in the cluster just for the purposes of quorum) A quorum server device
The third choice may be the easiest in that you neither need to make storage connections nor private network connections to the third site. You just need public network accessibility between each node and the quorum server.
Best Practices
6-31
Reducing the Performance Impact of Campus Clusters

Because the campus cluster increases the distance between the node and its remote storage, there is inevitably a degradation in performance. To minimize the impact of the distance, use the following:
The preferred plex feature of the VxVM software The VxVM software allows you to dene a preferred mirror plex. When conguring this feature, set the preferred plex to the plex that is local to that site. When the preferred plex is set, the VxVM software performs all read operations from that plex. Write operations are applied to both plexes. Only if the preferred plex fails does the VxVM software perform all read and write operations on the nonpreferred plex. You must manually perform or automate the change of the preferred plex after a switchover or failover of the resource group. The VxVM software does not perform this change automatically.
Note Solaris VM has a feature somewhat analogous to preferred plex, but it would not be possible to take advantage of it in this scenario. In Solaris VM the preferred copy is always the rst submirror. If you want to switch the preferred copy, you have to detach your rst submirror and add it back. This would cause a full resync of your data, which is obviously not what you want to achieve.
To improve performance, always use failover le systems rather than global le systems for failover services. Scalable services, on the contrary, must use global le systems because the data must be accessible from all nodes in the node list of the scalable resource group.
6-32

www.chinaitproject.com IT QQ : 3264454 Exercise: Using Best Practices
Exercise: Using Best Practices

You perform the following tasks in this exercise:

Task 1 Mirror the boot device using Solaris VM software Task 2 Encapsulate and mirror the boot device using VxVM software Task 3 Verify IPMP best practices Task 4 Implement quorum device monitoring
Preparation
No preparation is required for this exercise.
Best Practices
6-33
Task 1 Mirroring the Boot Disk Using Solaris VM

Skip this task and go directly to Task 2 if you want to mirror your boot disk using VxVM. Perform the following steps on all nodes in your cluster: 1. If you are using VxVM, remove the bootdg disk group on which your previous (pre-upgrade) root disk had been encapsulated: # vxprint -g bootdg # vxdg destroy bootdg # vxprint -g bootdg 2. Copy the partitioning from your current root disk to your second local disk:
Warning Make sure you perform prtvtoc on your current root disk, and apply with fmthard to your other local disk (the second one may have been your root disk at the beginning of the course, before the live upgrade). If you do this the wrong way, you can destroy your current root disk. # prtvtoc /dev/rdsk/cAtAdAs0 | fmthard -s - \ /dev/rdsk/cBtBdBs0 3. If you only have two local drives, make sure that both of them have the same number of replicas. You may already have replicas on both disks (if you were using Solaris VM for your data), or you may have no replicas at all and may have to add them to both disks. # metadb # metadb -a -f -c 3 cAtAdAs7 # metadb -a -c 3 cBtBdBs7 4. Create a simple concatenation submirror containing the existing root partition: # metainit -f d11 1 1 cAtAdAs0 5. 6. Create a second submirror using slice 0 of the other local disk: # metainit d12 1 1 cBtBdBs0 Create a one-way mirror of the root partition using the active submirror: # metainit d10 -m d11 7. Make a backup of the /etc/vfstab and /etc/system les: # cp /etc/system /system.backup # cp /etc/vfstab /vfstab.backup
6-34

www.chinaitproject.com IT QQ : 3264454 Exercise: Using Best Practices 8. Modify the /etc/vfstab and /etc/system les with metadevice entries for the boot device: # metaroot d10 9. Create a simple concatenation submirror containing the existing swap partition: # metainit -f d21 1 1 cAtAdAs1 10. Create a second submirror for the swap partition: # metainit d22 1 1 cBtBdBs1 11. Create a one-way mirror of the swap partition using the active submirror: # metainit d20 -m d21 12. Create a simple concatenation submirror containing the global devices partition: # metainit -f d31 1 1 cAtAdAs3 13. Create a second submirror for the global devices partition: # metainit d32 1 1 cBtBdBs3 14. Create a one-way mirror of the global devices partition using the active submirror. Note that the device name must be different on different nodes. For example, use d301 for node 1, d302 for node 2, and so on. # metainit d30X -m d31 15. Edit the /etc/vfstab and change each of the standard Solaris partitions on the boot device to metadevices. You will notice that the entry for the root le system is already edited (by metaroot, above). You need to edit the lines for swap (use /dev/md/dsk/d20) and /global/.devices/node@# (use /dev/md/dsk/d30X and /dev/md/rdsk/d30X). # vi /etc/vfstab 16. Flush the UFS logs to the le system: # lockfs -fa 17. Add the mirrored root ag to /etc/system # echo "set md:mirrored_root_flag=1" >> /etc/system 18. Reboot the node: # init 6 19. Attach the remaining submirrors: # metattach d20 d22 # metattach d30X d32 # metattach d10 d11
Best Practices
6-35
Note It would be more efcient to run these one at a time, waiting until one is nished (monitoring with metastat) before starting the next one. It will take a long time for them all to complete in parallel, but it happens in the background so you do not need to wait. 20. Modify the PROM boot-device parameter to include both submirrors. a. Identify the path to each root partition submirror: # ls -l /dev/dsk/rootdisks0 # ls -l /dev/dsk/rootmirrors0 The path begins after the string devices. For example, if the root disk is: ../../devices/sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a then use the following as the path to the boot slice: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a b. Edit the boot-device parameter: # eeprom boot-device="path-to-boot-slice path-to-mirror-slice"
Task 2 Encapsulating and Mirroring the Boot Disk Using VxVM Software
Perform the following steps if your cluster is using VxVM and you want to mirror with VxVM. 1. Remove the bootdg disk group on which your previous (preupgrade) root disk had been encapsulated: # vxprint -g bootdg # vxdg destroy bootdg # vxprint -g bootdg 2. 3. Encapsulate your boot disk: # clvxvm encapsulate After the reboots, edit the /etc/vfstab le and remove the nologging option for the root le system. Make sure you still have seven columns on that line. You can put in the word logging, or just leave a minus sign as logging is the UFS default for all Solaris OS versions supported in Sun Cluster 3.2. Reboot one more time to make your /etc/vfstab change take affect. Identify the local disk that you plan to use as the boot disk mirror: # vxdisk list
6-36 Sun Cluster 3.2 Advanced Administration
4. 5.
www.chinaitproject.com IT QQ : 3264454 Exercise: Using Best Practices 6. Verify that the boot disk mirror is the same size and geometry as the original boot disk: # format 7. Mirror the boot drive: # /etc/vx/bin/vxdisksetup -i c#t#d# format=sliced # vxdg -g rootdg adddisk rootmir=c#t#d# # /etc/vx/bin/vxmirror -g rootdg \ rootdisk-vm-name rootmir
Task 3 Verifying IPMP Best Practices

Perform the following steps: 1. On one of your nodes, use the ifconfig -a command to check for the following:

Do you have multiple adapters in your IPMP group? Are IPMP test addresses on the physical or virtual interface? Hint: Look for interfaces with the NOFAILOVER ag. Is the STANDBY ag set for either test interface? Is the DEPRECATED ag set on the test interfaces?
2.
Obtain IP addresses for the test interfaces from your instructor. Modify the /etc/hostname.xxx les as needed to satisfy the following conditions:
If you have both a physical and test interface, the test interface is on the virtual interface. No standby options are used. The deprecated option is used on the test interfaces.
The following are examples of ideal les. Make sure you use your own correct interface names and IP names: proto192# cat /etc/hostname.qfe1 proto192 group therapy netmask + broadcast + up addif proto192-qfe1-test -failover deprecated netmask + broadcast + up proto192# cat /etc/hostname.qfe2 proto192-qfe2 group therapy netmask + broadcast + up addif proto192-qfe2-test group therapy -failover deprecated \ netmask + broadcast + up
Best Practices
6-37
Exercise: Using Best Practices 3. 4. 5. Check if the FAILBACK=yes option is set in the /etc/default/mpathd le. If it is not, set it. Reboot your node to put any changes made into effect. Repeat this exercise on the other node if you so desire.
Task 4 Implementing Quorum Device Monitoring

Perform the following steps as indicated: 1. 2. 3. Type in your own copy of the script dened in the quorum section of this module, or use the one in the lab directory. Copy the script to each cluster node. Run the script by hand, and check the output: # # # # 4. cd directory_location_of_script chmod +x name_of_script ./name_of_script cat /var/cluster/qcheck
Congure the cron utility job to run the job every minute: # EDITOR=vi;export EDITOR # crontab -e (add a line) * * * * * /full_pathname_to_script
5. 6.
Check the /var/cluster/qcheck le periodically. If possible, generate a quorum device failure by pulling out the quorum device, and observe the /var/cluster/qcheck le output. If the quorum device is a hardware RAID LUN, you might need to power off the entire box or uncable it to simulate a failure. It might take the scdpmd daemon up to 10 minutes to detect the failed quorum drive.
6-38

Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications
Best Practices
6-39
Module 7

Objectives
Upon completion of this module, you should be able to:

Identify the importance of a security policy Identify security vulnerabilities in a Sun Cluster 3.x software environment Use the Solaris Security Toolkit software Download and install security software on the cluster nodes Implement the Toolkit software secure cluster driver Provide secure clustered services
7-1
Relevance
Relevance
!
?

Do you have and implement a security policy? What are the most common types of security threats you encounter? What are the security vulnerabilities particular to a system running the Sun Cluster 3.x software?
7-2

Sun Microsystems, Inc. Sun Cluster System Administration Guide For Solaris OS, part number 819-2971. Alex Noordegraf. Securing the Sun Cluster 3.x Software. [Online] Available at http://www.sun.com/solutions/blueprints/0203/ 817-1079.pdf. February 2003. Joel Weise and Charles R. Martin. Developing a Security Policy. [Online] Available at http://www.sun.com/solutions/ blueprints/1201/secpolicy.pdf. December 2001.

7-3
Using a Security Policy as a Framework for Decision Making
Using a Security Policy as a Framework for Decision Making

This module identies elements of Sun Cluster affecting security, and recommends the Solaris Security Toolkit. An overriding best practice for security, however, is that security decisions should always be made using the framework of a well-thought-out security policy.
Developing a Security Policy

An organizations assets include hardware, software, data, and people. A security policy denes the requirements for protecting the assets of an organization and the mechanisms through which these requirements are met. Another goal of a security policy is to provide a standard against which to perform security audits. The characteristics of a good security policy are:

It is implementable through fair and realistic procedures. It is enforceable with security tools. It denes the responsibility of all members of the organization. It is documented and distributed.
This module can help you dene a realistic security policy for the Sun Cluster environment by identifying which services are required and which services may be deleted in the Sun Cluster environment. Your enterprise may have a general security policy in place, for example, that mandates that rpcbind be eliminated whenever possible. However, Sun Cluster requires certain RPC services, and your cluster will not be able to operate if you eliminate rpcbind.
Implementing a Security Policy

After the security policy is developed and well-dened, you must implement it. The scope of the implementation depends on the nature of your organization.
7-4

www.chinaitproject.com IT QQ : 3264454 Using a Security Policy as a Framework for Decision Making The remainder of this module describes resources available to you to implement security in a Sun Cluster software environment. There are several tools in the Solaris OS that you can use to implement security, including the following:

Secure clustered services Secure Shell (SSH) utility Basic Security Module (BSM) Automated Security Enhancement Tool (ASET) Solaris Security Toolkit software Other security tools, such as Crack, TripWire, SATAN, SAINT, and TCP_Wrappers
This module focuses on using the Solaris Security Toolkit software to implement security in a Sun Cluster 3.x software environment.

7-5
Identifying Security Vulnerabilities

This section describes security vulnerabilities that are specic to a Sun Cluster 3.x software environment. These vulnerabilities include:

Solaris OS Software Oracle Real Application Cluster (RAC) software Cluster interconnects Internet services Cluster services Console access Node authentication
Minimizing Compared to Hardening the Solaris OS Software

Certain security administrators are always looking to minimize the Solaris OS; that is, to install less than the full OS for security reasons. Recall that Sun Cluster 3.x software requires a minimum of the SUNWCuser Solaris OS software cluster. The current recommendation is not to emphasize security via minimization. There are so many software dependencies that it ends up always causing massive complications to try to cut Solaris itself down to a minimum. Instead, it is recommended to install the whole OS, and then to harden the OS manually or via a tool such as the Solaris Security Toolkit, discussed later in this module.
Securing the Oracle RAC Software Installation

During the Oracle RAC software installation, you run the Oracle software installer from one cluster node. The installer, by default, uses the rsh and rcp utilities to copy les to other cluster nodes and thus requires the .rhosts le for the oracle user. Other Oracle software conguration tools, such as the netca utility, use the rsh utility to modify conguration les on other cluster nodes.
7-6

www.chinaitproject.com IT QQ : 3264454 Identifying Security Vulnerabilities
Note The Toolkit Sun Cluster 3.x software driver disables both the rsh and rcp utilities by default. It is possible to install the RAC software on each node and set up conguration les manually on each node if you do not want to change security settings. Alternatively, you can use the Secure Shell (SSH) ssh and scp commands to replace the functionality of the rsh and rcp commands. These commands provide an encrypted and authenticated mechanism for Oracle and any other software to perform tasks on remote machines. This simplies the installation and conguration of the Oracle RAC software in a secure manner. The Oracle runInstaller utility provides the ability to specify ssh and scp as the internode communication mechanism by specifying their path on the command line: $ ./runinstaller -remoteshell /usr/bin/ssh \ -remotecp /usr/bin/scp
Isolating Cluster Interconnects

It is important to the overall security of a cluster that cluster interconnect links are private and not reachable from any other network. Because cluster protocol information is shared over these links, you should use dedicated network equipment for the interconnect. It is supported, though not necessarily recommended, to use virtual local area networks (VLANs) for Sun Cluster 3.x interconnects.
Disabling Internet Services

Disable all unused Internet services. When hardening a Solaris OS, you typically disable most network services (in SMF in Solaris 10 or in inetd.conf in Solaris 9). You can then replace all of the interactive services (such as rlogin and telnet services) with the SSH software. Recall that some of the Internet services dened in SMF or the /etc/inetd.conf le are required by the Sun Cluster software and do not have SSH software replacements.

7-7
Identifying Sun Cluster 3.2 Software Services

The Sun Cluster 3.x software adds several daemons and services to a system. These include daemons running on the system and additional Remote Procedure Call (RPC) services that you cannot disable.
Cluster Standalone Services

The following daemons run on a default Sun Cluster 3.2 software cluster node: # ps -ef | grep cluster root 4 0 0 16:53:04 ? 0:34 cluster root 260 1 0 16:55:38 ? 0:00 /usr/cluster/lib/sc/failfastd root 235 1 0 16:54:24 ? 0:00 /usr/cluster/lib/sc/qd_userd root 276 1 0 16:55:38 ? 0:00 /usr/cluster/lib/sc/clexecd root 277 276 0 16:55:38 ? 0:00 /usr/cluster/lib/sc/clexecd root 582 1 0 16:56:12 ? 0:04 /usr/lib/inet/xntpd -c /etc/inet/ntp.conf.cluster root 1004 1 0 16:56:26 ? 0:10 /usr/cluster/lib/sc/rgmd root 943 1 0 16:56:23 ? 0:23 /usr/cluster/lib/sc/scdpmd root 944 1 0 16:56:23 ? 0:00 /usr/cluster/lib/sc/scprivipd root 946 1 0 16:56:24 ? 0:00 /usr/cluster/lib/sc/cl_eventlogd root 951 1 0 16:56:24 ? 0:01 /usr/cluster/lib/sc/rpc.fed root 889 1 0 16:56:22 ? 0:10 /usr/cluster/lib/sc/rpc.pmfd root 885 1 0 16:56:21 ? 0:00 /usr/cluster/lib/sc/sc_zonesd root 911 1 0 16:56:22 ? 0:00 /usr/cluster/lib/sc/sc_delegated_restarter root 894 1 0 16:56:22 ? 0:01 /usr/cluster/bin/pnmd root 897 1 0 16:56:22 ? 0:00 /usr/cluster/lib/sc/cl_eventd root 904 1 0 16:56:22 ? 0:01 /usr/cluster/lib/sc/cl_ccrad
Cluster RPC Internet Services

The Sun Cluster 3.2 software additionally installs the following Internet RPC services. Do not disable them, since the Sun Cluster software will not function:

rpc.scadmd sccheckd rpc.scrmd
These are dened in the /etc/inetd.conf le in Solaris 9 and as SMF services in Solaris 10 OS.
7-8

www.chinaitproject.com IT QQ : 3264454 Identifying Security Vulnerabilities
Solaris Volume Manager Services

Congurations using Solaris Volume Manager software require the following RPC services:

rpc.metad rpc.metamhd rpc.metametdd rpc.metcld
These are dened in the /etc/inetd.conf le in Solaris 9 and as SMF services in Solaris 10 OS.
Securing Console Access

Terminal servers frequently do not use encryption, allowing malicious individuals to snoop the network while you type the root password to log in to the console. Use a terminal server that supports encryption specically, one that implements the SSH utility. If you cannot use a terminal server that supports encryption, then put the terminal server on a network that is only reachable from the administrative workstation.
Securing Node Authentication During Installation

Node authentication determines how a node must identify itself to the existing cluster before the node is permitted to join. The Sun Cluster 3.x software provides several options for node authentication during installation, including:

none Any node is allowed to add itself to the cluster. sys A node is authenticated if its host name and IP address are consistent with what the current cluster members think the host name and IP address are. des A node is authenticated using the Dife-Hellman public-key mechanism. If you intend to use des, then you must manually congure the public keys, secret keys, and Secure RPC net names using les, Network Information Service (NIS), NIS+, or LDAP.

7-9
Using the Solaris Security Toolkit Software

This section describes the basics of using the Toolkit software, including:

Introduction Conguration Implementation
Introducing the Solaris Security Toolkit Software

This section covers the following Toolkit software topics:

Goal of the Toolkit software Modes of operation in which the Toolkit software runs Types of modications the Toolkit software makes
Goal of the Toolkit Software

The goal of the Toolkit software is to automate and simplify building secured Solaris OS systems. To provide a robust environment in which you can deploy the Sun Cluster 3.x software, very specic requirements are placed on the conguration of the Solaris OS. The Toolkit software makes over 100 modications to the Solaris OS of each cluster node. These modications disable unneeded services and enable optional Solaris OS security enhancements. The Toolkit software is designed to harden systems in one of two modes:
Standalone mode Run the Toolkit software from a command line. Standalone mode allows you to make security modications without reinstalling the Solaris OS. Standalone mode is particularly useful when re-hardening a system after packages or patches are installed. Applying patches and installing packages might overwrite or modify les that the Toolkit software previously modied. By re-running the Toolkit software, you can reimplement any security modications undone by the patch or package installation. JumpStart mode Ideally, you harden systems during installation. You can use the Toolkit software to harden systems during the third phase of a JumpStart software installation by running the Toolkit software scripts in a nish script.
7-10

www.chinaitproject.com IT QQ : 3264454 Using the Solaris Security Toolkit Software
Categories of Modication
Each of the modications performed by the Toolkit software to harden or minimize Sun Cluster 3.x software nodes falls into one of the following categories:

Disable Enable Install Remove Set Update
Structure of the Toolkit Software

Installing the toolkit software is a completely harmless operation; the pkgadd just lays down the les and directories for the toolkit but nothing gets executed until you actually run jass-execute. The default base directory for the Toolkit software is /opt/SUNWjass.
Finish Subdirectory
All the actual little worker-bee scripts live in the Finish directory of the Toolkit arena. The name of this directory reects how the Toolkits origins emphasized its used in the Jumpstart environment, but the same scripts get executed if you run the Toolkit in standalone mode, as is most likely in the cluster environment. In the 4.2.0 version of the software, there are 116 scripts in the Finish directory. You will see that not all scripts are appropriate for all environments. For this reason, it is not recommended to run these scripts directly, nor are you likely to even run them all.
Audit Subdirectory
This arena contains a collection of scripts similar to those in the Finish directory. Rather than actually harden your system, these scripts are used to audit your system; that is, to report on whether your system is already hardened.

7-11
Drivers Subdirectory
This arena contains master scripts that congure and execute collections of the Finish or Audit scripts that are most appropriate for certain environments. There is a specic driver for the Sun Cluster environment, a different driver appropriate for the Sun Fire 15K System Controller, a different driver appropriate for general non-clustered servers, and so on. Each driver is actually composed of three les in the Drivers arena, for example: suncluster3x-secure.driver suncluster3x-config.driver suncluster3x-hardening.driver When you execute the Toolkit, you will refer to the appropriate ...-secure.driver le. This le then uses the ...-config.driver to set options and variables and then calls the ...-hardening.driver to actually drive the proper collection of Finish or Audit scripts.
Packages and Patches Subdirectories

Certain of the Finish scripts can be used to install software and patches. This depends on the software and patches being staged in the Toolkits own directories.
Profiles and Sysidcfg Subdirectories

These are used only with the jumpstart functionality of the Toolkit and can be ignored if you run the Toolkit in standalone mode.
Files Subdirectory
This directory contains les that are inserted into your system as part of the implementation of some of the scripts.
bin Subdirectory
This directory contains the actual jass-execute utility.
7-12

www.chinaitproject.com IT QQ : 3264454 Using the Solaris Security Toolkit Software
Executing the Toolkit Software

The following paragraphs describe how to use the Toolkit to audit and harden your system.
Running the Toolkit in Audit Mode

You can run the Toolkit to audit your system as such: # $JASS_HOME_DIR/bin/jass-execute -a driver_name No modications are made to your system. Rather, detailed reports are created on standard output describing the hardening that would be required by each of the little scripts that would be selected by your particular driver.
Running the Toolkit in Hardening Mode

You can run the Toolkit to harden your system: # $JASS_HOME_DIR/bin/jass-execute -d driver_name The set of scripts is selected by the driver and executed. Detailed output is made to standard output. You should not consider your system to be fully hardened until after the next reboot.
Undoing the Toolkit Software Security Modifications

One of the features of the Toolkit software is the ability to undo a Toolkit software installation or series of installations. This feature provides you with a mechanism by which to return a system to its state prior to the Toolkit softwares execution. The undo feature is only available in standalone mode using the jass-execute command as follows: # /opt/SUNWjass/bin/jass-execute -u Executing driver, undo.driver Please select a Solaris Security Toolkit run to restore through: 1. December 02, 2006 at 01:18:14 (/var/opt/SUNWjass/run/20061202011814) Choice ('q' to exit)? 1

7-13
Select one of the listed runs as the nal run to undo. All system modications performed in that selected run and any runs made after that are undone. There are two important limitations to keep in mind with this feature:
If you select the Toolkit software option to not archive les, the undo feature is not available. You can undo a run only once. After a run is undone, all the les backed up by a Toolkit software run are restored to their original locations and are not backed up again.
If you manually change some of the security modications that the Toolkit software has made, then the jass-execute command warns you that the security modication is changed. The information needed for the undo feature is logged in the /var/opt/SUNWjass directory. For each run, a new subdirectory is created in the /var/opt/SUNWjass/runs directory. This subdirectory stores the necessary archive and log information for the undo feature. Note Never modify the contents of the les in the /var/opt/ SUNWjass/runs directory.
7-14

www.chinaitproject.com IT QQ : 3264454 Downloading and Installing Security Software
Downloading and Installing Security Software

This section describes how to download additional security software on the cluster nodes. This software includes the following:

Solaris Security Toolkit software Recommended and security patches FixModes freeware software MD5 software
Downloading and Installing the Toolkit Software

General Instructions
You can download a free copy of the Solaris Security Toolkit software by following the links at http://www.sun.com/software/ security/jass . The distribution is a compressed tar archive of a Solaris package. Installing the software does not make any security modications on the system; it just lays down the Toolkit les. No changes are made until you run the jass-execute command.

7-15
Downloading and Installing Security Software
Downloading Recommended and Security Patches

The recommended and security patches for the Solaris OS are distributed as a zip le that is found at http://sunsolve.sun.com. At the time of writing this course, click on the Patch Finder link on the left column of the main page to get to the page that lets you download recommended patch clusters. After you download this le, unzip it in a directory called /opt/SUNWjass/Patches on each cluster node. The suncluster3x-secure.driver script looks in this directory for these patches and automatically installs them using the patchadd utility.
Downloading the FixModes Software (Solaris 9 OS Only)

The FixModes software tightens the default Solaris OS directory and le permissions. Tightening these permissions can signicantly improve overall security. The FixModes software is distributed as a compressed tar le. After you download this le, leave it in the compressed format and put it in a directory called /opt/SUNWjass/Packages on each cluster node. The suncluster3x-secure.driver script looks in this directory for this package and automatically uncompresses, extracts, and installs the FixModes software.
Downloading the MD5 Software (Solaris 9 OS Only)

The MD5 software validates MD5 digital ngerprints. Validating the integrity of Solaris OS binaries provides a robust mechanism to detect system binaries that were altered by unauthorized users. The MD5 software is distributed as a compressed tar le. After you download the le, leave it in the compressed format and put it in a directory called /opt/SUNWjass/Packages on each cluster node. After you save the MD5 software in the /opt/SUNWjass/Packages directory, it is installed by the suncluster3x-secure.driver script. After the MD5 binaries are installed, you can use them to verify the integrity of executables on the system by using the Solaris Fingerprint Database tool.
7-16

www.chinaitproject.com IT QQ : 3264454 Implementing the Toolkit Software Modifications on a Cluster Node
Implementing the Toolkit Software Modications on a Cluster Node

When you have all the software downloaded and added to the correct directories, you can secure each of the Solaris OS images running on each of the Sun Cluster 3.x software nodes. The Solaris Security Toolkit software provides the suncluster3x-secure.driver driver that automates the installation of security software and Solaris OS modications. The driver for the cluster nodes performs the following tasks:
Installs and executes the FixModes software to tighten le system permissions Installs the MD5 software Installs the Recommended and Security Patch software Implements over 100 Solaris OS security modications
An outline of the hardening procedure is: 1. Run the hardening on one node at a time. Execute the suncluster3x-secure.driver script as follows: # cd /opt/SUNWjass/bin # ./jass-execute -d suncluster3x-secure.driver 2. 3. 4. 5. Reboot the node. Verify that the node is hardened. Verify that the node operates properly in the cluster. Repeat Steps 14 on the remaining nodes.

7-17
Providing Secure Clustered Services

In addition to hardening security on the cluster nodes with the Toolkit software, you can further enhance security by running secure clustered services. This section describes using Secure NFS, secure Apache Web Service, secure Sun Java System Web Server software, and secure LDAP.
Using Secure NFS and Kerberized NFS

You can congure NFS to use Secure RPC in a non-clustered environment so that the client and server authenticate each other in the NFS service. The authentication mechanism in Secure RPC is called Dife-Hellman, and it uses DES (Data Encryption Standard) public-key cryptography to encrypt client credentials at the beginning of an RPC session. Sun Cluster 3.x software does not support Secure NFS. However, starting with Sun Cluster 3.1 9/04 Sun Cluster does support the use of the Kerberos network authentication protocol with NFS. Note Sun Cluster 3.2 supports the Kerberos server itself as a failover application, but this is somewhat unrelated to the support of a kerberized NFS. NFS could make use of a Kerberos service inside or outside of the same cluster. Whether or not you are using Kerberos, Sun Cluster 3.x software does support the use of secure ports for NFS by adding the following entry to the /etc/system le on all cluster nodes: set nfssrv:nfs_portmon=1
Note The Solaris Security Toolkit will make this modication automatically.
7-18

www.chinaitproject.com IT QQ : 3264454 Providing Secure Clustered Services
Securing an LDAP Service

The Sun Java System Directory Server software service can run over secure, authenticated connections using SSL (Secure Socket Layer). By default, the non-secure LDAP server listens at Transmission Control Protocol (TCP) port 389 and the secure server listens at TCP port 636. You can modify these port numbers to further secure the service, but the clients must then specify the non-default port with every LDAP query. The Sun Cluster 3.x software agent for the Sun Java System Directory Server software, SUNW.nsldap, probes the LDAP service using two techniques:

Opens a TCP socket connection with the server Runs an ldapsearch command on the base of the directory
If you run your service in secure mode, the agent opens a socket connection with the server but does not run the ldapsearch command when performing a probe. Note In essence you have a trade-off; if you run the more secure cluster service the fault-monitoring for that service in the cluster is less robust. This is true for LDAP and for the two web server applications in the following two subsections.
Using the Secure Apache Web Service

You can run the Apache Web Service in secure mode using Secure Socket Layer (SSL). If you run Secure Apache Web service as a scalable service, then set the load_balancing_policy parameter to either LB_STICKY or LB_STICKY_WILD so that existing client connections are always forwarded to the server with which the client has authenticated. The Sun Cluster 3.x software agent for Apache Web Service, SUNW.apache, opens TCP connections with the secure server but does not send HTTPS queries when performing a probe.

7-19
Using the Secure Sun Java System Web Server Software

You can run the Sun Java System Web Server software service in secure mode using SSL. If you run the secure Sun Java Web software service as a scalable service, then set the load_balancing_policy parameter to either LB_STICKY or LB_STICKY_WILD so that the existing client connections are always forwarded to the server with which the client has authenticated. The Sun Cluster 3.x software agent for the Sun Java System Web Server software service, SUNW.iws, opens TCP connections with the secure server but does not send HTTPS queries.
7-20

www.chinaitproject.com IT QQ : 3264454 Exercise: Hardening Security With the Toolkit Software
Exercise: Hardening Security With the Toolkit Software

In this exercise, you complete the following tasks:

Task 1 Install the Toolkit software on the selected node Task 2 Execute the suncluster3x-secure.driver script Task 3 Verify that the selected node is hardened Task 4 Verify that the selected node operates properly in the cluster Task 5 Harden the remaining nodes (optional) Task 6 Undo the security modications on each cluster node
Preparation
Task 1 Installing the Toolkit Software on the Selected Node

Perform the following steps on one selected cluster node (either node). 1. Get the package: # # # # 2. cd class-software-directory cp SUNWjass-4.2.0.pkg.tar.Z /var/tmp cd /var/tmp zcat SUNWjass-4.2.0.pkg.tar.Z|tar xvf -
Use the pkgadd utility to install the Toolkit software: # pkgadd -d /var/tmp SUNWjass

7-21
Task 2 Running the suncluster3x-secure.driver Script

Perform the following steps on the selected cluster node: 1. 2. 3. Change directory to the JASS root directory: # cd /opt/SUNWjass/bin Execute the suncluster3x-secure.driver script as follows: # ./jass-execute -d suncluster3x-secure.driver Reboot: # reboot
Task 3 Verifying That the Selected Node Is Hardened

Perform the following steps to verify that the selected node is hardened: 1. Try to log in to the selected cluster node using the telnet utility from the administrative workstation (or display station) as follows: # telnet selected-node-name 2. Try to log in to the selected cluster node using the rlogin utility from the administrative workstation as follows: # rlogin -l root selected-node-name 3. Try to transfer a le to the selected cluster node using the ftp utility from the administrative workstation as follows: # ftp selected-node-name 4. Try to run the spray command from the administrative workstation as follows: # spray selected-node-name 5. Try to log in to the selected cluster node from the administrative workstation using ssh: # ssh -l root selected-node-name 6. On the selected node console, enable ssh access for root: # vi /etc/ssh/sshd_config PermitRootLogin yes # svcadm restart ssh 7. Verify that now you can access the selected node via secure shell.
7-22

www.chinaitproject.com IT QQ : 3264454 Exercise: Hardening Security With the Toolkit Software
Task 4 Verifying That the Selected Node Operates Properly in the Cluster
Perform the following steps to verify that the selected node operates correctly: 1. Verify that the node can run the ora-rg resource group as follows: a. 2. Switch the ora-rg resource group to the selected node. Type: # clrg switch -g selected-node ora-rg Verify that you can access the Oracle database. You should be able to serve as an Oracle client from either node, regardless of the node on which the failover service is running. # ksh # cd /oracli # ls clienv oraInventory/ product/ # . ./clienv # which sqlplus # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit 3. 4. Invoke your web browser on your display station. Navigate to http://iws-lh-name/cgi-bin/test-iws.cgi
Click the reload or refresh button several times to verify the behavior of the scalable application. Verify that you get responses from the node which has been hardened, as well as the other node.
Task 5 Hardening the Remaining Cluster Nodes (Optional)

If you really want to prove to yourself that your cluster operates correctly with all nodes hardened, repeat tasks 1-4 on the remaining nodes.

7-23
Task 6 Undoing the Security Modifications on Each Cluster Node

Perform the following steps on any nodes that you had hardened. 1. Run the jass-execute script to undo the security modications. Type the following: # /opt/SUNWjass/bin/jass-execute -u Type the number that represents your last security modication. You will be prompted about the notication that you made to the sshd_config le. You can choose the Keep option to keep the change. 2. 3. Reboot each cluster node, one at a time. Type the following: # init 6 Verify that the hardening no longer exists (your node should be available via telnet, for example).
7-24

Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications

7-25
Module 8
Examining Troubleshooting Tips

Objectives

Describe how to troubleshoot clustered services Identify log les for each layer
8-1
Relevance
Relevance
!
?
Are there special considerations when debugging application problems in a cluster? What resources are available for troubleshooting cluster software failures? What are the interdependencies among cooperating software in the cluster?
8-2

Additional resources The following reference provides additional information on the topics described in this module:
Sun Microsystems, Inc. SunSolve online support Web site. [Online] Available at: http://sunsolve.sun.com. Sun Microsystems, Inc. Sun Cluster Error Messages Guide For Solaris OS, part number 819-2973.

8-3
Defining How to Troubleshoot Clustered Services
Dening How to Troubleshoot Clustered Services

This module describes how to troubleshoot Sun-supported data services running in a Sun Cluster software environment. The Sun Java System Web Server and Oracle applications are described.
Examining a Generic Troubleshooting Approach

The following steps present a generic troubleshooting approach you can use to isolate application failures: 1. Ensure that the application runs well. It is important to establish sane operation of an application at the time of its installation. If you do not spend time ensuring this rst state, failures are both inevitable and difcult to isolate. 2. Observe the problem. Through any form of observation, identify a problem in the capabilities of the software. This might originate with an end user, a Simple Network Management Protocol (SNMP) agent, a script that parses a log le, or any other way in which functionality is monitored. 3. Hypothesize a solution. Using all resources available, design at least one solution. This might be one of many potential solutions brainstormed by service personnel. It might also be a solution recommended using the Sun Cluster 3.2 Error Message Guide. Over time, as problems recur, experience expedites nding a solution. 4. Implement the solution. Implement exactly one solution. Do not implement multiple solutions simultaneously. One reason for this is to prevent creating a new problem. Another reason is that you cannot know which solution xed the problem unless you limit your actions to a single solution at a time. 5. Verify the solution. Verify that the solution xed the problem. If the solution worked, the application state returns to Step 1. If the application is still broken, back out the solution and return to Step 3.
8-4

www.chinaitproject.com IT QQ : 3264454 Defining How to Troubleshoot Clustered Services
Triangulating the Causes of Failure

When hypothesizing causes for application failure in a client-server environment, it is important to rst narrow the scope of search. Three functional areas exist on which a problem might reside:

Client Server Network
After narrowing the search, begin troubleshooting within that functional area. Problems might occur in one or more of these functional areas that prohibit you from using the application. Some of these issues are described in the following sections.
Problems on the Client

Problems that the client can experience might render the application unusable. These problems include the following:
A user does not use the application properly, but everything else works well. The application is miscongured. The application has bugs. The client OS is miscongured. The client OS is heavily loaded, causing the application to hang. The client host hardware is faulted or miscongured.
Problems on the Network

Problems on the network can render an application unusable. The following are some of these problems:

The network is heavily loaded, causing application timeouts. A name server for either the client or server is unreachable. Routers are not forwarding IP packets between the hosts. The network is faulted or miscongured.

8-5
Problems on the Server

Problems on the server can render an application unusable. The following are some of these problems:

The service is miscongured. The service has bugs. The cluster fault monitor is not detecting error conditions in the application. The cluster is unable to start or stop the application. The server OS is miscongured. No public network interfaces are available. The cluster is shut down.
Defining the Sun Cluster Software Stack

There is a software stack present in all clustered services. The layers are as follows:

Application Data service Cluster framework Operating system Server hardware
8-6

www.chinaitproject.com IT QQ : 3264454 Defining How to Troubleshoot Clustered Services Figure 8-1 shows a clustered-service software stack.
Application
Data Service
Cluster Framework
Operating System
Server Hardware
Figure 8-1
Clustered Service Software Stack
Application Layer
The application layer contains the application such as Oracle software or Sun Java System Web Server, and includes the application conguration les, scripts, and binaries.
Data Service Layer

The data service layer includes the methods that start and stop the application when running in the cluster, as well as the fault monitor and the methods required to start and stop the fault monitor. This layer is controlled by the RGM daemon.

8-7
Cluster Framework Layer

The cluster framework layer includes the Cluster Membership Monitor, the Device Conguration System, and various utilities such as the Process Monitor Facility.
Operating System Layer

The operating system layer is the Solaris OS, including the standard kernel and user-level services, volume management software, network interface conguration, and name service conguration.
Hardware Layer
The server hardware layer includes all hardware for the physical nodes of the cluster. The hardware layer can contain the following items:

Server chassis and system boards Storage arrays and their switches, hubs, and cables Cluster transport interfaces, switches, and cables Public network interfaces
Identifying Dependencies Within Layers of the Stack

Dependencies exist between adjacent layers of the software stack. For example, the Oracle software instance is started and stopped by the data service layer. If the cluster is miscongured, for example in the oracle_sid parameter, then it cannot start and stop the database. Dependencies also exist between non-adjacent layers of the software stack. In fact, each layer depends on all layers under it to be fully functional. For example, if a host name used as a network resource cannot be resolved through the Solaris name service, the resource group fails to go online.
8-8

www.chinaitproject.com IT QQ : 3264454 Defining How to Troubleshoot Clustered Services Many dependencies exist within components of the same layer in the stack. The following are some examples:
The Sun Java System Messaging Server depends on the Sun Java System Directory Server. A resource can depend on another resource in the same resource group or in another resource group. An I/O card depends on its system board.
Deciding Where to Begin

After you localize a problem to either the client, network, or server, you can begin troubleshooting within that functional area. For problems on a cluster server, the software stack suggests an order for troubleshooting. One such order is top-down, in which you begin by debugging the application. Another order is bottom-up, in which you begin by debugging the hardware. You are not obliged to strictly follow an order. Over time, you debug problems based on experience and familiarity with the software involved.
Example: Troubleshooting Failure in an HA-Oracle Application

The following scenario demonstrates possible strategies in diagnosing failure of a failover application in the cluster.
Symptom
Every time you try to enable the Oracle failover resource group, the Oracle server resource fails to start. This causes the resource group to attempt to fail over to the next node, where the server resource fails as well. Finally the entire group remains ofine as the attempts to start are halted by the Pingpong_interval feature (as discussed in Module 3).

8-9
Strategies
The following strategies can help you isolate the layer in which the problem is occurring:
Separating Agent and Application layers from other layers

Ofine the Oracle resource group. Verify that other application resource groups are operating correctly. Verify that you can successfully switch over other application resource groups. Verify that the cluster framework utilities are behaving properly. Use the various status suboptions of the commands and so on. With the Oracle resource group ofine, disable (clrs disable) only the Oracle server resource. Switch the rest of the Oracle resource group online on a particular node (you likely still need the IP address and storage provided by the rest of the group). Verify that the IP and storage resources behave properly. Start the Oracle server by hand using sqlplus /nolog. If you are having problems starting the Oracle server, you can likely isolate the problem to the application conguration. The sqlplus utility will give you better error messages than anything you can nd in cluster log les. If you have no problems starting Oracle by hand and then accessing the database, the problem is likely in the agent conguration. Verify the values of the Oracle server properties (clrs show -v ora-server-res-name), which is where your problem likely lies.
Separating Agent layer from Application Layer
8-10

www.chinaitproject.com IT QQ : 3264454 Identifying Log Files for Each Layer
Identifying Log Files for Each Layer

This section identies the log les that each layer maintains. The following logs are described:
Application log les for the Oracle software and Sun Java System Web Server Data service agent log les Cluster framework log les
Identifying Application Log Files

Each Sun-supported data service can maintain individual log les, as well as use the Syslog software. The following paragraphs describe the location and content of each of the log les that the Sun Java System Web Server and Oracle database services maintain.
Sun Java System Web Server Software

The Sun Java System Web Server software has several logs. When you are running the application as a scalable service in the Sun Cluster environment, you must change the location of several of these logs from the default. The default location puts the logs beneath the server root, which, for the scalable location, will be in the global le system. This is unsuitable, since multiple servers simultaneously accessing the same log les will surely corrupt the logs. The following logs must have their location changed. Conguration of the log les can be done through the web server administrative interface, or directly through the web server conguration les, that are mentioned in each instance:
Access log The access log contains information about client requests and server responses.
Default location is: server-root/https-servername/logs/access Location is congured in: server-root/https-servername/config/server.xml
Error log The error log contains information about errors that the server encountered after creation of the le. It also contains informational messages about the server, such as when the server started.

8-11
Default location is: server-root/https-servername/logs/errors Location is congured in: server-root/https-servername/config/server.xml server-root/https-servername/logs/error
PID log The PID log contains the process ID for the web server watchdog daemon, which in turn starts the web server itself. The software needs to record the process ID of the watchdog daemon in order to stop the service.
Default location is: server-root/https-servername/logs/access Location is congured in: server-root/https-servername/config/magnus.conf
Setup log The setup log contains general and error information concerning the installation of the web server using the setup utility and is found in the server-root/setup directory.
The following example shows how the required location of the rst three logs has been changed to a location local to each node rather than a location in the global le system for a scalable service: # cd /global/web/iws/https-iws-lh.sunedu.com/config # grep /var/iws server.xml <PROPERTY name="accesslog" value="/var/iws/logs/access"/> <LOG file="/var/iws/logs/errors" loglevel="info" logvsid="false" logstdout="true" logstderr="true" logtoconsole="true" createconsole="false" usesyslog="false"/> # grep /var/iws magnus.conf PidLog /var/iws/logs/pid
Oracle Software
The Oracle database server maintains two different types of log les that you can use for troubleshooting and debugging purposes:
Trace les Each server and background process can write to an associated trace le. When a process detects an internal error, it writes information on the error to its trace le. The lename format of a trace le is: processname_PID_sid.trc Where:
8-12

The processname value is a three- or four-character, abbreviated process name identifying the process that generated the le (for example, the pmon, dbwr, ora, or reco name). The PID is the process ID number. The sid is the instance system identier.
A sample trace le name is found in the $ORACLE_BASE/admin/ $ORACLE_SID/bdump/lgwr_1237_TEST.trc le. All trace les for background processes are written to the destination directory specied by the BACKGROUND_DUMP_DEST initialization parameter. If you do not set this initialization parameter, the default is the $ORACLE_HOME/rdbms/log directory. All trace les for server processes are written to the destination directory specied by the USER_DUMP_DEST initialization parameter. Set the MAX_DUMP_FILE initialization parameter to at least 5000 to ensure that the trace le is large enough to store error information.
Alert les The alert_sid.log le stores signicant database events and messages. Anything that affects the database instance or global database is recorded in this le. This le is associated with a database and is located in the directory specied by the BACKGROUND_DUMP_DEST initialization parameter. If you do not set this initialization parameter, the default is the $ORACLE_HOME/ rdbms/log directory.

8-13
Identifying Cluster Framework Log Files

The Sun Cluster 3.x software framework maintains a set of log les as described in this section.
Installation and Upgrade

When performing an installation or upgrade of the Sun Cluster 3.x software framework, the scinstall utility maintains the following logs:
/var/cluster/logs/install/scinstall.log.PID Sun Cluster logs the actions of the scinstall utility in this le. /var/cluster/logs/install/scinstall.upgrade.log.PID Both the framework and data service upgrades are logged in this le. /var/cluster/upgrade Information regarding the les and packages installed through the upgrade are stored in this directory.
Syslog Logs
After you install the cluster software, it uses the Syslog software for logging informational and error messages. The framework of the cluster logs to the Syslog software daemon and kern facilities, depending on the software component. Each message that the Syslog software writes is comprised of three components:
Message ID An integer between 100000 and 999999 that uniquely identies the message. Use this value as a search key in the Sun Cluster Error Messages Guide for Solaris OS. Description An explanation of the problem. Solution A proposed solution to the problem. Some solutions are worded specically, while other solutions recommend contacting a system administrator or Sun support engineer.
Note The Sun Cluster Error Messages Guide for Solaris OS is a searchable volume of all error messages generated by the Sun Cluster software framework. Use it to nd potential solutions to problems.
8-14

RGM Logs
The RGM maintains disk-based logs in the /var/cluster/rgm directory for such data as pingpong timers.
Cluster Event Logging

Sun Cluster 3.1 Update 1 introduced an infrastructure that allows Sun Cluster software components to publish events using the Solaris syseventd daemon and the Sun Cluster cl_eventlogd and cl_eventd daemons. Such events are typically generated when the cluster undergoes a membership change, such as a node joining or leaving the cluster, or a resource group state change, such as a resource group going online or ofine. This enables collection of cluster event data for use by the Explorer tool and any custom application that wishes to receive notication of cluster recongurations. Such custom applications must use the CRNP Application Program Interface (API) to subscribe to the published events.
The sccheck Logs

Sun Cluster 3.1 Update 1 enhanced the sccheck command to perform more thorough reports of the cluster conguration, including the use of the Explorer tool. These reports are stored in the /var/cluster/sccheck directory.

8-15
Module 9
Troubleshooting the Software

Objectives
Upon completion of this module, you should be able to:

Observe self-induced problems Resolve instructor-induced problems and faults in your cluster Perform node disaster recovery in practice
9-1
Relevance
Relevance
!
?
What different ways can you can think of to break the cluster software? What is the most important source of error-logging information in the Sun Cluster software? How do you know when something is broken in the cluster?
9-2

Additional resources The following reference provides additional information on the topics described in this module:
Sun Microsystems, Inc. Sun Cluster Error Messages Guide For Solaris OS, part number 819-2973. Sun Microsystems, Inc. SunSolve online support web site. [Online] Available at http://sunsolve.sun.com.
The goals of this module are to convey the following concepts: 1.) The Sun Cluster software effectively resolves many problems without user intervention. 2.) Some errors that the Sun Cluster software identies are clearly communicated to the administrator, and it is a straightforward task to resolve such problems. 3.) Some errors that the Sun Cluster software identies are not clearly communicated to the administrator, and it requires experience to resolve such problems. The labs in this module are contrived. It is not common, for example, to accidentally kill the wrong daemon, or that a mistakenly revoke permissions previously granted to a user of a database. It is not the goal of this module to present specic problems that a student is likely to encounter in the real world. Instead the goal is to expose the student to a variety of errors in this safe, non-production, classroom environment so that the student can see how errors are expressed in the cluster. This module is designed to give the students a variety of labs from which to choose. The student is not expected to nish all of the tasks in both self-induced and instructor-induced exercises. Rather, the students should examine all the task summaries and then decide which tasks they are interested in doing.

9-3
Introducing the Troubleshooting Exercises
Introducing the Troubleshooting Exercises

This module consists of three types of self-paced troubleshooting exercises:

Self-induced Instructor-induced Disaster recovery
You can perform these exercises in any order. The self-induced exercises are entirely self-contained. The instructor-induced exercises require coordination with the instructor. Neither rely on the completion of any other exercise.
Troubleshooting Self-Induced Problems

This series of exercises is intended to portray how the cluster reacts to and expresses failures. These exercises consist of the following parts: 1. 2. 3. 4. Induce a problem or fault in the cluster conguration. Observe how the cluster reacts to the fault. Observe how the cluster logs, or otherwise makes known, the fault and its reaction to the fault. Reestablish a sane conguration, if necessary, by backing out the induced problem.
Troubleshooting Instructor-Induced Problems

This series of exercises is intended to provide you with experience using the troubleshooting tips and techniques described in this course. These exercises consist of the following parts: 1. 2. 3. 4. Tell the instructor when you are ready to begin each task, and indicate which task you are ready to perform. The instructor induces a problem or fault in your cluster. You identify the problem using the hints in each task. You x the problem.
9-4

www.chinaitproject.com IT QQ : 3264454 Introducing the Troubleshooting Exercises
Implementing Disaster Recovery

This exercise reviews the procedures you have learned about manipulating the cluster conguration in order to restore a completely failed node back into full cluster operations, with minimal or no interruption of your clustered applications.

9-5
Exercise 1: Inducing Problems and Observing Reactions

In this set of tasks, you inject your own faults in your cluster and observe the impact of the problem and how it is logged. The goals of this type of exercise are as follows:
Gain familiarity with how errors are expressed while the cluster operates in error scenarios Test the resiliency of the cluster software and determine how effective it is in resolving errors without user intervention Identify situations that require operator intervention

Task 1 Induce daemon failures Task 2 Induce a full root le system Task 3 Set an incorrect maxusers value Task 4 Induce operator errors
Preparation
Task 1 Inducing Daemon Failures

This series of steps observes how the cluster reacts to daemon failures. In some cases, the cluster node panics. In other cases, the cluster does not respond until some request is made to the service daemon. In some cases, the cluster does not respond at all. Perform any of the following steps on any cluster node:
Kill the rgmd process: # pkill rgmd
Note The system panics after 30 seconds. There is nothing you need to do to restore sane operation.
9-6

www.chinaitproject.com IT QQ : 3264454 Exercise 1: Inducing Problems and Observing Reactions
Kill the rpcbind process: From one cluster node, type the following: # ps -ef|grep rpcbind # pkill rpcbind
Note On Solaris 10 OS, this daemon is under control of the Service Management Facility (SMF), and is automatically restarted.
Kill the rpc.fed process: # pkill -9 rpc.fed
Note The system panics. There is nothing you need to do to restore sane operation.
Kill the rpc.pmfd process: # pkill rpc.pmfd
Kill the clexecd process: # pkill clexecd
Kill the iws_probe process ve times consecutively: # > > > > for i in 1 2 3 4 5 do pkill iws_probe sleep 3 done
Note The probe was under the control of PMF and restarts as many times as it is congured to do. After that threshold is exceeded, the PMF daemon does not do anything in response except log a message in the

9-7
/var/adm/messages le. To restore sane operation, stop and start the fault monitor for the resource group using the clrg unmonitor/monitor command.
If you are running VxVM in your cluster, perform the following steps: a. Kill the vxconfigd process: # pkill -9 vxconfigd
Note The system does not panic. The host on which that daemon was killed is unable to perform any further VxVM volume or disk group operations. b. c.
Try to perform a simple VxVM operation on that node: # vxdisk list Restart the vxconfigd process: # vxconfigd -x syslog -m boot
If you are running Solaris VM in your cluster, perform the following steps: a. Kill the rpc.metad process: # pkill rpc.metad
Note The system does not panic. This daemon gets restarted by inetd when diskset status is required from that host. There is nothing you need to do to restore sane operation. b. Print status for any diskset: # metastat -s ora-ds
9-8

Task 2 Inducing a Full Root File System

This task demonstrates how a cluster node reacts when it runs out of space on the root device. This task takes a long time due to the way in which UFS allocates space on a near-full device. Perform the following: 1. 2. 3. Determine how much free space is left on your root volume: # df -k / Create a le that is slightly less than the size determined in Step 1: # mkfile sizem /somehugefile Determine how much space is reserved by UFS for the root user: # fstyp -v /dev/rdsk/c#t#d#s0| grep minfree OR # fstyp -v /dev/md/dsk/d# | grep minfree 4. Use up the remaining space: # mkdir /mydir # find / | cpio -dump /mydir 5. Run the df command at ve-second intervals: # > > > > 6. while true do df -k / sleep 5 done
Does the cluster node continue to operate normally? Are there any messages displayed to the console that indicate the root le system is full? _____________________________________________________________ _____________________________________________________________ ___________________________________________________
Note Remove the /mydir directory to restore sane operation.

9-9
Task 3 Setting an Incorrect maxusers Value

This task demonstrates how a cluster node reacts when you set the maxusers kernel parameter, and thus several other kernel parameters, too low. Proceed as follows: 1. Edit the /etc/system le: # vi /etc/system ... set maxusers=1 2. Reboot the node: # init 6
Note You need to boot off the network or a CD-ROM into single-user mode, mount the root le system, and remove the entry from the /etc/system le to restore sane operation.
Task 4 Inducing Operator Errors

This task demonstrates how the cluster reacts to operator error. These steps are isolated and you can perform them in any order. Perform the following steps:
On any cluster node, bring down both private interfaces: # ifconfig first-private-interface down # ifconfig second-private-interface down
Note This causes a split brain, and one of the cluster nodes panics. If the system that panics is the node on which you ran the ifconfig command, then you do not need to do anything to restore sane operation. If the system that panics is the other node, then use the ifconfig command to bring each interface up.
On any cluster node, remove the CCR infrastructure table and make some change: # rm /etc/cluster/ccr/infrastructure # scconf -a -h venus
9-10

Note Use the ftp utility to get a copy of the infrastructure table from another cluster node.
On any cluster node, remove the CCR directory table and make some CCR change: # rm /etc/cluster/ccr/directory # clrg create -n node1,node2 somenewrg
Note Use the ftp utility to get the CCR directory table from another cluster node.

9-11
Exercise 2: Troubleshooting Instructor-Induced Problems

In this exercise, you practice debugging problems that are created by the instructor. You can perform these tasks in any order. The goals of this type of exercise are as follows:

Simulate some real-world scenarios that you might confront Test the effectiveness of the cluster messages, logs, and other tools discussed in this course to nd and resolve problems Test your ability to nd and resolve problems that occur in your cluster

Task 1 Troubleshoot IPMP errors Task 2 Troubleshoot an unknown state Task 3 Troubleshoot a resource STOP_FAILED state Task 4 Troubleshoot Oracle software resource group errors Task 5 Troubleshoot an unbootable cluster node Task 6 Troubleshoot oracle_server resource fault monitor errors Task 7 Troubleshoot the failure to start a web server Task 8 Troubleshoot iws-res resource failures on one node
Preparation
No preparation is required for this exercise. Tell your instructor when you are ready to begin a particular task.
9-12

www.chinaitproject.com IT QQ : 3264454 Exercise 2: Troubleshooting Instructor-Induced Problems
Task 1 Troubleshooting IPMP Errors

The IPMP facility is reporting error messages through the Syslog software on the console and in the /var/adm/messages le.
TO BREAK: On any cluster node, perform the following steps: 1. Type # vi /etc/hostname.pubnet_adapter 2. Insert some typographic error. 3. Type # init 6.
Verify the problem, and then answer the following questions:
Where is the conguration stored for IPMP? _______________________________________________ What is the correct syntax of the les? _______________________________________________ How can you restart your public interfaces with a new conguration? _______________________________________________
TO RESTORE: On any cluster node, perform the following steps: 1. Type # vi /etc/hostname.pubnet_adapter 2. Correct the typographic error. 3. Recongure the adapter using the ifconfig command.

9-13
Task 2 Troubleshooting an Unknown State

A cluster node is booted, and seems to be running all the cluster daemons and clustered applications. However cluster command run on that node report that the node is not booted into the cluster.
TO BREAK: On any cluster node, run the following command: # rm /etc/cluster/ccr/*
Is this likely to be a data service problem? _______________________________________________ Where is the core cluster conguration data held? _______________________________________________ How can you restore this conguration? _______________________________________________
TO RESTORE: On the cluster node, restore the CCR by taking a backup copy from another cluster node and restoring it to this node.
9-14

Task 3 Troubleshooting a Resource STOP_FAILED State

The clrs status utility shows a resource is in the STOP_FAILED state.
TO BREAK: Perform the following steps on the node in the cluster currently hosting the ora-rg resource group: 1. Type # cd /opt/SUNWscor/oracle_server/bin 2. Type # mv oracle_server_stop oracle_server_stop.old 3. Type: # vi oracle_server_stop 4. Change the entire contents of the oracle_server_stop le to the following: #!/bin/ksh /opt/SUNWscor/oracle_server/bin/oracle_server_stop.old $* sleep 400 5. Type # chmod a+x oracle_server_stop 6. Type # clrg offline ora-rg
What program is responsible for stopping a data service? _______________________________________________ Was the Oracle software instance stopped? Was the Oracle listener data service also stopped? _______________________________________________ What utilities and options command options do you need to use to x this problem? _______________________________________________
TO RESTORE: Rename the oracle_server_stop.old to oracle_server_stop. You also need to clear the STOP_FAILED ag of the ora-server-res resource and ora-rg resource group.

9-15
Task 4 Troubleshooting Oracle Software Resource Group Errors

You cannot start the resource group for the Oracle software service on any node.
TO BREAK: On any node, perform the following steps. 1. Type # clrg offline ora-rg 2. Type # mount /oracle 3. Type # cd /oracle/admin/MYORA/pfile 4. Type # mv initMYORA.ora initMYORA.ora.old; cd / 5. Type # umount /oracle 6. Type # clrg online ora-rg
What messages appear on the consoles of the server nodes that relate to starting the Oracle resource group software? _______________________________________________ How can you nd out the value of this resource parameter? _______________________________________________ Why does the resource group fail to come online on both nodes? _______________________________________________
TO RESTORE: Rename the initMYORA.ora.old le to initMYORA.ora.
9-16

Task 5 Troubleshooting an Unbootable Cluster Node

You are unable to boot Node 1.
TO BREAK: Perform the following steps as indicated: 1. On node 1, type init 0 2. On node 2, type init 0
What do the console messages indicate is the problem? _______________________________________________ What is blocking Node 1 and what is the node waiting for? _______________________________________________ What happens if Node 2 is on re and cannot boot? _______________________________________________
TO RESTORE: The student must boot node 2 rst because it owns the quorum device.

9-17
Task 6 Troubleshooting oracle_server Resource Fault Monitor Errors

The oracle_server resource comes online, but the fault monitor complains that the fault monitor user lacks the CREATE SESSION privilege. The clrs status command shows that the resource is online but the status is unknown.
TO BREAK: Log in as user oracle and run the following commands. 1. Type $ sqlplus /nolog 2. Type SQL> connect / as sysdba 3. Type SQL> revoke create session, create table from sc_fm; 4. Type SQL> quit 5. Type $ exit
Where are the error messages logged? _______________________________________________ Does this error disable the service availability? _______________________________________________ What does the ORA-1045 message mean? How can you nd out? _______________________________________________
TO RESTORE: Log in as user oracle and run the following commands. 1. Type $ sqlplus /nolog 2. Type SQL> connect / as sysdba 3. Type SQL> grant create session, create table to sc_fm; 4. Type SQL> quit 5. Type $ exit
9-18

Task 7 Troubleshooting the Failure to Start a Web Server

The web server instance cannot start on one node.
TO BREAK: Perform the following steps on one cluster node: 1. Type # clrg offline iws-rg 2. Type # clrg offline lb-rg 3. Type # vi /etc/hosts. 4. Comment out the iws-lh line. 5. Type # clrg online lb-rg 6. Type # clrg online iws-rg
Where are the errors logged? _______________________________________________ Does this appear to be an application, cluster, or operating system problem? _______________________________________________ How can you avoid problems like this? _______________________________________________
TO RESTORE: Fix the typographic error in the /etc/hosts le on the cluster node.

9-19
Task 8 Troubleshooting iws-res Resource Failures on One Node

The iws-rg resource group comes up on one node but not the other.
TO BREAK: Perform the following steps on one cluster node: 1. Type # clrg offline iws-rg 2. Type: # userdel webservd 3. Type # clrg online iws-rg
Where are the errors logged? _______________________________________________ Does this appear to be an application, cluster, or operating system problem? _______________________________________________ How can you avoid problems like this? _______________________________________________
TO RESTORE: Perform the following steps on the cluster node: 1. Type: # useradd -u 80 -g 80 -d / webservd 2. Type # clrg online iws-rg
9-20

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Implementing Disaster Recovery
Exercise 3: Implementing Disaster Recovery

Halt one of the nodes of your cluster. Pretend that it has absolutely failed and that you have wheeled in its replacement and that you must begin disaster recovery by installing a new OS on the node. Now restore the node so that it is completely operational again in the cluster:

Easiest: Do it on the third node (non-storage node) of Pair+1 cluster. Medium difculty: Do it on a storage node where all nodes are storage nodes (for example, node of a two-node cluster, or node of a three-node cluster where all are attached to storage). Hardest: Do it on a storage node of a Pair+1 cluster.
Hints for Installing New OS on the Failed (New) Node

The standard installation prole for the course installs Solaris 9 OS. In order to get the new node up to Solaris 10, you can use the upgrade option in the installES445 script provided with the course. This script uses the Live Upgrade software to install a pre-provided Flash version of Solaris 10 that already has the cluster packages installed.
Hints for Getting the Node Back into the Cluster

All the information you need is in Module 4. You should completely remove the denition of the node from the existing cluster, and then add it back.

9-21
Exercise Summary
Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications
9-22

Appendix A
Upgrading Oracle Software

The following exercise provides the steps needed to upgrade the HAOracle failover service to use Oracle 10gR2.
A-1
Exercise: Oracle Software Installation and Database Upgrade

This is an optional exercise. If you choose to complete this exercise, perform the following tasks:
Task 1 Bring the application resources ofine and start the application manually Task 2 Install the new Oracle 10g software and upgrade the database Task 3 Halt the listener and congure the new network components Task 4 Change and enable the resources Task 5 Verify that the Oracle database upgrade is successful
Preparation
1. Make sure that your administrative workstation or display machine permits X Windows clients to connect to it. On your administrative workstation or display machine, type the following: # /usr/openwin/bin/xhost + 2. Make sure that you have at least 3 Gbytes of free space in the /oracle le system. If you do not have enough space, then you must either grow the le system or create a new volume on which you can install the Oracle 10g software binaries.
Note The amount of time required to perform this exercise can vary greatly with the horsepower of your server nodes. It has taken up to four hours on slow equipment.
A-2

www.chinaitproject.com IT QQ : 3264454 Exercise: Oracle Software Installation and Database Upgrade
Task 1 Installing the New Oracle Software

You do not upgrade Oracle software in place. Instead you install a new version of Oracle in parallel with the old. As such you can leave your service up and running as you perform this task. As the user oracle on the node or non-global zone on which the database is running, perform the following steps: 1. Edit the .profile le and make the following changes: a. Modify the value of ORACLE_HOME. The example shows the old value commented out and the new value as the next line. #ORACLE_HOME=$ORACLE_BASE/products/9.2 ORACLE_HOME=$ORACLE_BASE/products/10.2 b. Modify the value of DISPLAY so that it matches your display name (or IP address) and display number. If you are running in a non-global zone, you may have to make an entry in /etc/hosts for the display station, or just use the IP address.
2.
Exit and return as the oracle user and verify the environment change: $ exit # su - oracle $ env
3.
Run the ORACLE 10g Universal Installer software. $ cd Oracle10gR2-DB-Location $ ./runInstaller Respond to the dialogs using Table 9-1.
Table 9-1 The runInstaller Script Dialog Answers Dialog Welcome to the Oracle Database 10g Installation Action Select the Advanced Installation radio button (near the bottom). Click Next. Specify Inventory Directory and Credentials Select Installation Type Click Next. Select the Custom radio button, and click Next.

A-3
Exercise: Oracle Software Installation and Database Upgrade Table 9-1 The runInstaller Script Dialog Answers (Continued) Dialog Specify Home Details Action
Verify specically that the Path matches the new value of $ORACLE_HOME in the .profile le. Click Next.
Available Product Components
Deselect the following (so the install goes faster):

Enterprise Edition Options 10.2.0.1.0 Oracle Enterprise Manager Console DB 10.2.0.1.0 Oracle Programmer 10.2.0.1.0 Oracle XML Development Kit 10.2.0.1.0
Click Next. Product Specic Prerequisite Checks If you are running in the global zone, these will all succeed, and you should be taken to the next screen without any interaction. If you are running in a non-global zone, you will get a security warning (because there is no /etc/system in a non-global zone). Click Next, and then click Yes to proceed when you get the popup warning window. Privileged Operating System Groups Create Database Summary Verify that both say dba, and click Next. Select the Install database Software only radio button, and Click Next. Verify, and click Install.
A-4

www.chinaitproject.com IT QQ : 3264454 Exercise: Oracle Software Installation and Database Upgrade Table 9-1 The runInstaller Script Dialog Answers (Continued) Dialog Execute Conguration Scripts Action In another window, log in as root on the node or non-global zone in which you are running the installer. Execute the two scripts noted (you should be able to paste their names straight from the Oracle installer into your shell window): /oracle/oraInventory/orainstRoot.sh /oracle/products/10.2/root.sh For the second script, accept the default pathname for the local bin directory. Once the scripts are completed, click Ok. End of Installation 4. Click Exit, and conrm. Run the Network Conguration Assistant to create a new listener for the new version. $ netca Dialog Oracle Net Conguration Assistant Welcome Listener Conguration, Listener Listener Conguration, Listener Name Select Protocols TCP/IP Protocol More Listeners Listener Conguration Done Welcome Action Select Listener Conguration, and click Next. Select Add, and Click Next. Verify the Listener name is LISTENER, and click Next. Verify that TCP is among the Selected Protocols, and click Next. Verify that the Use the standard port number of 1521 radio button is selected, and click Next. Verify that the No radio button is selected, and click Next. Click Next. Click Finish.

A-5
Exercise: Oracle Software Installation and Database Upgrade 5.
Halt the new listener that was automatically started (we will recongure it later to use the ora-lh address): $ lsnrctl stop
Task 2 Upgrading the Database

The Upgrade Assistant requires that the database be online, but you do not want it to be monitored by Sun Cluster (because the upgrade will be stopping it and starting the new version). 1. From any node, unmonitor the server resource, and disable the listener resource: # clrs unmonitor ora-server-res # clrs disable ora-listener-res 2. On the node or non-global zone still hosting the resource group, create a /var/opt/oracle/oratab le, which is needed by the installer to recognize that you may choose to upgrade a database: # mkdir -p /var/opt/oracle # vi /var/opt/oracle/oratab MYORA:/oracle/products/9.2:N 3. On that node or non-global zone, as the user oracle, execute the upgrade assistant: $ dbua Dialog Welcome Database Upgrade Assistant Step 1: Databases Warning Step 2: SYSAUX Tablespace Step 3: Recompile Invalid Objects Action Click Next. Verify that the MYORA database is selected. Click Next. Click Yes to ignore the warning and continue. Leave all the default settings on the form, and click Next. Leave all the default settings on the form. Click Next.
A-6

www.chinaitproject.com IT QQ : 3264454 Exercise: Oracle Software Installation and Database Upgrade Dialog Step 4: Backup Action Lie, and verify that the I have already backed up my database radio button is selected (in order to speed up this lab exercise, pretend you have). Click Next. Step 5: Management Options Verify that the Enterprise Manager is not available (because you deselected it when installing the software). Click Next. Step 6: Summary Database Upgrade Assistant: Progress Verify, and click Finish. The upgrade could take 1-4 hours to complete. It will appear to be stuck at several points, but be patient. Enjoy a ne cup of coffee courtesy of your training center, and a well-deserved nap. Click OK when it is 100% nished. Database Upgrade Assistant: Upgrade Results Verify (you do not need to set any more passwords), and click Close.

A-7
Task 3 Configuring the New Network Components

Perform the following steps on the node or non-global zone on which you have been working. 1. Switch to the oracle user, if you have not already done so: # su - oracle 2. Manually shut down the Oracle 10gR2 database that is now running. $ sqlplus /nolog sqlplus> connect / as sysdba sqlplus> shutdown sqlplus> quit 3. As root on any node disable the resource (if you had not shut down the database manually, the clrs disable would fail to do so because the properties are still pointing to the Oracle9i software, which is no longer running): # clrs disable ora-server-res 4. As use oracle on the node owning the storage, congure the ORACLE listener by typing: $ vi $ORACLE_HOME/network/admin/listener.ora Modify the HOST variable to match logical host name ora-lh. Add the following lines between the second-to-last and last right parentheses of the SID_LIST_LISTENER information: (SID_DESC = (SID_NAME = MYORA) (ORACLE_HOME = /oracle/products/10.2) (GLOBALDBNAME = MYORA) )
A-8

www.chinaitproject.com IT QQ : 3264454 Exercise: Oracle Software Installation and Database Upgrade Your entire le should end up looking identical to the following, assuming your logical host name is literally ora-lh: SID_LIST_LISTENER = (SID_LIST = (SID_DESC = (SID_NAME = PLSExtProc) (ORACLE_HOME = /oracle/products/10.2) (PROGRAM = extproc) ) (SID_DESC = (SID_NAME = MYORA) (ORACLE_HOME = /oracle/products/10.2) (GLOBALDBNAME = MYORA) ) ) LISTENER = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = ora-lh)(PORT = 1521)) (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC0)) ) ) ) 5. Congure the tnsnames.ora le by typing (as user oracle): $ vi $ORACLE_HOME/network/admin/tnsnames.ora Modify the HOST variables to match logical host name ora-lh.
Task 4 Changing and Enabling the Resources

Perform the following commands as root any one node: 1. Change the properties of the server resource: # clrs set -p Oracle_home=/oracle/products/10.2 \ -p Parameter_file="" ora-server-res 2. Change the properties of the listener resource:
# clrs set -p Oracle_home=/oracle/products/10.2 \ ora-listener-res 3. Enable the resources:

A-9
Exercise: Oracle Software Installation and Database Upgrade # clrs enable ora-server-res ora-listener-res # clrs monitor ora-server-res
Task 5 Verifying That the Oracle Database Upgrade Is Successful

Perform the steps as indicated to verify the upgrade: 1. 2. 3. 4. 5. Switch to user oracle: # su - oracle Invoke the sqlplus utility: $ sqlplus /nolog Connect to the database instance: SQL> connect / as sysdba Query a table: SQL> select * from mytable; Quit the sqlplus utility: SQL> quit $ exit 6. Switch the resource group to the other node, and repeat steps 2 through 6.
A-10

Exercise Summary
!
?
Experiences
Interpretations
Conclusions
Applications

A-11
Appendix B
Installing and Conguring Oracle 10gR2 RAC on Shared QFS

The following exercise provides the steps for installing and conguring Oracle 10gR2 RAC on a shared QFS le system. Note This exercise depends on the previous completion of the lab in Module 5 to congure a shared QFS le system mounted on /orashared
B-1
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software

In this exercise, you run Oracle 10g RAC in Sun Cluster 3.2 software. In this exercise, you complete the following tasks:

Task 1 Shutting down Failover Oracle Instances Task 2 Provisioning the Shared QFS File System Task 3 Conguring Oracle Virtual IPs Task 4 Conguring the oracle User Environment Task 5 Disabling Access Control on X Server of the Admin Workstation Task 6 Installing Oracle CRS Software Task 7 Installing Oracle Database Software Task 8 Create Sun Cluster Resources to Control Oracle RAC Through CRS Task 9 Verifying That Oracle RAC Works Properly in a Cluster
B-2

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software
Preparation
Before continuing with this exercise, read the background information in this section.
Background
Oracle 10g RAC software in the Sun Cluster environment encompasses several layers of software, as follows:
RAC Framework This layer sits just above the Sun Cluster framework. It encompasses the UNIX distributed lock manager (udlm) and a RAC-specic cluster membership monitor (cmm). In the Solaris 10 OS, you must create a resource group rac-framework-rg to control this layer (in the Solaris 9 OS, it is optional to create the resource group; if you do not, the daemons will be controlled by standard Solaris boot scripts).
Oracle Cluster Ready Services (CRS) CRS is essentially Oracles own implementation of a resource group manager. That is, for the Oracle RAC database instances and their associated listeners and related resources, CRS takes the place of the Sun Cluster resource group manager.
Oracle Database The actual Oracle RAC database instances run on top of CRS. The database software must be installed separately (it is a different product) after CRS is installed and enabled. The database product has hooks that recognize that it is being installed in a CRS environment.
Sun Cluster control of RAC The Sun Cluster 3.2 environment features new proxy resource types that allow you to monitor and control Oracle RAC database instances using standard Sun Cluster commands. The Sun Cluster instances issue commands to CRS to achieve the underlying control of the database.

B-3
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software The various RAC layers are illustrated in Figure 9-1.
ORACLE RAC Database DB instances controlled by CRS,
ORACLE CRS
5K + KIJAH 4)+ FH NO HAI KH?A AJI O K KIA 5K + KIJAH ? = @I J ? JH @=J=>=IA >O FH NOE C JDH KCD +45
Controlled by rac-framework rg
RAC Framework
Sun Cluster Framework
Figure 9-1
Oracle RAC Software Layers
RAC Database Storage

In the Sun Cluster environment, you have the following choices of where to store your actual data for RAC databases:
Raw devices using the VxVM Cluster Volume Manager (CVM) feature Raw devices using the Solaris Volume Manager multi-owner diskset feature Raw devices using no volume manager (assumes hardware RAID) Shared QFS le system on raw devices or Solaris Volume Manager multi-owner devices On a supported NAS device Starting in Sun Cluster 3.1 8/05, the only such supported device is a clustered Network Appliance Filer.
Note Use of global devices (using normal device groups) or a global le system is not supported. The rationale is that if you used global devices or a global le system, your cluster transport would now be used for both the application-specic RAC trafc and for the underlying device trafc. The performance detriment this might cause might eliminate the advantages of using RAC in the rst place.
B-4

Task 1 Shutting down Failover Oracle Instances

In production, you can run failover Oracle instances (HA-Oracle) and Oracle RAC together in the same cluster. In our clusters, we have the following issues:
The HA-Oracle installation, including the home directory for the user oracle, may be in a failover le system. We require the oracle home directory and binaries to be available to all nodes simultaneously. It is unknown whether the nodes you are running on may have the horsepower to run Oracle failover and Oracle RAC simultaneously.
For this reason, shut down your failover Oracle application. Type the following, from any one-node: # clrg offline ora-rg # clrs disable -g ora-rg + # clrg unmanage ora-rg
Task 2 Provisioning the Shared QFS File System

Perform the following on all nodes that can access the Shared QFS File System mounted on /orashared: # usermod -d /orashared oracle # passwd oracle New Password: oracle Re-enter new Password: oracle Perform the following on one of those nodes: # chown oracle:dba /orashared # cd /orashared # rm -r *

B-5
Task 3 Configuring Oracle Virtual IPs

Congure virtual IPs for use by the Oracle RAC database. Oracle CRS controls these IPs, failing over both of them to a surviving node if one of the nodes crashes. When one of these IPs fails over, Oracle clients do not successfully contact the database using that IP. Instead, they get a connection refused indication, and have their client software set to automatically try the other IP. Perform the following task on all selected nodes as root (you can edit the hosts le on one node and copy it over or paste in your entries): Edit the /etc/hosts le and create public network entries for new virtual IPs. You will have one IP per node that you are using with Oracle RAC. If you use consecutive IP addresses, then the conguration assistant in the next task automatically guesses the second IP when you type in the rst. Make sure you do not conict with any other physical or logical IPs that you or any other students have been using during the week. # vi /etc/hosts x.y.z.w nodename1-vip x.y.z.w+1 nodename2-vip
B-6

Task 4 Configuring the oracle User Environment

Perform the following steps on any one of the selected cluster nodes: 1. 2. Switch user to oracle user by typing: # su - oracle Edit .profile to include environment variables for Oracle 10g RAC by typing (or copy in a model le provided by the instructor):
$ vi .profile ORACLE_BASE=/orashared ORACLE_HOME=$ORACLE_BASE/product/10.2.0/db_1 CRS_HOME=$ORACLE_BASE/product/10.2.0/crs TNS_ADMIN=$ORACLE_HOME/network/admin DISPLAY=display-station-name-or-IP:display# if [ `/usr/sbin/clinfo -n` -eq 1 ]; then ORACLE_SID=sun1 fi if [ `/usr/sbin/clinfo -n` = 2 ]; then ORACLE_SID=sun2 fi PATH=/usr/ccs/bin:$ORACLE_HOME/bin:$CRS_HOME/bin:/usr/bin:/usr/sbin export ORACLE_BASE ORACLE_HOME TNS_ADMIN CRS_HOME export ORACLE_SID PATH DISPLAY 3. 4. Make sure your actual X-Windows display is set correctly on the line that begins with DISPLAY=. Read in the contents of your new .profile le and verify the environment. $ . ./.profile $ env 5. Enable rsh for the oracle user. $ echo + >/orashared/.rhosts
Task 5 Disabling Access Control on X Server of the Admin Workstation

Perform the following task: To allow client GUIs to be displayed, run the following command on the admin workstation or display station: (# or $) /usr/openwin/bin/xhost +

B-7
Task 6 Installing Oracle CRS Software

Install the Oracle CRS software from the rst selected cluster node. The installer automatically detects that binaries are being installed into a le system (the shared QFS le system) available to multiple nodes. Perform the following steps on the rst selected cluster node as the oracle user: 1. Change directory to the Oracle 10g CRS 10.1.0.2 software location by typing: $ cd ORACLE10gR2-CRS-software-location 2. 3. Run the runInstaller installation program by typing: $ ./runInstaller Respond to the dialog boxes by using Table 9-2.
Table 9-2 Oracle CRS Installation Dialog Actions Dialog Welcome Specify Inventory directory and credentials Specify Home Details Action Click Next. Verify, and click Next. Change the Path to: /orashared/product/10.2.0/crs Click Next. Product Specic Prerequisite Checks Most likely, these checks will all succeed, and you will be moved automatically to the next screen without having to touch anything. If you happen to get a warning, click Next, and if you get a pop-up warning window, click Yes to proceed.
B-8

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog Specify Cluster Conguration Action Enter the name of the cluster (this is actually unimportant). Highlight the line containing the name of any nonstorage node (where shared QFS is not congured) and click Remove. For each remaining node listed, verify that the Virtual Host Name column contains nodename-vip similar to what you entered in the /etc/hosts le. Click Next. If you get some error concerning a node that cannot be clustered (null), it is probably because you do not have an /orashared/.rhosts le, or a password for the oracle user. You must have one even on the node on which you are running the installer. Specify Network Interface Usage Be very careful with this section! To mark an adapter in the instructions in this box, highlight it and click Edit. Then choose the appropriate radio button and click OK Mark your actual public adapters as public. Mark only the clprivnet0 interface as private. Mark all other adapters, including actual private network adapters, as Do Not Use. Click Next. Specify Oracle Cluster Registry (OCR) Location Choose the External Redundancy radio button. Enter /orashared/ocr_file Click Next.

B-9
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog Specify Voting Disk Location Action
Choose the External Redundancy radio button. Enter /orashared/voting_file Click Next.
Summary
Verify, and click Install.
B-10

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog Execute Conguration Scripts Action On all selected nodes, one at a time, starting with the node on which you are running the installer, open a terminal window as user root and run the scripts: /orashared/oraInventory/orainstRoot.sh /orashared/product/10.2.0/crs/root.sh The second script formats the voting device and enables the CRS daemons on each node. Entries are put in /etc/inittab so that the daemons run at boot time. On all but the rst node, the messages: EXISTING Configuration detected... NO KEYS WERE Written..... are correct and expected. Please read the following section carefully: On the last node only, the second script tries to congure Oracles Virtual IPs. There is a known bug on the SPARC version: If your public net addresses are in the known non-Internettable range (10, 172.16 through 31, 192.168) this part fails right here. If you get the The given interface(s) ... is not public message, then ... before you continue ... As root (on that one node with the error): Set your DISPLAY (for example, DISPLAY=machine:#;export DISPLAY) Run /orashared/product/10.2.0/crs/bin/vipca Use Table 9-3 to respond, and when you are done, RETURN HERE. When you have run to completion on all nodes, including running vipca by hand if you need to, click OK on the Execute Conguration Scripts dialog. Conguration Assistants Let them run to completion (nothing to click).

B-11
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog End of Installation Action Click Exit and conrm.
Table 9-3 VIPCA Dialog Actions Dialog Welcome Network Interfaces Action Click Next. Verify that all/both of your public network adapters are selected Click Next. Virtual IPs for Cluster Nodes For each node, enter the nodename-vip name that you created in your hosts les previously. When you press TAB, the form might automatically ll in the IP addresses, and the information for other nodes. If not, ll in the form manually. Verify that the netmasks are correct. Click Next. Summary Conguration Assistant Progress Dialog Conguration Results Verify, and click Finish. Conrm that the utility runs to 100%, and click OK.
Verify, and click Exit.
B-12

Task 7 Installing Oracle Database Software

Install the Oracle database software from the rst selected cluster node as the oracle user. 1. Change directory to the Oracle 10gR2 database location. This is a different directory than the CRS software location: $ cd ORACLE10gR2-db-location 2. 3. Run the runInstaller installation program by typing: $ ./runInstaller Respond to the dialog boxes by using Table 9-4.
Table 9-4 Install Oracle Database Software Dialog Actions Dialog Welcome Select Installation Type Specify Home Details Action Click Next. Select the Custom radio button, and click Next. Verify, especially that the destination path is /orashared/product/10.2.0/db_1. Click Next. Specify Hardware Cluster Installation Mode Verify that the Cluster Installation radio button is selected. Put check marks next to all of your selected cluster nodes. Click Next. Available Product Components Deselect the following components (to speed up the installation):

Enterprise Edition Options Oracle Enterprise Manager Console DB 10.2.0.1.0 Oracle Programmer 10.2.0.1.0 Oracle XML Development Kit 10.2.0.1.0
Click Next.

B-13
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-4 Install Oracle Database Software Dialog Actions (Continued) Dialog Product Specic Prerequisite Checks Action
Most likely, these checks will all succeed, and you will be moved automatically to the next screen without having to touch anything. If you happen to get a warning, click Next, and if you get a pop-up warning window, click Yes to proceed.
Privileged Operating System Groups Create Database Summary Oracle Net Conguration Assistant Welcome Listener Conguration, Listener Name Select Protocols TCP/IP Protocol Conguration More Listeners
Verify that dba is listed in both entries, and click Next. Verify that Create a Database is selected, and click Next. Verify, and click Install Verify that the Perform Typical Conguration check box is not selected, and click Next. Verify the Listener name is LISTENER, and click Next. Verify that TCP is among the Selected Protocols, and click Next. Verify that the Use the Standard Port Number of 1521 radio button is selected, and click Next. Verify that the No radio button is selected, and click Next (be patient with this one, it takes a while to move on) Click Next. Verify that the No, I Do Not Want to Congure Additional Naming Methods radio button is selected, and click Next. Click Finish. Select the General Purpose radio button, and click Next. Type sun in the Global Database Name text eld (notice that your keystrokes are echoed in the SID Prex text eld), and click Next. Verify that Enterprise Manager is not available (you eliminated it when you installed the database software) and click Next.
Listener Conguration Done Naming Methods Conguration Done DBCA Step 1: Database Templates DBCA Step 2: Database Identication DBCA Step 3: Management Options
B-14

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-4 Install Oracle Database Software Dialog Actions (Continued) Dialog DBCA Step 4: Database Credentials Action Verify that the Use the Same Password for All Accounts radio button is selected. Enter cangetin as the password, and click Next. DBCA Step 5: Network Conguration Verify that the Register this database with all the listeners radio button is selected. Click Next. DBCA Step 6: Storage Options Verify that the Cluster File System radio button is selected. Click Next. DBCA Step 7: Database File Locations DBCA Step 8: Recovery Conguration DBCA Step 9: Database Content DBCA Step 10: Database Services DBCA Step 11: Initialization Parameters Select Use Database File Locations from Template Click Next. Make sure all the boxes are unchecked (uncheck manually if necessary) , and click Next. Uncheck Sample Schemas, and click Next. Click Next. On the Memory tab, verify that the Typical radio button is selected. Change the percentage to a ridiculously small number (1%). Click Next and accept the error telling you the minimum memory required. The percentage will automatically be changed on your form. Click Next. DBCA Step 12 Database Storage Verify that the database storage locations are correct by clicking leaves in the le tree in the left pane and examining the values shown in the right pane. These locations are les under $ORACLE_BASE (/orashared) Click Next.

B-15
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-4 Install Oracle Database Software Dialog Actions (Continued) Dialog DBCA Step 13: Creation Options Summary Action
Verify that the Create Database check box is selected, and click Finish. Verify and click OK (the database is built; this can take anywhere from 12 minutes to an hour depending on the obsoleteness of your hardware) Click Exit. Wait a while (anywhere from a few seconds to a few minutes: be patient) and you will get a pop-up window informing you that Oracle is starting the RAC instances. On each of your selected nodes, open a terminal window as root, and run the script: /orashared/product/10.2.0/db_1/root.sh Accept the default path name for the local bin directory. Click OK in the script prompt window.
Database Conguration Assistant
Execute Conguration Scripts
End of Installation
Click Exit and conrm.
Task 8 Create Sun Cluster Resources to Control Oracle RAC Through CRS
Perform the following steps on only one of your selected nodes to create Sun Cluster resources that monitor your RAC storage, and that allow you to use Sun Cluster to control your RAC instances through CRS: 1. Register the types required for RAC storage and RAC instance control:
# clrt register crs_framework # clrt register ScalMountPoint # clrt register scalable_rac_server_proxy 2. Create a CRS framework resource (the purpose of this resource is to try to cleanly shut down CRS if you are evacuating the node using cluster commands):
# clrs create -g rac-framework-rg -t crs_framework \ -p Resource_dependencies=rac-framework-res \ crs-framework-res
B-16

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software 3. Create a group to hold the resource to monitor the shared storage:
# clrg create -n node1,node2 \ -p Desired_primaries=2 \ -p Maximum_primaries=2 \ -p RG_affinities=++rac-framework-rg \ rac-storage-rg 4. Create the resource to monitor the shared le system. # clrs create -g rac-storage-rg -t ScalMountPoint \ -p filesystemtype=s-qfs \ -p mountpointdir=/orashared \ -p targetfilesystem=qfsorashared \ -p Resource_dependencies=qfsmeta-res \ rac-storage-res 5. 6. Bring the resource group that monitors the shared storage online. Create a group and a resource to allow you to run cluster commands that control the database through CRS.
# clrg online -M rac-storage-rg
# clrg create -n node1,node2 \ -p Desired_primaries=2 \ -p Maximum_primaries=2 \ -p RG_affinities=++rac-framework-rg,++rac-storage-rg \ rac-control-rg

B-17
Note For the following command, make sure you understand which node has node id 1, which has node id 2, and so forth, so that you match correctly with the names of the database sub-instances. You can use clinfo -n to verify the node id on each node.
# vi /var/tmp/cr_rac_control clrs create -g rac-control-rg -t scalable_rac_server_proxy \ -p DB_NAME=sun \ -p ORACLE_SID{name_of_node1}=sun1 \ -p ORACLE_SID{name_of_node2}=sun2 \ -p ORACLE_home=/orashared/product/10.2.0/db_1 \ -p Crs_home=/orashared/product/10.2.0/crs \ -p Resource_dependencies_offline_restart=rac-storage-res{local_node} \ -p Resource_dependencies=rac-framework-res \ rac-control-res # sh /var/tmp/cr_rac_control # clrg online -M rac-control-rg
B-18

Task 9 Verifying That Oracle RAC Works Properly in a Cluster

Run the following commands as indicated to verify that the Oracle software properly runs in the cluster: Note So that the ow of the following task makes sense, make sure you understand which node has node id 1, which has node id 2, and so forth. 1. 2. Switch user to the oracle user by typing (all selected cluster nodes): # su - oracle On node 1, connect to database sub-instance sun1 and create a table. $ sqlplus SYS@sun1 as sysdba Enter password: cangetin SQL> SQL> SQL> SQL> SQL> SQL> create table mytable (Name VARCHAR(20), age NUMBER(4)); insert into mytable values ('vincent', 14); insert into mytable values ('theo', 14); commit; select * from mytable; quit 3. From the other node, query the other database sub-instance and verify that the data is there: $ sqlplus SYS@sun2 as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit 4. As root on either node, take advantage of the Sun Cluster control resource to use Sun Cluster commands to shut down the database on one of the nodes (the resource accomplishes the actual control through CRS). # clrs disable -n node2 rac-control-res 5. On the node for which you shutdown Oracle, verify as the oracle user that the instance is unavailable: $ crs_stat -t $ sqlplus SYS@sun2 as sysdba Enter password: cangetin

B-19
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software 6.
On either node, reenable the instance through the Sun Cluster resource: # clrs enable -n node2 rac-control-res On that (affected node), you should be able to repeat step 5. It might take a few attempts before the database is initialized and you can successfully access your data. Cause a crash of node 1: # <Control-]> telnet> send break
7.
8.
9.
On the surviving node, you should see (after 45 seconds or so), the CRS-controlled virtual IP for the crashed node migrate to the surviving node. Run the following as user oracle: $ crs_stat -t|grep vip ora.node2.vip ora.node1.vip application application ONLINE ONLINE ONLINE node2 ONLINE node2
10. While this virtual IP has failed over, verify that there is actually no failover listener controlled by Oracle CRS. This virtual IP fails over merely so a client quickly gets a TCP disconnect without having to wait for a long time-out. Client software then has a client-side option to fail over to the other instance. $ sqlplus SYS@sun1 as sysdba SQL*Plus: Release 10.2.0.1.0 - Production on Tue May 24 10:56:18 2005 Copyright (c) 1982, 2005, ORACLE. Enter password: ERROR: ORA-12541: TNS:no listener All rights reserved.
Enter user-name: ^D 11. Boot the node that you had halted, by typing boot or go at the OK prompt in the console. If you choose the latter, the node will panic and reboot. 12. After the node boots, monitor the automatic recovery of the virtual IP, the listener, and the database instance by typing, as user oracle, on the surviving node: $ crs_stat -t
B-20

www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software $ /usr/cluster/bin/clrs status It can take several minutes for the full recovery. Keep repeating the steps. 13. Verify the proper operation of the Oracle database by contacting the various sub-instances as the user oracle on the various nodes: $ sqlplus SYS@sun1 as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit

B-21
Exercise Summary
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercises.
!
?
Manage the discussion based on the time allowed for this module, which was provided in the About This Course module. If you do not have time to spend on discussion, then just highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students what their overall experiences with this exercise have been. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Conclusions
Have students articulate any conclusions they reached as a result of this exercise experience.
Applications
B-22


ES445 - Sun Cluster 3.2 Advanced Administration - SG

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ES445 - Sun Cluster 3.2 Advanced Administration - SG

Enviado por

Direitos autorais:

Formatos disponíveis

www.chinaitproject.

Sun Cluster 3.2 Advanced Administration ES-445

Sun Cluster 3.2 Advanced Administration

Sun Cluster 3.2 Advanced Administration

Sun Cluster 3.2 Advanced Administration

Sun Cluster 3.2 Advanced Administration

Sun Cluster 3.2 Advanced Administration

Sun Cluster 3.2 Advanced Administration

About This Course

Conguring Data Services

Advanced Procedures and Features

Best Practices and Security

Best Practices for Cluster Security

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Topics Not Covered

Topics Not Covered

About This Course

How Prepared Are You?

How Prepared Are You?

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Introductions

About This Course

How to Use Course Materials

How to Use Course Materials

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Conventions

About This Course

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Conventions

About This Course

Before You Begin: Course Setup

Before You Begin: Course Setup

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Before You Begin: Course Setup

Task 1 Defining the Hardware and Software Components of Your Clusters

Cluster Name Node 1 Node 2 (Same as above)

Note the following:

Task 2 Verifying Installation and Configuration Information

About This Course

Before You Begin: Course Setup 3.

Task 3 Running the Setup Script on Your Cluster

Task 4 Reviewing Cluster Architecture (PIC FROM 345 of Hardware)

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Before You Begin: Course Setup

About This Course

Upgrades in the Sun Cluster Environment

Discussion The following questions are relevant to understanding this module:

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Additional Resources

Upgrades in the Sun Cluster Environment

Introduction to Upgrades in the Sun Cluster Environment

Introduction to Upgrades in the Sun Cluster Environment

Dual-partition Upgrades Live Upgrades

Sun Cluster 3.2 Advanced Administration

www.chinaitproject.com IT QQ : 3264454 Sun Cluster Component Relationships

Sun Cluster Component Relationships

Upgrading the OS in the Sun Cluster Environment

Upgrades in the Sun Cluster Environment

Sun Cluster Component Relationships

Procedure for Upgrading the OS (Non-Live Upgrade)

Upgrading the Volume Manager Software in the Sun Cluster Environment

Sun Cluster 3.2 Advanced Administration