Escolar Documentos
Profissional Documentos
Cultura Documentos
com IT QQ : 3264454
Sun Microsystems, Inc. UBRM05-104 500 Eldorado Blvd. Broomeld, CO 80021 U.S.A. Revision A
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Copyright 2007 Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. This product or document is protected by copyright and distributed under licenses restricting its use, copying, distribution, and decompilation. No part of this product or document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Sun, Sun Microsystems, the Sun logo, iPlanet, Java, JumpStart, OpenBoot, Solaris, Solaris Jumpstart, Solstice DiskSuite, Sun BluePrints, Sun Enterprise, Sun Java, Sun Enterprise Netbackup, SunPlex, Sun StorEdge, and Sun Enterprise Server 250 Internal RAID Storage Option are trademarks or registered trademarks of Sun Microsystems, Inc., in the U.S. and other countries. Netscape and the Netscape logo are trademarks or registered trademarks of the Netscape Communications Corporation in the United States and in other countries. UNIX is a registered trademark in the U.S. and other countries, exclusively licensed through X/Open Company, Ltd. The OPEN LOOK and Sun Graphical User Interface was developed by Sun Microsystems, Inc., for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Suns licensees who implement OPEN LOOK GUIs and otherwise comply with Suns written license agreements. ORACLE is a registered trademark of Oracle Corporation. Federal Acquisitions: Commercial Software Government Users Subject to Standard License Terms and Conditions Export Laws. Products, Services, and technical data delivered by Sun may be subject to U.S. export controls or the trade laws of other countries. You will comply with all such laws and obtain all licenses to export, re-export, or import as may be required after delivery to You. You will not export or re-export to entities on the most current U.S. export exclusions lists or to any country subject to U.S. embargo or terrorist controls as specified in the U.S. export laws. You will not use or provide Products, Services, or technical data for nuclear, missile, or chemical biological weaponry end uses. DOCUMENTATION IS PROVIDED AS IS AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS, AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. THIS MANUAL IS DESIGNED TO SUPPORT AN INSTRUCTOR-LED TRAINING (ILT) COURSE AND IS INTENDED TO BE USED FOR REFERENCE PURPOSES IN CONJUNCTION WITH THE ILT COURSE. THE MANUAL IS NOT A STANDALONE TRAINING TOOL. USE OF THE MANUAL FOR SELF-STUDY WITHOUT CLASS ATTENDANCE IS NOT RECOMMENDED. Export Commodity Classification Number (ECCN) assigned: 14 November 2005
Please Recycle
www.chinaitproject.com IT QQ : 3264454
Copyright 2007 Sun Microsystems Inc. 4150 Network Circle, Santa Clara, California 95054, Etats-Unis. Tous droits rservs. Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution, et la dcompilation. Aucune partie de ce produit ou document ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit, sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a. Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci par des fournisseurs de Sun. Sun, Sun Microsystems, le logo Sun, iPlanet, Java, JumpStart, OpenBoot, Solaris, Solaris Jumpstart, Solstice DiskSuite, Sun BluePrints, Sun Enterprise, Sun Java, Sun Enterprise Netbackup, SunPlex, Sun StorEdge, et Sun Enterprise Server 250 Internal RAID Storage Option sont des marques de fabrique ou des marques dposes de Sun Microsystems, Inc., aux Etats-Unis et dans dautres pays. Netscape est une marque de Netscape Communications Corporation aux Etats-Unis et dans dautres pays. Toutes les marques SPARC sont utilises sous licence sont des marques de fabrique ou des marques dposes de SPARC International, Inc., aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun Microsystems, Inc. UNIX est une marques dpose aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company, Ltd. ORACLE est une marque dpose registre de Oracle Corporation. Linterfaces dutilisation graphique OPEN LOOK et Sun a t dveloppe par Sun Microsystems, Inc. pour ses utilisateurs et licencis. Sun reconnat les efforts de pionniers de Xerox pour larecherche et le dveloppement du concept des interfaces dutilisation visuelle ou graphique pour lindustrie de linformatique. Sun dtient une licence non exclusive de Xerox sur linterface dutilisation graphique Xerox, cette licence couvrant galement les licencis de Sun qui mettent en place linterface dutilisation graphique OPEN LOOK et qui en outre se conforment aux licences crites de Sun. Lgislation en matire dexportations. Les Produits, Services et donnes techniques livrs par Sun peuvent tre soumis aux contrles amricains sur les exportations, ou la lgislation commerciale dautres pays. Nous nous conformerons lensemble de ces textes et nous obtiendrons toutes licences dexportation, de r-exportation ou dimportation susceptibles dtre requises aprs livraison Vous. Vous nexporterez, ni ne r-exporterez en aucun cas des entits figurant sur les listes amricaines dinterdiction dexportation les plus courantes, ni vers un quelconque pays soumis embargo par les Etats-Unis, ou des contrles anti-terroristes, comme prvu par la lgislation amricaine en matire dexportations. Vous nutiliserez, ni ne fournirez les Produits, Services ou donnes techniques pour aucune utilisation finale lie aux armes nuclaires, chimiques ou biologiques ou aux missiles. LA DOCUMENTATION EST FOURNIE EN LETAT ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE UTILISATION PARTICULIERE OU A LABSENCE DE CONTREFAON. CE MANUEL DE RFRENCE DOIT TRE UTILIS DANS LE CADRE DUN COURS DE FORMATION DIRIG PAR UN INSTRUCTEUR (ILT). IL NE SAGIT PAS DUN OUTIL DE FORMATION INDPENDANT. NOUS VOUS DCONSEILLONS DE LUTILISER DANS LE CADRE DUNE AUTO-FORMATION.
Please Recycle
www.chinaitproject.com IT QQ : 3264454
Table of Contents
About This Course ............................................................Preface-xvii Course Goals....................................................................... Preface-xvii Course Map........................................................................ Preface-xviii Topics Not Covered............................................................. Preface-xix How Prepared Are You?...................................................... Preface-xx Introductions ........................................................................ Preface-xxi How to Use Course Materials ...........................................Preface-xxii Conventions ........................................................................Preface-xxiii Icons ............................................................................Preface-xxiii Typographical Conventions ................................... Preface-xxiv Additional Conventions............................................Preface-xxv Before You Begin: Course Setup ......................................Preface-xxvi Preparation.................................................................Preface-xxvi Task 1 Defining the Hardware and Software Components of Your Clusters.............................Preface-xxvii Task 2 Verifying Installation and Configuration Information ............................................................Preface-xxvii Task 3 Running the Setup Script on Your Cluster...................................................................... Preface-xxviii Task 4 Reviewing Cluster Architecture (PIC FROM 345 of Hardware).............................. Preface-xxviii Upgrades in the Sun Cluster Environment ....................................1-1 Objectives ........................................................................................... 1-1 Relevance............................................................................................. 1-2 Additional Resources ........................................................................ 1-3 Introduction to Upgrades in the Sun Cluster Environment ........ 1-4 Sun Cluster Component Relationships........................................... 1-5 Upgrading the OS in the Sun Cluster Environment ............ 1-5 Procedure for Upgrading the OS (Non-Live Upgrade)....... 1-6 Upgrading the Volume Manager Software in the Sun Cluster Environment............................................... 1-6 Upgrading Applications in the Sun Cluster Environment .............................................................................. 1-7 The Upgrade Scenario for This Course.................................. 1-8
v
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454 Ordering the Upgrades of Cluster Components ................. 1-9 Introduction to Sun Cluster 3.2 Upgrade Strategies ................... 1-10 Traditional Upgrades (Without Dual-Partition or Live-Upgrade) ...................................................................... 1-10 Dual Partition Upgrades ........................................................ 1-11 Live Upgrade .......................................................................... 1-15 Comparison of Upgrade Strategies: Application Downtime and Total Time to Perform the Upgrade ........ 1-17 The Live Upgrade Process .............................................................. 1-18 Creating a Boot Environment............................................... 1-20 Upgrading a Boot Environment............................................ 1-23 Activating a Boot Environment ........................................... 1-26 Synchronizing Files................................................................. 1-27 Upgrading the VxVM Software ..................................................... 1-28 Removing the Previous VxVM From the Alternate Boot Environment ................................................................... 1-28 Using the pkgadd Utility to Install VxVM 5.0 Software .... 1-29 Upgrading Disk Groups......................................................... 1-30 Exercise: Upgrading the Solaris OS and VxVM Software.......... 1-31 Preparation............................................................................... 1-31 Task 1 Verifying That Your Cluster Is Operating Correctly ................................................................................ 1-31 Task 2 Installing the Solaris 10 Live Upgrade Software and Partitioning the Target Disk....................... 1-33 Task 3 Creating the New Boot Environment as a Clone of the Original Root Disk...................................... 1-34 Task 4 Remove VxVM 4.0 From the New Boot Environment ......................................................................... 1-35 Task 5 Upgrading to Solaris 10 OS in the New Boot Environment ................................................................ 1-36 Task 6 Adding VxVM 5.0 to the New Boot Environment ......................................................................... 1-36 Exercise Summary............................................................................ 1-38 Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades ............................................................................. 2-1 Objectives ........................................................................................... 2-1 Relevance............................................................................................. 2-2 Additional Resources ........................................................................ 2-3 Upgrading the Cluster Software (Non-Live Upgrades)............... 2-4 Upgrading the Shared Components (Non-Live Upgrades) 2-4 Upgrading the Sun Cluster Software Framework (Non-Live Upgrades)................................................................ 2-5 Upgrading Sun-Supported Data Services .......................... 2-10 Managing Dual-Partition Upgrades (Non Live-Upgrade) using scinstall .............................................................................. 2-14
vi
www.chinaitproject.com IT QQ : 3264454 Applying Changes to the First Partition (Initiating the Flop-Over Menu Option 4) ............................................. 2-17 Upgrading and Applying Changes to the Second Partition .................................................................................... 2-17 Upgrading the Cluster Software (Live Upgrade)........................ 2-18 Upgrading Java ES Shared Components (Live Upgrade). 2-18 Upgrading Sun Cluster Framework Packages (Live Upgrade) .................................................................................. 2-19 Upgrading Sun Cluster Data Service Packages (Live Upgrade) .................................................................................. 2-19 Booting Into the New Cluster (Live Upgrade) ................... 2-19 Live Upgrade With Dual-Partition Rolling Reboot (Experimental) ..................................................................... 2-20 Reviewing Sun Cluster Software Upgrade Issues (All Methods)............................................................................................ 2-21 Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade) .......................................................................... 2-22 Identifying Resource Type Upgrade Criteria ..................... 2-22 Naming Resource Types ........................................................ 2-22 Performing the Resource Type Upgrade............................. 2-23 Viewing an Example Resource Type Upgrade................... 2-24 Examining Resource Type Upgrade Issues......................... 2-26 Exercise: Upgrading the Sun Cluster Software ........................... 2-27 Task 1 Provisioning the New Boot Environment........... 2-28 Task 2 Upgrading the Shared Components .................... 2-29 Task 3 Upgrading the Sun Cluster Software Framework ............................................................................ 2-29 Task 4 Upgrading Sun Cluster Software Data Services .................................................................................. 2-30 Task 5 Run the fixforzones Script.................................. 2-30 Task 6A Rebooting the Cluster Nodes (Simultaneously; the Official Procedure) ............................ 2-30 Task 6B Rebooting the Cluster Nodes (Dual-Partition Method; Experimental) ....................................................... 2-31 Task 7 Upgrading Type Versions ..................................... 2-33 Task 8 Upgrading Disk Groups (VxVM Only) ................. 2-33 Task 9 Verifying Your Cluster Operation......................... 2-34 Exercise Summary............................................................................ 2-35 Advanced Data Service Configuration ...........................................3-1 Objectives ........................................................................................... 3-1 Relevance............................................................................................. 3-2 Additional Resources ........................................................................ 3-3 Introducing Sun Cluster 3.2 Software Data Services .................... 3-4 Defining a Data Service............................................................ 3-4 Data Service Methods And Resources .................................. 3-5
vii
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454 Callback Methods ..................................................................... 3-6 Suspended Resource Groups ................................................. 3-8 Resource Type Registration File ............................................ 3-9 Writing Sun Cluster 3.x Software Data Services ......................... 3-14 Data Services Built Without DSDL....................................... 3-15 Data Services Built Using the DSDL .................................... 3-16 Process Monitoring Facility (PMF)....................................... 3-17 Behavior of PMF, Action Script, and Fault Monitor With DSDL ........................................................................... 3-19 Details of a DSDL Fault Monitor ......................................... 3-20 DSDL Fault Monitor Service Restarts .................................. 3-22 Fault Monitor Initiation of Group Failover ......................... 3-22 DSDL Resource Type Similarities and Variations.............. 3-23 The Generic Data Service ....................................................... 3-23 Using Builders to Build a Data Service Skeleton............... 3-25 Controlling RGM Behavior Through Properties ......................... 3-26 Controlling Behavior Through Resource Group Properties .............................................................................. 3-26 Advanced Control of the RGM Through Standard Resource Properties ............................................................ 3-28 Resource Dependencies ........................................................ 3-31 Cross-Group Dependencies................................................... 3-33 The Implicit_network_dependencies Group Property................................................................................. 3-33 Resource Group Dependencies............................................. 3-34 Advanced Resource Group Relationships ................................... 3-35 Weak Positive and Negative Affinities................................ 3-35 Strong Positive Affinities ....................................................... 3-36 Strong Positive Affinity With Failover Delegation ............ 3-37 Strong Negative Affinity....................................................... 3-38 Example of Complex Affinity Relationships ...................... 3-38 Multimaster and Scalable Applications........................................ 3-40 The Desired_primaries and Maximum_primaries Properties .............................................................................. 3-40 Controlling Load Balancing for Scalable Applications .... 3-41 Choosing Client Affinity........................................................ 3-41 Strong Client Affinity and Weak Client Affinity................ 3-42 Affinity Timeout for Strong Client Affinity ........................ 3-42 Exercise 1: Creating Sun Cluster Software Data Services .......... 3-44 Preparation............................................................................... 3-44 Task 1 Creating a Wrapper Script for Your Application ........................................................................... 3-44 Task 2 Creating a New Resource Type ............................ 3-46 Task 3 Installing the New Resource Type ....................... 3-48 Task 4 Registering the New Resource Type..................... 3-48
viii
www.chinaitproject.com IT QQ : 3264454 Task 5 Instantiating a Resource of the New Resource Type ...................................................................... 3-48 Task 6 Putting the Resource Group Containing the New Resource Online ................................................... 3-49 Task 7 Testing the Fault Monitor for the New Resource Type ...................................................................... 3-49 Exercise 2: Create a Data Service using GDS ............................... 3-50 Task 1 Making a Version of an Application Wrapper Script Suitable for GDS........................................................ 3-50 Task 2 Creating and Enabling a Resource Group ........... 3-51 Task 3 Verifying Restart and Failover Behavior of the New Resource ................................................................ 3-51 Exercise 3: Advanced Resource and Resource Group Control ............................................................................................ 3-53 Task 1 Investigating Cross Group Dependencies and Restart Dependencies .................................................. 3-53 Task 2 Investigating Resource Group Affinities.............. 3-54 Task 3 Modifying a Failover Service failover_mode Property................................................................................. 3-55 Exercise Summary............................................................................ 3-57 Performing Recovery and Maintenance Procedures ....................4-1 Objectives ........................................................................................... 4-1 Relevance............................................................................................. 4-2 Additional Resources ........................................................................ 4-3 Adding a Node to an Existing Cluster............................................ 4-4 Redefining a Cluster to Use Switches ................................... 4-5 Cabling the New Node............................................................. 4-6 Configuring Solaris OS on the New Node ............................ 4-6 Preparing the Existing Cluster to Accept the New Node ... 4-6 Adding a vxio Major Number................................................ 4-7 Installing Sun Cluster Packages on the New Node ............. 4-8 Creating Mount Points for Global File Systems ................... 4-8 Checking the did Major Number.......................................... 4-8 Running the scinstall Utility on the New Node.............. 4-9 Managing Quorum Devices .................................................... 4-9 Configuring Volume Management on the New Node..... 4-11 Adding a New Node to Existing Device Groups ............... 4-11 Configuring IPMP.................................................................. 4-13 Preparing the New Node to Run Existing Applications............................................................................. 4-13 Adding the New Node to Existing Resource Groups ...... 4-14 Removing a Node From an Existing Cluster ............................... 4-15 Switching Services Off the Node .......................................... 4-16 Removing a Node From the Resource Group Nodelist ................................................................................... 4-17
ix
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454 Removing a Node From the Device Group Nodelist ........ 4-17 Rebooting The Node to be Removed to Non-Cluster Mode ......................................................................................... 4-18 Removing Quorum Votes and Quorum Devices ............... 4-18 Completing Node Removal (Orderly Removal) ................ 4-19 Completing Node Removal (Dead Node)........................... 4-19 Adding Back Quorum Devices ............................................. 4-19 Replacing a Failed Node in a Cluster............................................ 4-20 Removing the Node Definition From the Cluster and Adding It Back .............................................................. 4-20 Using a Well-Managed Archive............................................ 4-21 Uninstalling Sun Cluster Software From a Node........................ 4-22 Reviewing Disk Replacement Procedures ................................... 4-23 Identifying Individual Disk Drive Failures......................... 4-23 Identifying Cable or Total Array Failures ........................... 4-24 Reviewing DID Consistency Issues...................................... 4-24 Updating the Physical Disk ID in Device Driver RAM (SCSI Disks).......................................................................... 4-26 Updating the Physical ID in Device Driver RAM (Fibre-Channel JBOD Disks)............................................... 4-27 Updating the DID Serial Number Information From the Device Driver RAM....................................................... 4-28 Examining Disk Replacement and Mirror Fixing ............. 4-30 Examining Failed Quorum Device Issues ........................... 4-32 Viewing an Example............................................................... 4-32 Backing Up and Restoring the CCR .............................................. 4-34 Exercise: Performing Maintenance and Recovery Procedures...................................................................................... 4-35 Lab Task Order........................................................................ 4-35 Preparation............................................................................... 4-35 Task 1 Removing a Cluster Node ...................................... 4-36 Task 2 Adding a Node to the Cluster............................... 4-39 Task 3 Replacing a Failed Fibre JBOD Drive ................... 4-46 Task 4 Replacing a Failed SCSI JBOD Drive .................... 4-48 Exercise Summary............................................................................ 4-52 Advanced Features (ZFS, QFS and Zones) ................................... 5-1 Objectives ........................................................................................... 5-1 Relevance............................................................................................. 5-2 Additional Resources ........................................................................ 5-3 ZFS as a Failover File System Only ................................................. 5-4 ZFS Includes a Volume Management Layer......................... 5-4 ZFS Removes the Need for /etc/vfstab Entries ............... 5-4 Example: Creating a Mirrored Pool and Some Filesystems ................................................................................. 5-4 ZFS Snapshots .......................................................................... 5-6
www.chinaitproject.com IT QQ : 3264454 HAStoragePlus and ZFS .......................................................... 5-6 Introducing the Features of the Sun StorEdge QFS File System.................................................................................................. 5-8 Features and Benefits of QFS................................................... 5-8 QFS Considerations for the Cluster...................................... 5-10 Configuring A Standard (Non-Shared) QFS File System........... 5-11 QFS File System and Component Device Types ................ 5-11 Creating Underlying Storage Devices.................................. 5-12 Creating the Master Configuration File............................... 5-12 Creating and Mounting the File System.............................. 5-13 Configuring Standard QFS as a Failover File System in the Cluster........................................................... 5-15 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)..................................................... 5-17 Shared QFS File Systems on Solaris Volume Manager Multiowner Diskset Devices .............................................. 5-18 Configuring a Shared QFS File System................................ 5-19 Creating a Shared QFS File System ...................................... 5-21 Creating a Failover Sun Cluster Resource to Control the Metadata Server................................................................ 5-21 Resource Group Manager Support for Non-Global Zones........ 5-22 Exercise 1: Running a Standard Failover Service on QFS (Optional) ....................................................................................... 5-31 Task 1 Installing the QFS Software on Your Cluster Nodes ....................................................................... 5-31 Task 2a Adding A Volume on Which to Build a Failover QFS File System (With VxVM) ........................ 5-32 Task 2b Adding A Volume on Which to Build a Failover QFS File System (With SVM) ........................... 5-33 Task 3 Preparing a QFS File System Configuration........ 5-33 Task 4 Creating, Mounting, and Switching the QFS File System ........................................................................... 5-35 Task 5 Migrating Your Oracle Application Data to the QFS File System......................................................... 5-35 Task 6 Rearranging Your Mount Points So That /oracle Is Mounted From the New QFS......................... 5-36 Task 7 Reconfiguring Your Cluster Resources to Use the New File System .................................................... 5-36 Exercise 2: Configuring a Shared QFS File System (Optional) ....................................................................................... 5-37 Task 1 Installing the QFS Software on Your Cluster Nodes (If Not Already Done in the QFS Failover Lab)............................................................................................ 5-38 Task 2 Installing RAC Framework Packages for Oracle RAC With SVM Multiowner Disksets.................. 5-38
xi
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454 Task 3 Installing the Oracle Distributed Lock Manager................................................................................. 5-38 Task 4 Creating and Enabling the RAC Framework Resource Group.................................................................... 5-39 Task 5 Adding Volumes on Which to Build a Shared QFS File System................................................................... 5-40 Task 6 Preparing a Shared QFS File System Configuration ....................................................................... 5-40 Task 7 Creating and Mounting the File System .............. 5-42 Task 8 Mounting the File System on Other Node(s)....... 5-42 Task 9 Configuring the Metadata Server as a Failover Resource, and Testing Failover........................ 5-42 Exercise 3: Running Oracle in Non-Global Zones (Optional) .......................................................................................... 5-43 Task 1 Configuring and Installing the Zones.................... 5-43 Task 2 Migrating Oracle to Run in the Zone................... 5-45 Exercise 4: Migrating Your Oracle Data to ZFS (Optional)........ 5-46 Exercise Summary............................................................................ 5-49 Best Practices .................................................................................. 6-1 Objectives ........................................................................................... 6-1 Relevance............................................................................................. 6-2 Additional Resources ........................................................................ 6-3 IPMP Best Practices............................................................................ 6-4 Using IPMP Hardware Redundancy .................................... 6-5 Using Test Addresses or Link State Testing in Solaris 10 OS ........................................................................................ 6-5 Placing the Test IP Address on a Virtual Interface .............. 6-5 Using the standby Keyword .................................................. 6-6 Enabling Failback for IPMP Interfaces................................... 6-7 Using the deprecated Flag on All Test Interfaces .............. 6-8 Controlling Test Targets.......................................................... 6-9 Shared Storage File System Best Practices.................................... 6-10 Using Failover or Global File Systems ................................. 6-10 Configuring the /etc/vfstab File (Traditional Non-ZFS Filesystems) ............................................................ 6-11 Using Affinity Switching ....................................................... 6-12 Using HAStoragePlus Resources With Scalable Services .................................................................................. 6-13 Volume Management Software Best Practices ............................ 6-15 Managing Boot Disk Mirroring With VxVM or Solaris VM ........................................................................ 6-15 Using VxVM Software to Mirror the Boot Disk ................. 6-15 Using Solaris VM Software to Mirror the Boot Disk ........ 6-19 Quorum Device Best Practices ....................................................... 6-21 Limiting Quorum Votes ......................................................... 6-21
xii
www.chinaitproject.com IT QQ : 3264454 Disk Path Monitoring ............................................................. 6-24 Deciding When to Use a Quorum Server Device.............. 6-27 Best Practices for Campus Clusters ............................................... 6-29 Defining Campus Cluster Topologies................................. 6-30 Reducing the Performance Impact of Campus Clusters .. 6-32 Exercise: Using Best Practices ........................................................ 6-33 Preparation............................................................................... 6-33 Task 1 Mirroring the Boot Disk Using Solaris VM.......... 6-34 Task 2 Encapsulating and Mirroring the Boot Disk Using VxVM Software......................................................... 6-36 Task 3 Verifying IPMP Best Practices ............................... 6-37 Task 4 Implementing Quorum Device Monitoring ........ 6-38 Exercise Summary............................................................................ 6-39 Best Practices for Cluster Security ................................................7-1 Objectives ........................................................................................... 7-1 Relevance............................................................................................. 7-2 Additional Resources ........................................................................ 7-3 Using a Security Policy as a Framework for Decision Making ... 7-4 Developing a Security Policy .................................................. 7-4 Implementing a Security Policy .............................................. 7-4 Identifying Security Vulnerabilities ................................................ 7-6 Minimizing Compared to Hardening the Solaris OS Software................................................................................... 7-6 Securing the Oracle RAC Software Installation.................... 7-6 Isolating Cluster Interconnects ............................................... 7-7 Disabling Internet Services ...................................................... 7-7 Identifying Sun Cluster 3.2 Software Services..................... 7-8 Securing Console Access.......................................................... 7-9 Securing Node Authentication During Installation............. 7-9 Using the Solaris Security Toolkit Software................................. 7-10 Introducing the Solaris Security Toolkit Software ............. 7-10 Structure of the Toolkit Software.......................................... 7-11 Executing the Toolkit Software............................................. 7-13 Undoing the Toolkit Software Security Modifications...... 7-13 Downloading and Installing Security Software .......................... 7-15 Downloading and Installing the Toolkit Software............. 7-15 Downloading Recommended and Security Patches......... 7-16 Downloading the FixModes Software (Solaris 9 OS Only)............................................................... 7-16 Downloading the MD5 Software (Solaris 9 OS Only) ....... 7-16 Implementing the Toolkit Software Modifications on a Cluster Node ......................................................................... 7-17 Providing Secure Clustered Services ............................................ 7-18 Using Secure NFS and Kerberized NFS............................... 7-18 Securing an LDAP Service..................................................... 7-19
xiii
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454 Using the Secure Apache Web Service ................................ 7-19 Using the Secure Sun Java System Web Server Software................................................................................. 7-20 Exercise: Hardening Security With the Toolkit Software .......... 7-21 Preparation............................................................................... 7-21 Task 1 Installing the Toolkit Software on the Selected Node................................................................. 7-21 Task 2 Running the suncluster3x-secure.driver Script ...................................................................................... 7-22 Task 3 Verifying That the Selected Node Is Hardened.................................................................................. 7-22 Task 4 Verifying That the Selected Node Operates Properly in the Cluster ........................................................ 7-23 Task 5 Hardening the Remaining Cluster Nodes (Optional) .............................................................................. 7-23 Task 6 Undoing the Security Modifications on Each Cluster Node.......................................................... 7-24 Exercise Summary............................................................................ 7-25 Examining Troubleshooting Tips ................................................... 8-1 Objectives ........................................................................................... 8-1 Relevance............................................................................................. 8-2 Additional Resources ........................................................................ 8-3 Defining How to Troubleshoot Clustered Services ...................... 8-4 Examining a Generic Troubleshooting Approach ............... 8-4 Triangulating the Causes of Failure ....................................... 8-5 Defining the Sun Cluster Software Stack .............................. 8-6 Identifying Dependencies Within Layers of the Stack ........ 8-8 Deciding Where to Begin ......................................................... 8-9 Example: Troubleshooting Failure in an HA-Oracle Application ............................................................................. 8-9 Identifying Log Files for Each Layer............................................. 8-11 Identifying Application Log Files......................................... 8-11 Identifying Cluster Framework Log Files .......................... 8-14 Troubleshooting the Software ........................................................ 9-1 Objectives ........................................................................................... 9-1 Relevance............................................................................................. 9-2 Additional Resources ........................................................................ 9-3 Introducing the Troubleshooting Exercises ................................... 9-4 Troubleshooting Self-Induced Problems ............................... 9-4 Troubleshooting Instructor-Induced Problems .................... 9-4 Implementing Disaster Recovery ........................................... 9-5 Exercise 1: Inducing Problems and Observing Reactions............ 9-6 Preparation................................................................................. 9-6 Task 1 Inducing Daemon Failures....................................... 9-6 Task 2 Inducing a Full Root File System............................ 9-9
xiv
www.chinaitproject.com IT QQ : 3264454 Task 3 Setting an Incorrect maxusers Value................... 9-10 Task 4 Inducing Operator Errors ....................................... 9-10 Exercise 2: Troubleshooting Instructor-Induced Problems ....... 9-12 Preparation............................................................................... 9-12 Task 1 Troubleshooting IPMP Errors............................... 9-13 Task 2 Troubleshooting an Unknown State .................... 9-14 Task 3 Troubleshooting a Resource STOP_FAILED State....................................................................................... 9-15 Task 4 Troubleshooting Oracle Software Resource Group Errors........................................................................ 9-16 Task 5 Troubleshooting an Unbootable Cluster Node .. 9-17 Task 6 Troubleshooting oracle_server Resource Fault Monitor Errors........................................................... 9-18 Task 7 Troubleshooting the Failure to Start a Web Server ........................................................................ 9-19 Task 8 Troubleshooting iws-res Resource Failures on One Node........................................................................ 9-20 Exercise 3: Implementing Disaster Recovery............................... 9-21 Hints for Installing New OS on the Failed (New) Node.......................................................................................... 9-21 Hints for Getting the Node Back into the Cluster .............. 9-21 Exercise Summary............................................................................ 9-22 Upgrading Oracle Software ............................................................ A-1 Exercise: Oracle Software Installation and Database Upgrade........................................................................................... A-2 Preparation................................................................................ A-2 Task 1 Installing the New Oracle Software ....................... A-3 Task 2 Upgrading the Database.......................................... A-6 Task 3 Configuring the New Network Components...... A-8 Task 4 Changing and Enabling the Resources.................. A-9 Task 5 Verifying That the Oracle Database Upgrade Is Successful ......................................................................... A-10 Exercise Summary........................................................................... A-11 Installing and Configuring Oracle 10gR2 RAC on Shared QFS .. B-1 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software...............................................................................................B-2 Preparation................................................................................ B-3 Task 1 Shutting down Failover Oracle Instances ..............B-5 Task 2 Provisioning the Shared QFS File System..............B-5 Task 3 Configuring Oracle Virtual IPs................................ B-6 Task 4 Configuring the oracle User Environment .......... B-7 Task 5 Disabling Access Control on X Server of the Admin Workstation ..................................................................B-7 Task 6 Installing Oracle CRS Software .............................. B-8 Task 7 Installing Oracle Database Software.................... B-13
xv
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454 Task 8 Create Sun Cluster Resources to Control Oracle RAC Through CRS..................................................................B-16 Task 9 Verifying That Oracle RAC Works Properly in a Cluster ...............................................................................B-19 Exercise Summary............................................................................B-22
xvi
www.chinaitproject.com IT QQ : 3264454
Preface
Upgrade the Solaris Operating System (Solaris OS), VERITAS Volume Manager (VxVM), and ORACLE database software Upgrade Sun Cluster software Build Sun Cluster software data services and perform advanced resource and resource group management Perform recovery and maintenance procedures on Sun Cluster software Congure the ZFS, QFS, shared QFS, and zone advanced features of the cluster software Describe and use Sun Cluster best practices Describe and implement best practices for security in the Sun Cluster environment Use helpful tips to troubleshoot Sun Cluster software Perform Sun Cluster software troubleshooting exercises
Preface-xvii
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Course Map
www.chinaitproject.com IT QQ : 3264454
Course Map
The following course map enables you to see what you have accomplished and where you are going in reference to the course goals.
Upgrading Software
Upgrades in the Sun Cluster Environment Upgrading the Sun Cluster Software
Troubleshooting
Examining Troubleshooting Tips Troubleshooting the Software
Preface-xviii
VERITAS volume management Covered in ES-310: VERITAS Volume Manager Administration Sun Cluster 3.0 software administration Covered in ES-333: Sun Cluster 3.0 Administration Sun Cluster 3.1 software administration Covered in ES-338: Sun Cluster 3.1 Administration Sun Cluster 3.2 software administration Covered in ES-345: Sun Cluster 3.2 Administration
Refer to the Sun Educational Services catalog for specic information and registration.
Preface-xix
www.chinaitproject.com IT QQ : 3264454
Can you perform routine system administration tasks in a Solaris OS environment? Can you perform routine volume management tasks in either a VxVM or Solaris Volume Manager (Solaris VM) software environment? Can you install and congure Sun Cluster 3.x software?
Preface-xx
Introductions
Now that you have been introduced to the course, introduce yourself to the other students and the instructor, addressing the following items:
Name Company afliation Title, function, and job responsibility Experience related to topics presented in this course Reasons for enrolling in this course Expectations for this course
Preface-xxi
www.chinaitproject.com IT QQ : 3264454
Goals You should be able to accomplish the goals after nishing this course and meeting all of its objectives. Objectives You should be able to accomplish the objectives after completing a portion of instructional content. Objectives support goals and can support other higher-level objectives. Lecture The instructor presents information specic to the objective of the module. This information helps you learn the knowledge and skills necessary to succeed with the activities. Activities The activities take various forms, such as an exercise, self-check, description, and demonstration. Activities help you facilitate the mastery of an objective. Visual aids The instructor might use several visual aids to convey a concept, such as a process, in a visual form. Visual aids commonly contain graphics, animation, and video.
Preface-xxii
Conventions
The following conventions are used in this course to represent various training elements and alternative learning resources.
Icons
Additional resources Indicates other references that provide additional information on the topics described in the module.
!
?
Discussion Indicates that a small-group or class discussion on the current topic is recommended at this time.
Note Indicates additional information that can help you but is not crucial to your understanding of the concept being described. You should be able to understand the concept or complete the task without this information. Examples of notational information include keyword shortcuts and minor system adjustments. Caution Indicates that there is a risk of personal injury from a non electrical hazard, or risk of irreversible damage to data, software, or the operating system. A caution indicates that the possibility of a hazard (as opposed to certainty) might happen, depending on the action of the user.
Preface-xxiii
Conventions
www.chinaitproject.com IT QQ : 3264454
Typographical Conventions
Courier is used for the names of commands, les, directories, programming code, and on-screen computer output; for example: Use ls -al to list all les. system% You have mail. Courier is also used to indicate programming constructs, such as class names, methods, and keywords; for example: The getServletInfo method is used to get author information. The java.awt.Dialog class contains Dialog constructor. Courier bold is used for characters and numbers that you type; for example: To list the les in this directory, type the following: # ls Courier bold is also used for each line of programming code that is referenced in a textual description; for example: 1 import java.io.*; 2 import javax.servlet.*; 3 import javax.servlet.http.*; Notice the javax.servlet interface is imported to allow access to its life cycle methods (Line 2). Courier italic is used for variables and command-line placeholders that are replaced with a real name or value; for example: To delete a le, use the rm filename command. Courier italic bold is used to represent variables whose values are to be entered by the student as part of an activity; for example: Type chmod a+rwx filename to grant read, write, and execute rights for filename to world, group, and users. Palatino italic is used for book titles, new words or terms, or words that you want to emphasize; for example: Read Chapter 6 in the Users Guide. These are called class options.
Preface-xxiv
Additional Conventions
Java programming language examples use the following additional conventions:
Method names are not followed with parentheses unless a formal or actual parameter list is shown; for example: The doIt method... refers to any method called doIt. The doIt() method... refers to a method called doIt that takes no arguments.
Line breaks occur only where there are separations (commas), conjunctions (operators), or white space in the code. Broken code is indented four spaces under the starting code. If a command used in the Solaris OS is different from a command used in the Microsoft Windows platform, both commands are shown; for example: If working in the Solaris OS: $CD SERVER_ROOT/BIN If working in Microsoft Windows: C:\>CD SERVER_ROOT\BIN
Preface-xxv
www.chinaitproject.com IT QQ : 3264454
Task 1 Dene the hardware and software components of your clusters Task 2 Verify installation and conguration information Task 3 Run the setup script on your cluster Task 4 Review cluster architecture
Preparation
You will be running some scripts to set up your cluster. Each group will begin with a two-node cluster. The characteristics of the cluster are:
Running Solaris 9 9/04 (Update 7) OS and Sun Cluster 3.1 9/04 (Update 3) Running HA-Oracle 9i as a failover service with a failover le system Running Sun Java System Web server as a scalable (load-balanced) service Your groups choice of VERITAS Volume Manager (VxVM 4.0) or Solaris Volume Manager
If you have a third node you can also use the script to perform a Flash upgrade to Solaris 10, so that it is ready to be added to the cluster after the other nodes have already been upgraded in Modules 1 and 2. You can do this at your leisure.
Preface-xxvi
Node Name
ora-lh IP
iws-lh IP
(Same as above)
(Same as above)
Each group needs two separate logical host addresses (ora-lh for Oracle and iws-lh for the web server). They need not yet be entries in your hosts le on your nodes; the script will add entries. In the RLDC environment, existing entries on the vnchost may suggest which IP addresses to use. Consult your instructor.
Make sure you enter the transport adapter information in the correct order (rst one you enter on node one should be attached to the rst one you enter on node two), and that you enter the same logical host addresses on each node. The script will not check for this.
Make sure you enter the same choice of Volume manager on each node (the script will not check for this). For VxVM, the script will guide you through entering license information.
Preface-xxvii
www.chinaitproject.com IT QQ : 3264454
To verify that the root disk has been properly partitioned, type: # df -k # swap -l
Preface-xxviii
Notes
The vnchost is intended to be your display host inside RLDC. Each student can have a separate VNC session, use for web browser to iws-lh, for graphics for the xclock material in the data services module. Note the display number will be different for different students for the xclock material (that is, vnchost:3).
Preface-xxix
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Module 1
Describe high availability issues when performing upgrades in the Sun Cluster environment Describe the required relationships for upgrading the Sun Cluster software Describe the different upgrade strategies Describe and perform an upgrade of the Solaris Operating System using the Solaris Live Upgrade software Upgrade the Veritas Volume Manager (VxVM) software into the Solaris Live Upgrade environment
Caution Scripts must be run to set up the initial state of your clusters. Your instructor may have done this on behalf of the entire class, or you may be launching the scripts for your own cluster. Please refer to Before You Begin: Course Setup on page xxvi of the About this Course section. These scripts must be launched before you begin this module so that your clusters are ready for the exercises at the end of the module.
1-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
What are the relationships and interdependencies among OS upgrades, cluster release upgrades, and volume manager upgrades? Which methods are available for upgrading Solaris OS? Which methods give you the least downtime for your applications? Which are the safest? When must you upgrade the VxVM software?
1-2
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
Man page for the live_upgrade(5) command. Sun Microsystems, Inc. Solaris 10 Installation Guide: Solaris Live Upgrade and Upgrade Planning, part number 817-5505. Sun Microsystems, Inc. Sun Cluster Software Installation Guide For Solaris OS, part number 819-2970. ORACLE Corporation, ORACLE Technology Network. Installing the Oracle9i Database. 2001. [Online] Available at http://otn.oracle.com. Symantec Software Corporation. VxVM software documentation, VERITAS Storage Foundation 5.0 Installation Guide
1-3
www.chinaitproject.com IT QQ : 3264454
the complicated relationships that exist in the Sun Cluster environment between cluster software versions, operating system versions, volume manager versions, and application software versions. the variety of upgrade strategies that exist when upgrading specically to Sun Cluster 3.2. The following are strategies for minimizing application downtime, and can be used for upgrades to Sun Cluster 3.2 from any previous revision of Sun Cluster 3.0 or Sun Cluster 3.1:
1-4
You may be driven by a desire to upgrade the OS, but nd that you must also upgrade your cluster version to support your OS. For example, you may be standardizing your enterprise on Solaris 10 OS, and want to upgrade your cluster from Solaris 9 to Solaris 10. But in order to run Solaris 10, you have to upgrade to Sun Cluster 3.1 Update 4 (8/05) or Sun Cluster 3.2 as no lower version supports Solaris 10 OS. You may be driven by a desire to upgrade the cluster versions, but nd that the new cluster version no longer supports your old OS. For example, you may want to upgrade to Sun Cluster 3.2. Sun Cluster 3.2 supports only Solaris 9 Update 8 and above, and Solaris 10 Update 3 and above. If you are running any other update, or any revision of Solaris 8, you will have to upgrade your OS in order to run Sun Cluster 3.2. You may want to do a major OS upgrade (Solaris 8 to 9, for example), and not have any intention of upgrading the cluster version. However, any major OS upgrade (not just an update revision upgrade) implies that you have to do a cluster framework upgrade, even if you are staying on the same update of the cluster. Sun Cluster 3.1 Update 3 for Solaris 8, for example, is different cluster framework software than Sun Cluster 3.1 Update 3 for Solaris 9 OS.
1-5
www.chinaitproject.com IT QQ : 3264454
Considering the previous item, any major OS upgrade must be done before the corresponding Sun Cluster framework upgrade If you are upgrading to Sun Cluster 3.2, the dual-partition upgrade feature and live upgrade features let you perform your entire upgrade, including OS upgrades with very little downtime.
You need to comment out any global le systems from the vfstab le before the upgrade, except for the /global/.devices/node@# le system (you can leave that one). Since you are also upgrading the cluster after the OS upgrade, you will need to boot your new OS into non-clustered mode the very rst time. Therefore make sure you choose to not let the upgrade procedure reboot for you after the upgrade, so that you can boot the new OS manually into non-clustered mode.
You may need to upgrade your VxVM version because you want to upgrade your OS. For example, you may want to upgrade to Solaris 10 OS so you are required to upgrade to a minimum of VxVM 4.1.
1-6
You may need to upgrade VxVM because you want to upgrade the cluster revision. For example, you may be running Solaris 9 with VxVM 4.0 which is supported in Sun Cluster 3.1 But if you want to upgrade to Sun Cluster 3.2, this revision of VxVM is not supported. If you want to do a major OS upgrade (Solaris 9 to Solaris 10, for example), and do not intend to upgrade VxVM versions, you might still need to remove VxVM packages and add them back on the new OS, so that the correct drivers get loaded.
There is a strong relationship between OS version and versions of applications supported (for example, Solaris 10 is no longer supporting Oracle Server 8i). This will affect your decision about whether to upgrade the OS. If you are not upgrading your OS but only your cluster revision, your existing applications should still be supported in general by the new cluster agents, but you need to check to make sure. It is conceivable that the dual-partition feature could let you even accomplish application upgrade with very brief downtimes, but only if both of the following are true:
You installed separate application binaries locally on each node, so that you can upgrade one while the other is still running. The data itself can run equally well under the old and new software revisions (that is, you do not need to upgrade the data).
While the procedure to upgrade the actual application software may be the same inside and outside the cluster, you may have to upgrade properties of your cluster resources to get the new version running in the cluster. For example, many cluster resources have properties that point to the directory that contains application binaries or conguration les.
1-7
www.chinaitproject.com IT QQ : 3264454
Solaris 9 OS 9/04 (Update 7) Sun Cluster 3.1 9/04 (Update 3) Optionally, VxVM 4.0 (or you could have chosen Solaris VM) Running Oracle server 9i as a failover service Running Sun Java System Web Server 6.1 as a scalable service Solaris 10 11/06 (Update 3) Sun Cluster 3.2 VxVM 5.0 (if that is your volume manager of choice)
You will have to balance the goals of minimizing application downtime and minimizing the total time it takes to do the upgrade. We will take advantage of the Live Upgrade softwares ability to upgrade all nodes simultaneously, while keeping our original cluster completely operational during the upgrade. If you want to try to experimentally combine dual-partition upgrade and Live Upgrade in this lab, you can also experience the minimum possible application downtime. You may choose to upgrade Oracle to Oracle 10gR2. The procedure for this is in an optional lab in Appendix A.
Note The initial setup has the Oracle binaries placed in the shared storage. Therefore you will not be able to minimize downtime related to the upgrade of Oracle software itself. If you choose to upgrade Oracle, you will have to take the whole application ofine for the duration of the Oracle upgrade.
1-8
Application upgrade
1-9
www.chinaitproject.com IT QQ : 3264454
OS upgrades are made in the traditional way (booted from the new OS medium). Cluster software upgrades are made afterwards, and you must boot the new OS into non-cluster mode to do them.
It would seem that you could leave some nodes up running the old cluster and OS version, as you perform OS and cluster upgrades on some other nodes. This assumption would be correct. The problem is that it would seem in the traditional upgrade that you could take down the rst set of nodes, do the upgrade, take down the second set of nodes, and immediately boot the rst set into the new upgraded cluster environment. This assumption would be incorrect, because you would still be bound by the regular cluster rules that prevent cluster amnesia. In other words, you cannot boot any nodes into the new cluster environment until they are all upgraded. Upgrading some nodes rst may still be more robust. If something goes wrong, you still have your application running on the old nodes. But it will not reduce the total downtime. The downtime required for the traditional upgrade is the time to completely upgrade the entire OS and cluster software.
1-10
You can use the dual-partition mechanism for all upgrades from any version of Sun Cluster 3.0 or 3.1 to Sun Cluster 3.2. You can use the dual-partition mechanism regardless of whether or not you are also upgrading your OS.
1-11
www.chinaitproject.com IT QQ : 3264454
Figure 1-1
1-12
www.chinaitproject.com IT QQ : 3264454 Introduction to Sun Cluster 3.2 Upgrade Strategies The main difference between rolling and dual-partition upgrades lies in the transition of applications. In a rolling upgrade, as demonstrated in Figure 1-2, the nodes in the rst partition join the cluster with nodes of the second partition. Normal application switchovers can then occur, driven by commands issued manually:
Figure 1-2
In the dual-partition upgrade, the nodes of the rst partition boot the new software but never join the cluster with the second partition nodes, as illustrated in Figure 1-3:
Figure 1-3
1-13
www.chinaitproject.com IT QQ : 3264454
The rest of the upgrade is very similar in the two scenarios. The second partition is upgraded in non-cluster mode, and then the nodes can join the cluster to complete the cluster upgrade.
1-14
Live Upgrade
Live Upgrade software is a way to clone your entire boot environment and then apply upgrades to the clone only, while your original software versions continue to run on your original, non-upgraded boot disks. This is illustrated in Figure 1-4:
Figure 1-4
Beginning with upgrades to Sun Cluster 3.2, from any previous version of Sun Cluster 3.0 or 3.1, the live upgrade mechanism can support the upgrade of any or all of the following components directly onto the new boot disk, while the old software versions are still running:
Solaris OS Veritas Volume Manager Sun Cluster framework and data services
The Live Upgrade strategy has the following advantages over any other upgrade strategy:
1-15
www.chinaitproject.com IT QQ : 3264454
It is the only upgrade mechanism that both minimizes application downtime and where all of the nodes can be upgraded simultaneously. You may, for example, have a restricted window for completing a cluster upgrade in its entirety. If you use the dual-partition strategy and do not take advantage of the live-upgrade strategy, you can minimize application downtime, but you can not minimize the amount of time it takes to upgrade the entire cluster.
It is the only upgrade mechanism that makes it completely trivial to back out of the upgrade and restore the original operations on the entire cluster. Since you never upgrade the original boot disk at all, you can always go back and just boot the original boot disk and start over again.
Caution If you have completed the entire cluster upgrade, and have upgraded VxVM disk group version numbers, then you may not be able to return to your original cluster running an older version of VxVM. For the original release of Sun Cluster 3.2, the Live Upgrade and dualpartition upgrade mechanisms are not supported together. You might think if you were doing live upgrade there would be no need for a dual-partition upgrade, since you can complete the actual upgrading on all nodes while still running the original software versions, and then reboot all nodes into the cluster. However, the amount of application downtime required to reboot all nodes, especially when upgrading to Solaris 10 OS, can be signicantly more than is required in the dual-partition upgrade strategy.
1-16
Comparison of Upgrade Strategies: Application Downtime and Total Time to Perform the Upgrade
The following table summarizes the course authors experience with the different upgrade strategies. The cluster being upgraded is a 2-node cluster running the exact same scenario as is presented in this course. App. Downtime 3 hrs
Upgrade Strategy Traditional (upgrade some nodes rst, leaving other nodes up in the cluster (safer)) Traditional (upgrade all nodes at same time (faster)) Dual Partition Live Upgrade (reboot all nodes at once into new cluster) Live Upgrade (Using dual partition to achieve a rolling reboot)
Comments
2 min
3 hrs
1-17
www.chinaitproject.com IT QQ : 3264454
Boot Environments
The Solaris Live Upgrade software feature of the Solaris OS enables you to maintain multiple operating system images of a single system. An image, or boot environment, represents a set of operating system and application software packages. Different BEs might contain different operating system and application versions. A single system can have multiple BEs, but only one of them is the active BE; all other BEs are inactive.
Software Installation
You must install the Solaris Live Upgrade software from the OS that you will upgrade to in order to get full upgrade functionality via Live Upgrade. The Solaris 10 distribution contains an installer called liveupgrade20 under the Solaris_10/Tools/Installers directory. This installs the requisite SUNWluu and SUNWlur packages. Solaris 10 also has a SUNWluzone package used for upgrading zones upward from Solaris 10. It will do no harm to install this package, even if you do not use it.
Command Summary
Table 1-1 describes some important Solaris Live Upgrade software commands. A complete list is described in the live_upgrade(5) man page. Table 1-1 Solaris Live Upgrade Software Commands Command lu Purpose Access the Forms and Menu Language Interpreter (FMLI)-based interface for creating and administering BEs. Dene or display which BE to use at the next reboot. Cancel a job scheduled with the FMLI-based interface.
luactivate lucancel
1-18
www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process Table 1-1 Solaris Live Upgrade Software Commands (Continued) Command lucompare Purpose Compare les in two BEs, or compare les in a BE with a previously-taken compare database (list of les, sizes, and checksums). Create a BE. The command can either:
lucreate
Create a BE and populate it by cloning the active boot environment. This is what you would do if you intend to continue by upgrading the new BE with luupgrade. Create a BE that has only empty, fresh le systems. This is what you would do if you intend to lay down a pre-existing Flash image on the new BE with luupgrade.
Display the name of the active BE. Delete a BE. List the le systems within a BE. Recreate a BE based on the active BE. Rename a BE. Mount the le systems of a non-active BE. Unmount the le systems of a non-active BE. Report the status of all BEs present on a system. Do one of the following:
Upgrade the OS on a BE. Lay down a ash image on the BE. Whatever was on the BE previously is lost. This is not really an upgrade at all.
1-19
www.chinaitproject.com IT QQ : 3264454
The node is booted in the cluster. The root disk has only /, swap, and /global/.devices/node@X partitions.
If your current root disk is VxVM-encapsulated, your new boot environment will not be encapsulated. When the new boot environment is activated and booted, you will still have access to your original root volumes from the original BE. You may then choose if you wish to delete these and encapsulate the new root disk.
an entry for a failover (non-global) lesystem on a node not currently mounting the le system. The Live Upgrade software will gripe about this and refuse to create the new BE. The solution for this is to comment out any such entry before running the Live Upgrade software (only on nodes that are not currently mounting the failover le system).
The Live Upgrade software will not copy its contents to the new boot disk because of the keyword global. The Live Upgrade software believes that all le systems with the global option are cluster data le systems, and will not copy their contents (this is the correct behavior for real data le systems). The Live Upgrade software is unhappy with DID device names.
1-20
www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process The solution for this is to edit the /etc/vfstab le before running the Live Upgrade software. You will remove the global option and replace the DID device names with traditional c#t#d#s# names. The following shows fragments of a vfstab le edited so that Live Upgrade will run correctly. The fragment is from a node not currently mounting the failover le system /oracle: These lines are from the original le:
/dev/did/dsk/d6s3 /dev/did/rdsk/d6s3 /global/.devices/node@1 ufs 2 no global /dev/md/orads/dsk/d100 /dev/md/orads/rdsk/d100 /oracle ufs 2 no -
These are the changes you need to make in order for Live Upgrade to run properly (save your original le for easy restoration):
/dev/dsk/c0t0d0s3 /dev/rdsk/c0t0d0s3 /global/.devices/node@1 ufs 2 no #/dev/md/orads/dsk/d100 /dev/md/orads/rdsk/d100 /oracle ufs 2 no -
Example BE Creation
Perform the following: 1. Determine the number and size of the le systems in the current BE: # df -k # cat /etc/vfstab 2. Verify that the correct number of needed target slices are created and properly sized. If your target disk is exactly the same geometry as your original root disk (which is easiest), you can follow one of the following strategies:
If the original root disk is VxVM-encapsulated, you can retrieve the /etc/vx/reconfig.d/disk.d/c#t#d#/vtoc le, which has the original Volume Table of Contents (VTOC) from your pre-encapsulated root disk. You can then apply this partitioning to the target disk. If the original root disk is not VxVM-encapsulated, you can just copy its partition table to the new disk.
3.
1-21
www.chinaitproject.com IT QQ : 3264454
Note In this example, c0t1d0 is the target disk. The command entry is identical whether or not the original root disk is VxVM-encapsulated. The example shows in bold some output that is specic to a VxVMencapsulated disk. The example shows using the -c option to name the current boot environment (this creates conguration entries for the original root conguration in the LU software). The -n option is used to name the new boot environment. # lucreate -c s9be \ -n s10be \ -m /:/dev/dsk/c0t1d0s0:ufs \ -m -:/dev/dsk/c0t1d0s1:swap \ -m /global/.devices/node@1:/dev/dsk/c0t1d0s3:ufs Discovering physical storage devices Discovering logical storage devices Cross referencing storage devices with boot environment configurations Determining types of file systems supported Validating file system requests Preparing logical storage devices Preparing physical storage devices Configuring physical storage devices Configuring logical storage devices Analyzing system configuration. No name for current boot environment. Current boot environment is named <s9be>. Creating initial configuration for primary boot environment <s9be>. WARNING: The device </dev/vx/dsk/bootdg/rootvol> for the root file system mount point </> is not a physical device. WARNING: The system boot prom identifies the physical device </dev/dsk/c0t0d0s0> as the system boot device. Is the physical device </dev/dsk/c0t0d0s0> the boot device for the logical device </dev/vx/dsk/bootdg/rootvol>? (yes or no) yes INFORMATION: Assuming the boot device </dev/dsk/c0t0d0s0> obtained from the system boot prom is the physical boot device for logical device </dev/vx/dsk/bootdg/rootvol>. The device </dev/dsk/c0t0d0s0> is not a root device for any boot environment. PBE configuration successful: PBE name <s9be> PBE Boot Device </dev/dsk/c0t0d0s0>. Comparing source boot environment <s9be> file systems with the file system(s) you specified for the new boot environment. Determining which file systems should be in the new boot environment. The file system is not mounted for the currently running BE
1-22
www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process The file system is not mounted for the currently running BE Updating boot environment description database on all BEs. Searching /dev for possible boot environment filesystem devices Updating system configuration files. The device </dev/dsk/c0t1d0s0> is not a root device for any boot environment. Creating configuration for boot environment <s10be>. Source boot environment is <s9be>. The file system is not mounted for the currently running BE The file system is not mounted for the currently running BE Creating boot environment <s10be>. Creating file systems on boot environment <s10be>. Creating <ufs> file system for </> on </dev/dsk/c0t1d0s0>. Creating <ufs> file system for </global/.devices/node@1> on </dev/dsk/c0t1d0s3>. Mounting file systems for boot environment <s10be>. Calculating required sizes of file systems for boot environment <s10be>. Populating file systems on boot environment <s10be>. Checking selection integrity. Integrity check OK. Populating contents of mount point </>. Populating contents of mount point </global/.devices/node@1>. Copying . .
1-23
www.chinaitproject.com IT QQ : 3264454
# lustatus BE_name Complete Active ActiveOnReboot CopyStatus ------------------------------------------------s9be yes yes yes s10be yes no no 2. Make sure that the BE is not currently mounted, and unmount it if necessary. If the alternate BE is mounted, the mount point for its root le system will be /.alt.be-name, for example /.alt.s10be. One reason it might be mounted is that there was some manipulation that you had to perform by hand, such as removing an older version of VxVM (which is discussed later in this module). # df -k # luumount s10be 3. Run the luupgrade utility to upgrade an inactive BE. The amount of time varies greatly according to the horsepower of your system. On the course developers system (2 x 1.5 Gigahertz (GHz) CPUs, 6 Gigabytes (GB) RAM, it takes about two hours. The course developer has seen slower systems where it took over nine hours.
Note The Solaris OS image identied by the -s option must be the directory containing the .cdtoc, .install_config, .slicemapfile, and .volume.inf les. This directory then contains the Solaris_n directory. # luupgrade -u -n s10be -s /net/server/sol10u3sparc Validating the contents of the media </net/clustergw/sol10u3sparc>. The media is a standard Solaris media. The media contains an operating system upgrade image. The media contains <Solaris> version <10>. Constructing upgrade profile to use. Locating the operating system upgrade program. Checking for existence of previously scheduled Live Upgrade requests. Creating upgrade profile for BE <s10be>. Determining packages to install or upgrade for BE <s10be>. Performing the operating system upgrade of the BE <s10be>. CAUTION: Interrupting this process may leave the boot environment unstable or unbootable. Upgrading Solaris: 100% completed Installation of the packages from this the media is complete. Updating package information on boot environment <s10be>. Package information successfully updated on boot environment <s10be>. Adding operating system patches to the BE <s10be>. The operating system patch installation is complete.
1-24
www.chinaitproject.com IT QQ : 3264454 The Live Upgrade Process INFORMATION: The file </var/sadm/system/logs/upgrade_log> on boot environment <s10be> contains a log of the upgrade operation. INFORMATION: The file </var/sadm/system/data/upgrade_cleanup> on boot environment <s10be> contains a log of cleanup operations required. WARNING: <3> packages failed to install properly on boot environment <s10be>. INFORMATION: The file </var/sadm/system/data/upgrade_failed_pkgadds> on boot environment <s10be> contains a list of packages that failed to upgrade or install properly. INFORMATION: Review the files listed above. Remember that all of the files are located on boot environment <s10be>. Before you activate boot environment <s10be>, determine if any additional system maintenance is required or if additional media of the software distribution must be installed. The Solaris upgrade of the boot environment <s10be> is partially complete.
Note This upgrade displays what seem to be pkgadd errors. The reason for this is that some of the Sun Cluster auxiliary components that were installed on top of the original Solaris 9 are now part of the base OS in Solaris 10. However, the versions that are already in the Solaris 9 cluster are newer than the ones Solaris 10 was trying to install. Therefore, there is no problem. You could see this detail in the /var/sadm/system/data/upgrade_failed_pkgadds le of the upgraded boot environment (not on the original root disk). 4. Check the status of the upgrade: # lustatus s10be
1-25
www.chinaitproject.com IT QQ : 3264454
# luactivate s10be WARNING: <3> packages failed to install properly on boot environment <s10be>. INFORMATION: </var/sadm/system/data/upgrade_failed_pkgadds> on boot environment <s10be> contains a list of packages that failed to upgrade or install properly. Review the file before you reboot the system to determine if any additional system maintenance is required.
********************************************************************** The target boot environment has been activated. It will be used when you reboot. NOTE: You MUST NOT USE the reboot, halt, or uadmin commands. You MUST USE either the init or the shutdown command when you reboot. If you do not use either init or shutdown, the system will not boot using the target BE. ********************************************************************** In case of a failure while booting to the target BE, the following process needs to be followed to fallback to the currently working boot environment: 1. Enter the PROM monitor (ok prompt). 2. Change the boot device back to the original boot environment by typing: setenv boot-device disk:a
1-26
3. Boot to the original boot environment by typing: boot ********************************************************************** Activation of boot environment <s10be> successful.
3.
Verify that the desired BE is the one active for next reboot: # luactivate s10be
4. 5.
Bring the system down: # init 0 The boot-device will automatically be changed so that the next time you boot, it will boot off of the new boot environment.
The luactivate command makes it clear that you must use either the init or /usr/sbin/shutdown command to reboot the system using the newly activated BE. This is to ensure that the /etc/rc0.d/K62lu shutdown script will run. If booted in the cluster, you can use the /usr/cluster/bin/scshutdown command because it calls the /sbin/rc0 script.
Synchronizing Files
The rst time you boot a newly created boot environment, the Solaris Live Upgrade software synchronizes the les dened in the /etc/lu/synclist le, using the corresponding les from the boot environment that was last active as the source of these les. You can add and remove entries in the synclist le to ensure that they are synchronized to the new BE when rst booting the new BE. Note At the time of writing of this course, this feature does not work using the OVERWRITE keyword to synchronize custom les and directories. It is supposed to be able to synchronize entire directory trees.
1-27
www.chinaitproject.com IT QQ : 3264454
The rst three steps can proceed while your original boot environment is up and running the original OS, original VxVM version, the original cluster framework, and the clustered applications. When upgrading to VxVM 5.0 you do not require a new license to perform the upgrade, unless your VxVM 4.x license has expired. VERITAS does not support direct upgrades to VxVM 5.0 from any release of VxVM earlier than 4.0. If you have an older VxVM release, you must rst upgrade to version 4.0, then to 5.0.
1-28
As the VxVM package is added, you will be asked if you want to restore the previous conguration, which was automatically preserved for you. You should answer yes to restore the conguration, as in the following example: At the following prompt: - Enter "y" if you are upgrading VxVM and want to use the existing VxVM configuration. Do not run vxinstall after installation. - Enter "n" if this is a new installation. All disks will still have old configuration. You will need to run vxinstall and then vxdiskadm after installation to initialize and configure VxVM. Restore and reuse the old VxVM configuration [y,n,q,?] (default: y): y
1-29
www.chinaitproject.com IT QQ : 3264454
1-30
Task 1 Verify that your cluster is operating correctly Task 2 Install the Solaris 10 Live Upgrade software and partition the target disk Task 3 Create a new boot environment as a clone of the original root disk Task 4 Remove VxVM 4.0 from the new boot environment (if you are using VxVM) Task 5 Upgrade the boot environment to Solaris 10 OS Task 6 Add VxVM 5.0 into the new boot environment (if you are using VxVM)
Preparation
You must identify the local disk where the new Live Upgrade boot environment will be created. This drive must be local, must not be a VxVM disk, and should have the identical size and geometry of the current boot disk. Use the format command to identify your local disks and use the vxdisk -o alldgs list command to verify that they are not VxVM disks (if you are using VxVM). You also need to know the node id of each node during this exercise. Run the following command on both cluster nodes to determine each node id: # clinfo -n
1-31
www.chinaitproject.com IT QQ : 3264454
2.
Verify that a failover resource group for Oracle, a scalable resource group for Sun Java System Web Server (iws), and a failover resource group for the scalable service load balancer (lb-rg) are congured. # scstat -g Verify that you can access the Oracle database. A small /oracli client environment has been installed on each node. You should be able to serve as an Oracle client from either node, regardless of the node on which the failover service is running. # ksh # cd /oracli # ls clienv oraInventory/ product/ # . ./clienv # which sqlplus # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> insert into mytable values ('yourname', age); SQL> commit; SQL> select * from mytable; SQL> quit
3.
4.
On your administrative workstation or display station, edit /etc/hosts and add your IP address that is known on the nodes as iws-lh.
Note You can call this iws-lh on the admin workstation if you are not in a shared admin station environment. In an RLDC environment with a shared display server, the logical names should already be entered. Consult your instructor. 5. Invoke your web browser on your display station. Check your proxy settings; if you are using a proxy to get to the Internet, set a proxy exception for the name you entered in the previous step. Navigate to http://iws-lh-name/cgi-bin/test-iws.cgi. Click the reload or refresh button several times to verify that you are receiving responses from both nodes.
6. 7.
1-32
Task 2 Installing the Solaris 10 Live Upgrade Software and Partitioning the Target Disk
Perform the following step on all of your cluster nodes: 1. Use the installer provided as part of the Solaris 10 distribution to install the new Live Upgrade packages. # cd Sol10_distr_dir/Solaris_10/Tools/Installers # ./liveupgrade20 -nodisplay 2. 3. Accept the license and use the Typical option to install all the packages. This is a very quick install. If the original root disk is VxVM-encapsulated, partition the target disk as follows: If you are not using VxVM or your root disk is not VxVM encapsulated, skip this step and do step 4 instead. Perform the following on all the nodes in the cluster: a. b. Determine the device being used as the current boot disk: # ls /etc/vx/reconfig.d/disk.d Copy the vtoc le that contains the partition table for the current boot disk before it was encapsulated: # cp /etc/vx/reconfig.d/disk.d/boot-disk/vtoc /tmp c. Edit the vtoc le: # vi /tmp/vtoc 1. 2. 3. Delete the rst two comment lines. Remove all instances of the string 0x from the second column. Remove all instances of the string 0x2 from the third column (once again, you will be removing two characters from the second column and three characters from the third column). Save the le.
4. d.
Create partitions on the target disk according to the vtoc le. Be careful that you format the target disk, not the current boot disk: # fmthard -s /tmp/vtoc /dev/rdsk/c#t#d#s2
1-33
www.chinaitproject.com IT QQ : 3264454
If the original root disk is not VxVM-encapsulated, partition the target disk as follows: Do this step instead of step 3 if your root disk is not VxVMencapsulated. Copy the partitioning from the original root disk to the new disk. In this example cAtAdAs0 is the original disk and cBtBdBs0 is the new disk. # prtvtoc /dev/rdsk/cAtAdAs2| fmthard -s - \ /dev/rdsk/cBtBdBs2
Task 3 Creating the New Boot Environment as a Clone of the Original Root Disk
Perform the following steps on all cluster nodes. Do not wait for the lucreate command to complete on one node before starting the next. But be careful and heed the warning that the commands will not be identical on each node. 1. Verify that the target disk was partitioned correctly: # format c#t#d# The tasks in this exercise assume that the boot disk has a slice 0 for the root le system, a slice 1 for swap, a slice 3 for global devices, and a slice 7 for Solaris Volume Manager software replicas. 2. If you are using Solaris Volume Manager (only), add metadevice database replicas on slice 7 of the new disk. Verify that you have three copies on each disk: # metadb -a -c 3 c#t#d#s7 # metadb -i 3. Save a copy of the original vfstab on each node: # cp /etc/vfstab /etc/vfstab.preLU 4. Edit the /etc/vfstab le: a. If you are using SVM: On the line for the /global/.devices/node@# le system, change /dev/did/dsk/d#s3 and /dev/did/rdsk/d#s3 to /dev/dsk/c#t#d#s3 and /dev/rdsk/c#t#d#s3, using the same c#t#d# as the root disk.
1-34
www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Solaris OS and VxVM Software b. Regardless of your volume manager, replace the word global in the 7th eld of the line for /global/.devices/node@# with a minus sign (-). Only on the node not mounting /oracle (check carefully), comment out the line for /oracle.
c. 5.
Create a clone of the current boot environment on the target disk using the node id as the value of the variable X as follows:
Warning This command is not identical on each node of the cluster. The node@X will differ. You might also have different target disks. Be careful: c#t#d# identies the target disk. If you are using VxVM, conrm when the command asks about the identity of your underlying root drive. # lucreate -c s9be -n s10be \ -m /:/dev/dsk/c#t#d#s0:ufs \ -m -:/dev/dsk/c#t#d#s1:swap \ -m /global/.devices/node@X:/dev/dsk/c#t#d#s3:ufs
Note The lucreate command can take approximately 10-20 minutes to complete.
1-35
www.chinaitproject.com IT QQ : 3264454
Note The upgrade to Solaris 10 OS can take two to seven hours to complete, depending on the speed of your hardware. Your lecture will probably be continuing at this point, and you will be continuing the lab later. Near the end of the upgrade you will see information and warnings about some failed pkgadds, as discussed in the body of the module. The cluster installation had some versions of packages that were newer than those bundled in the base release of Solaris 10. You can ignore these warnings.
1-36
www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Solaris OS and VxVM Software # cp VRTSvlic.tar.gz VRTSvxvm.tar.gz \ VRTSvmman.tar.gz /var/tmp # # # # 3. cd /var/tmp gzcat VRTSvlic.tar.gz | tar xf gzcat VRTSvxvm.tar.gz | tar xf gzcat VRTSvmman.tar.gz | tar xf -
Add the new VxVM packages. Answer yes when you are asked about using the saved conguration. # pkgadd -d /var/tmp -R /.alt.s10be \ VRTSvlic VRTSvxvm VRTSvmman
4.
Add any VxVM 5.0 patches directly into the new boot environment: # cd veritas_50_patch_dir # patchadd -R /.alt.s10be xxxxx-yy
Note At the time of writing this course, there are no VxVM patches required by the course itself. 5. Unmount the new boot environment. # luumount s10be
1-37
Exercise Summary
www.chinaitproject.com IT QQ : 3264454
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
1-38
www.chinaitproject.com IT QQ : 3264454
Module 2
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Objectives
Upon completion of this module, you should be able to do the following:
Upgrade Sun Cluster software when not using Live Upgrade Use the scinstall options that control the dual-partitioned upgrade method, when not using Live Upgrade Use Live Upgrade to upgrade the Sun Cluster software Upgrade resource types and resources
2-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
Can functionality already present in commands like pkgadd and patchadd install software directly into a new boot environment that is created with the Live Upgrade Software? Is the time required for a reboot the minimum that you could possibly expect your applications to be down during an upgrade? Or could you achieve even less downtime than that in the cluster environment? What would happen if you never upgraded from the Sun Cluster 3.1 resource type versions after an upgrade to Sun Cluster 3.2? Should existing resources still work? Always? What might break in the future?
2-2
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
Sun Microsystems, Inc. Sun Cluster Software Installation Guide For Solaris OS, part number 819-2970. Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971.
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-3
www.chinaitproject.com IT QQ : 3264454
Note You use all of these same procedures whether or not you are taking advantage of the dual-partition upgrade feature. If you are using the dual-partition upgrade feature, you will be using the new scinstall dual-partition management options, discussed in the next major section of this module to manage the transitions between the partitions. When nodes of each partition are booted in non-cluster mode, you then use the normal OS upgrade procedures to upgrade the OS and these procedures to upgrade the cluster components.
2-4
www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades) When you do upgrades of the Sun Cluster framework, you must use the Java ES installer to upgrade the shared components rst. For example, if you use the graphical installer, you can just choose the shared components. It will indicate that the Sun Cluster software itself can not be upgraded by the Java ES installer (you will have to do that afterwords.) Note The Java DB software is another required component that is listed separately from the other shared components. Make sure you select the Java DB software when upgrading the shared components.
user
scinstall -u update
save config
pkgrm
pkgadd
restore config
Figure 2-1
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-5
www.chinaitproject.com IT QQ : 3264454
Sun Cluster 3.0 software uses Public Network Management (PNM) network adapter failover (NAFO) groups for public network fault monitoring. You must convert this PNM conguration to an IPMP conguration for the Sun Cluster 3.1 or Sun Cluster 3.2 software. The scinstall utility performs this conversion when you upgrade the framework. If you are upgrading from Sun Cluster 3.1, it is assumed that IPMP is already congured.
2-6
*** Upgrade Menu *** Please select from any one of the following options: 1) Upgrade Sun Cluster framework on this node 2) Upgrade Sun Cluster data service agents on this node 3) Upgrade Sun Cluster Support for Oracle RAC on this node ?) Help with menu options q) Return to the Main Menu Option: 1
The node must be booted in noncluster mode in order to upgrade the framework. Press Control-d at any time to return to the Main Menu.
yes
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-7
www.chinaitproject.com IT QQ : 3264454
scinstall -u update Starting upgrade of Sun Cluster framework software Saving current Sun Cluster configuration Do not boot this node into cluster mode until upgrade is complete. Renamed "/etc/cluster/ccr" to "/etc/cluster/ccr.upgrade". ** Removing Sun Cluster framework packages ** ** Removing Sun Cluster framework packages ** Removing SUNWscspmr..done Removing SUNWscspmu..done Removing SUNWscspm...done Removing SUNWscva....done Removing SUNWscmasa..done Removing SUNWmdm.....done Removing SUNWscvm....done Removing SUNWscsam...done Removing SUNWscsal...done Removing SUNWscman...done Removing SUNWscgds...done Removing SUNWscdev...done Removing SUNWscnm....done Removing SUNWscsck...done Removing SUNWscu.....done Removing SUNWscr.....done ** Installing SunCluster 3.2 framework ** SUNWscu.....done SUNWsccomu..done SUNWsczr....done SUNWsccomzu..done SUNWsczu....done SUNWscsckr..done SUNWscscku..done SUNWscr.....done SUNWscrtlh..done SUNWscnmr...done SUNWscnmu...done SUNWscdev...done SUNWscgds...done SUNWscsmf...done
2-8
www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades) SUNWscman...done SUNWscsal...done SUNWscsam...done SUNWscvm....done SUNWmdmr....done SUNWmdmu....done SUNWscmasa..done SUNWscmasar..done SUNWscmasasen..done SUNWscmasau..done SUNWscmautil..done SUNWscmautilr..done SUNWjfreechart..done SUNWscspmr..done SUNWscspmu..done SUNWscderby..done SUNWsctelemetry..done Dec 19 12:09:10 rico java[9874]: pkcs11_softtoken: Keystore version failure. Ensure that the EEPROM parameter "local-mac-address?" is set to "true" ... done Restored /etc/cluster/ccr.upgrade to /etc/cluster/ccr
Completed Sun Cluster framework upgrade Updating nsswitch.conf ... done Press Enter to continue:
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-9
www.chinaitproject.com IT QQ : 3264454
2-10
This option is used to upgrade Sun Cluster data service agents on this node. Press Control-d at any time to return to the Main Menu.
You must specify the location of the Java Enterprise System (JES) distribution that contains the Sun Cluster data service agents. The name that you give must be the full path to the directory that contains the "Solaris_sparc" subdirectory. Where is it located? /net/srvr/sc32 Select the data service agents you want to upgrade: Identifier 1) iws 2) oracle 3) All Description Sun Cluster HA Sun Java System Web Server Sun Cluster HA for Oracle All data services in this menu
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-11
www.chinaitproject.com IT QQ : 3264454
This is the complete list of data services you selected: oracle iws Is it correct (yes/no) [yes]? Is it okay to upgrade these data services now (yes/no) [yes]?
List of upgradable data services agents: (*) indicates selected for upgrade. * iws * oracle
Do not boot this node into cluster mode until upgrade is complete. ** Removing HA oracle Data Service on Sun Cluster ** Removing SUNWscor....done ** Installing Sun Cluster HA for Oracle ** SUNWscor....done ** Removing HA Sun Java System Web Server **
2-12
www.chinaitproject.com IT QQ : 3264454 Upgrading the Cluster Software (Non-Live Upgrades) Removing SUNWschtt...done ** Installing Sun Cluster HA Sun Java System Web Server ** SUNWschtt...done
Completed upgrade of Sun Cluster data services agents Press Enter to continue:
The packages for the previous version are removed but the resource types remain registered. The packages for the new data services are added but any new resource type versions are not registered. The scinstall command does not upgrade the already instantiated resources for any types, such as oracle_server, and oracle_listener, that require type version upgrades. This must be done after booting the cluster, and is discussed later in the module.
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-13
www.chinaitproject.com IT QQ : 3264454
2-14
www.chinaitproject.com IT QQ : 3264454 Managing Dual-Partition Upgrades (Non Live-Upgrade) using scinstall This option is completely harmless and can be called at any time, or not at all if you know what your partitioning scheme will be. For example, this was called on a three-node cluster where all nodes were attached to the shared data storage: rico:/# scinstall -u plan Option 1 First partition rico midnight Second partition noodle Option 2 First partition rico noodle Second partition midnight Option 3 First partition rico Second partition midnight noodle
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-15
www.chinaitproject.com IT QQ : 3264454
You run it from any one node before any upgrades at all have been performed. You run it from the Sun Cluster 3.2 installation media, regardless of which version of Sun Cluster 3.2 you are running currently (previous to the upgrade). You specify which nodes will be in the partition that is shut down rst. Quorum votes are manipulated in the remaining nodes. Once the nodes you specify are shut down, the only quorum votes remaining will be ones belonging to the remaining nodes (with quorum device votes set as if only the remaining nodes were attached). The nodes that you specify (the rst-partition nodes) are automatically halted (through ssh or rsh). This may or may not include the node from which you are running the option. The intention is that you do the complete upgrade of nodes that have been halted, not booting them into the cluster until upgrades of all layers have been completed.
Note The utility will automatically insert scripts into the rst-partition nodes to prevent you from accidentally trying to boot them into the cluster until you are ready to use the next option (apply) to perform the opover. If you use the interactive menus, the dialogue will have you choose the rst partition-nodes. If you use the non-interactive command line-version, you specify the rst-partition nodes using the -h option as in the following example: rico:/# scinstall -u begin -h rico Broadcast Message from root (???) on rico Tue Dec 19 11:30:45... THE SYSTEM rico IS BEING SHUT DOWN NOW ! ! ! Log off now or risk your files being damaged
2-16
Applying Changes to the First Partition (Initiating the Flop-Over Menu Option 4)
After performing all your upgrades to the nodes in the rst partition, you use this menu option on any one of the rst partition nodes. The noninteractive command-line version is: # /usr/cluster/bin/scinstall -u apply This operation performs all of the following operations for you automatically: 1. Nodes of the rst partition are rebooted back into cluster mode. They do not communicate with the non-upgraded, second partition nodes. Rather, for a short amount of time, you have two separate clusters running side by side. A node in the rst partition will (through an automatically provisioned boot-service): a. b. c. d. ssh or rsh to the nodes of the second partition Halt the clustered applications there Halt those nodes Initialize application takeover on the rst partition
2.
This is the whole beauty-part of the dual-partition upgrade strategy. Your applications are down only for the length of time it takes the rst partition nodes to halt the second partition nodes and to take over the applications.
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-17
www.chinaitproject.com IT QQ : 3264454
Java ES Shared Components Sun Cluster software framework packages Sun Cluster data service packages
You can perform all these upgrades directly into your new boot environment on all cluster nodes simultaneously, while your entire cluster is still booted and available using your original root disks.
2-18
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-19
www.chinaitproject.com IT QQ : 3264454
Note The current bug in this procedure, and part of the reason it is not supported, is a timing issue. The rst partition nodes may cause a disk reservation conict after calling init 6 but still not giving sufcient time for them to shut down. This may cause a kernel panic. In the lab exercises, if you want to try this experimental procedure, you will modify the run_reserve script to put in a delay to work around this problem.
2-20
www.chinaitproject.com IT QQ : 3264454 Reviewing Sun Cluster Software Upgrade Issues (All Methods)
The cluster framework upgrade procedure (scinstall -u update, or submenu item 1 of the upgrade menu) command is idempotent. This means that you can resume the command after it is interrupted. The estimated time to upgrade a single node can vary greatly, depending on the node horsepower. Output from the upgrade process is logged in the /var/cluster/upgrade directory (conguration and state information) and /var/cluster/logs/install directory (les created during upgrade). If you are using Live Upgrade, these will be in the new boot environment, not the old one.
Once a particular node is upgraded, that nodes upgrade (on that particular boot environment) is irreversible.
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-21
www.chinaitproject.com IT QQ : 3264454
Multiple versions of the same resource type must be able to coexist within the same cluster. The contents of the RTR les for these versions fully describe each resource type version. You can upgrade resources from an old type-version to a new typeversion without having to delete and recreate them.
2-22
www.chinaitproject.com IT QQ : 3264454 Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade)
ANYTIME You can upgrade the resource type version whether the resource is online or ofine. WHEN_DISABLED You must disable the resource in order to upgrade its type version. WHEN_OFFLINE The resource must be ofine in order to upgrade (it could still be enabled, if its group were completely ofine). WHEN_UNMONITORED You can upgrade the resource type version when the resource is ofine or online, but not monitored. Likely, the new resource type version has new monitoring code. WHEN_UNMANAGED You can upgrade the resource type version only when the resource is in a group that is unmanaged. AT_CREATION This is a more polite way of saying never. You need to delete the resource and add it back as a new version.
For example, the oracle_server RTR le for resource type version 6 contains directives: #$upgrade_from #$upgrade_from #$upgrade_from #$upgrade_from "1" "3.1" "4" "5" anytime anytime anytime anytime
Assuming that the package containing the methods and description for the new version of the resource type is installed (for example, by the procedure that you used to upgrade the data service packages), you use the following steps to register new versions of resource types and upgrade resource instances to these new resource type versions: 1. Determine whether you need to change the state of the resource to upgrade its type. You look for upgrade_from directives in the RTR le (for example, in /opt/cluster/lib/rgm/rtreg/SUNW.oracle_server). These les are discussed in more detail in the next module. Put the resource in the appropriate state. Here are some example commands: # //do nothing, if it can be upgraded anytime # clrs disable res
2.
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-23
Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade) # clrs unmonitor res # clrg offline rg-containing-res # clrg unmanage rg-containing-res 3. 4. Register the new resource type version: # clrt register res-type
www.chinaitproject.com IT QQ : 3264454
For each resource of the old version, change the Type_version property to the new version: Restore the state of the resource, the resource monitor, or the resource group, if necessary. Type one of the following commands: # # # # //nothing, if it is already online clrs enable res clrs monitor res clrg online -M rg-containing-res
6.
(Optional) Unregister the old resource types. Although you removed the resource type packages using the scinstall upgrade procedure, they are still registered and can cause confusion for subsequent resource creation. # clrt unregister old-resource-type
2-24
www.chinaitproject.com IT QQ : 3264454 Examining Resource Types and Resource Upgrades (Post Cluster-Upgrade)
Resource: Type: Type_version: Group: R_description: Resource_project_name: Enabled{rico}: Enabled{midnight}: Monitored{rico}: Monitored{midnight}: 4.
# grep upgrade /opt/cluster/lib/rgm/rtreg/SUNW.oracle_server #$upgrade #$upgrade_from "1" anytime #$upgrade_from "3.1" anytime #$upgrade_from "4" anytime #$upgrade_from "5" anytime 5. 6. Upgrade the ora-server-res resource to the new version type: # clrs set -p Type_version=6 ora-server-res Verify that the upgrade succeeded: # clrs show ora-server-res Resources === Resource: Type: Type_version: Group: R_description: Resource_project_name: Enabled{rico}: Enabled{midnight}: Monitored{rico}: Monitored{midnight}: 7. ora-server-res SUNW.oracle_server:6 6 ora-rg default True True True True
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-25
www.chinaitproject.com IT QQ : 3264454
Version 3.1 and 3.2 RTR les contain the #$upgrade directive. Version 3.0 software RTR les do not. All Sun Cluster 3.0 resources are considered to be of Type_version 1. Sun Cluster 3.0 did not have any functionality to manipulate resource type versions, but Sun Cluster 3.1 and 3.2 type upgrades all consider resource type version 1 a valid type version from which to upgrade. Sun Cluster 3.1 and 3.2 RTR les must dene the RT_VERSION resource type property in addition to the START, STOP, and RESOURCE_NAME properties. The Sun Cluster 3.1 and 3.2 software stores 3.1 and 3.2 resource types in the Cluster Conguration Repository (CCR) under a concatenated name. If you are upgrading from Sun Cluster 3.0, the software continues to store previous software version resource types in the CCR under a non-concatenated name. Unless there is a compelling reason not to do so, unregister all old Sun Cluster software data services after you upgrade all resource instances to the new version data services. Even though old version data services can coexist with new version data services in a cluster, it is usually confusing to have both types.
2-26
Task 1 Patching and restoring vfstab les Task 2 Verify dependency software Task 3 Upgrade the Sun Cluster software framework Task 4 Upgrade the Sun Cluster software data services Task 5 Run the fixforzones script Task 6A Reboot the cluster nodes Task 6B Reboot the cluster nodes (experimental rolling reboot) Task 7 Upgrade resource instances whose types have new versions Task 9 Upgrade disk groups (VxVM only) Task 10 Verify your cluster operation
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-27
www.chinaitproject.com IT QQ : 3264454
b. 3.
Patch your upgraded OS. At the time of writing this course, there are two required IDR patches. # cd patch_location # patchadd -R /.alt.s10be -M . patches
4.
Check and x the nsswitch.conf le in the new boot environment. Make sure the line for ipnodes references only the files keyword, regardless of any other name service you are using: # vi /.alt.s10be/etc/nsswitch.conf ipnodes: files
5.
Create an empty state le so that you are not asked about NFSV4 domains the rst time you boot your new OS. # touch /.alt.s10be/etc/.NFS4inst_state.domain
Note The remaining live upgrade procedures rely on the new boot environment still being mounted. Do not perform the luumount command here.
2-28
Use the upgrade state le produced in step 1 to perform a silent Java ES installation of the shared components directly into the alternate boot environment. # cd sc32_directory/Solaris_sparc # ./installer -noconsole -nodisplay \ -altroot /.alt.s10be \ -state /var/tmp/jesupgr.state
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-29
www.chinaitproject.com IT QQ : 3264454
2-30
www.chinaitproject.com IT QQ : 3264454 Exercise: Upgrading the Sun Cluster Software You will need to wait for both nodes to load their SMF services, but eventually your new cluster should be active and your cluster services should run automatically. Try to time how long your applications were not available.
3.
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-31
www.chinaitproject.com IT QQ : 3264454
On the chosen node, initiate the dual-partition rolling reboot. Ignore any error messages that you see right after the command, unless you mistyped it.
# cd sc32_directory/Solaris_sparc/Product/sun_cluster # cd Solaris_10/Tools # ./scinstall -u begin -R /.alt.s10be \ -h name_of_node_you_are_typing_on 6. Observe the reboot sequence and time the amount of time your applications are down. This will take more total time than the supported post-live-upgrade reboot, but your applications will be down for a shorter duration. Restore the run_reserve le, on the node you had driven from:
7.
Warning Make sure you do not do this until both nodes reboot successfully in the rolling fashion into the new OS. # cd /usr/cluster/lib/sc # mv run_reserve.save run_reserve
2-32
2.
Upgrade the Oracle resources. Ignore validation error messages that occur, as usual, on the node not mounting the failover le system. # clrs set -p Type_version=6 ora-server-res # clrs set -p Type_version=5 ora-listener-res # clrs set -p Type_version=4 ora-stor
3.
Upgrade the iws resources. # clrs set -p Type_version=5 iws-res # clrs set -p Type_version=4 iws-stor
4.
Unregister the old types: # # # # clrt clrt clrt clrt unregister unregister unregister unregister SUNW.oracle_server:4 SUNW.oracle_listener:4 SUNW.iws:4 SUNW.HAStoragePlus:2
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-33
www.chinaitproject.com IT QQ : 3264454
2-34
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
Upgrading the Sun Cluster Software and Completing Sun Cluster Upgrades
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
2-35
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Module 3
Understand Sun Cluster data services Write Sun Cluster 3.x software data services Control RGM behavior through resource group properties and resource properties Use advanced resource group relationships Tune multimaster and scalable applications
3-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
What are the benets of being able to congure instances of a custom resource type using extension properties? Does Sun Cluster require that all application fault monitors use pretty much the same logic, or is it just a convention? What is the benet of the convention? If GDS had been invented with the initial release of Sun Cluster 3.0, would there be so many specic resource types?
3-2
Additional Resources
Additional resources The following references provide additional details on the topics described in this module:
hatimerun(1M) pmfadm(1M) property_attributes(5) r_properties(5) rg_properties(5) rpc.pmfd(1M) rt_callbacks(1HA) rt_properties(5) rt_reg(4) scdsbuilder(1HA) scdsconfig(1HA) scdscreate(1HA) scha_calls(3HA) SUNW.gds(5)
Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971. Sun Microsystems, Inc. Sun Cluster Data Service Developers Guide For Solaris OS, part number 819-2972.
3-3
www.chinaitproject.com IT QQ : 3264454
A data service consists of the application itself, which is installed separately from the cluster framework software, and the following:
Methods (also known as callback methods) that the cluster calls in response to automatic or manual requests Fault probes that monitor the health of the application Resource type registration (RTR) le that denes:
Methods Properties Other directives, such as upgrade directives (as seen earlier in the course)
3-4
The state of the disabled/enabled ag for a resource is preserved during state transitions, unless you explicitly add a -e option to the clrg switch or clrg online commands. A managed but Ofine group is still subject to going Online automatically upon cluster reconguration (that is, node failure or node joining), unless it is suspended.
Group online resources running if enabled
res1: disabled/enabled res2: disabled/enabled clrs disable/enable res clrg offline rg (enabled/disabled state preserved)
clrg switch -n node rg clrg online rg (brings it on preferred node) (enabled/disabled state of resource preserved) Group: Offline No resources running
Resources may be enabled or disabled clrs disable/enable res (affects whether they will run when group switched on) clrg manage rg
Group: Unmanaged
Figure 3-1
3-5
www.chinaitproject.com IT QQ : 3264454
Callback Methods
There are several methods that you can create, but only those that start and stop the application are required. Table 3-1 includes the full list of methods and their descriptions. Refer to the man page for the rt_callbacks(1HA) command for more information on callback methods. Table 3-1 Data Service Callback Methods Method Name START STOP MONITOR_START MONITOR_STOP MONITOR_CHECK Method Description This method starts the application, usually under PMF control. It is called when bringing a resource online. This method stops the application. It is called when bringing a resource ofine. This method starts the fault monitors, usually under PMF control. It is called when bringing a resource online. This method stops the fault monitors. It is called when bringing a resource ofine. This method performs a sanity check before a failover to validate that a given data service can run on the proposed destination node. This method is used in addition to or instead of the START method. It is called before the network resources are congured when bringing a resource online. This method is used in conjunction with or instead of the STOP method to stop the application. It is called after the network resources are uncongured when bringing a resource ofine. This method is called when a resource is instantiated or a resource property is modied to validate that the requested change is okay. If it is not okay, this method vetoes the change (exits as non-zero). This method is called: (A) When a resource is added to a managed resource group. (B) For all resources in a group when a resource group moves from the unmanaged to managed state.
PRENET_START
POSTNET_STOP
VALIDATE
INIT
3-6
www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services Table 3-1 Data Service Callback Methods (Continued) Method Name BOOT FINI Method Description This method is called when a node joins the cluster, if the resource group containing the resource is already managed. This method is the opposite of INIT. It is called: (A) When a resource is removed from a managed resource group. (B) For all resources in a group when a resource group moves from the managed to the unmanaged. This method is called when a resource is instantiated or a resource property is modied. If the VALIDATE method fails, then the UPDATE method is not called.
UPDATE
3-7
www.chinaitproject.com IT QQ : 3264454
You can still transition the resource group any way you like using the commands presented on the previous pages. You can still enable/disable any individual resources, using the commands presented on the previous pages. If the resource group is online, the resources will go on and off accordingly. The fault monitors for resources will still be started. Resources will not automatically be restarted by fault monitors, nor will entire groups automatically fail over, even if an entire node fails.
The reason you might want to suspend an online resource group is to perform maintenance on itthat is, start and stop some applications manually, but while preserving the online status of the group and other components, so that dependencies can still be honored correctly. The reason you might suspend an ofine resource group is so that it does not go online automatically when you did not intend it to do so. For example, when you put a resource group ofine (but it is still managed and not suspended), a node failure still causes the group to go online. To suspend a group, type: # clrg suspend grpname To remove the suspension of a group, type: # clrg resume grpname To see whether a group is currently suspended, use the clrg status command.
3-8
Resource type properties, such as the Resource_type, RT_version, RT_basedir, START, STOP, and upgrade directive properties Standard properties that specify specic minima, maxima, defaults, and requirements for this type Extension properties applicable to this resource type only
Starting in SC3.2, by convention application-oriented RTR les live in /opt/cluster/lib/rgm/rtreg, and RTR les for built-in types like LogicalHostname and SharedAddress still live in /usr/cluster/lib/rgm/rtreg.
3-9
www.chinaitproject.com IT QQ : 3264454
RESOURCE_TYPE = "iws"; VENDOR_ID = SUNW; RT_DESCRIPTION = "HA Sun Java System Web Server"; RT_VERSION ="5"; API_VERSION = 2; INIT_NODES = RG_PRIMARIES; V ALIDATE, INIT, and BOOT, and FINI for resources of this type are called on all nodes on the nodelist for that resource group, which is logical. The alternate value is RT_INSTALLED which would call the methods for a resource on all nodes where the type is installed, even ones not in the nodelist for that RG. RT_BASEDIR=/opt/SUNWschtt/bin; FAILOVER = FALSE; [In other words, could be scalable as well] START STOP VALIDATE UPDATE MONITOR_START MONITOR_STOP MONITOR_CHECK PKGLIST = SUNWschtt; # # Upgrade directives # #$upgrade #$upgrade_from "1.0" anytime #$upgrade_from "3.1" anytime #$upgrade_from "4" anytime = = = = = = = iws_svc_start; iws_svc_stop; iws_validate; iws_update; iws_monitor_start; iws_monitor_stop; iws_monitor_check;
3-10
www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services # The paramtable is a list of bracketed resource property declarations # that come after the resource-type declarations # The property-name declaration must be the first attribute # after the open curly of a paramtable entry Dening standard property types is done typically to set a default for THIS resource type -whichever ones are omitted can still be used but you would just get the "default default" from man r_properties. There is an exception for standard properties relating to load-balancing, such as Network_resource_used and Scalable. These are standard properties but they must be mentioned in the RTR le in order to be used at all for this resource type. { PROPERTY = Start_timeout; MIN=60; DEFAULT=300; } { PROPERTY = Stop_timeout; MIN=60; DEFAULT=300; } . . [ Lists of similar ones are omitted to save paper.] . . { PROPERTY = FailOver_Mode; DEFAULT = SOFT; TUNABLE = ANYTIME; } { PROPERTY = Network_resources_used; TUNABLE = AT_CREATION; DEFAULT = ""; } { PROPERTY = Scalable; DEFAULT = FALSE; TUNABLE = AT_CREATION; } { PROPERTY = Load_balancing_policy; DEFAULT = LB_WEIGHTED; TUNABLE = AT_CREATION; } { PROPERTY = Load_balancing_weights; DEFAULT = "";
3-11
Introducing Sun Cluster 3.2 Software Data Services TUNABLE = ANYTIME; } { PROPERTY = Port_list; DEFAULT = "80/tcp"; TUNABLE = AT_CREATION; } # # Extension Properties #
www.chinaitproject.com IT QQ : 3264454
# Not to be edited by end user { PROPERTY = Paramtable_version; EXTENSION; STRING; DEFAULT = "1.0"; DESCRIPTION = "The Paramtable Version for this Resource"; } # Must specify installation path of iPlanet (on PXFS) # Can be a SET of these for sticky mode scalable iPlanet # Web servers (These need to be under the same resource). { PROPERTY = Confdir_list; EXTENSION; STRINGARRAY; TUNABLE = AT_CREATION; DESCRIPTION = "The Configuration Directory Path(s)"; } # These two control the restarting of the fault monitor itself # (not the server daemon) by PMF. { PROPERTY = Monitor_retry_count; EXTENSION; INT; MIN=-1; DEFAULT = 4; TUNABLE = ANYTIME; DESCRIPTION = "Number of PMF restarts allowed for the fault monitor"; } {
3-12
www.chinaitproject.com IT QQ : 3264454 Introducing Sun Cluster 3.2 Software Data Services PROPERTY = Monitor_retry_interval; EXTENSION; INT; MIN=-1; DEFAULT = 2; TUNABLE = ANYTIME; DESCRIPTION = "Time window (minutes) for fault monitor restarts"; } # This is an optional property, which determines whether to failover when # retry_count is exceeded during retry_interval. { PROPERTY = Failover_enabled; EXTENSION; BOOLEAN; DEFAULT = TRUE; TUNABLE = WHEN_DISABLED; DESCRIPTION = "Determines whether to failover when retry_count is exceeded during retry_interval"; } # Time out value for the probe { PROPERTY = Probe_timeout; EXTENSION; INT; MIN=15; DEFAULT = 90; TUNABLE = ANYTIME; DESCRIPTION = "Time out value for the probe (seconds)"; } # # # # # # {
List of URIs to be probed. The iws agent probe will send HTTP/1.1 GET requests to each of the listed URIs. The probe looks at the http response code and regards 500 (Internal Server Error) as a failure. PROPERTY = Monitor_Uri_List; EXTENSION; STRINGARRAY; DEFAULT = ""; TUNABLE = ANYTIME; DESCRIPTION = "URI(s) that will be monitored by the agent probe";
3-13
www.chinaitproject.com IT QQ : 3264454
Callback methods Resource Management application programming interface (RMAPI) Process Monitor Facility (PMF) Data Service Development Library (DSDL) The hatimerun command
Resource Types
Callback Methods
libsdev (DSDL)
libscha (RMAPI)
PMF
hatimerun (1M)
RGM
Figure 3-2 Resource Types
3-14
Access property values for resources and resource groups Request restart of a resource or failover of a whole group Get or set resource status
PMF The process monitoring facility, which provides a means of monitoring processes and their descendants, and restarting them if they should stop.
Note PMF is not required. For example, the Oracle data service launches Oracle processes by issuing the database startup command and the Oracle application itself is not monitored by PMF.
The hatimerun command A facility for running programs under a timeout (likely an application probe).
Access to all of these facilities is via either C library routines or commandline utilities that can be called from a script. Therefore, the methods for data services built without DSDL can be in any programming language. Figure 3-3 illustrates the interfaces for a data service built without DSDL.
Resource Types
Callback Methods
libscha (RMAPI)
PMF
hatimerun (1M)
RGM
Figure 3-3
3-15
www.chinaitproject.com IT QQ : 3264454
The libscha.so library The C language implementation of the RMAPI The PMF service The process monitoring facility, which provides a means of monitoring processes and their descendants, and restarting them if they should stop The hatimerun command A facility for running programs under a timeout
The libsdev.so library contains the DSDL functions. This is accessible from C and C++ programs only. Figure 3-4 illustrates the interfaces for a data service built with DSDL.
Resource Types
Callback Methods
libsdev (DSDL)
libscha (RMAPI)
PMF
hatimerun (1M)
RGM
Figure 3-4
3-16
Restart the daemon every time it dies Restart the daemon a certain number of times within a certain number of minutes Invoke an action script A combination of the previous two (invoke an action script only if the daemon dies more times than the threshold)
3-17
www.chinaitproject.com IT QQ : 3264454
3-18
PMF monitors but does not try to restart the application daemon itself. PMF uses the scds_pmf_action_script to tell the fault monitor about the application daemon death (every single data service written with DSDL uses the same action script). The fault monitor decides whether to have RGM restart the service or fail over the whole resource group.
PMF
Action Script
Application Daemons
Restart or Failover
RGM
Figure 3-5
3-19
www.chinaitproject.com IT QQ : 3264454
Lets PMF inform it immediately of application daemon death, through the action script. Periodically probes the application. For example:
To verify that the NFS service is healthy, the fault probe sends null RPC requests to the NFS service daemons. To verify that Sun Java Web Server or Apache Web Server are healthy (the probes are identical), the fault probe contacts the web server port and retrieves the head information from a congurable list of URLs (it defaults to just the root URL).
If these requests return to the probe within probe_timeout seconds, then the probe concludes that the service is healthy. Otherwise, it increments the failure history and either restarts the service or fails it over, depending on the values of the retry_count and retry_interval properties and the number of recorded failures. DSDL fault monitors dene a partial failure that contributes a fractional quantity to the failure history. Actual values for these fractional quantities are determined by the resource type developer, but they must be scaled to a number between 0 and 1 before being added to the failure history. To illustrate this concept, suppose there is an instance of an Apache web server running in a given resource group. If the probe successfully establishes a TCP connection to the web server but fails to read all the requested data before the probe times out, the failure is considered a partial (50 percent) failure. Two such partial failures that accumulate within the same retry interval are considered a complete failure.
3-20
www.chinaitproject.com IT QQ : 3264454 Writing Sun Cluster 3.x Software Data Services Figure 3-6 depicts how a DSDL fault monitor checks the health of a service.
Data service fault monitor Set status Service is degraded Increment failure count by fractional amount
Sleep (thorough_probe_interval)
Yes
Partial failure? Suggest Restart Update failure history
No
Success?
Yes
No
Increment failure count by 1
No
Request successful?
scha_control
Yes
retry_count?
No
To suggest failover
Yes
Set status Service has failed
End
TE:
Figure 3-6
3-21
www.chinaitproject.com IT QQ : 3264454
Let PMF do it directly. Let the fault monitor do it directly, without the involvement of the RGM. Let the fault monitor restart the resource, but inform the RGM that the resource is being restarted. This is implemented with the low-level call: scha_control -O RESOURCE_IS_RESTARTED -G RG -R RES
Let the fault monitor tell RGM to restart the resource. This is implemented with the low-level call: scha_control -O RESOURCE_RESTART
DSDL fault monitors always choose the last of these (it is embedded in the DSDL restart function). There are several reasons that you want the RGM to perform resource restarts rather than restarting it outside of the scope of the RGM. This is discussed further in the advanced resource control section of this module.
3-22
They may have different extension properties that are used to customize different instances of that same resource type. The START method of each resource type will have code particular to that type that gets values for the extension properties and uses them to launch the application daemon with the correct parameters. The part of the fault monitor that actually probes the application will be particular for that application. The STOP method may have some customized code to stop an application (like apache calling apachectl stop).
The application to be launched (the Start_command) The application probe to be called by the standard fault monitor framework (the Probe_command) The command to stop the application (the Stop_command)
3-23
www.chinaitproject.com IT QQ : 3264454
Invent some conguration le that you will put on all nodes (or in a global le system), usually with some VARIABLE=VALUE lines. Have the application specied by the Start_command be a wrapper around your real application. The wrapper reads in the conguration values from the le in order to correctly start the customized instance of your application. Use the conguration values in a similar way with the Probe_command and Stop_command.
Reducing code bloat. Ensuring that the 95 percent redundant part in every data service is implemented correctly with DSDL. If you program DSDL yourself, you may make programming errors in the calls or the convention. Simplifying conguration by using a custom conguration le rather than application-specic extension properties.
GDS does not allow you to congure different application instances using extension properties. Using GDS, you cannot use the standard clrs show command to see extension properties that customize particular instances. With GDS, conguration of your custom parameters cannot be validated with a VALIDATE method, which is valuable in a non-GDS agent. They must be validated by your wrappers.
3-24
Real new resource types whose methods are written in C using DSDL (you must have a C compiler in your PATH to be offered this option) Real new resource type whose methods are written in ksh (no DSDL) New applications to be congured as an instance of SUNW.gds
Note There is less value to using a builder for a new application congured as an instance of SUNW.gds, since 95 percent of the code is already encapsulated in the GDS. However, the builder does provide scripts for you to ease the burden of calling the correct cluster commands to build your application as a GDS. It does not help you create a framework for having application wrappers read a conguration le (to replace the absence of customized extension properties). The agent builders have a code generation phase (scdscreate on the command line, or the rst action of the GUI) that lays down the skeleton code for you. For real new types, a full RTR le and all the methods and probes are created. You then customize the code and move to the packaging phase (scdsconfig on the command line), which creates a Solaris package for your agent.
3-25
www.chinaitproject.com IT QQ : 3264454
If a resource group fails to start twice on the same particular node (failure of START methods of same or different resources in the group) within the pingpong interval (expressed in seconds), then the RGM will not consider that node as a candidate for the group failover. If one particular resource successfully does a scha_control -O GIVEOVER to take a group off of a particular node, and then the same resource tries to do another scha_control -O GIVEOVER on a different node to bring the group back to the original node, RGM will reject it within the pingpong interval.
3-26
Note The Pingpong_interval property is meant to prohibit faulty start scripts or properties and faulty fault monitors, or problem applications, from causing endless pingponging between nodes.
3-27
www.chinaitproject.com IT QQ : 3264454
Describes what should happen to a resource group if this resource fails to start up (a START method fails). Should the group move to another node, or should it live without this resource? Describes what happens to a resource group if this resource fails to stop (a stop method fails). Should the group be frozen pending manual clearing by the administrator of a STOP_FAILED ag, or should you reboot the node where the method failed? Puts restrictions on whether the RGM will allow the fault monitor for this resource to cause group failover through scha_control -O GIVEOVER. By setting this property, you can have RGM categorically deny give over requests from this resources fault monitor, although fault monitors for other resources in the same group might still cause the whole group to fail over. Puts restrictions on whether this resources fault monitor can cause RGM to restart this resource. This is one reason why having the fault monitor delegate resource restarts to the RGM is preferred, rather than the fault monitor (or PMF) restarting resources directly. The restart restriction cannot be enforced if the restart occurs outside of RGM control.
3-28
www.chinaitproject.com IT QQ : 3264454 Controlling RGM Behavior Through Properties Table 3-2 describes how values of the Failover_mode property work. Table 3-2 The Failover_mode Value Operation Can Fault Monitor Cause RGM to Fail RG Over? Yes Can Fault Monitor Cause RGM to Restart the Resource? Yes
Failure to Start
Failure to Stop
NONE
Other resources in the same resource group can still start (if non-dependent). The whole resource group is switched to another node. The whole resource group is switched to another node. Other resources in the same resource group can still start (if non-dependent). Other resources in the same resource group can still start (if non-dependent).
The STOP_FAILED ag is set on the resource. The STOP_FAILED ag is set on the resource. The node reboots. The STOP_FAILED ag is set on the resource. The STOP_FAILED ag is set on the resource.
SOFT
Yes
Yes
HARD
Yes
Yes
RESTART_ONLY
No
Yes
LOG_ONLY
No
No
RESTART_ONLY and LOG_ONLY are new with SC31U3 (9/04). Note they are the same as NONE concerning START and STOP failures -- the difference is that they put restrictions on what the RGM will do on behalf of the fault monitor (with either value set -- the resource CANNOT be the cause of RG failover).
If the STOP_FAILED ag is set, it must manually be cleared using the clrs clear command before the service can start again. # clrs clear -f STOP_FAILED -n nodename resname
3-29
www.chinaitproject.com IT QQ : 3264454
3-30
Resource Dependencies
Resource dependency properties form a special subset of the standard resource properties. The RGM enforces four kinds of resource dependencies: regular, weak, restart and ofine-restart.
The dependency is indicated by defining: Resource_dependencies=ResBname as a property of Resource A. Resource B must be added rst. Resource B must be started rst. Without the dependency, RGM may have been able to start them in parallel. Resource A must be stopped rst. Without the dependency, RGM may have been able to stop them in parallel. Resource A must be deleted rst. The rgmd daemon will not try to start resource A if resource B fails to go online, and will not try to stop resource B if resource A fails to be stopped.
Note Starting in Sun Cluster 3.2 you are allowed to explicitly disable resource B (the dependee) even when resource A is still online.
3-31
www.chinaitproject.com IT QQ : 3264454
Ofine-Restart Dependencies
These dependencies are new to Sun Cluster 3.2 and are similar to restart dependencies. In the new ofine-restart dependencies, when resource B is detected to be ofine, or is put ofine explicitly, resource A is restarted. But the actual start part of the restart for resource A will block until resource B actually goes online again. This can lead to more accurate semantics for the relationship between A and B. Regular restart dependencies do not do anything with the dependent (A) even if the RGM is aware that the dependee (B) is ofine. However if you know that A cannot operate properly without B, the ofine-restart behavior may be more correct.
3-32
Cross-Group Dependencies
Starting in Sun Cluster 3.1 9/04 (Update 3), any of the types of dependencies can be between resources in the same resource group or in different resource groups. If the resources are in different groups, the dependency does not imply any preference about whether the groups run on the same or different nodes (that is controlled by RG_affinities, presented later in this module). There are additional side effects of cross-group dependencies:
You will not be allowed to put a group containing the dependee (resource B) ofine (with clrg offline) if the dependent (resource A) is online in its group. You are allowed to switch the group containing the dependee to a different node (clrg switch) while the dependent stays online. But if it is a restart dependency or ofine-restart dependency the RGM will also restart the dependent while leaving the rest of its group alone. This could daisy-chain. Assuming all resources are enabled, if you start the dependents group before that of the dependee, the dependents group will be in a Pending Online state with the dependent resource not started until the dependency can be satised.
3-33
www.chinaitproject.com IT QQ : 3264454
3-34
3-35
Advanced Resource Group Relationships The following will be affected by a weak positive afnity:
www.chinaitproject.com IT QQ : 3264454
Failover of the source group If the target is online, when the source group needs to fail over it will fail over to the node running the target group, even if that node is not a preferred node on the source groups node list.
Putting the resource group online onto a non-specied node: # clrg online source-grp Similarly, when a source group goes online and you do not specify a specic node, it will go onto the same node as the target, even if that node is not a preferred node on the source groups node list.
Weak negative afnities affect the exact same scenarios, with the source group preferring to fail over or go online on a node not currently running the target group. However, weak afnities are not enforced when you manually bring or switch a group onto a specic node. The following command will succeed, even if the source group has a weak afnity for a target running on a different node. # clrg switch -n specific-node src-grp There can be multiple resource groups as the value of the property. In other words, a source can have more than one target. In addition, a source can have both weak positive and weak negative afnities. In these cases, the source prefers to choose a node satisfying the greatest possible number of weak afnities. For example, it will select a node that satises two weak positive afnities and two weak negative afnities rather than a node that satises three weak positive afnities and zero weak negative afnities.
The only node or nodes on which the source can be online are nodes on which the target is online.
3-36
If the source and target are currently running on one node, and you switch the target to another node, it will drag the source with it. If you ofine or online the target group, it will ofine or online the source as well. An attempt to switch the source to a node where the target is not running will fail. If a resource in the source group fails, the source group still cannot fail over to a node where the target is not running. (The solution to this is discussed in the next section.)
The source and target are closely tied together. If you have two failover resource groups with a strong positive afnity relationship, it might make sense to make them one group. So why does strong positive afnity exist?
The relationship can be between a failover group (source) and a scalable group (target). That is, you are saying the failover group must run on some node already running the scalable group. You might want to be able to ofine the source group but leave the target group running, which strong positive afnity allows. Some resources may reject being put in the same group, but you still want them all running on the same node or nodes.
The only difference between the +++ and the ++ is that with +++, if a resource in the source group fails and its fault monitor suggests a failover, the failover can succeed. The RGM will move the target group over to where the source wants to fail over to, and then the source gets dragged correctly.
3-37
www.chinaitproject.com IT QQ : 3264454
Will these two groups run on the same node or different nodes? RG1 has a weak positive afnity for RG2. Normally RG1 will be running on a different node from RG2. If RG1 needs to fail over, it will fail over to the node where RG2 is running. Because RG2 has a strong negative afnity for RG1, it will have to move either to a third node, or to the node where RG1 was running previously (if the node is still alive). If there are no more nodes remaining, RG2 would have to go ofine completely.
3-38
www.chinaitproject.com IT QQ : 3264454 Advanced Resource Group Relationships The relationship is illustrated in Figure 3-7:
Node 2 RG2
Figure 3-7
What kind of application needs this type of afnity? This afnity is used in the SAP Web Application data service. In the example, RG1 contains a master application. RG2 contains a memory state replica application, intended to run only on a different node from RG1s node, and intended to replicate memory state on behalf of the master application. The memory state replica application provides memory state so that if RG1 fails over, it will fail to the node RG2 was on and get its memory state preserved. At that point, RG2 no longer needs to run on that same node, it needs to move to another node. If there is only one node left, there is no need for RG2 (the memory replicator), since its purpose is to run on a different node.
3-39
www.chinaitproject.com IT QQ : 3264454
Multimaster applications run on more than one node at a time but do not make use of the internal load balancing features provided by the Sun Cluster software. Scalable applications specically make use of the internal load balancing features provided by the SharedAddress resource. This is the type represented by a resource that has the value of the Scalable resource property set to TRUE.
3-40
LB_STICKY Connections from the same client IP all go to the same node. Load balancing is only for different clients. This is only for ports listed in the Port_list property. LB_STICKY_WILD Connections from the same client to any server port go to the same node. This is good when port numbers are generated dynamically and not known in advance.
The default is no client afnity (the default value of the Load_balancing_policy is set to LB_WEIGHTED).
3-41
www.chinaitproject.com IT QQ : 3264454
3-42
www.chinaitproject.com IT QQ : 3264454 Multimaster and Scalable Applications For example, if you set Affinity_timeout to 900 (15 minutes, a reasonable value for a shopping cart application), a client would lose the afnity 15 minutes after the last connection closes. This includes time that the connection is in a TIME_WAIT state. Setting afnity timeout can prevent wasting away all your memory if you really had millions and millions of possible clients. You can set Affinity_timeout to -1, or innite, so that afnity is never lost, unless the node to which a client is mapped goes down or its application becomes disabled. The default value for Affinity_timeout is 0. Does this mean that afnity still exists? As long as the client makes a continuous series of connections and never reaches a state of all connections closed, it would still have afnity. If some connections were still in the TIME_WAIT state (this would give you an extra 60 seconds on later Solaris 9 OS versions and on Solaris 10), you would keep your afnity, but once the last connection closed, afnity would be gone. The default value of 0 might be appropriate for some applications. Consider something like a scalable FTP application, where a new data connection must be connected to the same node as an existing control connection.
3-43
www.chinaitproject.com IT QQ : 3264454
Task 1 Create a wrapper script for your application Task 2 Create a new resource type Task 3 Install the newly created resource type Task 4 Register the new resource type Task 5 Instantiate a resource of the new resource type Task 6 Put the resource group containing the new resource online Task 7 Test the fault monitor for the new resource type
Preparation
There is no special preparation required for the following tasks.
3-44
www.chinaitproject.com IT QQ : 3264454 Exercise 1: Creating Sun Cluster Software Data Services 2. Create a wrapper script for the xclock binary. This is the application for which you create a new resource type. Be careful to use the proper quote characters. On all cluster nodes, type the following (or use a model le provided by the instructor):
# vi /usr/openwin/bin/myxclock #!/bin/ksh # this script will get called as follows: myxclock $RESOURCE_NAME # because you will subsequently modify the START method to do so DISP_HOST=$(/usr/cluster/bin/scha_resource_get \ -O EXTENSION -R $1 Display_host | sed -n '$p') CLOCKTYPE=$(/usr/cluster/bin/scha_resource_get \ -O EXTENSION -R $1 Clock_type | sed -n '$p') /usr/openwin/bin/xclock -$CLOCKTYPE -display $DISP_HOST \ -title "xclock on $(uname -n)" 3. Make the wrapper script executable: # chmod a+x /usr/openwin/bin/myxclock
3-45
www.chinaitproject.com IT QQ : 3264454
Vendor Name TEST Application Name xclock RT Version 1.0 Working Directory /var/tmp Scalable/Failover (Radio Button) Failover Network Aware check box Deselect Type of generated source ksh
3. 4. 5. 6.
Click Create. Click OK in the Success! dialog box. Click Next. Fill out the form as follows:
Start Command: /usr/openwin/bin/myxclock $RESOURCE_NAME Stop Command Leave this blank Validate Command Leave this blank Probe Command /usr/cluster/bin/pmfadm -q $PMF_TAG
7. 8. 9.
Click Congure. Click OK in the Success! dialog box. Do not exit the builder application; go to a command line window on the same node on which you are running the builder.
3-46
www.chinaitproject.com IT QQ : 3264454 Exercise 1: Creating Sun Cluster Software Data Services 10. Modify the RTR le to add an extension property. The scdsbuilder executable left a section for you at the bottom of the RTR le. If you use the instructors model le, make sure to insert its lines in the appropriate place into the TEST.xclock le. # cd /var/tmp/TESTxclock/etc # vi TEST.xclock # User added code -- BEGIN vvvvvvvvvvvvvvv { PROPERTY = Display_host; EXTENSION; STRING; DEFAULT = ""; TUNABLE = WHEN_DISABLED; DESCRIPTION = "Display (host:#) on which to display xclock GUI"; } { PROPERTY = Clock_type; EXTENSION; Enum {digital, analog}; Default = analog; TUNABLE = WHEN_DISABLED; DESCRIPTION = "Type of clock (analog or digital)"; } # User added code -- END ^^^^^^^^^^^^^^^ 11. Back in the builder window, again click Congure. 12. Acknowledge the Success! dialog box, and exit the builder application.
3-47
www.chinaitproject.com IT QQ : 3264454
3-48
www.chinaitproject.com IT QQ : 3264454 Exercise 1: Creating Sun Cluster Software Data Services 1. 2. Create an empty resource group: # clrg create -n node1,node2 xclock-rg Create an instance of the TEST.xclock resource type: # clrs create -t TEST.xclock \ -g xclock-rg -p Clock_type=analog \ -p Thorough_probe_interval=10 \ -p Display_host=display_station:display# xclock-res
Task 6 Putting the Resource Group Containing the New Resource Online
From one cluster node, run the following commands: # clrg online -M xclock-rg # clrg status # clrs status
Task 7 Testing the Fault Monitor for the New Resource Type
Perform the following steps: 1. 2. Close the xclock window, and verify that it is restarted. What are the values that control how many restarts can happen on the same node within a certain interval before a failover occurs? # clrs show -p Retry_count -p Retry_interval xclock-res 3. What happens if you close the window three times within the 370 second interval? Do it (the three times includes the time in step 1, if it is still within the interval). Now that your application has failed to the next node, close the application three consecutive times on that node. Why will your application not fail back to the original node? What kind of messages are you seeing on the node consoles or at the bottom of the /var/adm/messages le? On either node, print out the current value of the Pingpong_interval, and change it to a lower value: # clrg show -p Pingpong_interval xclock-rg # clrg set -p Pingpong_interval=60 xclock-rg 7. Verify that the application can now fail back to its rst node.
4. 5.
6.
3-49
www.chinaitproject.com IT QQ : 3264454
Task 1 Make a version of an application wrapper script suitable for GDS Task 2 Create a variation of the data service using GDS Task 3 Verify restart and failover behavior of the new resource
Note Every real-world application that is based on GDS has to be congured by some custom-built le such as this rather than by extension properties.
3-50
3-51
www.chinaitproject.com IT QQ : 3264454
Verify in the console message that the fault probe is unable to restart the resource, and that it fails over to the other node. Repeat steps 3 and 4 TWICE on the new node, rerunning the clrg status command each time until you verify that the resource is Online. Close the GDS a third time on the new node. Verify that the application cannot fail over because of the Pingpong_interval.
9.
10. Note the slightly different behavior of this resource (which conforms more to the DSDL standards) than the one from the previous exercise. The resource should be restarted. 11. On either node, print out the previous value of the Pingpong_interval, and change it to a lower value: # clrg show -p Pingpong_interval xclockgds-rg # clrg set -p Pingpong_interval=60 xclockgds-rg 12. Did you prefer the GDS version of the application to the other version? Is it less work to create? Is it fully functional? There are no correct answers to these questions.
3-52
Task 1 Investigate cross-group dependencies and restart dependencies Task 2 Investigate resource group afnities Task 3 Modify a failover service FAILOVER_MODE property
3.
3-53
www.chinaitproject.com IT QQ : 3264454
Now try to disable the dependee (you will see a restart issued for the dependent GDS resource, but its start is blocked): # clrs disable xclock-res # clrg status xclock-rg xclockgds-rg
7.
Restart the dependee and verify that the dependent is now unblocked as well: # clrs enable xclock-res # clrg status xclock-rg xclockgds-rg # clrs status xclock-res xclockgds-res
# clrg offline xclock-rg # clrg offline xclockgds-rg # clrs set -p Resource_dependencies_offline_restart="" xclockgds-res 2. Place a weak negative afnity for each group on the other:
# clrg set -p RG_affinities=-xclockgds-rg xclock-rg # clrg set -p RG_affinities=-xclock-rg xclockgds-rg 3. Bring the groups online without specifying a particular node. Do they end up on the same or different nodes? # clrg online xclockgds-rg # clrg online xclock-rg 4. Switch the groups so that they are on the same node. Why is this allowed? # clrg switch -n othernode xclockgds-rg 5. 6. Try to set a strong negative afnity. What message do you get? Switch one of your groups and set the strong negative afnity. Note that you are making the non-GDS resource group the target of the afnity relationship. That group can switch, but the source group will always move out of its way. # clrg set -p RG_affinities=--xclock-rg xclockgds-rg
3-54
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Advanced Resource and Resource Group Control 7. 8. 9. # # # # clrg clrg clrg clrg Try to switch the source group. Can you do it? # clrg switch -n othernode xclockgds-rg What happens when you switch the target group? # clrg switch -n othernode xclock-rg Take both groups ofine and set a strong positive afnity relationship, and bring them online:
offline xclock-rg offline xclockgds-rg set -p RG_affinities= xclock-rg set -p RG_affinities=++xclock-rg xclockgds-rg 10. Bring them online. Try to bring the source group online rst and see what happens. Then bring the target group online and it should drag the source online with it. # clrg online xclockgds-rg # clrg online xclock-rg 11. Try to switch the source and then the target. How do they behave? # clrg switch -n oterhnode xclockgds-rg # clrg switch -n oterhnode xclock-rg 12. Try to make the source fail over (by closing the GDS xclock three times). Wait each time until clrs status xclockgds-res reports that the resource is Online again. Why does it not fail over? 13. Modify the afnity so that it can perform failover delegation.
# clrg set -p RG_affinities=+++xclock-rg xclockgds-rg 14. Repeat step 12 and observe the results. You may see results while killing it fewer than three times because the RGM is still counting failovers within the interval. 15. Can you set a strong positive afnity between a scalable group and a failover group? Try it. # clrg set -p RG_affinities=++xclock-rg iws-rg
3-55
www.chinaitproject.com IT QQ : 3264454
Replace the STOP method for the ora-listener-res resource on the node determined in Step 1:
# mv /opt/SUNWscor/oracle_listener/bin/oracle_listener_stop /var/tmp # vi /opt/SUNWscor/oracle_listener/bin/oracle_listener_stop #!/bin/ksh exit 1 # chmod a+x /opt/SUNWscor/oracle_listener/bin/oracle_listener_stop 3. Modify the failover_mode property from one node (ignore validation errors on the node not mounting the failover le system, as usual). # clrs set -p Failover_mode=HARD ora-listner-res 4. Put the ora-rg resource group ofine from one node: # clrg offline ora-rg
Note The host running the ora-rg resource group reboots as a result of the failure of the STOP method and the value of the failover_mode property. 5. After the node reboots, restore the STOP method for the oracle_listener property on that node.
3-56
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
3-57
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Module 4
Add a new node to a running cluster Remove a node from a cluster Replace a failed node in a cluster Uninstall the Sun Cluster 3.2 software from a node Replace failed disks Back up and restore the Cluster Conguration Repository (CCR)
Note If you have a third node to add to the cluster as part of the exercises for this module, you will need to make sure it is upgraded to Solaris 10 by running the ./installES445 script and choosing option 3, as noted in the preface.
4-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics presented in this module. While they are not expected to know the answers to these questions, the answers should be of interest to them and inspire them to learn the material presented in this module.
!
?
Discussion The following questions are relevant to understanding the content of this module:
What is required to get existing cluster services to run on a new node in the cluster? When performing a fresh installation on a failed node, how do you get the existing nodes to accept it back into the cluster? What is the difference between replacing hardware Redundant Array of Independent Disks (RAID) disks and Just a Bunch Of Disks (JBOD) disks?
4-2
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
Sun Microsystems, Inc. Sun Cluster Software Installation Guide For Solaris OS, part number 819-2970. Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971.
4-3
www.chinaitproject.com IT QQ : 3264454
7. 8. 9.
10. Congure quorum devices properly to take the new node into consideration. 11. Congure volume management on the new node, if needed. 12. Add the new node to existing volume manager device groups. 13. Verify or congure IP multipathing (IPMP) on the new node. 14. Prepare the new node to run existing applications. 15. Add the new node to existing resource groups.
4-4
# clintr disable vincent:hme0,theo:hme0 # clintr remove vincent:hme0,theo:hme0 2. # clintr add sw1 3. 4. Add a transport cable connecting node 1 to the transport switch: Add a transport cable connecting node 2 to the transport switch: # clintr add vincent:hme0,sw1 # clintr add theo:hme0,sw1 Modify the other transports in a similar way. Add a transport switch to the conguration:
4-5
www.chinaitproject.com IT QQ : 3264454
All cluster transport switches Public network hubs or switches Data storage devices or switches, as appropriate
Run the devfsadm utility, or reboot the new node with the boot -r command to make sure it recognizes any new storage connections. Set up IPMP on the public network connections. If you do not set up IPMP on your public network, the following occurs:
Sun Cluster 3.1 8/05 (Update 4) and later When you use the scinstall command to congure the node, it builds a an IPMP group sc_ipmpN for every adapter for which there is a nonIPMP /etc/hostname.xxx le. Sun Cluster 3.1 Updates 1-3 A singleton IPMP group sc_ipmpN will be added if you were to create a brand new LogicalHostname or SharedAddress resource. But in order to modify an existing resource, you would need to congure IPMP by hand.
4-6
You must modify this list to include your new node. Authentication is almost always congured with the value sys. A reverse IP lookup is performed on a node trying to add the cluster, and the name is matched with one on the list. Use the claccess utility on an existing node in the cluster to allow the new node to join as follows: # claccess allow -h noodle # claccess show Host Access Control === Cluster name: Allowed hosts: Authentication Protocol: orangecat noodle sys
Note If the vxio major number used by the existing nodes already conicts with a different major number on that new node, then change the conicting entry on the new node to a higher, unused number, and do a reconguration reboot (or wait until the one that occurs automatically at the end of scinstall). If you are using Solaris VM software in your cluster, then the major number for the md driver is already in the /etc/name_to_major le as a result of the Solaris OS installation. This is a reserved number, so that it will always match on old and new nodes.
4-7
www.chinaitproject.com IT QQ : 3264454
Sun Cluster 3.2 and Sun Cluster 3.1 8/05 (Update 4) You are required to use the Java ES installer in order to install the packages (or install a Solaris Flash archive from a node where the Java ES had been used to install the packages but has not yet been congured with scinstall). Sun Cluster 3.1 9/04 (Update 3) You may use the installer or a Flash image (from an uncongured system) to install the cluster packages, or you may wait and let scinstall install them for you. If you let scinstall install the packages, you must rst install the Java Web Console, if it is not already installed (it is already part of the base OS in Solaris 10 OS).
All earlier releases You can use an installer to install the packages, or just wait and let scinstall install them. These releases do not use or require the Sun Java Web Console.
Note If the did major number used by the existing nodes conicts with a different major number on the new node, then change the conicting entry on the new node to a higher, unused number, and make the did major number match the other nodes. scinstall will conclude with the desired reconguration reboot.
4-8
Select the option to add this machine as a node in an established cluster. You can use the Typical install if you have the standard placeholder /globaldevices le system. Any existing node can be the sponsor node. Use autodiscovery for the cluster transport if you use Ethernet. Autodiscovery reliably probes and discovers Ethernet transport adapters.
4-9
www.chinaitproject.com IT QQ : 3264454
4-10
4-11
www.chinaitproject.com IT QQ : 3264454
Note Solaris VM allows only two hosts to be mediators even if three or more hosts are attached to the diskset, so if you are adding a third attached node you will not be adding it as a mediator. For the VxVM software, use the cldg command to add the node to a device group. You can use a wildcard (+) here if it is appropriate. #cldg add-node -n new-nodename dgname You can also use the clsetup utility to perform this procedure.
4-12
Configuring IPMP
IPMP must be congured on the public network interfaces on the new node before you can add the node to any LogicalHostname or SharedAddress resources. IPMP can be set up before the new node is added to the cluster, either manually or as part of a Solaris JumpStart software installation. Verify the network conguration with the following command: # ifconfig -a If the public network interfaces are not in IPMP groups, set up /etc/hostname.xxx les on the new node and reboot it.
Install any application software on the new node that you installed on the local disks of other nodes. Installing applications on local disks instead of global le systems can allow rolling maintenance of the software, although it is more time consuming to do so. Create any necessary local les and directories, even for software installed in the global le systems. For scalable services, log les must be on the local storage. Modify the /etc/passwd, /etc/shadow, /etc/group, /etc/system, and /etc/project les with any application-specic changes that you already made on the other nodes. Add any lines to the /etc/vfstab le that reference global le systems. Add any lines to the /etc/vfstab le for failover le systems that the application accesses. These require that the node be physically connected to the storage and appear in the Nodelist property for the device group. Create any logical host (or shared address) IP address entries in /etc/hosts (or /etc/inet/ipnodes for IPV6). Install any data service agents that are needed on the new node.
4-13
www.chinaitproject.com IT QQ : 3264454
# clrs set -p netiflist=sc_ipmp0@1,sc_ipmp0@2,sc_ipmp0@3 ora-lh # clrs set -p netiflist=sc_ipmp0@1,sc_ipmp0@2,sc_ipmp0@3 iws-lh Now you can modify the Nodelist property for the existing resource groups to include the new node. In the new CLI, clrg add-node adds a new node to the node list while keeping the existing nodes intact. For scalable applications you must modify the Nodelist property for the failover group containing the SharedAddress resource before modifying the Nodelist property for the scalable application group. You could do them with the same command, or with the wildcard. # clrg add-node -n noodle ora-rg # clrg add-node -n noodle lb-rg # clrg add-node -n noodle iws-rg If you have a scalable resource group, change the Desired_primaries and Maximum_primaries properties of the resource group, assuming you want the group to run on all nodes at the same time. # clrg set -p Desired_primaries=3 -p Maximum_primaries=3 iws-rg
Now you can use clrg switch or clrg online to run your applications on the new node.
4-14
Orderly removal of a node (includes deconguration of the cluster framework on that node) Removal of a dead node from the cluster conguration
The only interruption to service would be the switching of services off a live node that you are removing in an orderly fashion. These procedures have been greatly simplied in Sun Cluster 3.2. The following steps are required to remove a node. 1. 2. 3. 4. 5. 6. Switch any services and device groups off the node, if it is still alive. Remove the node from the Nodelist property of resource groups. Remove the node from the Nodelist property of the VxVM and Solaris VM device groups. Halt the node to be removed (orderly scenario) and boot it to noncluster mode. Disable node quorum votes and remove attached quorum devices. Remove or clear the node from the cluster, depending on the state of the node being removed. (A) Orderly Removal Scenario: a. b. Allow removal access for the node to be removed. On the node to be removed, comment out any global le systems from /etc/vfstab (you can leave /global/.devices/node@#). Run clnode remove on the node being removed. This automates all other steps of the node removal, including deconguration of the cluster framework on that node. Run clnode clear on a remaining cluster node
c.
(B) Dead Node Removal Scenario: a. 7. Add back any quorum devices, as appropriate.
4-15
www.chinaitproject.com IT QQ : 3264454
Note This will also evacuate any services from non-global zones running on the specied node.
4-16
Note If any resource group were congured in a non-global zone, its Nodelist property would not be modied by the above command. You would have to explicitly remove the non-global zone from the Nodelist. For example: clrg remove-node -n noodle:myzone myrg
Note The Desired_primaries and Maximum_primaries properties are automatically reduced for scalable groups.
4-17
www.chinaitproject.com IT QQ : 3264454
Note The -f is required if the node being removed is already dead. At the time of writing there is an outstanding bug that these take an excruciatingly long time (more than 5 minutes) if the node being removed is already dead. The BugID is 6507093.
4-18
Note The -F option overrides any objection that node may have to the fact that a local tape drive is still listed in its own local copy of the CCR as a device group.
4-19
www.chinaitproject.com IT QQ : 3264454
Complete failure of the node hardware itself Failure or accidental corruption of the root disk or root le system
It makes no difference whether you need to install a new OS on a new replacement node, or on the same node that you used to have. Both are considered complete node replacements. The following subsections discuss two possible replacement procedures. Note that the rst is more general (the replacement node could even have different disk controllers and network adapters).
Removing the Node Definition From the Cluster and Adding It Back
The most reliable solution to replacing a failed node in a cluster is to completely remove the nodes denition from the cluster, and then add it back as a brand new node. This allows the new node to have different transport adapters, storage attachments, and even a different name (or any or all of these could be the same as before). Although this solution involves going through all the procedures mentioned earlier in the chapter, to remove a node denition and then adding it back, this is likely to be the fastest way to restore a cluster node. The alternative, restoring archives, takes much longer. In order to make Solaris and Sun Cluster package installation faster, you can install a Flash archive that was created on a node that had the Sun Cluster packages installed but had not been congured with scinstall. Note You might think you could create a Flash archive from a node already congured into the cluster, and use it to restore a node without having to remove its denition from the cluster and add it back. The problem is that standard Flash installation post-processing is inconsistent with the ability to boot right back into the cluster. While with some clever installation post-processing it was possible to make this work in Solaris 8 OS and Solaris 9 OS, it has never been supported and this author has not been able to make it work in Solaris 10 OS.
4-20
4-21
www.chinaitproject.com IT QQ : 3264454
4-22
Individual disk drive failures in hardware RAID rather than software RAID Entire array (or cable to array) failures DID consistency issues Physical disk IDs for SCSI-JBOD and Fibre-JBOD disks Updating physical disk IDs in the DID database (in the CCR) VxVM software procedures for xing broken mirrors Solaris VM software procedures for xing broken mirrors Special issues for replacing a failed drive that is used as a quorum device
4-23
www.chinaitproject.com IT QQ : 3264454
The c#t#d# value from each node If this changes, then you get a different DID number for the replacement. A disk serial number or worldwide name (WWN) The fact that the DID database contains a specic physical serial number or WWN from a disk is a tricky problem. In the cluster environment, this information exists and must be kept consistent across three different places:
The physical disk itself The disk device driver (in each individual nodes RAM) The DID database portion of the CCR
4-24
www.chinaitproject.com IT QQ : 3264454 Reviewing Disk Replacement Procedures The illustration in Figure 4-1 shows the relationship among the different places that store a physical Disk ID:
Physical ID on Disk
Boot cfgadm (SCSI JBOD Disks) luxadm/devfsadm/ devfsadmd (Fibre JBOD Disks) Node 2 Physical ID in Device Driver (RAM)
cldev repair
Figure 4-1
The device driver in RAM on any particular node is automatically updated from the information that is physically present on the disk during a boot or reboot.
4-25
www.chinaitproject.com IT QQ : 3264454
3.
Note If the device group contains any failover le systems, you must switch the associated resource group, that is, the application, and let it drag over the device group. If there are no failover le systems, you can just switch the device group. 4. If you are using VxVM, temporarily remove the disk from VxVM control: # vxdisk offline c#t#d# # vxdisk rm c#t#d# # vxdmpadm -f disable path=c#t#d#s2 5. 6. Use cfgadm to uncongure the disk: # cfgadm -c unconfigure c#::dsk/c#t#d# Physically replace the disk with a new disk (you can do this step anytime up until now, as long as the new disk is in place before the next step). If you are repeating these steps (as required) on a second node attached to the disk, the new disk will already be in place and you do not need to touch the hardware. Use cfgadm to read the new physical disk information and congure it into device driver RAM: # cfgadm -c configure c#::dsk/c#t#d# 8. If you are using VxVM, the DMP device driver is automatically updated about the new disk. Type the following so that VxVM recognizes the new disk: # vxdisk scandisks
7.
4-26
Note The vxdisk scandisk option exists in VxVM version 4.0 and above. Previous to VxVM 4.0, you had to run vxdctl enable, whose documented meaning was to enable the conguration daemon but which had the side-effect of scanning all the disks.
9.
Repeat steps 3-8 on remaining nodes also connected to the disk. If you are using Solaris VM, in step 3, you must switch the device group or resource group off of that node (back to the rst node, if there are only two attached nodes).
4-27
www.chinaitproject.com IT QQ : 3264454
Note While this is the documented procedure for Fibre JBOD disk replacement in the cluster, in Solaris 9 OS and Solaris 10 OS the devfsadmd daemon on each node will automatically detect a new Fibre JBOD disk insertion and perform the equivalent of the above commands automatically for you.
Updating the DID Serial Number Information From the Device Driver RAM
The following procedure assumes that you physically replaced a JBOD drive. Because it has the same c#t#d# value as the drive it is replacing, the DID device number is the same. However, you still have to update the physical disk information stored in the DID database (in the CCR) for each DID. Proceed as follows: 1. Make sure you know which DID number you are working with. Use the cldev utility to map an individual c#t#d# value to a DID number, or the DID number to an individual c#t#d# value: # cldev list -v d13 DID Device Full Device Path ------------------------d13 theo:/dev/rdsk/c2t1d0 d13 vincent:/dev/rdsk/c2t1d0 2. From any node physically connected to the disk, print out the previous physical disk information stored in the CCR (for comparison purposes, so you can make sure your update succeeds):
# cldev show -v d13 DID Device Instances === DID Device Name: /dev/did/rdsk/d13 Full Device Path: theo:/dev/rdsk/c2t1d0 Full Device Path: vincent:/dev/rdsk/c2t1d0 Replication: none default_fencing: global Disk ID 46554a49545355204d4146333336344c2053554e33364720303036373731303320202020 Ascii Disk ID: FUJITSU MAF3364L SUN36G 00677103
4-28
www.chinaitproject.com IT QQ : 3264454 Reviewing Disk Replacement Procedures 3. From any node physically connected to the disk, use the cldev repair utility to synchronize the DID database with the new physical disk information from device driver RAM.
# cldev repair d13 # cldev show -v d13 DID Device Instances === DID Device Name: /dev/did/rdsk/d13 Full Device Path: vincent:/dev/rdsk/c2t1d0 Full Device Path: theo:/dev/rdsk/c2t1d0 Replication: none default_fencing: global Disk ID: 46554a49545355204d4146333336344c2053554e33364720303036383532303920202020 Ascii Disk ID: FUJITSU MAF3364L SUN36G 00685209
4-29
www.chinaitproject.com IT QQ : 3264454
2.
b.
c.
A caution to this procedure is that mirrors involving broken disks might already be repaired by the Volume Manager softwares vxrelocd process. If this has occurred, the Volume Manager does not consider these volumes broken anymore, even if they are now mirrored inside the same controller. You might need to manually examine every mirror carefully, deleting xed plexes that are in the same storage array and remirroring onto your xed disk either manually or using the command vxunreloc.
4-30
Note When metadb replicas are added automatically to new diskset disks in Sun Cluster 3.2, they get the name /dev/did/rdsk/d# (it is a reverse lookup problem, it really is s7, but for non-EFI disks d# is same major/minor number as d#s7). You need to delete using the same name as shows up in metadb -s setname -i. If you upgraded from SC3.1, the name will still include the s7.
3.
If you use soft partitions on top of Solaris OS partitions, rewrite soft partition information to the new drive: # metarecover -s orads /dev/did/rdsk/d#s0 -p It is often easiest to mirror regular partitions rst, and then make soft partitions on top of the mirrors. In this case no recovery is required for your soft partitions because, while the mirror is degraded, it remains usable.
4.
Run the metastat -s dsname command to identify any broken mirrors. This command indicates where to run the metareplace command to x mirrors. If mirrors are already xed by hot spares,
4-31
www.chinaitproject.com IT QQ : 3264454
you can still run the same metareplace command to put the xed disk back into work and the spare back into the spare pool. Use the following command for each disk: metareplace -s nfsds -e d#_for_mirror fixed_component In this command, the fixed_component argument is the soft partition device if you use soft partitions, or the /dev/did/rdsk/d#s# value if you do not use soft partitions.
Viewing an Example
This example shows how to x a Sun StorEdge A5200 array disk failure with the VxVM software: 1. Identify the failed drive as follows. This is often easiest to do through volume management software: # vxdisk list # vxprint -g dgname 2. On the node hosting the device group to which the disk belongs, replace the failed disk. Let the luxadm utility guide you through removing the failed disk and inserting a new disk: # luxadm remove enclosure,position
4-32
www.chinaitproject.com IT QQ : 3264454 Reviewing Disk Replacement Procedures # luxadm insert enclosure,position 3. On the other node, run the devfsadm command.
Note In Solaris 9 OS and Solaris 10 OS the devfsadmd utility detects physical disk insertion and automatically performs steps 2 and 3 for you. 4. On either node, recongure the DID information as follows: # cldev list -v c#t#d# # cldev repair d# 5. 6. On all nodes attached to the device group, type the following: # vxdisk scandisks On the node hosting the device group, type the following: # vxdiskadm a. You may need to use the option 22, Change/Display the default disk layouts. Starting in VxVM 4.0, the built-in default is CDS disks. If you are repairing a disk that is in a non-CDS group, you need to change the default layout used by vxdiskadm. Use option 5, Replace a failed or removed disk. Let this utility guide you through the process.
b. 7.
Verify in the Volume Manager software that the failed mirrors are resynched, or that they were previously reconstructed by the vxrelocd process: # vxprint -g grpname # vxtask list
8. 9.
Move all the hot-relocated subdisks back to the repaired disk: # vxunreloc -g grpname repaireddiskname If the failed drive was a quorum device, then create a new quorum device and remove the failed quorum device from the cluster conguration: # clq add new-quorum-did # clq remove old-quorum-did
4-33
www.chinaitproject.com IT QQ : 3264454
4-34
Task 1 Remove a cluster node Task 2 Add a node to the cluster Task 3 Replace a failed Fibre JBOD drive Task 4 Replace a failed SCSI JBOD drive
Preparation
Perform the following steps before beginning any tasks: 1. Some of the following tasks refer to Node 1 and Node 2. You need to resolve these node IDs to host names. On both nodes, type the following: # clinfo -n 2. 3. Some of the following tasks distinguish between two-node clusters and three-node clusters. Some of the following tasks distinguish between clusters using Solaris VM software and clusters using VxVM software.
4-35
www.chinaitproject.com IT QQ : 3264454
Note The idea here is to leave in IPMP information only for those nodes that will remain. # clrs set -p netiflist=sc_ipmp0@1[,sc_ipmp0@2] ora-lh # clrs set -p netiflist=sc_ipmp0@1[,sc_ipmp0@2] iws-lh
4-36
www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures 5. Perform this step only if the node to be removed is physically attached to the shared storage (that is, go directly to step 6 if you are removing a third, non-storage node). For VxVM: Remove the node from the nodelist property of any VxVM device groups. # cldg remove-node -n + For Solaris Volume Manager: Remove the node as a mediator rst, and then remove it from the disksets by using the metaset command. On any node, type the following: # metaset -s orads -d -f -m node_to_be_removed # metaset -s iwsds -d -f -m node_to_be_removed
Note If you are going from three storage nodes down to two storage nodes, it is possible that the node you are removing was never a mediator in the rst place
# metaset -s orads -d -f -h node_to_be_removed # metaset -s iwsds -d -f -h node_to_be_removed 6. 7. Reboot the node to be removed into non-cluster mode: # reboot -- -x Put the node into maintenance state. On a remaining node, type the following: # clq disable node_to_be_removed 8. If you are going from a two-node cluster to a one-node cluster (only), put the cluster into install mode: # cluster set -p installmode=enabled 9. Remove one quorum device. If the node in question is physically attached to the quorum device, this will be your one and only quorum device. If you are removing a non-storage (third node), you should still remove one of the quorum devices so that you are left with the correct quorum votes. # clq status # clq remove d#
4-37
Exercise: Performing Maintenance and Recovery Procedures 10. Enable remove/add access for the node to be removed # claccess allow -h node_to_be_removed
www.chinaitproject.com IT QQ : 3264454
11. On the node to be removed (booted in non-cluster mode): a. b. Comment out the vfstab entry for /global/web Run the following to decongure the node and remove it from the cluster conguration:
# clnode remove -F -n name_of_any_remaining_node 12. At this point, you could remove the cluster software if this were a real-world server that you no longer wanted in the cluster but whose OS you wanted to preserve. For lab purposes, you may be adding the node back into the cluster, so you might as well just leave the cluster packages installed.
4-38
Use the claccess command on any node already in the cluster to allow the new node to join the cluster: # claccess allow -h newnodename Create the mount points /global/web and /oracle on the new node, if they do not already exist: # mkdir /oracle # mkdir -p /global/web
3.
4-39
www.chinaitproject.com IT QQ : 3264454
If you are using VxVM software in the cluster, verify that the vxio major number is the same on the new node as on existing nodes. Create the entry on the new node if it does not yet exist. # grep vxio /etc/name_to_major vxio same_number_as_other_node[s]
Note You have to use the same number as the other node(s). If there is some other device driver entry in the le that already has that number, reassign the number for that other driver to a higher number (higher than any existing entry in the le).
5.
If this is a brand new (third) node which never had any cluster framework packages installed, use the Java ES installer to install the cluster framework packages. You can run the installer in graphical mode (if your DISPLAY is set correctly) or command-line mode: # pkginfo -l | grep SUNWsc If no cluster packages exist (they may exist even on a new node if you installed a Flash archive that includes the cluster packages), then do the following: # cd sc32_software_location/Solaris_sparc # ./installer or # ./installer -nodisplay Choose the Sun Cluster Core packages (not the agents). Choose the Configure Later option.
6.
Verify that the did major number is the same on the new node as on the current nodes: # grep did /etc/name_to_major did same_number_as_other_node[s]
Note You have to use the same number as the other nodes. If there is some other device driver entry in the le that already has that number, reassign the number for that other driver to a higher number (higher than any existing entry in the le) 7. Run the scinstall utility on the new node to add it to the cluster: # /usr/cluster/bin/scinstall
4-40
www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures a. b. c. d. e. Choose Option 1 from the Main Menu. Choose Option 3 from the Install Menu, Add this machine as a node in an existing cluster. From the Type of Installation Menu, choose Option 1, Typical. Provide the name of any node already in the cluster as a sponsoring node. Provide the name of the cluster that you want to join. Type cluster show -t global on an existing node if you have forgotten the name of the cluster. f. g. h. i. 8. Answer no to avoid sccheck (to save time) Use auto-discovery for the transport adapters. Reply yes to the automatic reboot question. Examine and approve the scinstall command-line options.
After the new cluster node is rebooted into the cluster, delete any existing quorum devices which are also physically connected to the new node. You may have no quorum devices at all, if you are going from a one-node to a two-node cluster. a. b. List the existing quorum devices: # clq status Determine if any of the DIDs listed are physically connected to the new node: # cldev list -v did_number_listed_above c. Remove any such quorum device. Be careful to remove only existing quorum devices that are attached to your new node. Do not remove the existing quorum device if you are adding a third, non-storage node. # clq remove did_number # /usr/cluster/lib/sc/pgre -c pgre_scrub \ -d /dev/did/rdsk/d#s2 # /usr/cluster/lib/sc/pgre -c pgre_inkeys \ -d /dev/did/rdsk/d#s2
9.
Add any additional required or desired quorum devices, and call clq reset in case you had previously enabled the installmode ag. If you are adding a third storage node, this could be the quorum device you just deleted in the previous step.
4-41
www.chinaitproject.com IT QQ : 3264454
If you are adding a third node which is a non-storage node, you want to keep the original quorum device (listed in the previous step), and add a second quorum device. # cldev list -v # clq add d#
Note At the time of writing there is a bug being investigated and one of your original two nodes may still panic (specically, when adding the same quorum device as before where a new storage node is attached). After that node reboots, the cluster is congured correctly and you can continue.
# clq reset // only really required if you are // adding back a second, node, in order // to reset the installmode flag 10. If you are using VxVM and the new node is physically attached to the storage, install VxVM if it is not already installed: a. b. Remove the vxio major number from /etc/name_to_major. Install the VxVM 5.0 packages:
# cd veritas50dir/volume_manager/pkgs # cp VRTSvlic.tar.gz VRTSvxvm.tar.gz \ VRTSvmman.tar.gz /var/tmp # cd /var/tmp # gzcat VRTSvlic.tar.gz | tar xf # gzcat VRTSvxvm.tar.gz | tar xf # gzcat VRTSvmman.tar.gz | tar xf # pkgadd -d /var/tmp VRTSvlic VRTSvxvm VRTSvmman # vxinstall // add a license and accept the default // for all questions except the last one // (say no when asked about default disk group) # clvxvm initialize # reboot 11. If you are using Solaris VM, create local metadevice database replicas if they do not already exist: # metadb -i # metadb -a -f -c 3 c#t#d#s7 12. Add the new node to any existing VxVM or Solaris VM device groups to which the new node is physically attached (if you are adding a non-storage node, ignore this):
4-42
www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures If you are using VxVM type the following: # cldg add-node -n newnodename oradg # cldg add-node -n newnodename iwsdg If you are using the Solaris VM software, type the following: # # # # metaset metaset metaset metaset -s -s -s -s orads iwsds orads iwsds -a -a -a -a -h -h -m -m newnodename newnodename newnodename newnodename
13. For VxVM only, synchronize your device groups. This will create the proper devices if you have a non-storage third node. # cldg sync oradg iwsdg 14. Make sure that IPMP is congured on the new node, and note the name of the IPMP groups. # clnode status -m 15. Make system changes on the new node to support the scalable web server application, including installing the agent:
Option A: Node that you removed in Task 1 but left software intact.
# vi /etc/vfstab (Make sure the entry for /global/web is uncommented) Option B: Brand new node:
# cd SC32_loc/Solaris_sparc/Product # cd sun_cluster_agents/Solaris_10/Packages # pkgadd -d . SUNWschtt # # # # cd same_directory_with_install_script_from_day1 gzcat iws-common.cpio.gz | cpio -ivmud mkdir -p /var/iws/logs chown webservd:webservd /var/iws/logs
# vi /etc/vfstab (Add a line for /global/web. You can paste it from another node, but be careful about the paste procedure adding an unwanted newline character.) # vi /etc/hosts (Add entry for iws-lh)
4-43
www.chinaitproject.com IT QQ : 3264454
16. Make system changes on the new node to support running Oracle, including installing the agent. Note if you want the failover Oracle application to run on a non-storage node, you need to change the failover le system to a global le system.
Option A: Node that you removed in Task 1 but left software intact: (Everything should be left over here. There is no need to do anything.)
# cd SC32_loc/Solaris_sparc/Product # cd sun_cluster_agents/Solaris_10/Packages # pkgadd -d . SUNWscor # cd same_directory_with_install_script_from_day1 # gzcat oracli.cpio.gz | cpio -ivmud # groupadd -g 8888 dba # useradd -u 8888 -g dba -s /bin/ksh \ -c "Oracle User" -d /oracle oracle # vi /etc/vfstab (Add a line for /oracle; it can remain a failover le system. You can paste from another node, but be careful about the paste procedure adding an unwanted new-line character.) # vi /etc/hosts (Add an entry for ora-lh.)
Option C: New non-storage node (change to global le system): 1. Make basic administrative changes to the new node:
# cd SC32_loc/Solaris_sparc/Product # cd sun_cluster_agents/Solaris_10/Packages # pkgadd -d . SUNWscor # cd same_directory_with_install_script_from_day1 # gzcat oracli.cpio.gz | cpio -ivmud # groupadd -g 8888 dba # useradd -u 8888 -g dba -s /bin/ksh \ -c "Oracle User" -d /oracle oracle # vi /etc/hosts (Add an entry for ora-lh.) 2. From any one node, bring the application ofine:
4-44
www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures 3. On all nodes, modify the /etc/vfstab le so that the /oracle is a global le system. On your new node, this will be a new line: For the existing line for /oracle, modify the last two elds so that they are yes global rather than no -. For the new node, add or paste the line as a global le system. You can paste it from another node, but be careful about the paste procedure adding an unwanted new-line character. 4. From any one node, change the AffinityOn property and resume the application again. You need to disable the dependents of the ora-stor in order to disable the orastor itself to change the property: clrs disable ora-server-res clrs disable ora-listener-res clrs disable ora-stor clrs set -p AffinityOn=false ora-stor mount /oracle clrg online -e ora-rg
# vi /etc/vfstab
# # # # # #
17. Modify the NetIfList property of any LogicalHostName or SharedAddress resources in those resource groups to include the IPMP group of the new node: # clrs show -v iws-lh ora-lh|grep NetIfList # clrs set -p netiflist=existing-list,ipmp-group@new-nodeid ora-lh iws-lh 18. Add the new node to resource group node lists. For the scalable resource group, modify the Desired_primaries and Maximum_primaries to include the new number of nodes (two or three). # clrg add-node -n newnodename ora-rg # clrg add-node -n newnodename lb-rg # clrg add-node -n newnodename iws-rg # clrg set -p Desired_primaries=new_number_of_nodes \ -p Maximum_primaries=new_number_of_nodes iws-rg
4-45
www.chinaitproject.com IT QQ : 3264454
19. Verify that you can run existing resource groups on the new node. # clrg switch -n newnodename ora-rg # clrg switch -n newnodename lb-rg # clrg online -n newnodename iws-rg
Note If you wanted to make the xclock-rg and/or xclockgds-rg run on a brand new node as well, you would have to copy over the custom wrapper software and conguration les before being able to add the new node to the resource groups. As this is not a formal requirement of this lab, it is left as an exercise to the reader. If you are adding back a second node that you had just deleted from the cluster without having rebuilt the whole OS, then you could just add the node to the resource groups as in step 18.
20. Deny all future nodes from adding themselves to the cluster: # claccess deny-all
4.
4-46
www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures # cldev list -v # clq status 5. If the disk in question is the quorum device, assign a different quorum device, and remove the original one: # clq add dnew_number # clq remove dold_number 6. Replace the failed drive. If you are simulating a failure by zeroing out Solaris partitions, make the drive look like a new disk (no partitions except s2, which covers the whole disk). On one connected node, type the following: # luxadm remove_dev -F enclosure,position # luxadm insert_dev enclosure,position On the other node, type the following: # devfsadm 7. Repair the appropriate DID device: # cldev show -v did_number # cldev repair did_number # cldev show -v did_number 8. If you are running VxVM software, rescan the conguration on all connected nodes: # vxdisk scandisks 9. If you are running VxVM software, x the mirrors on the node which owns the device group: # vxdiskadm 10. If you are running Solaris VM software, perform the following steps on the node which owns the device group. a. Reformat the drive according to the way it was before the failure: # fmthard -s /old-vtoc.txt /dev/did/rdsk/d#s2 b. Replace failed metastate database replicas: # metadb -s disksetname # metadb -s disksetname -d /dev/did/rdsk/d#s7
Note This may be just /dev/did/rdsk/d#, as per the note earlier in the module. c. Rebalance metastate database replicas:
4-47
Exercise: Performing Maintenance and Recovery Procedures # metaset -s disksetname -b d. Re-enable the failed submirror component:
www.chinaitproject.com IT QQ : 3264454
# metastat -s disksetname | grep metareplace Note The command that this procedure tells you to invoke is not exactly correct (you cannot abbreviate a DID device name, for example). However, it tells you which device is the mirror and which DID was broken. # metareplace -s disksetname -e mirror /dev/did/rdsk/d#s0 e. Verify the status: # metastat -s disksetname
4.
4-48
www.chinaitproject.com IT QQ : 3264454 Exercise: Performing Maintenance and Recovery Procedures For Solaris VM: # metastat -s dsname # metadb -s dsname -i 5. Replace the failed drive. If you are simulating a failure by zeroing out Solaris partitions, make the drive look like a new disk (no partitions except s2, which covers the whole disk). Identify the quorum: # cldev list -v # clq status 7. If the disk in question is the quorum device, assign a different quorum device, and remove the original one: # clq add dnew_number # clq remove dold_number # clq status 8. If you are using Solaris Volume Manager, delete bad metadevice database replicas (so that you are only left with good ones on surviving disks): # metadb -s disksetname -i # metadb -s disksetname -d /dev/did/rdsk/d#s7
6.
Note That might be /dev/did/rdsk/d#s7 or /dev/did/rdsk/d#, as per the note earlier in the module. # metadb -s disksetname -i 9. On each node physically connected to the disk (one at a time): a. For Solaris VM only (do this step and then skip to c.): 1. Switch the device group to which the disk belongs to a different node (other than the one on which you are operating): # cldg switch -n othernode devgrpname 2. If the device group has a failover le system (like /oracle) and the above command fails, then switch the associated resource group: # clrg switch -n othernode ora-rg b. For VxVM only: Temporarily remove the disk from VxVM control: # vxdisk offline c#t#d# # vxdisk rm c#t#d#
4-49
www.chinaitproject.com IT QQ : 3264454
Use cfgadm -c unconfigure to uncongure the disk: # cfgadm -c unconfigure c#::dsk/c#t#d# Use cfgadm -c configure to congure the disk. Ignore any notice you receive such as cannot instrument return of fd_intr. # cfgadm -c configure c#::dsk/c#t#d# If you are using VxVM, the DMP device driver is automatically informed about the new disk. Type the following so that VxVM completely recognizes the new disk: # vxdisk scandisks Repeat steps a-e for the other connected nodes. For SVM, as you operate on each node, you need to switch the group off that node.
e.
f.
10. Repair the appropriate DID device (from any node connected to the disk). You will see the physical disk ID changed if you really were able to change the drive. # cldev show -v did_number # cldev repair did_number # cldev show -v did_number 11. If you are running VxVM software, x the mirrors in the VxVM software on the node which owns the device group: # vxdiskadm If you are xing one of the disk groups built by the class scripts, it is a CDS disk group and you should not need to change the layout default. If you happen for some reason to have a non-CDS disk group, you need to use option 22 to change the layout default to sliced. Then you can choose to replace the drive. 12. If you are running Solaris VM software, perform the following steps on the node which owns the device group. a. b. c. Restore the partition table of the drive: # fmthard -s /old-vtoc.txt /dev/did/rdsk/d#s2 Rebalance metastate database replicas: # metaset -s disksetname -b Re-enable the failed submirror component: # metastat -s disksetname | grep metareplace
4-50
Note The command that this procedure tells you to invoke is not exactly correct (you cannot abbreviate a DID device name, for example). However, it tells you which device is the mirror and which DID was broken. # metareplace -s disksetname -e mirror /dev/did/rdsk/d#s0 d. Verify the status: # metastat -s disksetname
4-51
Exercise Summary
www.chinaitproject.com IT QQ : 3264454
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
4-52
www.chinaitproject.com IT QQ : 3264454
Module 5
Congure data for any failover application in an HA-ZFS le system Understand the design and features of the QFS le system Congure a standard QFS le system Congure a shared QFS le system in the cluster using Solaris Volume Manager multiowner diskset devices Congure Sun Cluster resource groups in non-global zones
5-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics presented in this module. While they are not expected to know the answers to these questions, the answers should be of interest to them and inspire them to learn the material presented in this module.
!
?
Discussion The following questions are relevant to understanding the content of this module:
When was the basic technology of the UFS le system invented? Does it have any weaknesses? If we already have a global le system, why do we need another le system technology that supports simultaneous le access by multiple nodes? What is the advantage of the administrative sandbox provided by Solaris 10 zones?
5-2
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
Sun Microsystems, Inc. Sun Cluster Software Administration Guide For Solaris OS, part number 819-2971. Sun Microsystems, Inc. Sun StorEdge QFS Installation and Upgrade Guide, part number 819-4334. Sun Microsystems, Inc. Sun StorEdge QFS Conguration and Administration Guide, part number 819-4332.
5-3
www.chinaitproject.com IT QQ : 3264454
5-4
www.chinaitproject.com IT QQ : 3264454 ZFS as a Failover File System Only state: ONLINE scrub: none requested config: NAME marcpool mirror c1t0d0 c2t0d0 STATE ONLINE ONLINE ONLINE ONLINE READ WRITE CKSUM 0 0 0 0 0 0 0 0 0 0 0 0
errors: No known data errors Now we can create lesystems that occupy the pool. ZFS automatically creates mount points and mounts the le systems. You never need to make /etc/vfstab entries. vincent:/# zfs create marcpool/myfs1 vincent:/# zfs create marcpool/myfs2
vincent:/# zfs list NAME marcpool marcpool/myfs1 marcpool/myfs2 vincent:/# df -k . . marcpool marcpool/myfs1 marcpool/myfs2
1% 1% 1%
The mount points default to /poolname/fsname, but you can change them to whatever you want: vincent:/# zfs set mountpoint=/oracle marcpool/myfs1 vincent:/# zfs set mountpoint=/shmoracle marcpool/myfs2 vincent:/# df -k |grep pool marcpool 34836480 26 34836332 1% marcpool/myfs1 34836480 24 34836332 1% marcpool/myfs2 34836480 24 34836332 1%
5-5
www.chinaitproject.com IT QQ : 3264454
ZFS Snapshots
ZFS has an instantaneous point-in-time snapshot feature. Initially, snapshots do not consume any room in the zpool. As the original (parent) copy of the le system changes, snapshots start to take up room in the zpool to record the old values of changed data, at the time of the snapshot: theo:/# zfs list NAME orapool orapool/oracle USED 2.70G 2.70G AVAIL 30.5G 30.5G REFER 24.5K 2.70G MOUNTPOINT //orapool //oracle
theo:/# zfs snapshot orapool/oracle@thursday_1feb07 theo:/# zfs list NAME USED AVAIL REFER MOUNTPOINT orapool 2.70G 30.5G 24.5K //orapool orapool/oracle 2.70G 30.5G 2.70G //oracle orapool/oracle@thursday_1feb07 299K - 2.70G -
The parent le system can be rolled back. If you want to roll back to a snapshot that is not the most recent one, you can specify a -r option to the rollback subcommand which will automatically destroy snapshots taken after the one you are rolling back to. theo:/# zfs rollback orapool/oracle@thursday_1feb07
5-6
Note A single HAStoragePlus instance can refer to multiple traditional (non-ZFS) lesystems, or multiple ZFS zpools, but not both. When you use the Zpools property the values of the properties for traditional lesystems (FilesystemMountPoints and AffinityOn) are ignored.
5-7
www.chinaitproject.com IT QQ : 3264454
5-8
www.chinaitproject.com IT QQ : 3264454 Introducing the Features of the Sun StorEdge QFS File System
Choice of Dual Disk Allocation Unit (DAU) Components and Single-Size DAU Components
The Disk Allocation Unit (DAU) is the minimum amount of le data guaranteed to be contiguous on the underlying device. This is a similar concept to the le system block size in UFS. QFS supports the following:
Dual DAU components This supports les where the rst eight DAUs of the le will each be 4 kilobytes (Kbytes), and the remaining DAUs of the le are larger (defaults to 16 Kbytes for a non-shared le system, and can be set to 32 Kbytes or 64 Kbytes at le system creation time). This will optimize space when you have a lot of small les. Single-size DAU components This uses a single size for the DAU, which defaults to 64 Kbytes. At le system creation time, you are allowed to specify a much larger DAU (up to 64 MB).
5-9
www.chinaitproject.com IT QQ : 3264454
Standard (non-shared) QFS can be used only as a failover le system. The Sun cluster standard global le system (PxFS) is not supported at the time of writing of this course with an underlying QFS le system. One result of this caveat is that you could not use QFS to store le data for a failover application that needs to fail over to a non-storage node. If you have such a conguration, then UFS and VxFS are still the only supported le system types.
QFS requires separate licensing. Starting with QFS 4.3 this is a paperonly license. Like UFS, QFS does not support shrinking a le system.
5-10
Choose QFS le system and component device types Create underlying storage devices (partitioning and/or volume management) Create the Master Conguration File (mcf) Create and mount the le system Congure an instance of HAStoragePlus to support cluster failover
One or more component devices of the type mm. This is a dual DAU type that supports metadata only. One or more component devices to hold le data only.
You can choose dual allocation types (md) or single allocation types (mr, gXXX).
5-11
www.chinaitproject.com IT QQ : 3264454
You cannot mix and match dual allocation types and single allocation types in the same le system. The gXXX type allows you to group single allocation subcomponents. Later, storage for specic directory trees in the le system can be assigned to specic groups.
5-12
id File system or underlying device name. Note that in this example the underlying disk component is a Solaris VM device (likely a mirror or soft partition of a mirror). ordinal Record number that must be unique in the le but otherwise has no specic meaning. type File system or component type. family Name used to associate the le system with components, typically the same as the le system name. state The state can be on or off. It is meaningful for initialization of tape devices in SAMFS-QFS, and would always be on or - for QFS disk components.
5-13
www.chinaitproject.com IT QQ : 3264454
5-14
Note The le system type is always samfs even when using QFS without the archiving manager packages. The example here uses no in the mount-at-boot column, which will be required for a failover le system in the cluster. Outside of the cluster, you would likely use yes. You mount the le system just like any other that is listed in the vfstab: # mount /oracle
Make sure you have no in the mount-at-boot eld in the /etc/vfstab entry. Copy the mcf le to all other connected nodes and run samd config on each connected node. Add an HAStoragePlus instance to control QFS failover:
5-15
www.chinaitproject.com IT QQ : 3264454
The only difference between this resource and an HAStoragePlus instance controlling a standard UFS or VxFS failover le system is the FilesystemCheckCommand property value.
5-16
www.chinaitproject.com IT QQ : 3264454 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)
Conguring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)
The Shared QFS le system allows multiple servers to read and write le system le data directly to the storage medium. Shared QFS does not support the following:
Block and character device les and named pipes Mandatory le locking Advisory le locking is supported; this is much like NFS.
One server at a time functions as the metadata server; that is, while le data can be simultaneously accessed from multiple servers, metadata must be accessed from only one server at a time for each particular le system. The data ow for a Shared QFS in the Sun Cluster environment is illustrated by the diagram in Figure 5-1:
File Data
Figure 5-1
Note that the shared QFS le system is an alternative to the standard Sun Cluster global le system. Its advantage is the simultaneous le data access. Shared QFS can be used both inside and outside of the Sun Cluster environment. Inside the Sun cluster, the following apply:
5-17
www.chinaitproject.com IT QQ : 3264454 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)
The Shared QFS le system is accessed only by nodes in a single cluster physically connected to the data. It cannot be simultaneously accessed from other servers outside the cluster, even if they are physically attached to the same storage, because of data fencing. A resource of type SUNW.qfs (provided with the QFS software) must be congured to drive failover of the metadata server. The only choice for underlying component volume management (if you need software mirroring for the components) is the Solaris Volume Manager multiowner disksets. There is no support for any VxVM volume management. The only supported application for Shared QFS is Oracle RAC.
Note You could get Shared QFS running in the cluster without any reference to Oracle RAC, if you were not using any volume manager. At this time, it is possible that other multimaster or scalable applications besides Oracle RAC could run on Shared QFS (given the limitations previously listed in this section). However, only Oracle RAC has been veried.
Shared QFS File Systems on Solaris Volume Manager Multiowner Diskset Devices
Beginning with QFS 4.4, shared QFS le systems in the Sun Cluster environment can use Solaris VM multiowner diskset devices as underlying components. Prior to QFS 4.4, no volume manager was supported underneath shared QFS, and it was therefore only suitable in the cluster with hardware RAID devices. Solaris Volume Manager multiowner disksets require the following minimum software:
Solaris 9 OS 9/04 (update 7) or later (including all versions of Solaris 10 OS) Sun Cluster 3.1 9/04 (update 3) or later
At this time, the Solaris Volume Manager multiowner diskset feature does have a dependency on the Oracle RAC framework (the underlying RACspecic cluster membership monitor). Thus, you do have to install and congure the RAC framework (which includes the ORCLudlm package from Oracle) in order to use Solaris Volume Manager with Shared QFS.
5-18
www.chinaitproject.com IT QQ : 3264454 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)
On Solaris 9 OS, this is optional. If you do not create it, the RAC framework daemons will automatically still be launched by boot scripts. On Solaris 10 OS, this is required. Only these resources will launch the RAC framework daemons.
The rac-framework-rg resources do all the launching of the daemons in the BOOT and INIT methods, as long as the group is managed. Trying to stop and start the resources will do nothing. This is a protective mechanism to guarantee that the RAC framework is enabled on behalf of its dependents Solaris VM multiowner disksets and Oracle RAC itself.
5-19
www.chinaitproject.com IT QQ : 3264454 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only) The following is an example of a master conguration le that species two shared QFS le systems suitable for Oracle RAC (one for the binaries and one for the data): qfs1 10 ma qfs1 on shared /dev/md/orashareds/dsk/d100 11 mm qfs1 on /dev/md/orashareds/dsk/d200 12 mr qfs1 on qfs2 20 ma qfs2 on shared /dev/md/orashareds/dsk/d300 21 mm qfs2 on /dev/md/oradhareds/dsk/d400 22 mr qfs2 on In this example, all of our underlying devices are mirrored volumes in the same Solaris Volume Manager multiowner diskset.
A server name (as returned by hostname) An IP address or resolvable name used to reach that server In the cluster, use the cluster private hostname such that metadata trafc goes over the cluster transport.
A server priority (non-negative number) An unused eld The word server to identify the initial metadata server
The following is an example of a hosts.fsname le suitable for a shared QFS conguration in the cluster: # cat hosts.qfs1 vincent clusternode1-priv 1 - server theo clusternode2-priv 2 Note that only one node (it should not matter which one) is congured as the initial metadata server. The other node is still a potential metadata server, and failover in the cluster will be controlled by a Sun Cluster resource that will be discussed later in the module.
5-20
www.chinaitproject.com IT QQ : 3264454 Configuring a Shared QFS File System in the Cluster (for Use by Oracle RAC Only)
The master conguration le and hosts.fsname le need to be manually replicated to each node that will be mounting the le system. The sammkfs -S fsname is called from any one node. The /etc/vfstab entry on each node always has no in the mount-atboot column and shared in the options column, like the following: qfs1 qfs2 - /oracle - /oradata samfs samfs no no shared shared
On the initial mount, the le system needs to be mounted rst from the initial metadata server (the one identied with the server ag), and then the mount command needs to be issued from the other nodes as well. A boot script automatically mounts shared QFS le systems thereafter.
5-21
www.chinaitproject.com IT QQ : 3264454
5-22
You can give zones on different nodes the same or different names. The booting of these zones is not under control of the cluster. You are likely to want to set the autoboot property of the zone to true. You need to install the data service agents that you want in the nonglobal zones:
You could install it in global zones, using pkgadd without the -G option, and just let the packages get inherited into current and future non-global zones. Alternatively, you could install the agents only in the nonglobal zones in which they are needed.
Note You must type the clrt register command in the global zone, and by default, it looks for the resource-type registration (RTR) les only in the global zone. If the agent is installed only in a non-global zone, you can use the -f option of clrt register to refer to an RTR le in the nonglobal zones root path.
There are two ways to specify a zone name in the place of a node name:
The old CLI supports only a syntax of -h nodename:zonename to refer to a zone (in a node list, or as a switch target, or whatever). The new CLI supports both the -n nodename:zonename syntax as well as the alternate syntax -n nodename,[nodename..] -z zonename When this latter syntax is used it implies that you want to specify the same zone name for each node name listed with the -n option.
LogicalHostname and SharedAddress resources work within a resource group that is mastered on a non-global zone, as expected. However, while they will appear to only ever belong inside the nonglobal zone, the underlying implementation of the methods will actually congure them in the global zone and then move them to the appropriate non-global zone. The only implication for their conguration is that the IP addresses in question be resolvable in the global zone, rather than in the nonglobal zone in which the addresses will actually end up. Obviously, in order to have your application happy, you are likely to also need
5-23
www.chinaitproject.com IT QQ : 3264454
to make the IP address resolvable in the non-global zone. But the dependency for proper operation of the cluster IP resources themselves is just for address resolution in the global zone. For failover applications, it is optional to congure dedicated zone IPs using zonecfg. For scalable applications, it is required that each zone in question have their own dedicated public network IP addresses congured using zonecfg.
Traditional (non-ZFS) HAStoragePlus instances that represent global or failover lesystems are slightly strange. The le system needs to be mounted on the appropriate physical node (or nodes, in the case of global), and then a loopback le system mount is made in the nonglobal zone. The non-global zone mount point does not need to be the same as the physical node mount point. The HAStoragePlus supports a new syntax of global-zone-mt-pt:non-global-zonemt-pt as the value of the FilesystemMountPoints property if you want to specify a different mount point in the non-global zone. Without the new syntax, the methods of HAStoragePlus will assume that the global and non-global zone mount points are the same. ZFS is a little different in that the le system software is zoneaware and can be made available exclusively in a non-global zone when required. This is discussed further later in this document.
Viewing status or conguration of almost anything Changing state or switching a resource group or individual resource, assuming the groups node list contains that specic non-global zone. Creating an application resource (but not a LogicalHostname, SharedAddress, or HAstoragePlus) in a group whose node list contains that specic non-global zone. LogicalHostname, SharedAddress, and HAStoragePlus resources must still be created by commands typed in the global zone.
Example
The following example shows the creation of a full scalable Apache conguration running only in a pair of non-global zones on different nodes.
5-24
www.chinaitproject.com IT QQ : 3264454 Resource Group Manager Support for Non-Global Zones The following steps have been taken prior to this example:
Separate non-global zones have been created and booted on the two cluster nodes. In this example, both zones have the name frozone. A standard global lesystem /global/web has been created on the global zone. The HAStoragePlus resource that we will demonstrate in this example will automatically perform the loopback mounts in the non-global zones. The IP entry for food-web is resolvable both on the physical nodes, such that you can create the IP resource, and in the non-global zones, so that the application can run correctly. Each non-global zone has its own dedicated non-failover IP address on the public network. This is a requirement for scalable applications in zones.
# clrg delete apache-sa-rg # clrg create -n pecan,grape -z frozone # clrg status Cluster Resource Groups === Group Name ---------apache-sa-rg Node Name --------pecan:frozone grape:frozone
apache-sa-rg
Suspended --------No No
Create a shared address resource (this must be typed in a global zone). # clrssa create -g apache-sa-rg food-web
5-25
www.chinaitproject.com IT QQ : 3264454
Manage and online the resource group to demonstrate that the IP address actually goes online in the non-global zone (in this case, on pecan:frozone, as it is the rst node in the groups node list). This command could be run either from a global or non-global zone. In the example we are still running from the global zone so we can see with ifconfig how the IP address is placed in the non-global zone: # clrg online -M apache-sa-rg # clrs status Cluster Resources === Resource Name ------------food-web online. Node Name --------pecan:frozone grape:frozone State ----Online Offline Status Message -------------Online - SharedAddress Offline
# ifconfig -a . . . qfe2:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 zone frozone inet 192.168.1.52 netmask ffffff00 broadcast 192.168.1.255 . . .
A global lesystem mount point /global/web has already been created and provisioned in the global zones only. The only thing already provisioned in the non-global zones is the mount point. In this example we will use the same mount points for the loop-back in the non-global zones. You do not put vfstab entries in the non-global zones.
5-26
www.chinaitproject.com IT QQ : 3264454 Resource Group Manager Support for Non-Global Zones The following commands are entered in a global zone: # clrt register HAStoragePlus # clrg create -p Desired_primaries=2 -p Maximum_primaries=2 \ -n pecan,grape -z frozone apache-rg # clrg status Cluster Resource Groups === Group Name ---------apache-sa-rg Node Name --------pecan:frozone grape:frozone pecan:frozone grape:frozone Suspended --------No No No No Status -----Online Offline Unmanaged Unmanaged
apache-rg
# clrs create -g apache-rg -t HAStoragePlus \ -p FilesystemMountpoints=/global/web \ -p AffinityOn=false \ web-stor # clrg online -M apache-rg
The loopback le system mount will automatically be performed in the non-global zones as the HAStoragePlus resource goes online.
5-27
Resource Group Manager Support for Non-Global Zones # clrt register apache # clrs create -g apache-rg -t apache \ -p Bin_dir=/global/web/bin \ -p SCALABLE=true \ -p Resource_dependencies=food-web,web-stor \ apache-res
www.chinaitproject.com IT QQ : 3264454
The apache resource will go online on both zones immediately, since new resources are now created in an enabled state and the scalable resource group is already online.
Make up a name for each particular zones private IP. This name is not congured in any external name service including the hosts le; rather it is stored and resolved through the CCR just like the node private host names. Allow the cluster to automatically select an IP address corresponding to your choice of name. The cluster will automatically choose an appropriate IP address in the correct private network range for the per-zone private network IP address.
5-28
Each command will automatically congure a new clprivnet virtual IP on the respective node. For example: # zlogin frozone [Connected to zone 'frozone' pts/6] Last login: Mon Jun 19 11:34:14 on console Sun Microsystems Inc. SunOS 5.10 Generic January 2005 frozone-p:/# ifconfig -a lo0:1: flags=20010008c9<UP,LOOPBACK,RUNNING,NOARP,MULTICAST,IPv4,VIRTUAL> mtu 8232 index 1 inet 127.0.0.1 netmask ff000000 qfe1:2: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 192.168.1.231 netmask ffffff00 broadcast 192.168.1.255 qfe2:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 inet 192.168.1.52 netmask ffffff00 broadcast 192.168.1.255 clprivnet0:1: flags=1009843<UP,BROADCAST,RUNNING,MULTICAST,MULTI_BCAST,PRIVATE,IPv4> mtu 1500 index 6 inet 172.16.4.66 netmask fffffe00 broadcast 172.16.5.255 Current values of the zone private host name can be displayed using clnode show. In the example we display information for one node, which would include zones on that node. Omitting a node name (or using the default wildcard +) would show all nodes. # clnode show pecan Cluster Nodes === Node Name: Node ID: Enabled: privatehostname: reboot_on_path_failure: globalzoneshares: defaultpsetmin: pecan 2 yes clusternode1-priv disabled 1 1
5-29
Resource Group Manager Support for Non-Global Zones quorum_vote: quorum_defaultvote: quorum_resv_key: Transport Adapter List: Node Zones: --- Zones on node pecan --Zone Name: zprivatehostname:
pecan:frozone priv-frozone-p
Once you set the zprivatehostname property for a zone, changing the name by repeating the command with a different name just changes the name and leaves the same IP address. There is currently no mechanism for conguring more than one private IP address in a zone. If you want to uncongure the zones private IP address you can just null out the value of the zprivatehostname: # clnode set -p zprivatehostname='' grape:frozone # clnode set -p zprivatehostname='' pecan:frozone
5-30
Task 1 Install the QFS software on your cluster nodes Task 2 Add volumes on which to build a failover QFS le system (VxVM or SVM) Task 3 Prepare a QFS le system conguration Task 4 Create, mount, and switch the QFS le system Task 5 Migrate your Oracle application data to the QFS le system Task 6 Rearrange your mount points so that /oracle is mounted from the new QFS Task 7 Recongure your cluster resources to use the new le system
5-31
Exercise 1: Running a Standard Failover Service on QFS (Optional) 1. Install the QFS packages: # cd qfs_4.5_software_location/2.10 # pkgadd -d . -G SUNWqfsr SUNWqfsu Answer yes to all the questions asked by pkgadd.
www.chinaitproject.com IT QQ : 3264454
Task 2a Adding A Volume on Which to Build a Failover QFS File System (With VxVM)
Perform the following steps only on the node that is currently the owner of the oradg disk group. You can determine which node that is by running cldg status on any node. 1. Select two disks from shared storage (one from one array and one from the other array) for a new mirrored volume. Make sure you do not use any disks already in use in existing device groups. Note the logical device name (referred to as cAtAdA and cBtBdB in step 2). The following example checks against all volume managers that you could possibly be using, including zfs. # # # # 2. vxdisk -o alldgs list metaset zpool status //run this one on all nodes cldev list -v
Add the disks to your oradg disk group: # /etc/vx/bin/vxdisksetup -i cAtAdA # /etc/vx/bin/vxdisksetup -i cBtBdB # vxdg -g oradg adddisk qfsd1=cAtAdA qfsd2=cBtBdB
3.
Create a mirrored volume to hold the failover QFS le system. You mirror in the background so you can proceed without having to wait. # vxassist -g oradg make qfsvol 6g qfsd1 # vxassist -g oradg mirror qfsvol qfsd2 &
4.
Synchronize the device group so that correct cluster global devices get created for the new volume. # cldg sync oradg
5-32
Task 2b Adding A Volume on Which to Build a Failover QFS File System (With SVM)
Perform the following steps only on the node that is currently the owner of the orads disk group. You can determine which node that is by running cldg status on any node. 1. Select two disks from shared storage (one from one array and one from the other array) for a new mirrored volume. Make sure you do not use any disks already in use in existing device groups. Note the DID device names (referred to as dA and dB in step 2). The following example checks against all volume managers that you could possibly be using, including zfs. # # # # 2. vxdisk -o alldgs list metaset zpool status //run this one on all nodes cldev list -v
Add the disks to your orads diskset: # metaset -s orads -a /dev/did/rdsk/dA # metaset -s orads -a /dev/did/rdsk/dB
3.
Create a volume (soft partition on top of mirrored disks) to hold the failover QFS le system. # # # # # metainit -s orads d21 1 1 /dev/did/rdsk/dAs0 metainit -s orads d22 1 1 /dev/did/rdsk/dBs0 metainit -s orads d20 -m d21 metattach -s orads d20 d22 metainit -s orads d200 -p d20 6g
5-33
Exercise 1: Running a Standard Failover Service on QFS (Optional) For SVM: qfsora 100 ms qfsora on /dev/md/orads/dsk/d200 101 md qfsora on
www.chinaitproject.com IT QQ : 3264454
Note In this example, just for ease of doing the lab, you are creating an ms type of le system which has the metadata and le data on the same device. 2. Verify the conguration that you have just entered: # /opt/SUNWsamfs/sbin/sam-fsd You should see a bunch of trace output if everything looks OK. There will be conguration error messages if something is wrong with your conguration. 3. 4. 5. Notify the QFS daemon of your new conguration: # /opt/SUNWsamfs/sbin/samd config Make a mount point for your new le system: # mkdir /oranew Add an entry in /etc/vfstab for your new le system: # vi /etc/vfstab qfsora /oranew samfs - no sync_meta=1
5-34
On a different storage node, relocate the device group (it will be dragged across by the ora-rg resource group) and verify that you can mount and unmount the new le system: # # # # clrg switch -n new-node ora-rg mount /oranew df -k umount /oranew
Task 5 Migrating Your Oracle Application Data to the QFS File System
Perform the following steps on the node that owns the oradg or orads device group: 1. If you have any non-storage nodes (three-node cluster in a pair+1 conguration), make sure the Nodelist property for the ora-rg resource group contains only the storage nodes: # clrg remove-node -n any_non_storage_node ora-rg 2. # # # # # # # Halt your Oracle resources and migrate the data: clrs disable ora-server-res ora-listener-res mount /oranew chown oracle:dba /oranew cd /oracle find . -print|cpio -pdmu /oranew cd / umount /oranew 3. Disable the old storage resource: # clrs disable ora-stor
5-35
www.chinaitproject.com IT QQ : 3264454
Task 6 Rearranging Your Mount Points So That /oracle Is Mounted From the New QFS
Perform the following on all the storage nodes. Edit /etc/vfstab:
Change the old /oracle mount point to /ora-old. Change the /oranew mount point to /oracle.
Task 7 Reconfiguring Your Cluster Resources to Use the New File System
Perform the following steps on any one node in the cluster: 1. Make a new cluster resource for the QFS le system and set the application dependencies (ignore validation errors, as usual, from the node where the new QFS failover lesystem is not mounted).
# clrs set -p Resource_dependencies=qfsora-stor ora-server-res # clrs set -p Resource_dependencies=qfsora-stor ora-listener-res 2. 3. 4. Remove the old storage resource: # clrs delete ora-stor Enable all of the resources: # clrs enable -g ora-rg + Verify that Oracle is running properly with its data on the new QFS le system, and that you can switch over and fail over the ora-rg resource group. If you see error messages about busy Solaris VM volumes, it may be because any mirror resynchronization in progress has to be restarted when you do a switchover. This can be ignored.
5-36
Task 1 Install the QFS software on your cluster nodes if needed Task 2 Install the RAC framework packages in order to support SVM multiowner disksets Task 3 Install the Oracle distributed lock manager Task 4 Create and enable the RAC framework resource group Task 5 Add volumes on which to build a shared QFS le system Task 6 Prepare a shared QFS le system conguration Task 7 Create and mount the le system Task 8 Mount the le system on other node(s) Task 9 Congure the metadata server as a failover resource, and test failover
5-37
www.chinaitproject.com IT QQ : 3264454
Task 1 Installing the QFS Software on Your Cluster Nodes (If Not Already Done in the QFS Failover Lab)
Perform the following steps on all nodes that are physically attached to your data storage. If you have a third node that is a non-storage node, it cannot host any QFS le systems. 1. Install the QFS packages: # cd qfs_4.5_software_location/2.10 # pkgadd -d . -G SUNWqfsr SUNWqfsu Answer yes for the questions asked by pkgadd.
Task 2 Installing RAC Framework Packages for Oracle RAC With SVM Multiowner Disksets
Perform the following steps on all cluster nodes connected to storage: 1. Install the appropriate packages from the data service agents CD: # cd sc32_location/Solaris_sparc/Product/sun_cluster_agents # cd Solaris_10/Packages # pkgadd -d . SUNWscucm SUNWudlm SUNWudlmr SUNWscmd 2. 3. List out local metadbs: # metadb Add metadbs on the root drive if they do not yet exist: # metadb -a -f -c 3 c#t#d#s7
5-38
www.chinaitproject.com IT QQ : 3264454 Exercise 2: Configuring a Shared QFS File System (Optional) The pkgadd command will prompt you for the group that is to be the DBA for the database. Respond by typing: Please enter the group which should be able to act as the DBA of the database (dba): [?] dba
If all goes well, you will see a message on both consoles: Unix DLM version(2) and SUN Unix DLM Library Version (1):compatible
5-39
www.chinaitproject.com IT QQ : 3264454
Create a new multiowner diskset and add the disks: # metaset -s orashareds -M -a -h node1 node2 # metaset -s orashareds -a /dev/did/rdsk/dA # metaset -s orashareds -a /dev/did/rdsk/dB
3.
Create a volume (soft partition on top of mirrored volume) to hold the shared QFS le system metadata: # metainit -s orashareds d11 1 1 /dev/did/rdsk/dAs0 # metainit -s orashareds d10 -m d11 # metainit -s orashareds d100 -p d10 1g
4.
Create a volume (soft partition on top of mirrored volume) to hold the shared QFS le system le data: # metainit -s orashareds d21 1 1 /dev/did/rdsk/dBs0 # metainit -s orashareds d20 -m d21 # metainit -s orashareds d200 -p d20 6g
Note The metadata and le data are on separate spindles. Each of them is currently a soft partition of a mirrored volume with only one submirror. At the end of lab, you could optionally choose additional disks and complete the mirroring of your data.
5-40
www.chinaitproject.com IT QQ : 3264454 Exercise 2: Configuring a Shared QFS File System (Optional) 1. Create an entry in the QFS conguration le to congure the failover le system (add lines to your le if you have done the failover QFS exercise): # cd /etc/opt/SUNWsamfs # vi mcf qfsorashared 10 ma qfsorashared on shared /dev/md/orashareds/dsk/d100 11 mm qfsorashared on /dev/md/orashareds/dsk/d200 12 mr qfsorashared on 2. Create a parameter le for the shared QFS le system that is suitable for Oracle RAC: # vi /etc/opt/SUNWsamfs/samfs.cmd fs = qfsorashared stripe = 1 sync_meta = 1 mh_write qwrite nstreams = 1024 rdlease = 600 3. Verify the conguration that you have just entered: # /opt/SUNWsamfs/sbin/sam-fsd You should see a series of trace output if everything looks OK. You will see conguration error messages if something is wrong with your conguration. 4. 5. Notify the QFS daemon of your new conguration: # /opt/SUNWsamfs/sbin/samd config Print out the association between node names and private network host names: # clnode show -p privatehostname 6. Use the above output to create the shared QFS le associating shared QFS hosts with the private network hostname. This le should be identical on all nodes. List the rst node as the only server (this is the metadata server that will fail over when you add its failover agent): # vi /etc/opt/SUNWsamfs/hosts.qfsorashared nodename1 clusternode1-priv 1 - server nodename2 clusternode2-priv 2 7. Make a mount point for your new le system: # mkdir /orashared
5-41
Exercise 2: Configuring a Shared QFS File System (Optional) 8. Add an entry to /etc/vfstab for your new le system: # vi /etc/vfstab qfsorashared -
www.chinaitproject.com IT QQ : 3264454
Task 9 Configuring the Metadata Server as a Failover Resource, and Testing Failover
1. From any one cluster node, create and enable a resource group containing a single resource of type SUNW.qfs for metadata failover: # clrt register SUNW.qfs # clrg create -n node1,node2 qfsmeta-rg # clrs create -g qfsmeta-rg -t SUNW.qfs \ -p QFSFileSystem=/orashared qfsmeta-res # clrg online -M qfsmeta-rg 2. Verify that you can manually switch the metadata server. Note the messages on the node consoles: # clrg switch -n othernode qfsmeta-rg 3. Halt the node that is the current metadata server. When it reboots, verify that the shared le system is automatically mounted.
5-42
Task 1 Conguring and Installing the Zones Task 2 Migrating Oracle to Run in the Zone
# zonecfg -z orazone orazone: No such zone configured Use 'create' to begin configuring a new zone. zonecfg:orazone> create zonecfg:orazone> set zonepath=/orazone zonecfg:orazone set autoboot=true zonecfg:orazone> commit zonecfg:orazone> exit 2. Install the zone:
# zoneadm -z orazone install Preparing to install zone <orazone>. Creating list of files to copy from the global zone. .
Advanced Features (ZFS, QFS and Zones)
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
5-43
www.chinaitproject.com IT QQ : 3264454
The file </orazone/root/var/sadm/system/logs/install_log> contains a log of the zone installation. 3. 4. Boot the zone: # zoneadm -z orazone boot Connect to the zone console and congure the zone. It will look similar to a standard Solaris OS that is booting after a sysunconfig: # zlogin -C orazone [Connected to zone 'myzone' console] Wait until the SMF services are all loaded, and navigate through the conguration screens. Get your terminal type correct, or you may have trouble with the rest of the conguration screens. The choice for CDE Terminal Emulator seems to work best for ctelnet and cconsole windows, even in the Java Desktop environment. When you have nished system conguration of the zone, it will reboot automatically. You can stay connected to the zone console. 5. Log in and perform other zone post-installation steps: myzone console login: root Password: *** # vi /etc/default/login Comment out the CONSOLE=/dev/console line. 6. Add an oracle user to the zone: # groupadd -g 8888 dba # useradd -u 8888 -g dba -s /bin/ksh -d /oracle \ -c "Oracle User" oracle 7. Add in an entry for ora-lh to the /etc/hosts le of the zone. Use the same IP address as your ora-lh in the nodes hosts le in the global zone. Make an /oracle directory in the zone: # mkdir /oracle 9. Disconnect from the zone console using ~.
8.
5-44
5-45
www.chinaitproject.com IT QQ : 3264454
3.
If your old /oracle is a global le system, because the application had been congured on a non-storage node, unmount it and delete it from the vfstab le: # umount /oracle # vi /etc/vfstab (comment out line for /oracle)
7.
Disable your old (non-ZFS) storage resource, and then set the point point of the ZFS le system to /oracle. # clrs disable ora-stor // this will be qfsora-stor if you already // did the failover QFS exercise # zfs set mountpoint=/oracle orapool/oracle # df -k
5-46
8.
Reset your resources using the ZFS storage. If your resource group is already running in a non-global zone, you will now see the ZFS le system only in the non-global zone. # clrs create -g ora-rg -t HAStoragePlus \ -p Zpools=orapool ora-zfs-stor # clrs set -p Resource_depdendencies=ora-zfs-stor \ ora-server-res # clrs set -p Resource_depdendencies=ora-zfs-stor \ ora-listener-res # clrs delete ora-stor // this will be qfsora-stor if you already // did the failover QFS exercise # clrs enable -g ora-rg + # clrs status
9.
Observe switchover and failover behavior of the oracle application, which will now include the zpool containing your ZFS le system.
10. On the node or non-global zone where Oracle is running, take a snapshot of the data: # zfs snapshot orapool/oracle@thedatathisminute 11. Make some modications to the oracle data: # ksh # cd /oracli # . ./clienv # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> insert into mytable values SQL> insert into mytable values SQL> insert into mytable values SQL> insert into mytable values SQL> commit; SQL> select * from mytable; SQL> quit
12. Switch your oracle application to the other node (or zone), just to prove that the snapshots fail over along with everything else. 13. Verify your new data, and then restore your snapshot on the node where Oracle is running. Do you get your old data back? # ksh # cd /oracli # . ./clienv
5-47
Exercise 4: Migrating Your Oracle Data to ZFS (Optional) # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit
www.chinaitproject.com IT QQ : 3264454
# clrs disable ora-server-res ora-listener-res # zfs rollback orapool/oracle@thedatathisminute # clrs enable ora-server-res ora-listener-res
# ksh # cd /oracli # . ./clienv # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit
5-48
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
5-49
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Module 6
Best Practices
Objectives
Upon completion of this module, you should be able to do the following:
Dene and implement best practices for Internet Protocol multipathing (IPMP) Dene and implement best practices for shared storage le systems Dene and implement best practices for boot disk encapsulation and mirroring Dene and implement best practices for quorum devices Dene and implement best practices for campus clusters
6-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
Discussion The following questions are relevant to understanding the content of this module:
Why do you need the HAStoragePlus resource for global data if global le systems are mounted at boot time? What is a failover le system? What are the best ifconfig command options to use for IPMP? Can you unencapsulate a boot mirror?
6-2
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
Gene Trantham and Ben Howard. Towards a Reference Conguration for VxVM Managed Boot Disks. Sun Blueprints Online. August 2000 Sun Microsystems, Inc. Sun Cluster System Administration Guide For Solaris OS, part number 819-0580. Sun Microsystems, Inc. System Administration Guide: IP Services (from Solaris 10 Collection), part number 816-4554.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-3
www.chinaitproject.com IT QQ : 3264454
IPMP supports active IP addresses on all interfaces in the group, while a NAFO group can have only one live interface at a time. IPMP lets you choose between an active-standby group or an active-active group. In an active-active group, the Sun Cluster software automatically load balances cluster service IP addresses across the interfaces of a group. The default failover time for IPMP is much faster than for NAFO. It starts at 10 seconds for IPMP and adjusts automatically, while it takes about 45 seconds for NAFO. IPMP supports automatic repair detection of interfaces and failback of IP addresses, while NAFO does not. IPMP is part of the base Solaris OS installation, so the Sun Cluster software does not need to repeat development to support network adapter failover. IPMP supports IPv6. The Sun Cluster software supports IPv6 on the public network and IPv6 logical hostnames starting at Sun Cluster 3.1 9/04 (Update 3).
This section proposes some best practices for conguring and using IPMP, including the following:
Achieving the best hardware redundancy Using test addresses or link state testing in Solaris 10 OS Placing test addresses on the virtual interfaces, if possible Avoiding standby interfaces to achieve better load balancing Ensuring you use the failback=true parameter with load balancing Using deprecated ag on all test interfaces Controlling test targets
6-4
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-5
www.chinaitproject.com IT QQ : 3264454
Note The bug concerns failure of certain RPC applications when the deprecated ag is on the physical interface. Since using the deprecated ag with test addresses is recommended as a best practice, you want to make sure the test addresses are not on the physical interface. proto192# cat /etc/hostname.qfe1 proto192 group therapy netmask + broadcast + up addif proto192-qfe1-test -failover deprecated netmask + broadcast + up The second interface in a group does not normally have an additional physical node IP associated with it. Unfortunately, there is no way to make a virtual interface without a physical interface. As an alternative, it is possible to have yet another placeholder IP for the second member of the group. An example is: proto192# cat /etc/hostname.qfe2 proto192-phys-placeholder group therapy netmask + broadcast + up addif proto192-qfe2-test -failover deprecated netmask + broadcast + up This conguration requires that you allocate an additional subnet IP per node. You already need two extra subnet IPs for the test IPs. You can use the test IP alone and not use the placeholder in the hope that you do not run into bug #4710499. For example: proto192# cat /etc/hostname.qfe2 proto192-qfe2-test group therapy -failover deprecated \ netmask + broadcast + up
6-6
www.chinaitproject.com IT QQ : 3264454 IPMP Best Practices Without the standby keyword on either interface of the group, the Sun Cluster software automatically load balances the LogicalHostname and SharedAddress resource IP addresses across the members of the IPMP group. This achieves a measure of inbound load balancing, as shown in the following example.
proto192# ifconfig -a . . . qfe1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 172.20.4.192 netmask ffffff00 broadcast 172.20.4.255 groupname therapy ether 8:0:20:f1:2b:d qfe1:1: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 2 inet 172.20.4.194 netmask ffffff00 broadcast 172.20.4.255 qfe1:2: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 2 inet 172.20.4.182 netmask ffffff00 broadcast 172.20.4.255 qfe2: flags=9040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4,NOFAILOVER> mtu 1500 index 3 inet 172.20.4.195 netmask ffffff00 broadcast 172.20.4.255 groupname therapy ether 8:0:20:f1:2b:e qfe2:1: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 inet 172.20.4.183 netmask ffffff00 broadcast 172.20.4.255 qfe2:2: flags=1040843<UP,BROADCAST,RUNNING,MULTICAST,DEPRECATED,IPv4> mtu 1500 index 3 inet 172.20.4.184 netmask ffffff00 broadcast 172.20.4.255
It is a best practice to not use the standby keyword on either interface of a two-member IPMP group.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-7
www.chinaitproject.com IT QQ : 3264454
6-8
having at least two routers. Some vendors provide true HA clusterlike solutions for routing as well. Manually adding routes to hosts, solely for the purpose of more routers being added to the routing table that can then be chosen as targets.
For example, if you wanted to use 192.168.1.39 and 192.168.1.5 as the targets, you could run these commands: # route add -host 192.168.1.39 192.168.1.39 -static # route add -host 192.168.1.5 192.168.1.5 -static
You can put these commands in a boot script. The Solaris Administration Guide: IP Services from docs.sun.com listed in the resources section at the beginning of this module suggests making a boot script /etc/rc2.d/S70ipmp.targets. This works on both Solaris 9 and Solaris 10.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-9
www.chinaitproject.com IT QQ : 3264454
When to use a global le system instead of a non-global failover le system How to set up the /etc/vfstab le for le systems When to use afnity switching How to use the HAStoragePlus resource type with scalable services
The le system is for failover service only. The Nodelist property for the resource group contains only the nodes physically connected to the storageit should be the same as the node list for the device group. Only services in a single resource group use the le system.
If these conditions are true, you generally receive a performance benet from using a failover le system, especially for le system-intensive services.
6-10
The le system is for a scalable service. The le system is for a failover service that must fail over to a node not physically connected to the storage. The le system contains data for different failover services in different resource groups. You have an administrative reason to need access to the data from a node not running the service.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-11
www.chinaitproject.com IT QQ : 3264454
The vfstab le global option automatically enables logging, but you must explicitly include the logging option for a failover le system. It is a good practice to keep the logging keyword for record keeping, and in case you convert a global le system to a failover le system.
Solaris 9 4/04 (Update 6) and above (including Solaris 10) The logging option is the default for both global and failover le systems.
Failover le systems must have the value no in the Mount at boot column and must not have the word global in the options column of the vfstab le, as in the following example. There is no harm in including the logging option, even if it is the default, as in Solaris 10 OS: /dev/vx/dsk/nfsdg/burns /dev/vx/rdsk/nfsdg/burns /localnfs 2 no logging The VxFS le system has always logged by default. The cluster software requires that the vfstab le entries be present and identical on all nodes in the Nodelist property for the resource group in which you put the HAStoragePlus resource. This includes nodes not connected to the storage, for global le systems only. The VALIDATE method for the HAStoragePlus resource type enforces this and does not distinguish between nodes physically connected and nodes not connected to the storage. ufs
The node list for the resource group includes only nodes physically connected to the storage. No other services outside of that resource group use the storage.
6-12
www.chinaitproject.com IT QQ : 3264454 Shared Storage File System Best Practices Global le systems in scalable services ignore the AffinityOn parameter setting. Set this parameter to false to indicate that you understand its function, although it has no real meaning in this context.
The purpose of the HAStoragePlus resource START method is to ensure access to the storage. You want to ensure access from every node that is to run the data service. You want to place a dependency between the scalable data service and the global storage. This properly prevents starting the data service on any node where the storage is not accessible.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-13
www.chinaitproject.com IT QQ : 3264454
The proper relationship between the resources and resource groups associated with a typical scalable service are shown in Figure 6-1.
Scalable Resource Group Failover Resource Group
SUNW.HAStoragePlus Resource
SUNW.SharedAddress Resource
Resource_dependencies
SUNW.apache Resource
Figure 6-1
6-14
Solaris VM mirroring is generally easier in recovery scenarios. VxVM root mirroring, if you are already using VxVM for your data, may be considered simpler in that you are only using one volume management product.
Partitioning your boot disk prior to encapsulation Encapsulating the boot disk with clvxvm encapsulate Making sure you still have a logging root le system Properly mirroring the boot disk Unencapsulating the boot disk on either the original encapsulated disk or a mirrored copy
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-15
www.chinaitproject.com IT QQ : 3264454
To acquire space for the non-overlapping private region, the boot disk encapsulation process removes a cylinder from the beginning of your swap slice. You lose little, and it is an acceptable practice to let the VxVM software boot encapsulation do this. It is a good practice to separate the /var directory as a separate le system so that runaway logging does not ll the entire boot disk. Make sure that you give plenty of room for normally large-sized log les. On an 18-Gbyte or 36-Gbyte boot disk, 4 Gbytes or 6 Gbytes are average sizes for the /var le system. The VxVM software can encapsulate (and later mirror) any le system on the boot disk, as long as you stay within the partition limit of ve. You need a partition for the /global/.devices/node@# le system. This partition must be on a local disk. While theoretically it could be on a separate local disk than the root disk, it is not recommended to increase the number of local disks required for boot. You should put the global devices le system on the boot disk. The following highlights the points to remember:
Never put more than ve partitions, including swap, on the boot disk. Do not separate out the root, /usr, and /opt le systems. Make a separate /var partition if you are concerned about runaway logging. Put the /global/.devices/node@# le system on the boot disk.
The volume name for that le system is a different name on each node. The minor number for that le system is a different minor number on each node.
6-16
The /etc/vfstab le is properly edited prior to the reboot so that the le system is recognized by the VxVM scripts as part of the boot disk, and the correct volume is inserted by these scripts.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-17
www.chinaitproject.com IT QQ : 3264454
The vxmirror command calls the vxrootmir command on the boot disk to mirror the boot disk volumes.
Performing Unencapsulation
If you properly mirror your boot disks, you can unencapsulate onto the original encapsulated disk or onto an initialized mirror copy. All versions of VxVM software supported by Sun Cluster 3.2 make the original boot disk and its mirror copy identical, and you can unencapsulate all partitions leaving either copy as the remaining copy. To unencapsulate, you must unmirror each boot disk volume individually. This example shows the removal of the mirrored copy (mirror halves living on the disk rootmir). You could encapsulate just as easy by removing the halves on the original root disk. # # # # # vxprint -g rootdg vxassist -g rootdg remove mirror rootvol !rootmir vxassist -g rootdg remove mirror swapvol !rootmir vxassist -g rootdg remove mirror rootdisk_13vol !rootmir /etc/vx/bin/vxunroot
The vxunroot command detects any problems before the attempt to unencapsulate begins.
6-18
The system will continue running if at least half of the state database replicas are physically accessible. The system will panic if fewer than half of the state database replicas are physically accessible. The system will not reboot into multiuser mode unless more than half of the state database replicas are physically accessible and consistent.
You can force Solaris VM to run, even if a majority of the database replicas are not available, by setting the tunable mirrored_root_flag to 1 in the /etc/system le. The default value of this tunable is disabled, which requires that a majority of all replicas are physically accessible before Solaris VM software will start. To enable this tunable, type the following: # echo "set md:mirrored_root_flag=1" >> /etc/system Consider using at least three local disks when using Solaris VM to manage your boot disks. If for no other purpose, this third local disk can be a container for a replica in the event that one of the mirrored disks fails. You then have full availability, even after a single failure.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-19
www.chinaitproject.com IT QQ : 3264454
Create one-way mirrors, and reboot using your new metadevices before attaching the second submirror. If you attach the second mirror before rebooting, it can be corrupted by the time you reboot (since only the original partition is still mounted before the reboot).
The metadevice number must be different on each node for the /global/.devices/node@# le system, since each of these is mounted as a global le system. Update /etc/vfstab to use the new metadevices:
The metaroot command updates only the entry for the root le system (and also adds the correct line to /etc/system). All other entries (swap, /global/.devices/node@#) must be edited manually before reboot.
The command lockfs is used just before reboot to ush all transactions out of any logs and write the transactions to the le system. You must use this just once before the reboot because there are le systems that use the logging option.
3. 4.
6-20
Adding quorum devices so that the number of quorum votes is one less than the number of node votes Writing a script to periodically check for quorum device failure Deciding when to use a quorum server
Junction
Node 1 (1)
Node 2 (1)
Node 3 (1)
Node 4 (1)
QD(1)
QD(1) QD(1)
Figure 6-2
Pair+N Cluster
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-21
www.chinaitproject.com IT QQ : 3264454
You want to be able to boot the cluster with only Node 1 or only Node 2 and still have full access to the data. With only one or two quorum devices, you cannot do that because you need at least two nodes to form a quorum of three out of ve votes, or four out of six votes. The optimal number of quorum votes from devices allows you to form a one-node cluster of four votes, as long as it is a node connected to the storage. A more complicated example might be a six-node cluster with three pairs of two nodes, shown in Figure 6-3.
N1 N1
N2 N2
N3 N3
N4 N4
N5 N5
N6 N6
Q(1) Q(1)
Q(1) Q(1)
Q(1) Q(1)
Figure 6-3
With only the three quorum devices shown above, you cannot lose one pair and half of each of the other pairs of nodes. You might have twothirds of your data storage serviceable, but you cannot form a cluster with only four out of the possible nine votes. By adding two more quora as shown in Figure 6-4 on page 6-23, you can form a cluster.
6-22
www.chinaitproject.com IT QQ : 3264454 Quorum Device Best Practices You can survive more possible outages with the conguration shown in Figure 6-4. For example, Nodes 2 and 4 give you a total of six quorum votes, as do Nodes 2 and 5 or 3 and 5.
N1
N2
N3
N4
N5
N6
Q(1)
Q(1)
Q(1)
Q(1)
Q(1)
Figure 6-4
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-23
www.chinaitproject.com IT QQ : 3264454
/dev/did/rdsk/d11
/dev/did/rdsk/d12
6-24
www.chinaitproject.com IT QQ : 3264454 Quorum Device Best Practices /dev/did/rdsk/d13 theo vincent theo vincent theo Ok Ok Ok Ok Ok
/dev/did/rdsk/d14
/dev/did/rdsk/d15
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-25
Quorum Device Best Practices #!/bin/ksh PATH=$PATH:/usr/cluster/bin MYNAME=$(uname -n) if [[ -x /usr/sbin/clinfo ]] && /usr/sbin/clinfo then for QDEV in $(clq list -t scsi) do if cldev list -v $QDEV |grep " $MYNAME:" >/dev/null 2>&1 then STATUS_OUTPUT=$(cldev status -n $MYNAME $QDEV| nawk '$2=="'$MYNAME'" {print $3}')
www.chinaitproject.com IT QQ : 3264454
if [[ $STATUS_OUTPUT != Ok ]] then logger -p daemon.crit "Quorum device $QDEV is faulted" fi echo "$(date): Quorum $QDEV $STATUS_OUTPUT">>/var/cluster/qcheck fi done fi
6-26
Switch
Node 1 (1)
Node 2 (1)
Node 3 (1)
Node 4 (1)
scqsd daemon (3 votes for this cluster) [ can be quorum for other clusters too ] External machine running Quorum Server Software
Figure 6-5
Note that the quorum server daemon automatically is assigned a number of votes one fewer than the number of nodes. You would therefore never assign both a quorum server quorum device and another traditional quorum device; you would end up with too many quorum device votes. It might seem that using quorum server quorum device could be a good practice for any cluster, but this is not necessarily true. Consider a Pair +1 cluster three-node cluster. A quorum server quorum device will automatically two votes. Therefore any node, including the non-storage node, can form the cluster by itself, and any node can be the last node remaining in the cluster.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-27
www.chinaitproject.com IT QQ : 3264454
Is it a good idea for a non-storage node to be left in the cluster itself? It might seem harmless at rst: you might think that normal HAStoragePlus dependencies would prevent applications from running anyway on the third node. However, there is a problem:
HAStoragePlus dependencies can prevent applications from starting, but can not prevent an application from continuing to run if the storage disappears. This is because there is no fault monitor associated with HAStoragePlus. A non-storage node unable to access a global le system can still cause the global le system mount point to be busy, thus preventing the proper restoration of a global le system even when one of the storage nodes returns.
The conclusion is that the quorum server is not an ideal solution for all topologies. In particular, it is a best practice to use the quorum server quorum device only when you understand and desire the semantics of an all-connected quorum device.
6-28
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-29
www.chinaitproject.com IT QQ : 3264454
The difference between these two topologies is the location in which the quorum device (always required for two-node clusters) is placed.
Node 1
Node 2
Storage
Figure 6-6
In the two-site conguration, the quorum device is located at the site of one of the nodes. If this entire site goes down, you will have complete loss of availability.
6-30
Node 1
Node 2
Storage
Storage
Quorum
Figure 6-7
The three-site topology behaves much like a standard, non-campus, two-node cluster during split-brain scenarios. This is the preferred campus cluster, but it requires more hardware and a third site.
Choosing A Quorum Device, a Quorum Node, or a Quorum Server for the Third Site
In the three-site campus cluster, the third site could contain:
A traditional disk quorum device A quorum node (a third node congured in the cluster just for the purposes of quorum) A quorum server device
The third choice may be the easiest in that you neither need to make storage connections nor private network connections to the third site. You just need public network accessibility between each node and the quorum server.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-31
www.chinaitproject.com IT QQ : 3264454
The preferred plex feature of the VxVM software The VxVM software allows you to dene a preferred mirror plex. When conguring this feature, set the preferred plex to the plex that is local to that site. When the preferred plex is set, the VxVM software performs all read operations from that plex. Write operations are applied to both plexes. Only if the preferred plex fails does the VxVM software perform all read and write operations on the nonpreferred plex. You must manually perform or automate the change of the preferred plex after a switchover or failover of the resource group. The VxVM software does not perform this change automatically.
Note Solaris VM has a feature somewhat analogous to preferred plex, but it would not be possible to take advantage of it in this scenario. In Solaris VM the preferred copy is always the rst submirror. If you want to switch the preferred copy, you have to detach your rst submirror and add it back. This would cause a full resync of your data, which is obviously not what you want to achieve.
To improve performance, always use failover le systems rather than global le systems for failover services. Scalable services, on the contrary, must use global le systems because the data must be accessible from all nodes in the node list of the scalable resource group.
6-32
Task 1 Mirror the boot device using Solaris VM software Task 2 Encapsulate and mirror the boot device using VxVM software Task 3 Verify IPMP best practices Task 4 Implement quorum device monitoring
Preparation
No preparation is required for this exercise.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-33
www.chinaitproject.com IT QQ : 3264454
Warning Make sure you perform prtvtoc on your current root disk, and apply with fmthard to your other local disk (the second one may have been your root disk at the beginning of the course, before the live upgrade). If you do this the wrong way, you can destroy your current root disk. # prtvtoc /dev/rdsk/cAtAdAs0 | fmthard -s - \ /dev/rdsk/cBtBdBs0 3. If you only have two local drives, make sure that both of them have the same number of replicas. You may already have replicas on both disks (if you were using Solaris VM for your data), or you may have no replicas at all and may have to add them to both disks. # metadb # metadb -a -f -c 3 cAtAdAs7 # metadb -a -c 3 cBtBdBs7 4. Create a simple concatenation submirror containing the existing root partition: # metainit -f d11 1 1 cAtAdAs0 5. 6. Create a second submirror using slice 0 of the other local disk: # metainit d12 1 1 cBtBdBs0 Create a one-way mirror of the root partition using the active submirror: # metainit d10 -m d11 7. Make a backup of the /etc/vfstab and /etc/system les: # cp /etc/system /system.backup # cp /etc/vfstab /vfstab.backup
6-34
www.chinaitproject.com IT QQ : 3264454 Exercise: Using Best Practices 8. Modify the /etc/vfstab and /etc/system les with metadevice entries for the boot device: # metaroot d10 9. Create a simple concatenation submirror containing the existing swap partition: # metainit -f d21 1 1 cAtAdAs1 10. Create a second submirror for the swap partition: # metainit d22 1 1 cBtBdBs1 11. Create a one-way mirror of the swap partition using the active submirror: # metainit d20 -m d21 12. Create a simple concatenation submirror containing the global devices partition: # metainit -f d31 1 1 cAtAdAs3 13. Create a second submirror for the global devices partition: # metainit d32 1 1 cBtBdBs3 14. Create a one-way mirror of the global devices partition using the active submirror. Note that the device name must be different on different nodes. For example, use d301 for node 1, d302 for node 2, and so on. # metainit d30X -m d31 15. Edit the /etc/vfstab and change each of the standard Solaris partitions on the boot device to metadevices. You will notice that the entry for the root le system is already edited (by metaroot, above). You need to edit the lines for swap (use /dev/md/dsk/d20) and /global/.devices/node@# (use /dev/md/dsk/d30X and /dev/md/rdsk/d30X). # vi /etc/vfstab 16. Flush the UFS logs to the le system: # lockfs -fa 17. Add the mirrored root ag to /etc/system # echo "set md:mirrored_root_flag=1" >> /etc/system 18. Reboot the node: # init 6 19. Attach the remaining submirrors: # metattach d20 d22 # metattach d30X d32 # metattach d10 d11
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-35
www.chinaitproject.com IT QQ : 3264454
Note It would be more efcient to run these one at a time, waiting until one is nished (monitoring with metastat) before starting the next one. It will take a long time for them all to complete in parallel, but it happens in the background so you do not need to wait. 20. Modify the PROM boot-device parameter to include both submirrors. a. Identify the path to each root partition submirror: # ls -l /dev/dsk/rootdisks0 # ls -l /dev/dsk/rootmirrors0 The path begins after the string devices. For example, if the root disk is: ../../devices/sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a then use the following as the path to the boot slice: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0:a b. Edit the boot-device parameter: # eeprom boot-device="path-to-boot-slice path-to-mirror-slice"
Task 2 Encapsulating and Mirroring the Boot Disk Using VxVM Software
Perform the following steps if your cluster is using VxVM and you want to mirror with VxVM. 1. Remove the bootdg disk group on which your previous (preupgrade) root disk had been encapsulated: # vxprint -g bootdg # vxdg destroy bootdg # vxprint -g bootdg 2. 3. Encapsulate your boot disk: # clvxvm encapsulate After the reboots, edit the /etc/vfstab le and remove the nologging option for the root le system. Make sure you still have seven columns on that line. You can put in the word logging, or just leave a minus sign as logging is the UFS default for all Solaris OS versions supported in Sun Cluster 3.2. Reboot one more time to make your /etc/vfstab change take affect. Identify the local disk that you plan to use as the boot disk mirror: # vxdisk list
6-36 Sun Cluster 3.2 Advanced Administration
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
4. 5.
www.chinaitproject.com IT QQ : 3264454 Exercise: Using Best Practices 6. Verify that the boot disk mirror is the same size and geometry as the original boot disk: # format 7. Mirror the boot drive: # /etc/vx/bin/vxdisksetup -i c#t#d# format=sliced # vxdg -g rootdg adddisk rootmir=c#t#d# # /etc/vx/bin/vxmirror -g rootdg \ rootdisk-vm-name rootmir
Do you have multiple adapters in your IPMP group? Are IPMP test addresses on the physical or virtual interface? Hint: Look for interfaces with the NOFAILOVER ag. Is the STANDBY ag set for either test interface? Is the DEPRECATED ag set on the test interfaces?
2.
Obtain IP addresses for the test interfaces from your instructor. Modify the /etc/hostname.xxx les as needed to satisfy the following conditions:
If you have both a physical and test interface, the test interface is on the virtual interface. No standby options are used. The deprecated option is used on the test interfaces.
The following are examples of ideal les. Make sure you use your own correct interface names and IP names: proto192# cat /etc/hostname.qfe1 proto192 group therapy netmask + broadcast + up addif proto192-qfe1-test -failover deprecated netmask + broadcast + up proto192# cat /etc/hostname.qfe2 proto192-qfe2 group therapy netmask + broadcast + up addif proto192-qfe2-test group therapy -failover deprecated \ netmask + broadcast + up
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-37
Exercise: Using Best Practices 3. 4. 5. Check if the FAILBACK=yes option is set in the /etc/default/mpathd le. If it is not, set it. Reboot your node to put any changes made into effect. Repeat this exercise on the other node if you so desire.
www.chinaitproject.com IT QQ : 3264454
Congure the cron utility job to run the job every minute: # EDITOR=vi;export EDITOR # crontab -e (add a line) * * * * * /full_pathname_to_script
5. 6.
Check the /var/cluster/qcheck le periodically. If possible, generate a quorum device failure by pulling out the quorum device, and observe the /var/cluster/qcheck le output. If the quorum device is a hardware RAID LUN, you might need to power off the entire box or uncable it to simulate a failure. It might take the scdpmd daemon up to 10 minutes to detect the failed quorum drive.
6-38
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
Best Practices
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
6-39
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Module 7
Identify the importance of a security policy Identify security vulnerabilities in a Sun Cluster 3.x software environment Use the Solaris Security Toolkit software Download and install security software on the cluster nodes Implement the Toolkit software secure cluster driver Provide secure clustered services
7-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
Do you have and implement a security policy? What are the most common types of security threats you encounter? What are the security vulnerabilities particular to a system running the Sun Cluster 3.x software?
7-2
Additional Resources
Additional resources The following references provide additional information on the topics described in this module:
Sun Microsystems, Inc. Sun Cluster System Administration Guide For Solaris OS, part number 819-2971. Alex Noordegraf. Securing the Sun Cluster 3.x Software. [Online] Available at http://www.sun.com/solutions/blueprints/0203/ 817-1079.pdf. February 2003. Joel Weise and Charles R. Martin. Developing a Security Policy. [Online] Available at http://www.sun.com/solutions/ blueprints/1201/secpolicy.pdf. December 2001.
7-3
www.chinaitproject.com IT QQ : 3264454
It is implementable through fair and realistic procedures. It is enforceable with security tools. It denes the responsibility of all members of the organization. It is documented and distributed.
This module can help you dene a realistic security policy for the Sun Cluster environment by identifying which services are required and which services may be deleted in the Sun Cluster environment. Your enterprise may have a general security policy in place, for example, that mandates that rpcbind be eliminated whenever possible. However, Sun Cluster requires certain RPC services, and your cluster will not be able to operate if you eliminate rpcbind.
7-4
www.chinaitproject.com IT QQ : 3264454 Using a Security Policy as a Framework for Decision Making The remainder of this module describes resources available to you to implement security in a Sun Cluster software environment. There are several tools in the Solaris OS that you can use to implement security, including the following:
Secure clustered services Secure Shell (SSH) utility Basic Security Module (BSM) Automated Security Enhancement Tool (ASET) Solaris Security Toolkit software Other security tools, such as Crack, TripWire, SATAN, SAINT, and TCP_Wrappers
This module focuses on using the Solaris Security Toolkit software to implement security in a Sun Cluster 3.x software environment.
7-5
www.chinaitproject.com IT QQ : 3264454
Solaris OS Software Oracle Real Application Cluster (RAC) software Cluster interconnects Internet services Cluster services Console access Node authentication
7-6
Note The Toolkit Sun Cluster 3.x software driver disables both the rsh and rcp utilities by default. It is possible to install the RAC software on each node and set up conguration les manually on each node if you do not want to change security settings. Alternatively, you can use the Secure Shell (SSH) ssh and scp commands to replace the functionality of the rsh and rcp commands. These commands provide an encrypted and authenticated mechanism for Oracle and any other software to perform tasks on remote machines. This simplies the installation and conguration of the Oracle RAC software in a secure manner. The Oracle runInstaller utility provides the ability to specify ssh and scp as the internode communication mechanism by specifying their path on the command line: $ ./runinstaller -remoteshell /usr/bin/ssh \ -remotecp /usr/bin/scp
7-7
www.chinaitproject.com IT QQ : 3264454
These are dened in the /etc/inetd.conf le in Solaris 9 and as SMF services in Solaris 10 OS.
7-8
These are dened in the /etc/inetd.conf le in Solaris 9 and as SMF services in Solaris 10 OS.
none Any node is allowed to add itself to the cluster. sys A node is authenticated if its host name and IP address are consistent with what the current cluster members think the host name and IP address are. des A node is authenticated using the Dife-Hellman public-key mechanism. If you intend to use des, then you must manually congure the public keys, secret keys, and Secure RPC net names using les, Network Information Service (NIS), NIS+, or LDAP.
7-9
www.chinaitproject.com IT QQ : 3264454
Goal of the Toolkit software Modes of operation in which the Toolkit software runs Types of modications the Toolkit software makes
Standalone mode Run the Toolkit software from a command line. Standalone mode allows you to make security modications without reinstalling the Solaris OS. Standalone mode is particularly useful when re-hardening a system after packages or patches are installed. Applying patches and installing packages might overwrite or modify les that the Toolkit software previously modied. By re-running the Toolkit software, you can reimplement any security modications undone by the patch or package installation. JumpStart mode Ideally, you harden systems during installation. You can use the Toolkit software to harden systems during the third phase of a JumpStart software installation by running the Toolkit software scripts in a nish script.
7-10
Categories of Modication
Each of the modications performed by the Toolkit software to harden or minimize Sun Cluster 3.x software nodes falls into one of the following categories:
Finish Subdirectory
All the actual little worker-bee scripts live in the Finish directory of the Toolkit arena. The name of this directory reects how the Toolkits origins emphasized its used in the Jumpstart environment, but the same scripts get executed if you run the Toolkit in standalone mode, as is most likely in the cluster environment. In the 4.2.0 version of the software, there are 116 scripts in the Finish directory. You will see that not all scripts are appropriate for all environments. For this reason, it is not recommended to run these scripts directly, nor are you likely to even run them all.
Audit Subdirectory
This arena contains a collection of scripts similar to those in the Finish directory. Rather than actually harden your system, these scripts are used to audit your system; that is, to report on whether your system is already hardened.
7-11
www.chinaitproject.com IT QQ : 3264454
Drivers Subdirectory
This arena contains master scripts that congure and execute collections of the Finish or Audit scripts that are most appropriate for certain environments. There is a specic driver for the Sun Cluster environment, a different driver appropriate for the Sun Fire 15K System Controller, a different driver appropriate for general non-clustered servers, and so on. Each driver is actually composed of three les in the Drivers arena, for example: suncluster3x-secure.driver suncluster3x-config.driver suncluster3x-hardening.driver When you execute the Toolkit, you will refer to the appropriate ...-secure.driver le. This le then uses the ...-config.driver to set options and variables and then calls the ...-hardening.driver to actually drive the proper collection of Finish or Audit scripts.
Files Subdirectory
This directory contains les that are inserted into your system as part of the implementation of some of the scripts.
bin Subdirectory
This directory contains the actual jass-execute utility.
7-12
7-13
www.chinaitproject.com IT QQ : 3264454
Select one of the listed runs as the nal run to undo. All system modications performed in that selected run and any runs made after that are undone. There are two important limitations to keep in mind with this feature:
If you select the Toolkit software option to not archive les, the undo feature is not available. You can undo a run only once. After a run is undone, all the les backed up by a Toolkit software run are restored to their original locations and are not backed up again.
If you manually change some of the security modications that the Toolkit software has made, then the jass-execute command warns you that the security modication is changed. The information needed for the undo feature is logged in the /var/opt/SUNWjass directory. For each run, a new subdirectory is created in the /var/opt/SUNWjass/runs directory. This subdirectory stores the necessary archive and log information for the undo feature. Note Never modify the contents of the les in the /var/opt/ SUNWjass/runs directory.
7-14
Solaris Security Toolkit software Recommended and security patches FixModes freeware software MD5 software
7-15
www.chinaitproject.com IT QQ : 3264454
7-16
Installs and executes the FixModes software to tighten le system permissions Installs the MD5 software Installs the Recommended and Security Patch software Implements over 100 Solaris OS security modications
An outline of the hardening procedure is: 1. Run the hardening on one node at a time. Execute the suncluster3x-secure.driver script as follows: # cd /opt/SUNWjass/bin # ./jass-execute -d suncluster3x-secure.driver 2. 3. 4. 5. Reboot the node. Verify that the node is hardened. Verify that the node operates properly in the cluster. Repeat Steps 14 on the remaining nodes.
7-17
www.chinaitproject.com IT QQ : 3264454
Note The Solaris Security Toolkit will make this modication automatically.
7-18
Opens a TCP socket connection with the server Runs an ldapsearch command on the base of the directory
If you run your service in secure mode, the agent opens a socket connection with the server but does not run the ldapsearch command when performing a probe. Note In essence you have a trade-off; if you run the more secure cluster service the fault-monitoring for that service in the cluster is less robust. This is true for LDAP and for the two web server applications in the following two subsections.
7-19
www.chinaitproject.com IT QQ : 3264454
7-20
Task 1 Install the Toolkit software on the selected node Task 2 Execute the suncluster3x-secure.driver script Task 3 Verify that the selected node is hardened Task 4 Verify that the selected node operates properly in the cluster Task 5 Harden the remaining nodes (optional) Task 6 Undo the security modications on each cluster node
Preparation
No preparation is required for this exercise.
Use the pkgadd utility to install the Toolkit software: # pkgadd -d /var/tmp SUNWjass
7-21
www.chinaitproject.com IT QQ : 3264454
7-22
Task 4 Verifying That the Selected Node Operates Properly in the Cluster
Perform the following steps to verify that the selected node operates correctly: 1. Verify that the node can run the ora-rg resource group as follows: a. 2. Switch the ora-rg resource group to the selected node. Type: # clrg switch -g selected-node ora-rg Verify that you can access the Oracle database. You should be able to serve as an Oracle client from either node, regardless of the node on which the failover service is running. # ksh # cd /oracli # ls clienv oraInventory/ product/ # . ./clienv # which sqlplus # sqlplus SYS@MYORA as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit 3. 4. Invoke your web browser on your display station. Navigate to http://iws-lh-name/cgi-bin/test-iws.cgi
Click the reload or refresh button several times to verify the behavior of the scalable application. Verify that you get responses from the node which has been hardened, as well as the other node.
7-23
www.chinaitproject.com IT QQ : 3264454
7-24
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
7-25
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Module 8
Describe how to troubleshoot clustered services Identify log les for each layer
8-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
Discussion The following questions are relevant to understanding the content of this module:
Are there special considerations when debugging application problems in a cluster? What resources are available for troubleshooting cluster software failures? What are the interdependencies among cooperating software in the cluster?
8-2
Additional Resources
Additional resources The following reference provides additional information on the topics described in this module:
Sun Microsystems, Inc. SunSolve online support Web site. [Online] Available at: http://sunsolve.sun.com. Sun Microsystems, Inc. Sun Cluster Error Messages Guide For Solaris OS, part number 819-2973.
8-3
www.chinaitproject.com IT QQ : 3264454
8-4
After narrowing the search, begin troubleshooting within that functional area. Problems might occur in one or more of these functional areas that prohibit you from using the application. Some of these issues are described in the following sections.
A user does not use the application properly, but everything else works well. The application is miscongured. The application has bugs. The client OS is miscongured. The client OS is heavily loaded, causing the application to hang. The client host hardware is faulted or miscongured.
The network is heavily loaded, causing application timeouts. A name server for either the client or server is unreachable. Routers are not forwarding IP packets between the hosts. The network is faulted or miscongured.
8-5
www.chinaitproject.com IT QQ : 3264454
The service is miscongured. The service has bugs. The cluster fault monitor is not detecting error conditions in the application. The cluster is unable to start or stop the application. The server OS is miscongured. No public network interfaces are available. The cluster is shut down.
8-6
www.chinaitproject.com IT QQ : 3264454 Defining How to Troubleshoot Clustered Services Figure 8-1 shows a clustered-service software stack.
Application
Data Service
Cluster Framework
Operating System
Server Hardware
Figure 8-1
Application Layer
The application layer contains the application such as Oracle software or Sun Java System Web Server, and includes the application conguration les, scripts, and binaries.
8-7
www.chinaitproject.com IT QQ : 3264454
Hardware Layer
The server hardware layer includes all hardware for the physical nodes of the cluster. The hardware layer can contain the following items:
Server chassis and system boards Storage arrays and their switches, hubs, and cables Cluster transport interfaces, switches, and cables Public network interfaces
8-8
www.chinaitproject.com IT QQ : 3264454 Defining How to Troubleshoot Clustered Services Many dependencies exist within components of the same layer in the stack. The following are some examples:
The Sun Java System Messaging Server depends on the Sun Java System Directory Server. A resource can depend on another resource in the same resource group or in another resource group. An I/O card depends on its system board.
Symptom
Every time you try to enable the Oracle failover resource group, the Oracle server resource fails to start. This causes the resource group to attempt to fail over to the next node, where the server resource fails as well. Finally the entire group remains ofine as the attempts to start are halted by the Pingpong_interval feature (as discussed in Module 3).
8-9
www.chinaitproject.com IT QQ : 3264454
Strategies
The following strategies can help you isolate the layer in which the problem is occurring:
Ofine the Oracle resource group. Verify that other application resource groups are operating correctly. Verify that you can successfully switch over other application resource groups. Verify that the cluster framework utilities are behaving properly. Use the various status suboptions of the commands and so on. With the Oracle resource group ofine, disable (clrs disable) only the Oracle server resource. Switch the rest of the Oracle resource group online on a particular node (you likely still need the IP address and storage provided by the rest of the group). Verify that the IP and storage resources behave properly. Start the Oracle server by hand using sqlplus /nolog. If you are having problems starting the Oracle server, you can likely isolate the problem to the application conguration. The sqlplus utility will give you better error messages than anything you can nd in cluster log les. If you have no problems starting Oracle by hand and then accessing the database, the problem is likely in the agent conguration. Verify the values of the Oracle server properties (clrs show -v ora-server-res-name), which is where your problem likely lies.
8-10
Application log les for the Oracle software and Sun Java System Web Server Data service agent log les Cluster framework log les
Access log The access log contains information about client requests and server responses.
Error log The error log contains information about errors that the server encountered after creation of the le. It also contains informational messages about the server, such as when the server started.
8-11
www.chinaitproject.com IT QQ : 3264454
PID log The PID log contains the process ID for the web server watchdog daemon, which in turn starts the web server itself. The software needs to record the process ID of the watchdog daemon in order to stop the service.
Setup log The setup log contains general and error information concerning the installation of the web server using the setup utility and is found in the server-root/setup directory.
The following example shows how the required location of the rst three logs has been changed to a location local to each node rather than a location in the global le system for a scalable service: # cd /global/web/iws/https-iws-lh.sunedu.com/config # grep /var/iws server.xml <PROPERTY name="accesslog" value="/var/iws/logs/access"/> <LOG file="/var/iws/logs/errors" loglevel="info" logvsid="false" logstdout="true" logstderr="true" logtoconsole="true" createconsole="false" usesyslog="false"/> # grep /var/iws magnus.conf PidLog /var/iws/logs/pid
Oracle Software
The Oracle database server maintains two different types of log les that you can use for troubleshooting and debugging purposes:
Trace les Each server and background process can write to an associated trace le. When a process detects an internal error, it writes information on the error to its trace le. The lename format of a trace le is: processname_PID_sid.trc Where:
8-12
The processname value is a three- or four-character, abbreviated process name identifying the process that generated the le (for example, the pmon, dbwr, ora, or reco name). The PID is the process ID number. The sid is the instance system identier.
A sample trace le name is found in the $ORACLE_BASE/admin/ $ORACLE_SID/bdump/lgwr_1237_TEST.trc le. All trace les for background processes are written to the destination directory specied by the BACKGROUND_DUMP_DEST initialization parameter. If you do not set this initialization parameter, the default is the $ORACLE_HOME/rdbms/log directory. All trace les for server processes are written to the destination directory specied by the USER_DUMP_DEST initialization parameter. Set the MAX_DUMP_FILE initialization parameter to at least 5000 to ensure that the trace le is large enough to store error information.
Alert les The alert_sid.log le stores signicant database events and messages. Anything that affects the database instance or global database is recorded in this le. This le is associated with a database and is located in the directory specied by the BACKGROUND_DUMP_DEST initialization parameter. If you do not set this initialization parameter, the default is the $ORACLE_HOME/ rdbms/log directory.
8-13
www.chinaitproject.com IT QQ : 3264454
/var/cluster/logs/install/scinstall.log.PID Sun Cluster logs the actions of the scinstall utility in this le. /var/cluster/logs/install/scinstall.upgrade.log.PID Both the framework and data service upgrades are logged in this le. /var/cluster/upgrade Information regarding the les and packages installed through the upgrade are stored in this directory.
Syslog Logs
After you install the cluster software, it uses the Syslog software for logging informational and error messages. The framework of the cluster logs to the Syslog software daemon and kern facilities, depending on the software component. Each message that the Syslog software writes is comprised of three components:
Message ID An integer between 100000 and 999999 that uniquely identies the message. Use this value as a search key in the Sun Cluster Error Messages Guide for Solaris OS. Description An explanation of the problem. Solution A proposed solution to the problem. Some solutions are worded specically, while other solutions recommend contacting a system administrator or Sun support engineer.
Note The Sun Cluster Error Messages Guide for Solaris OS is a searchable volume of all error messages generated by the Sun Cluster software framework. Use it to nd potential solutions to problems.
8-14
RGM Logs
The RGM maintains disk-based logs in the /var/cluster/rgm directory for such data as pingpong timers.
8-15
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Module 9
Observe self-induced problems Resolve instructor-induced problems and faults in your cluster Perform node disaster recovery in practice
9-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
Relevance
www.chinaitproject.com IT QQ : 3264454
Relevance
Present the following questions to stimulate the students and get them thinking about the issues and topics described in this module. Although they are not expected to know the answers to these questions, the answers should interest them and inspire them to learn the material.
!
?
What different ways can you can think of to break the cluster software? What is the most important source of error-logging information in the Sun Cluster software? How do you know when something is broken in the cluster?
9-2
Additional Resources
Additional resources The following reference provides additional information on the topics described in this module:
Sun Microsystems, Inc. Sun Cluster Error Messages Guide For Solaris OS, part number 819-2973. Sun Microsystems, Inc. SunSolve online support web site. [Online] Available at http://sunsolve.sun.com.
The goals of this module are to convey the following concepts: 1.) The Sun Cluster software effectively resolves many problems without user intervention. 2.) Some errors that the Sun Cluster software identies are clearly communicated to the administrator, and it is a straightforward task to resolve such problems. 3.) Some errors that the Sun Cluster software identies are not clearly communicated to the administrator, and it requires experience to resolve such problems. The labs in this module are contrived. It is not common, for example, to accidentally kill the wrong daemon, or that a mistakenly revoke permissions previously granted to a user of a database. It is not the goal of this module to present specic problems that a student is likely to encounter in the real world. Instead the goal is to expose the student to a variety of errors in this safe, non-production, classroom environment so that the student can see how errors are expressed in the cluster. This module is designed to give the students a variety of labs from which to choose. The student is not expected to nish all of the tasks in both self-induced and instructor-induced exercises. Rather, the students should examine all the task summaries and then decide which tasks they are interested in doing.
9-3
www.chinaitproject.com IT QQ : 3264454
You can perform these exercises in any order. The self-induced exercises are entirely self-contained. The instructor-induced exercises require coordination with the instructor. Neither rely on the completion of any other exercise.
9-4
9-5
www.chinaitproject.com IT QQ : 3264454
Gain familiarity with how errors are expressed while the cluster operates in error scenarios Test the resiliency of the cluster software and determine how effective it is in resolving errors without user intervention Identify situations that require operator intervention
Task 1 Induce daemon failures Task 2 Induce a full root le system Task 3 Set an incorrect maxusers value Task 4 Induce operator errors
Preparation
No preparation is required for this exercise.
Note The system panics after 30 seconds. There is nothing you need to do to restore sane operation.
9-6
Kill the rpcbind process: From one cluster node, type the following: # ps -ef|grep rpcbind # pkill rpcbind
Note On Solaris 10 OS, this daemon is under control of the Service Management Facility (SMF), and is automatically restarted.
Note The system panics. There is nothing you need to do to restore sane operation.
Note The system panics. There is nothing you need to do to restore sane operation.
Note The system panics. There is nothing you need to do to restore sane operation.
Kill the iws_probe process ve times consecutively: # > > > > for i in 1 2 3 4 5 do pkill iws_probe sleep 3 done
Note The probe was under the control of PMF and restarts as many times as it is congured to do. After that threshold is exceeded, the PMF daemon does not do anything in response except log a message in the
9-7
www.chinaitproject.com IT QQ : 3264454
/var/adm/messages le. To restore sane operation, stop and start the fault monitor for the resource group using the clrg unmonitor/monitor command.
If you are running VxVM in your cluster, perform the following steps: a. Kill the vxconfigd process: # pkill -9 vxconfigd
Note The system does not panic. The host on which that daemon was killed is unable to perform any further VxVM volume or disk group operations. b. c.
Try to perform a simple VxVM operation on that node: # vxdisk list Restart the vxconfigd process: # vxconfigd -x syslog -m boot
If you are running Solaris VM in your cluster, perform the following steps: a. Kill the rpc.metad process: # pkill rpc.metad
Note The system does not panic. This daemon gets restarted by inetd when diskset status is required from that host. There is nothing you need to do to restore sane operation. b. Print status for any diskset: # metastat -s ora-ds
9-8
Does the cluster node continue to operate normally? Are there any messages displayed to the console that indicate the root le system is full? _____________________________________________________________ _____________________________________________________________ ___________________________________________________
9-9
www.chinaitproject.com IT QQ : 3264454
Note You need to boot off the network or a CD-ROM into single-user mode, mount the root le system, and remove the entry from the /etc/system le to restore sane operation.
On any cluster node, bring down both private interfaces: # ifconfig first-private-interface down # ifconfig second-private-interface down
Note This causes a split brain, and one of the cluster nodes panics. If the system that panics is the node on which you ran the ifconfig command, then you do not need to do anything to restore sane operation. If the system that panics is the other node, then use the ifconfig command to bring each interface up.
On any cluster node, remove the CCR infrastructure table and make some change: # rm /etc/cluster/ccr/infrastructure # scconf -a -h venus
9-10
Note Use the ftp utility to get a copy of the infrastructure table from another cluster node.
On any cluster node, remove the CCR directory table and make some CCR change: # rm /etc/cluster/ccr/directory # clrg create -n node1,node2 somenewrg
Note Use the ftp utility to get the CCR directory table from another cluster node.
9-11
www.chinaitproject.com IT QQ : 3264454
Simulate some real-world scenarios that you might confront Test the effectiveness of the cluster messages, logs, and other tools discussed in this course to nd and resolve problems Test your ability to nd and resolve problems that occur in your cluster
Task 1 Troubleshoot IPMP errors Task 2 Troubleshoot an unknown state Task 3 Troubleshoot a resource STOP_FAILED state Task 4 Troubleshoot Oracle software resource group errors Task 5 Troubleshoot an unbootable cluster node Task 6 Troubleshoot oracle_server resource fault monitor errors Task 7 Troubleshoot the failure to start a web server Task 8 Troubleshoot iws-res resource failures on one node
Preparation
No preparation is required for this exercise. Tell your instructor when you are ready to begin a particular task.
9-12
Where is the conguration stored for IPMP? _______________________________________________ What is the correct syntax of the les? _______________________________________________ How can you restart your public interfaces with a new conguration? _______________________________________________
TO RESTORE: On any cluster node, perform the following steps: 1. Type # vi /etc/hostname.pubnet_adapter 2. Correct the typographic error. 3. Recongure the adapter using the ifconfig command.
9-13
www.chinaitproject.com IT QQ : 3264454
Is this likely to be a data service problem? _______________________________________________ Where is the core cluster conguration data held? _______________________________________________ How can you restore this conguration? _______________________________________________
TO RESTORE: On the cluster node, restore the CCR by taking a backup copy from another cluster node and restoring it to this node.
9-14
What program is responsible for stopping a data service? _______________________________________________ Was the Oracle software instance stopped? Was the Oracle listener data service also stopped? _______________________________________________ What utilities and options command options do you need to use to x this problem? _______________________________________________
TO RESTORE: Rename the oracle_server_stop.old to oracle_server_stop. You also need to clear the STOP_FAILED ag of the ora-server-res resource and ora-rg resource group.
9-15
www.chinaitproject.com IT QQ : 3264454
What messages appear on the consoles of the server nodes that relate to starting the Oracle resource group software? _______________________________________________ How can you nd out the value of this resource parameter? _______________________________________________ Why does the resource group fail to come online on both nodes? _______________________________________________
9-16
What do the console messages indicate is the problem? _______________________________________________ What is blocking Node 1 and what is the node waiting for? _______________________________________________ What happens if Node 2 is on re and cannot boot? _______________________________________________
TO RESTORE: The student must boot node 2 rst because it owns the quorum device.
9-17
www.chinaitproject.com IT QQ : 3264454
Where are the error messages logged? _______________________________________________ Does this error disable the service availability? _______________________________________________ What does the ORA-1045 message mean? How can you nd out? _______________________________________________
TO RESTORE: Log in as user oracle and run the following commands. 1. Type $ sqlplus /nolog 2. Type SQL> connect / as sysdba 3. Type SQL> grant create session, create table to sc_fm; 4. Type SQL> quit 5. Type $ exit
9-18
Where are the errors logged? _______________________________________________ Does this appear to be an application, cluster, or operating system problem? _______________________________________________ How can you avoid problems like this? _______________________________________________
TO RESTORE: Fix the typographic error in the /etc/hosts le on the cluster node.
9-19
www.chinaitproject.com IT QQ : 3264454
Where are the errors logged? _______________________________________________ Does this appear to be an application, cluster, or operating system problem? _______________________________________________ How can you avoid problems like this? _______________________________________________
TO RESTORE: Perform the following steps on the cluster node: 1. Type: # useradd -u 80 -g 80 -d / webservd 2. Type # clrg online iws-rg
9-20
Easiest: Do it on the third node (non-storage node) of Pair+1 cluster. Medium difculty: Do it on a storage node where all nodes are storage nodes (for example, node of a two-node cluster, or node of a three-node cluster where all are attached to storage). Hardest: Do it on a storage node of a Pair+1 cluster.
9-21
Exercise Summary
www.chinaitproject.com IT QQ : 3264454
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
9-22
www.chinaitproject.com IT QQ : 3264454
Appendix A
A-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454
Task 1 Bring the application resources ofine and start the application manually Task 2 Install the new Oracle 10g software and upgrade the database Task 3 Halt the listener and congure the new network components Task 4 Change and enable the resources Task 5 Verify that the Oracle database upgrade is successful
Preparation
1. Make sure that your administrative workstation or display machine permits X Windows clients to connect to it. On your administrative workstation or display machine, type the following: # /usr/openwin/bin/xhost + 2. Make sure that you have at least 3 Gbytes of free space in the /oracle le system. If you do not have enough space, then you must either grow the le system or create a new volume on which you can install the Oracle 10g software binaries.
Note The amount of time required to perform this exercise can vary greatly with the horsepower of your server nodes. It has taken up to four hours on slow equipment.
A-2
2.
Exit and return as the oracle user and verify the environment change: $ exit # su - oracle $ env
3.
Run the ORACLE 10g Universal Installer software. $ cd Oracle10gR2-DB-Location $ ./runInstaller Respond to the dialogs using Table 9-1.
Table 9-1 The runInstaller Script Dialog Answers Dialog Welcome to the Oracle Database 10g Installation Action Select the Advanced Installation radio button (near the bottom). Click Next. Specify Inventory Directory and Credentials Select Installation Type Click Next. Select the Custom radio button, and click Next.
A-3
Exercise: Oracle Software Installation and Database Upgrade Table 9-1 The runInstaller Script Dialog Answers (Continued) Dialog Specify Home Details Action
www.chinaitproject.com IT QQ : 3264454
Verify specically that the Path matches the new value of $ORACLE_HOME in the .profile le. Click Next.
Enterprise Edition Options 10.2.0.1.0 Oracle Enterprise Manager Console DB 10.2.0.1.0 Oracle Programmer 10.2.0.1.0 Oracle XML Development Kit 10.2.0.1.0
Click Next. Product Specic Prerequisite Checks If you are running in the global zone, these will all succeed, and you should be taken to the next screen without any interaction. If you are running in a non-global zone, you will get a security warning (because there is no /etc/system in a non-global zone). Click Next, and then click Yes to proceed when you get the popup warning window. Privileged Operating System Groups Create Database Summary Verify that both say dba, and click Next. Select the Install database Software only radio button, and Click Next. Verify, and click Install.
A-4
www.chinaitproject.com IT QQ : 3264454 Exercise: Oracle Software Installation and Database Upgrade Table 9-1 The runInstaller Script Dialog Answers (Continued) Dialog Execute Conguration Scripts Action In another window, log in as root on the node or non-global zone in which you are running the installer. Execute the two scripts noted (you should be able to paste their names straight from the Oracle installer into your shell window): /oracle/oraInventory/orainstRoot.sh /oracle/products/10.2/root.sh For the second script, accept the default pathname for the local bin directory. Once the scripts are completed, click Ok. End of Installation 4. Click Exit, and conrm. Run the Network Conguration Assistant to create a new listener for the new version. $ netca Dialog Oracle Net Conguration Assistant Welcome Listener Conguration, Listener Listener Conguration, Listener Name Select Protocols TCP/IP Protocol More Listeners Listener Conguration Done Welcome Action Select Listener Conguration, and click Next. Select Add, and Click Next. Verify the Listener name is LISTENER, and click Next. Verify that TCP is among the Selected Protocols, and click Next. Verify that the Use the standard port number of 1521 radio button is selected, and click Next. Verify that the No radio button is selected, and click Next. Click Next. Click Finish.
A-5
www.chinaitproject.com IT QQ : 3264454
Halt the new listener that was automatically started (we will recongure it later to use the ora-lh address): $ lsnrctl stop
A-6
www.chinaitproject.com IT QQ : 3264454 Exercise: Oracle Software Installation and Database Upgrade Dialog Step 4: Backup Action Lie, and verify that the I have already backed up my database radio button is selected (in order to speed up this lab exercise, pretend you have). Click Next. Step 5: Management Options Verify that the Enterprise Manager is not available (because you deselected it when installing the software). Click Next. Step 6: Summary Database Upgrade Assistant: Progress Verify, and click Finish. The upgrade could take 1-4 hours to complete. It will appear to be stuck at several points, but be patient. Enjoy a ne cup of coffee courtesy of your training center, and a well-deserved nap. Click OK when it is 100% nished. Database Upgrade Assistant: Upgrade Results Verify (you do not need to set any more passwords), and click Close.
A-7
www.chinaitproject.com IT QQ : 3264454
A-8
www.chinaitproject.com IT QQ : 3264454 Exercise: Oracle Software Installation and Database Upgrade Your entire le should end up looking identical to the following, assuming your logical host name is literally ora-lh: SID_LIST_LISTENER = (SID_LIST = (SID_DESC = (SID_NAME = PLSExtProc) (ORACLE_HOME = /oracle/products/10.2) (PROGRAM = extproc) ) (SID_DESC = (SID_NAME = MYORA) (ORACLE_HOME = /oracle/products/10.2) (GLOBALDBNAME = MYORA) ) ) LISTENER = (DESCRIPTION_LIST = (DESCRIPTION = (ADDRESS = (PROTOCOL = TCP)(HOST = ora-lh)(PORT = 1521)) (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC0)) ) ) ) 5. Congure the tnsnames.ora le by typing (as user oracle): $ vi $ORACLE_HOME/network/admin/tnsnames.ora Modify the HOST variables to match logical host name ora-lh.
A-9
Exercise: Oracle Software Installation and Database Upgrade # clrs enable ora-server-res ora-listener-res # clrs monitor ora-server-res
www.chinaitproject.com IT QQ : 3264454
A-10
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercise. Describe the following:
!
?
Manage the discussion based on the time allowed for this module. If you do not have time to spend on discussion, highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students to describe their overall experiences with this exercise. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Ask students to articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
A-11
www.chinaitproject.com IT QQ : 3264454
www.chinaitproject.com IT QQ : 3264454
Appendix B
B-1
Copyright 2007 Sun Microsystems, Inc. All Rights Reserved. Sun Services, Revision A
www.chinaitproject.com IT QQ : 3264454
Task 1 Shutting down Failover Oracle Instances Task 2 Provisioning the Shared QFS File System Task 3 Conguring Oracle Virtual IPs Task 4 Conguring the oracle User Environment Task 5 Disabling Access Control on X Server of the Admin Workstation Task 6 Installing Oracle CRS Software Task 7 Installing Oracle Database Software Task 8 Create Sun Cluster Resources to Control Oracle RAC Through CRS Task 9 Verifying That Oracle RAC Works Properly in a Cluster
B-2
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software
Preparation
Before continuing with this exercise, read the background information in this section.
Background
Oracle 10g RAC software in the Sun Cluster environment encompasses several layers of software, as follows:
RAC Framework This layer sits just above the Sun Cluster framework. It encompasses the UNIX distributed lock manager (udlm) and a RAC-specic cluster membership monitor (cmm). In the Solaris 10 OS, you must create a resource group rac-framework-rg to control this layer (in the Solaris 9 OS, it is optional to create the resource group; if you do not, the daemons will be controlled by standard Solaris boot scripts).
Oracle Cluster Ready Services (CRS) CRS is essentially Oracles own implementation of a resource group manager. That is, for the Oracle RAC database instances and their associated listeners and related resources, CRS takes the place of the Sun Cluster resource group manager.
Oracle Database The actual Oracle RAC database instances run on top of CRS. The database software must be installed separately (it is a different product) after CRS is installed and enabled. The database product has hooks that recognize that it is being installed in a CRS environment.
Sun Cluster control of RAC The Sun Cluster 3.2 environment features new proxy resource types that allow you to monitor and control Oracle RAC database instances using standard Sun Cluster commands. The Sun Cluster instances issue commands to CRS to achieve the underlying control of the database.
B-3
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software The various RAC layers are illustrated in Figure 9-1.
ORACLE RAC Database DB instances controlled by CRS,
www.chinaitproject.com IT QQ : 3264454
ORACLE CRS
5K + KIJAH 4)+ FH NO HAI KH?A AJI O K KIA 5K + KIJAH ? = @I J ? JH @=J=>=IA >O FH NOE C JDH KCD +45
Controlled by rac-framework rg
RAC Framework
Figure 9-1
Raw devices using the VxVM Cluster Volume Manager (CVM) feature Raw devices using the Solaris Volume Manager multi-owner diskset feature Raw devices using no volume manager (assumes hardware RAID) Shared QFS le system on raw devices or Solaris Volume Manager multi-owner devices On a supported NAS device Starting in Sun Cluster 3.1 8/05, the only such supported device is a clustered Network Appliance Filer.
Note Use of global devices (using normal device groups) or a global le system is not supported. The rationale is that if you used global devices or a global le system, your cluster transport would now be used for both the application-specic RAC trafc and for the underlying device trafc. The performance detriment this might cause might eliminate the advantages of using RAC in the rst place.
B-4
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software
The HA-Oracle installation, including the home directory for the user oracle, may be in a failover le system. We require the oracle home directory and binaries to be available to all nodes simultaneously. It is unknown whether the nodes you are running on may have the horsepower to run Oracle failover and Oracle RAC simultaneously.
For this reason, shut down your failover Oracle application. Type the following, from any one-node: # clrg offline ora-rg # clrs disable -g ora-rg + # clrg unmanage ora-rg
B-5
www.chinaitproject.com IT QQ : 3264454
B-6
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software
$ vi .profile ORACLE_BASE=/orashared ORACLE_HOME=$ORACLE_BASE/product/10.2.0/db_1 CRS_HOME=$ORACLE_BASE/product/10.2.0/crs TNS_ADMIN=$ORACLE_HOME/network/admin DISPLAY=display-station-name-or-IP:display# if [ `/usr/sbin/clinfo -n` -eq 1 ]; then ORACLE_SID=sun1 fi if [ `/usr/sbin/clinfo -n` = 2 ]; then ORACLE_SID=sun2 fi PATH=/usr/ccs/bin:$ORACLE_HOME/bin:$CRS_HOME/bin:/usr/bin:/usr/sbin export ORACLE_BASE ORACLE_HOME TNS_ADMIN CRS_HOME export ORACLE_SID PATH DISPLAY 3. 4. Make sure your actual X-Windows display is set correctly on the line that begins with DISPLAY=. Read in the contents of your new .profile le and verify the environment. $ . ./.profile $ env 5. Enable rsh for the oracle user. $ echo + >/orashared/.rhosts
B-7
www.chinaitproject.com IT QQ : 3264454
Table 9-2 Oracle CRS Installation Dialog Actions Dialog Welcome Specify Inventory directory and credentials Specify Home Details Action Click Next. Verify, and click Next. Change the Path to: /orashared/product/10.2.0/crs Click Next. Product Specic Prerequisite Checks Most likely, these checks will all succeed, and you will be moved automatically to the next screen without having to touch anything. If you happen to get a warning, click Next, and if you get a pop-up warning window, click Yes to proceed.
B-8
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog Specify Cluster Conguration Action Enter the name of the cluster (this is actually unimportant). Highlight the line containing the name of any nonstorage node (where shared QFS is not congured) and click Remove. For each remaining node listed, verify that the Virtual Host Name column contains nodename-vip similar to what you entered in the /etc/hosts le. Click Next. If you get some error concerning a node that cannot be clustered (null), it is probably because you do not have an /orashared/.rhosts le, or a password for the oracle user. You must have one even on the node on which you are running the installer. Specify Network Interface Usage Be very careful with this section! To mark an adapter in the instructions in this box, highlight it and click Edit. Then choose the appropriate radio button and click OK Mark your actual public adapters as public. Mark only the clprivnet0 interface as private. Mark all other adapters, including actual private network adapters, as Do Not Use. Click Next. Specify Oracle Cluster Registry (OCR) Location Choose the External Redundancy radio button. Enter /orashared/ocr_file Click Next.
B-9
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog Specify Voting Disk Location Action
www.chinaitproject.com IT QQ : 3264454
Choose the External Redundancy radio button. Enter /orashared/voting_file Click Next.
Summary
B-10
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog Execute Conguration Scripts Action On all selected nodes, one at a time, starting with the node on which you are running the installer, open a terminal window as user root and run the scripts: /orashared/oraInventory/orainstRoot.sh /orashared/product/10.2.0/crs/root.sh The second script formats the voting device and enables the CRS daemons on each node. Entries are put in /etc/inittab so that the daemons run at boot time. On all but the rst node, the messages: EXISTING Configuration detected... NO KEYS WERE Written..... are correct and expected. Please read the following section carefully: On the last node only, the second script tries to congure Oracles Virtual IPs. There is a known bug on the SPARC version: If your public net addresses are in the known non-Internettable range (10, 172.16 through 31, 192.168) this part fails right here. If you get the The given interface(s) ... is not public message, then ... before you continue ... As root (on that one node with the error): Set your DISPLAY (for example, DISPLAY=machine:#;export DISPLAY) Run /orashared/product/10.2.0/crs/bin/vipca Use Table 9-3 to respond, and when you are done, RETURN HERE. When you have run to completion on all nodes, including running vipca by hand if you need to, click OK on the Execute Conguration Scripts dialog. Conguration Assistants Let them run to completion (nothing to click).
B-11
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-2 Oracle CRS Installation Dialog Actions (Continued) Dialog End of Installation Action Click Exit and conrm.
www.chinaitproject.com IT QQ : 3264454
Table 9-3 VIPCA Dialog Actions Dialog Welcome Network Interfaces Action Click Next. Verify that all/both of your public network adapters are selected Click Next. Virtual IPs for Cluster Nodes For each node, enter the nodename-vip name that you created in your hosts les previously. When you press TAB, the form might automatically ll in the IP addresses, and the information for other nodes. If not, ll in the form manually. Verify that the netmasks are correct. Click Next. Summary Conguration Assistant Progress Dialog Conguration Results Verify, and click Finish. Conrm that the utility runs to 100%, and click OK.
B-12
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software
Table 9-4 Install Oracle Database Software Dialog Actions Dialog Welcome Select Installation Type Specify Home Details Action Click Next. Select the Custom radio button, and click Next. Verify, especially that the destination path is /orashared/product/10.2.0/db_1. Click Next. Specify Hardware Cluster Installation Mode Verify that the Cluster Installation radio button is selected. Put check marks next to all of your selected cluster nodes. Click Next. Available Product Components Deselect the following components (to speed up the installation):
Enterprise Edition Options Oracle Enterprise Manager Console DB 10.2.0.1.0 Oracle Programmer 10.2.0.1.0 Oracle XML Development Kit 10.2.0.1.0
Click Next.
B-13
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-4 Install Oracle Database Software Dialog Actions (Continued) Dialog Product Specic Prerequisite Checks Action
www.chinaitproject.com IT QQ : 3264454
Most likely, these checks will all succeed, and you will be moved automatically to the next screen without having to touch anything. If you happen to get a warning, click Next, and if you get a pop-up warning window, click Yes to proceed.
Privileged Operating System Groups Create Database Summary Oracle Net Conguration Assistant Welcome Listener Conguration, Listener Name Select Protocols TCP/IP Protocol Conguration More Listeners
Verify that dba is listed in both entries, and click Next. Verify that Create a Database is selected, and click Next. Verify, and click Install Verify that the Perform Typical Conguration check box is not selected, and click Next. Verify the Listener name is LISTENER, and click Next. Verify that TCP is among the Selected Protocols, and click Next. Verify that the Use the Standard Port Number of 1521 radio button is selected, and click Next. Verify that the No radio button is selected, and click Next (be patient with this one, it takes a while to move on) Click Next. Verify that the No, I Do Not Want to Congure Additional Naming Methods radio button is selected, and click Next. Click Finish. Select the General Purpose radio button, and click Next. Type sun in the Global Database Name text eld (notice that your keystrokes are echoed in the SID Prex text eld), and click Next. Verify that Enterprise Manager is not available (you eliminated it when you installed the database software) and click Next.
Listener Conguration Done Naming Methods Conguration Done DBCA Step 1: Database Templates DBCA Step 2: Database Identication DBCA Step 3: Management Options
B-14
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-4 Install Oracle Database Software Dialog Actions (Continued) Dialog DBCA Step 4: Database Credentials Action Verify that the Use the Same Password for All Accounts radio button is selected. Enter cangetin as the password, and click Next. DBCA Step 5: Network Conguration Verify that the Register this database with all the listeners radio button is selected. Click Next. DBCA Step 6: Storage Options Verify that the Cluster File System radio button is selected. Click Next. DBCA Step 7: Database File Locations DBCA Step 8: Recovery Conguration DBCA Step 9: Database Content DBCA Step 10: Database Services DBCA Step 11: Initialization Parameters Select Use Database File Locations from Template Click Next. Make sure all the boxes are unchecked (uncheck manually if necessary) , and click Next. Uncheck Sample Schemas, and click Next. Click Next. On the Memory tab, verify that the Typical radio button is selected. Change the percentage to a ridiculously small number (1%). Click Next and accept the error telling you the minimum memory required. The percentage will automatically be changed on your form. Click Next. DBCA Step 12 Database Storage Verify that the database storage locations are correct by clicking leaves in the le tree in the left pane and examining the values shown in the right pane. These locations are les under $ORACLE_BASE (/orashared) Click Next.
B-15
Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software Table 9-4 Install Oracle Database Software Dialog Actions (Continued) Dialog DBCA Step 13: Creation Options Summary Action
www.chinaitproject.com IT QQ : 3264454
Verify that the Create Database check box is selected, and click Finish. Verify and click OK (the database is built; this can take anywhere from 12 minutes to an hour depending on the obsoleteness of your hardware) Click Exit. Wait a while (anywhere from a few seconds to a few minutes: be patient) and you will get a pop-up window informing you that Oracle is starting the RAC instances. On each of your selected nodes, open a terminal window as root, and run the script: /orashared/product/10.2.0/db_1/root.sh Accept the default path name for the local bin directory. Click OK in the script prompt window.
End of Installation
Task 8 Create Sun Cluster Resources to Control Oracle RAC Through CRS
Perform the following steps on only one of your selected nodes to create Sun Cluster resources that monitor your RAC storage, and that allow you to use Sun Cluster to control your RAC instances through CRS: 1. Register the types required for RAC storage and RAC instance control:
# clrt register crs_framework # clrt register ScalMountPoint # clrt register scalable_rac_server_proxy 2. Create a CRS framework resource (the purpose of this resource is to try to cleanly shut down CRS if you are evacuating the node using cluster commands):
B-16
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software 3. Create a group to hold the resource to monitor the shared storage:
# clrg create -n node1,node2 \ -p Desired_primaries=2 \ -p Maximum_primaries=2 \ -p RG_affinities=++rac-framework-rg \ rac-storage-rg 4. Create the resource to monitor the shared le system. # clrs create -g rac-storage-rg -t ScalMountPoint \ -p filesystemtype=s-qfs \ -p mountpointdir=/orashared \ -p targetfilesystem=qfsorashared \ -p Resource_dependencies=qfsmeta-res \ rac-storage-res 5. 6. Bring the resource group that monitors the shared storage online. Create a group and a resource to allow you to run cluster commands that control the database through CRS.
B-17
www.chinaitproject.com IT QQ : 3264454
Note For the following command, make sure you understand which node has node id 1, which has node id 2, and so forth, so that you match correctly with the names of the database sub-instances. You can use clinfo -n to verify the node id on each node.
# vi /var/tmp/cr_rac_control clrs create -g rac-control-rg -t scalable_rac_server_proxy \ -p DB_NAME=sun \ -p ORACLE_SID{name_of_node1}=sun1 \ -p ORACLE_SID{name_of_node2}=sun2 \ -p ORACLE_home=/orashared/product/10.2.0/db_1 \ -p Crs_home=/orashared/product/10.2.0/crs \ -p Resource_dependencies_offline_restart=rac-storage-res{local_node} \ -p Resource_dependencies=rac-framework-res \ rac-control-res # sh /var/tmp/cr_rac_control # clrg online -M rac-control-rg
B-18
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software
B-19
www.chinaitproject.com IT QQ : 3264454
On either node, reenable the instance through the Sun Cluster resource: # clrs enable -n node2 rac-control-res On that (affected node), you should be able to repeat step 5. It might take a few attempts before the database is initialized and you can successfully access your data. Cause a crash of node 1: # <Control-]> telnet> send break
7.
8.
9.
On the surviving node, you should see (after 45 seconds or so), the CRS-controlled virtual IP for the crashed node migrate to the surviving node. Run the following as user oracle: $ crs_stat -t|grep vip ora.node2.vip ora.node1.vip application application ONLINE ONLINE ONLINE node2 ONLINE node2
10. While this virtual IP has failed over, verify that there is actually no failover listener controlled by Oracle CRS. This virtual IP fails over merely so a client quickly gets a TCP disconnect without having to wait for a long time-out. Client software then has a client-side option to fail over to the other instance. $ sqlplus SYS@sun1 as sysdba SQL*Plus: Release 10.2.0.1.0 - Production on Tue May 24 10:56:18 2005 Copyright (c) 1982, 2005, ORACLE. Enter password: ERROR: ORA-12541: TNS:no listener All rights reserved.
Enter user-name: ^D 11. Boot the node that you had halted, by typing boot or go at the OK prompt in the console. If you choose the latter, the node will panic and reboot. 12. After the node boots, monitor the automatic recovery of the virtual IP, the listener, and the database instance by typing, as user oracle, on the surviving node: $ crs_stat -t
B-20
www.chinaitproject.com IT QQ : 3264454 Exercise 3: Running Oracle 10g RAC in Sun Cluster 3.2 Software $ /usr/cluster/bin/clrs status It can take several minutes for the full recovery. Keep repeating the steps. 13. Verify the proper operation of the Oracle database by contacting the various sub-instances as the user oracle on the various nodes: $ sqlplus SYS@sun1 as sysdba Enter password: cangetin SQL> select * from mytable; SQL> quit
B-21
Exercise Summary
www.chinaitproject.com IT QQ : 3264454
Exercise Summary
Discussion Take a few minutes to discuss what experiences, issues, or discoveries you had during the lab exercises.
!
?
Manage the discussion based on the time allowed for this module, which was provided in the About This Course module. If you do not have time to spend on discussion, then just highlight the key concepts students should have learned from the lab exercise.
Experiences
Ask students what their overall experiences with this exercise have been. Go over any trouble spots or especially confusing areas at this time.
Interpretations
Ask students to interpret what they observed during any aspect of this exercise.
Conclusions
Have students articulate any conclusions they reached as a result of this exercise experience.
Applications
Explore with students how they might apply what they learned in this exercise to situations at their workplace.
B-22