Você está na página 1de 21

Checklist de Dispositivos de Fita para Ambiente SAN

A ocorrncia de erros de I/O tais como [90:190] Invalid format version of Data Protector medium, [90:51] Cannot write to device, [90:54] Cannot open device, na maioria das vezes no so decorrentes de problema de hardware ou do DP, mas da ao de agentes externos ao Data Protector que durante a operao de backup (escrita em midia), foram o dispositivo a executar uma operao de rewind. Como o Data Protector no tem cincia desta operao, aps o rewind, os dados continuam sendo escritos, sobreescrevendo a header da fita. Desta forma, a fita deixa de ser reconhecida e passa para um status BLANK. A mdia registrada no pool que deixou de existir por perda do header permanecer no pool com status POOR. Pela experincia de suporte, os agentes mais comuns em ambiente SAN que causam este comportamento so: ferramentas de monitorao de dispositivos; ferramentas de monitorao da SAN; outras aplicaes de backup ativas na SAN (arcserver, por exemplo); poltica de locking de dispositivos de fita do DP incorretamente aplicada. reset da SAN, switches problemticos, manuteno da SAN; reboot de sistemas linux; drivers de Fibre Channel e SCSI desatualizados;

Pode-se claramente identificar a ocorrncia deste tipo interferncia, analisando o report da(s) NSR(s), localizando a string FCP_CDB 00000000 na sesso de traces: ... 19505ms 289us Vx Date 04/24/08 Time 16:01:15 0ms 1us FrmHdr 06040b00 00050c00 08290008 00000000 0136ffff Port 0 0ms 2us FCP_LUN 00000000 00000000 FCP_CNTRL 00000000 FCP_DL 00000000 0ms 0us FCP_CDB 00000000 00000001 44000000 00000000 IOCB 80F2B5DC 0ms 20us RMI_getPortIdforRoute: routeIndex: x3, Route_Port: x0 0ms 2us fcpTrns_cleanupPersistentCommand: Leaving ....

Como reforo, segue um excerto traduzido do Enterprise Backup Solution Design Guide, documento que deve ser utilizado quando da configurao de um ambiente de backup (http://h18000.www1.hp.com/products/storageworks/ebs/):

Rogue applications
Rogue applications (ou aplicaes trapaceiras), uma categoria de produtos de software frequentemente encontradas em ambientes de SAN que podem interferir no funcionamento normal de operaes de backup e restore. Rogue applications incluem agentes de gerenciamento, softwares de monitorao e uma ampla lista de drive de fitas e utilitrios de configurao de sistema. Uma lista de rogue applications conhecidas e os sistemas operacionais onde elas rodam pode

ser vista abaixo. Esta lista no pretende cobrir todas as aplicaes, somente um exemplo das mais comuns. . Windows (all versions) . SAN Surfer (HBA configuration utility) . HBAnywhere/lputilnt (HBA configuration utilities) . HP System Insight Manager (management agents) . Removable Storage Manager . HP Library & Tape Tools (tape utilities) . Linux (all versions) . SAN Surfer . HP Library & Tape Tools . mt commands (native to OS) . Unix . mt commands (native to OS) . diagnostics

. Solaris . SUN Explorer (system configuration utility) Essas aplicaes, utilitrios e comandos reconhecidamente interferem nos componentes onde os dados trafegam e quando rodados concorrentemente as operaes de backup ou restore, tem o potencial de causar interrupo de jobs, corrupo de dados e emitir falsos alarmes de hardware. Por exemplo, utilitrios de HBA tais como SAN Surfer e HBAnywhere tem a habilidade de de resetar portas de Fibre Channel; utilitrios como HP Library and Tape Tools permitem testes completos e reset de devices e upgrade de firmware; agentes de gerenciamento e utilitrios, tais como HP Systems Insight Manager and SUN Explorer fazem polling de dispositivos de fita e podem causar interrupes e/ou contenes no acesso aos mesmos.

Recomendaes

Implemente uma poltica de acesso restritiva aos dispositivos da library, evitando que novos hosts tenham inadvertidamente acesso aos drives da library. O discovery manual e a associao de um mapa nulo inicial para um novo host do router fazem esta funo.

As seguintes aes que devem ser efetivadas em todos os servidores que tenham acessos compartilhados, via SAN, aos dispositivos de fita (drives e libraries): Manter os drivers de Fibre Channel e SCSI/SCSI-tape atualizados nos sistemas operacionais; Manter um controle rgido de mudanas nos hosts da SAN;

IMPORTANTE:
Instalao de atualizaes de software ou hardware (Windows Service Pack, Proliant Support Pack) eventualmente podem desfazer as modificaes aqui

sugeridas, retornando uma condio em que teremos as interferncias presentes novamente no ambiente.

REVISE ESTE CHECKLIST APS QUALQUER ATUALIZAO DOS SERVIDORES.

Ambiente Windows

Desabilitar os servios:

RSM (Removable Storage Manager)


Start > Run > dcomcnfg, ao acessar o MMC - Component Services > Computers > My Computer > Dcom Config > Removable Storage Manager, selecionar e clicar com o boto direito do mouse e em seguida clicar em properties

Na aba location verificar se a apo "Run Applcation on this computer" est desabilitada.

Desabilitar tambm o servio em Services do painel de controle ferramentas administrativas

TUR (Test Unit Ready)


Tomar as aes recomendadas em http://support.microsoft.com/default.aspx?scid=kb;en-us;842411 Ou http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp? lang=en&cc=us&objectID=c00718488&jumpid=reg_R1002_USEN Manually edit the system registry using RegEdit. Logged into the system as a user with Administrative privileges, run RegEdit and navigate to the following registry key: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\hplto. To disable RSM polling, edit the AutoRun value found in this key. A value of 0 (zero) indicates that polling is disabled; a value of 1 indicates that polling is enabled. If this key does no exist, create it:

Value: AutoRun Type: REG_DWORD Data: 0 is disabled

After completion of these steps, the affected system should be rebooted.


IMPORTANT: Adding or removing tape drives from the system may cause an older driver inf file to be re-read, which in turn can re-enable RSM polling. If drives are added or removed, the registry should be checked for proper configuration and, if necessary, repeat step 2 above.

HP Management Agents - Storage Agents


Start > Control Panel > HP Management Agents.

Na barra de ttulo pode ser identificada a verso do HP Management Agents instalada

Na aba de Process Monitor, localizar cqmgstor e clicar na opo Stop. Clicar em Ok.

Start > Run > service.msc

HP Management Agents - Fibre Agent Tape Support


Start > Control Panel > HP Management Agents.

Na barra de ttulo pode ser identificada a verso do HP Management Agents instalada

Uma vez identificada a verso dos agents, siga as instrues abaixo para desabilitar o tape device polling.

10

Agentes na Verso 7.30 e Superiores

Clique na aba Storage e marque a checkbox Disable Fibre Agent Tape Support.

Agentes na Verso 7.20 Clique na aba Storage e marque a checkbox Disable Fibre Agent Tape Support.

11

Agentes nas verses 7.10 e 7.00 Para desabilitar o Fibre Array Tape Support, aplique o SoftPaq SP25792 disponvel em: ftp://ftp.compaq.com/pub/softpaq/sp25501-26000/SP25792.EXE. A documentao deste SoftPaq est disponvel em: ftp://ftp.compaq.com/pub/softpaq/sp25501-26000/sp25792.txt Para confirmar, checar o registry: [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\CqMgStor\CPQFCA] "DisplayName"="Fibre Array Information" "Timeout"=dword:0001d4c0 "DisableFlags"=dword:00000001

12

HBAs de Fibra

Emulex: If you are using Emulex HBA, the Emulex HBA driver has to be updated and resetTPRLO has to be set to 2 as per HP guide lines. This can be done directly in the System Registry or using Lputil Utility. This utility will be there along with the Emulex device drivers. Servers with Emulex adapters using Storport: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\elxstor\Paramet ers\Device "DriverParameters"="NodeTimeOut=10;LinkTimeOut=40;QueueTarget=1;E mulexOption=0;ResetTPRLO=2;"

QLogic: 5. If you are using QLogic HBA, from the "Configuration settings" menu in FastUTIL, select "Advanced Adapter Settings" and set the "Enable Target Reset" to NO, the default is 'Yes'. (If installed, you can also use the SANSurfer CLI or the SANSurfer GUI, check the manual of the HBA for details). For Qlogics FC-Adapter: Go into the registry and change this parameters HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\ql2300\Parameters\Device "DriverParameters"="UseSameNN=0;" To "DriverParameters"="UseSameNN=0;rstbus=2;tapereset=0"

!! Some configuration need the SCSI reset to be enable (like clusters configuration) to work correctly, check with customer if these settings can be made in his configuration.

13

Outras ferramentas de backup


Verificar a existncia de outras ferramentas de backup instaladas, tais como NTBackup ou ArcServer e desabilit-las. ...e qualquer outro servio de monitorao de dispositivos de fita.

14

Ambiente Linux:

Desabilitar o servio HP Linux Storage Agent para sistemas Proliant http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp? objectID=c00715023 /opt/compaq/storage/etc/cmascsid stop vi /opt/compaq/storage/etc/cmascsid Comentar a linha . $CMAINCLUDE SUsE: http://support.novell.com/techcenter/psdb/f3f70d4088fdc8473c2b7d44afa82b30. html

Desabilitar devices com rewind no RedHat: 1 Renomeie os devices com rewind: edite-> /etc/udev/rules.d/50-udev.rules adicione-> KERNEL="st[0-9]*", BUS="scsi", NAME="xst%n" 2 Mude os parmetros default dos devices com rewind: edite-> /etc/udev/permissions.d/50-udev.permissions modifique a linha de permisses do st para-> xst*:root:disk:0000 3 - Reboot o servidor.

15

Ambiente HP-UX (pr 11.31):

verificar os parmetros de kernel: st_san_safe = 1 st_ats_enable = 0


o st_san_safe This feature prevents that another HP-UX host is able to open a tape device with the automatic rewind at close functionality. It does not prevent another host from doing all other kind of things with the tape drive. One could for example issue a different tape position command with mt. A tape drive is a non tagged queuing scsi device, that means only one command to a tape drive can be outstanding. That limits the possibility that a host that is not using the drive for a backup will interfere with the currently ongoing backup, but it can not guarantee that another host interferes. o st_ats_enabled This parameter enables a feature that the scsi command set offers. It is a reserve and release mechanism. A host can reserve a drive so that only this host can access this drive. Any other host which tries to access this drive will receive a check condition "reservation conflict" and not be able to do anything with this drive. The host that holds the reservation needs to explicitly release the drive again when it doesn't use the drive anymore. The biggest problem arises when the host does not release the drive, then all other hosts are not able to access the drive as well as they can not break the reservation conflict. The only possibility is to reset the tape drive or the original host will do the release. The above kernel tunable enables the (s)tape driver to reserve the drive when one opens the tape device and do a automatic release when the device is closed. It is important to understand that only st_ats_enabled can assure that no other host can access a tape drive while another one is using it. This is a type of mandatory locking. Due to the mentioned problems to break the reservation should a host have "forgotten" to release the reservation, Omniback and other backup solutions do not use this mechanism anymore. They favor the first mechanism, that only prevents the rewind of the tape through other hosts, but it is important to understand that this is not a real locking meachanism that can prevent others from doing wrong things with the tape drive. Omniback and other backup solutions try to coordinate the tape access by some cell servers, but they can not prevent that a system admin accidently accesses a tape device file with mt or another command.

16

Verificar se o EMS est na verso A.29.00 0112 December 2001 ou superior, a qual j deve conter a cfg abaixo:
Set the POLL_INTERVAL value in the file /var/stm/config/tools/monitor/dm_stape.cfg to zero to stop the monitor from polling and uncomment it (remove the leading #). The dm_stape.cfg config file will be reread within 60 minutes if polling was disabled, otherwise within one current polling cycle (no reboot is necessary). IMPORTANT NOTE: The diaglogd process must be running when you set the POLL_INTERVAL value to zero. Otherwise, the monitor will fill the api.log file with error messages (until the hard disk space is used up) and consume most of the CPU time. Under no circumstances should diaglogd or the STM diagnostics be shut down!

Ambiente HP-UX (B.11.31)

Instalar o ltimo patch de SO 11.31 estape cumulative patch (atualmente o PHKL_39593) e suas dependncias. On previous versions of HP-UX the client had to issue the following command: # kctune st_san_safe=1 . With HP-UX B.11.31, the command needed to get the same functionality is: # scsimgr set_attr -d estape -a norewind_close_disabled=1 . To preserve the change across reboots, the user must also run: # scsimgr save_attr -d estape -a norewind_close_disabled=1 . To confirm your desired settings, run: # scsimgr -d estape get_attr . DRIVER estape GLOBAL ATTRIBUTES: . name = version current = 0.1 default = saved = . name = norewind_close_disabled current = 1 <--- Here is the set_attr change default = 0 saved = 1 <--- The save_attr will save the setting across reboots . name = st_ats_enable current = 0 default = 0 saved = . For more information, read the scsimgr(1M) and scsimgr_estape(7) man pages or refer to the I/O subsystem section of the release notes for HP-UX 11iv3.

17

STM Info Tool: root cause: The info tool is sending TUR (Test Unit Ready) commands, which cause running backups to abort and tapes to rewind if they are using rewind device files. Up to hpux system running 11.23 and using the cstm info tool we are using inquiries only but on hpux running 11.31, the cstm info tool is sending a TUR in addition to the inquiry. Solution: Install online Diagnostic Sept 2009. In this bundle the tape driver for online diags is fixed. WORKAROUND: as long online Diagnostic Sept 2009 is not available, install the binaries (PA or IA) as follows: -rw-r--r-- 1 root -rw-r--r-- 1 root 1) 2) 3) 4) 5) 6) sys sys 159364 Mar 9 14:29 tlscsidev.sl_IA 61440 Mar 9 14:28 tlscsidev.sl_PA

Procedure to use these binary: Binary to be replaced in the target system is /usr/sbin/stm/uut/lib/tlscsidev.sl Take a backup of the existing binary at location /usr/sbin/stm/uut/lib using Replace the corresponding (IA/PA) binary in the target system. Change the permission for the binary file as: chmod 555 tlscsidev.sl Change the owner: chown bin:bin tlscsidev.sl Issue the info command for tape.

the command: mv tlscsidev.sl tlscsidev.sl_backup

tlscsidev.sl_IA tlscsidev.sl_PA

18

No ambiente DP:
Aps a configurao de um ambiente de SAN, podero existir mltiplos drives lgicos representando um mesmo drive fsico. O Data Protector usa um mecanismo de locking que evita que um backup que utilize um determinado drive lgico venha a sofrer interferncia de outro backup que utilize um device lgico que mapeie o mesmo drive fsico. Este mecanismo chamado Lock Name e consiste na utilizao de um mesmo lockname para todos os drives lgicos que mapeiem o mesmo drive fsico. A configurao automtica de devices do DP a forma recomendada para evitar erros de configurao, pois cria os lock names e ajusta os drive index/SCSI paths automaticamente.

Levantamento do escopo do problema


Para verificar se h corrupo de headers, pode-se procurar nas mensagens das sesses de backup do DP em que tape drives o problema j ocorreu: cd /var/opt/omni/server/db40/msg/2007/06 for i in `grep -l 90:190 *` do echo $i grep -e BMA -e 90:190 $i done Uma vez sabidos os drives em que o problema ocorre, possvel levantar os sistemas que os acessam via NSR para investigar somente os envolvidos com o problema.

19

CONFIGURAO DE LOCK

20

21

Você também pode gostar