Veritas Cluster
Veritas Cluster
Veritas Cluster
Veritas Cluster enables one system to failover to the other system. All related software processes are simply moved from one system to the other system with minimal downtime. Veritas Cluster does NOT have both boxes up at once servicing requests. It only offers a hot standby system. This enables the system to keep running (with a short transfer period) if a machine fails or system maintenance needs to be done.
vxdiskadm - initialize driives in array vxassist -g oradg -maxsize drive1 drive2 - setup array drives mount disks add to vfstab DBAs install oracle on ONE machine, update /etc/system on BOTH machines, add table for vcs On BOTH machines /etc/init.d/volmgt start insert cluster server cd cd /cdrom/cdrom0 pkgadd -d . (add packages 3,2,5,1,4,6; yes to everything) eject cdrom; mount oracle cluster agent cd /cdrom/cdrom0 pkgadd -d . eject cdrom cd /opt/VRTSllt cp llttab /etc cd /etc vi /etc/llttab -- uncomment/change following: set-node 0 [on one machine set to 1] set-cluster 0 link hme0 & qfe0 low-link pri qfe1 start (at bottom) vi /etc/gabtab uncomment gabconfig -c -n 2 cd /etc/rc2.d start llt and gab on both machines /sbin/lltconfig -a list [check for all 3 interfaces] gabconfig -a [check for membership] add /sbin and /opt/VRTSvcs/bin to /.profile PATH] On ONE machine: mkdir /etc/VRTSvcs/conf/config cd /etc/VRTSvcs/config cp *.cf config cd sample-oracle cp main.cf ../config [may need to copy other files - check] cd ../config vi main.cf update systema and systemb update SystemList and AutoStartList add diskgroups IP qfe1 nic-qfe1 add mountpoints update oracle info build dependancies
listener oracle mount volumes diskgroup vip nic hacf -verify . manually stop listener [lsnrctl stop] manually stop db [svrmgrl; connect internal ; shutdown immediate] take mountpoints out of vfstab hastart [start cluster] hagrp -switch oragrp -to systemb [test switchover] run veritas testing on both machines.
Cluster Startup
Here is what the cluster does at startup: Node checks if other node is already started, if so -- stays OFFLINE If no other machine is running, checks communication (gabconfig). May need system admin intervention if cluster requires both nodes to be available. (/sbin/gabconfig -c -x) Once communication between machines is open -- or gabconfig has been started, it sets up network (nic & ip adddress) (starts cluster server) If also brings up volume manager, file system, and then oracle. If any of the critical processes fail, the whole system is faulted. The most common reason for failing is expired licenses, so check licenses before doing work with vxlicense -p.
IP_A: related to shared IP Volume_A: related to Volume manager Mount_A: related to mounting actual filesystes (filesystem) DiskGroup_A: related to Volume Manager/Cluster Server NIC_A: related to actual network device Look at the most recent ones for debugging purposes (ls -ltr). Conf files: Llt conf: /etc/llttab [should NOT need to access this] Network conf: /etc/gabtab If has: /sbin/gabconfig -c -n2 , will need to run /sbin/gabconfig -c -x if only one system comes up and both systems were down. Cluster conf: /etc/VRTSvcs/conf/config/main.cf Has exact details on what the cluster contains. Most executables are in: /opt/VRTSvcs/bin or /sbin
Changing Configurations
ALWAYS be very careful when changing the cluster configurations. The only time I needed to change the cluster configuration was when Vipul upgraded Oracle versions and ORACLE_HOME changed directories. This is a very dangerous thing to do. There are two ways of changing the configurations. The method one uses if the system is up (cluster is running on at least one node, preferably on both): haconf -makerw run needed commands (ie. hasys ....) haconf -dump -makero If both systems are down: hastop -all (shouldn't need this as cluster is down) cp main.cf main.cf.save vi main.cf hacf -verify /etc/VRTSvcs/conf/config hacf -generate /etc/VRTSvcs/conf/config hastart
Veritas Volume Manager divides disks into disk groups and partitions these groups as desired. There is a nice GUI which helps alot. You can even pull up a command window to see what the gui is running. The newest version of the gui is vmsa. General Commands: Veritas Volume Manager licenses info: /usr/sbin/vxlicense -p What volume groups: vxdg list Import volume group (see details on cluster debugging) vxdg import oradg Specific voulme group info: vxprint -ht What is veritas doing (if running another command and it is hanging) vxtask list
Veritas cluster server is a high availability server. This means that processes switch between servers when a server fails. All database processes are run through this server - and as such, this needs to run smoothly. Note that the oracle process should only actually be running on the server which is active. On monitoring tools, the procs light for whichever box is secondary should be yellow, because oracle is not running. Yet, the cluster is running on both systems. Cluster Not Up -- HELP The normal debugging of steps includes: checking on status, restarting if no faults, checking licenses, clearing faults if needed, and checking logs. To find out Current Status: /opt/VRTSvcs/bin/hastatus -summary This will give the general status of each machine and processes /opt/VRTSvcs/bin/hares -display This gives much more detail - down to the resource level. If hastatus fails on both machines (it returns that the cluster is not up or returns nothing), try to start the cluster /opt/VRTSvcs/bin/hastart /opt/VRTSvcs/bin/hastatus -summary will tell you if processes started properly. It will NOT start processes on a FAULTED system. Starting Single System NOT Faulted If the system is NOT FAULTED and only one system is up, the cluster probably needs to have gabconfig manually started. Do this by running: /sbin/gabconfig -c -x /opt/VRTSvcs/bin/hastart /opt/VRTSvcs/bin/hastatus -summary If the system is faulted, check licenses and clear the faults as described next. To check licenses:
vxlicense -p Make sure all licenses are current - and NOT expired! If they are expired, that is your problem. Call VERITAS to get temporary licenses. There is a BUG with veritas licences. Veritas will not run if there are ANY expired licenses -- even if you have the valid ones you need. To get veritas to run, you will need to MOVE the expired licenses. [Note: you will minimally need VXFS, VxVM and RAID licenses to NOT be expired from what I understand.] vxlicense -p Note the NUMBER after the license (ie: Feature name: DATABASE_EDITION [100]) cd /etc/vx/elm mkdir old mv lic.number old [do this for all expired licenses] vxlicense -p [Make sure there are no expired licenses AND your good licenses are there] hastart If still fails, call veritas for temp licenses. Otherwise, be certain to do the same on your second machine. To clear FAULTS: hares -display For each resource that is faulted run: hares -clear resource-name -sys faulted-system If all of these clear, then run hastatus -summary and make sure that these are clear. If some don't clear you MAY be able to clear them on the group level. Only do this as last resort: hagrp -disableresources groupname hagrp -flush group -sys sysname hagrp -enableresources groupname To get a group to go online: hagrp -online group -sys desired-system If it did NOT clear, did you check licenses?
System has the following EXACT status: gedb002# hastatus -summary -- SYSTEM STATE -- System
State
Frozen
A A
gedb001 gedb002
RUNNING RUNNING
0 0
-- GROUP STATE -- Group State B oragrp OFFLINE B oragrp OFFLINE gedb002# nic-qfe3 nic-qfe3
System
Probed
AutoDisabled
gedb001 gedb002
Y Y
N N
hares -display | grep ONLINE State gedb001 ONLINE State gedb002 ONLINE
gedb002# vxdg list NAME STATE rootdg enabled gedb001# vxdg list NAME STATE rootdg enabled Recovery Commands:
ID 957265489.1025.gedb002
ID 957266358.1025.gedb001
hastop -all on one machine hastart wait a few minutes on other machine hastart Reviewing Log Files If you are still having troubles, look at the logs in /var/VRTSvcs/log. Look at the most recent ones for debugging purposes (ls -ltr). Here is a short description of the logs in /var/VRTSvcs/log: hashadow-log_A: hashadow checks to see if the ha cluster daemon (had) is up and restarts it if needed. This is the log of that process. engine.log_A: primary log, usually what you will be reading for debugging Oracle_A: oracle process log (related to cluster only) Sqlnet_A: sqlnet process log (related to cluster only) IP_A: related to shared IP Volume_A: related to Volume manager Mount_A: related to mounting actual filesystes (filesystem)
DiskGroup_A: related to Volume Manager/Cluster Server NIC_A: related to actual network device By looking at the most recent logs, you can know what failed last (or most recently). You can also tell what did NOT run which may be jut as much of a clue. Of course, if none of this helps, open a call with veritas tech support. Calling Tech Support: If you have tried the previously described debugging methods, call Veritas tech support: 800-634-4747. Your company needs to have a Veritas support contract. Restarting Services: If a system is gracefully shutdown and it was running oracle or other high availability services, it will NOT transfer them. It only transfers services when the system crashes or has an error. hastart hastatus -summary will tell you if processes started properly. It will NOT start processes on a FAULTED system. If the system is faulted, clear the faults as described above. Doing Maintenance on DBs: BEFORE working on DB Run hastop -all -force AFTER working on Dbs: You MUST bring up oracle on same machine Once Oracle is up, run: hastart on the same machine as you started the work on (the first on system with oracle running) wait 3-5 minutes then run hastart on the other system If you need the instance to run on the other system, you can run: hagrp -switch oragrp -to othersystem Shutting down db machines:
If you shutdown the machine that is running veritas cluster, it will NOT start on the other machine. It only fails over if the machine crashes. You need to manually switch the services if you shutdown the machine. To switch processes: Find out groups to transfer over hagrp -display Switch over each group hagrp -switch group-to-move -to new-system Then shutdown machine as desired. When rebooted will start cluster daemon automatically. Doing Maintenance on Admin Network: If the admin network is brought down (that the veritas cluster uses), veritas WILL fault both machines AND bring down oracle (nicely). You will need to do the following to recover: hastop -all On ONE machine: hastart wait 5 minutes On other machine: hastart Manual start/stop WITHOUT veritas cluster: THIS IS ONLY USED WHEN THERE ARE DB FAILURES If possible, use the section on DB Maintenance. Only use this if system fails on coming up AND you KNOW that it is due to a db configuration error. If you manually startup filesystems/oracle -- manually shut them down and restart using hastart when done. To startup: Make sure ONLY rootdg volume group is active on BOTH NODEs. This is EXTREMELY important as if it is active on both nodes corruption occurs. [ie. oradg or xxoradg is NOT present] vxdg list hastatus (stop on both as you are faulted on both machines ) hastop -all (if either was active make sure you are truly shutdown!) Once you have confirmed that the oracle datagroup is not active, on ONE machine do the following: vxdg import oradg [this may be xxoradg where xx is the client 2 char code] vxvol -g oradg startall mount -F vxfs /dev/vx/dsk/oradg/name /mountpoint [Find volumes and mount points in /etc/VRTSvcs/conf/config/main.cf]
Let DBAs do their stuff To shutdown: umount /mountpoint [foreach mountpoint] vxdg deport oradg vxvol -g oradg stopall clear faults; start cluster as described above
3. Verify Cluster is Running First verify that veritas is up & running: hastatus -summary If this command could NOT be found, add the following to root's path in /.profile: vi /.profile add /opt/VRTSvcs/bin to your PATH variable If /.profile does not already exist, use this one: PATH=/usr/bin:/usr/sbin:/usr/ucb:/usr/local/bin:/opt/VRTSvcs/ bin:/sbin:$PATH export PATH . /.profile Re-verify command now runs if you changed /.profile: hastatus -summary Here is the expected result (your SYSTEMs/GROUPs may vary): One system should be OFFLINE and one system should be ONLINE ie: # hastatus -summary -- SYSTEM STATE -- System A A e4500a e4500b
Frozen 0 0
System
Probed
AutoDisabled
e4500a e4500b
Y Y
N N
If your systems do not show the above status, try these debugging steps: If NO systems are up, run hastart on both systems and run hastatus summary again. If only one system is shown, start other system with hastart. Note: one system should ALWAYS be OFFLINE for the way we configure systems here. (If we ran oracle parallel server, this could change -- but currently we run standard oracle server) If both systems are up but are OFFLINE and hastart did NOT correct the problem and oracle filesystems are not running on either system, the cluster needs to be reset. (This happens under strange network situations with GE Access.) [You ran hastart and that wasn't enough to get full cluster to work.]
Verify that the systems have the following EXACT status (though your machine names will vary for other customers): gedb002# hastatus -summary -- SYSTEM STATE -- System A A gedb001 gedb002
Frozen 0 0
System State
Probed
gedb001
gedb002
hares -display | grep ONLINE State gedb001 ONLINE State gedb002 ONLINE
gedb002# vxdg list NAME STATE rootdg enabled gedb001# vxdg list NAME STATE rootdg enabled
ID 957265489.1025.gedb002
ID 957266358.1025.gedb001
Recovery Commands: hastop -all on one machine hastart wait a few minutes on other machine hastart hastatus -summary (make sure one is OFFLINE && one is ONLINE)
If none of these steps resolved the situation, contact Lorraine or Luke (possibly Russ Button or Jen Redman if they made it to Veritas Cluster class) or a Veritas Consultant. 4. Verify Services Can Switch Between Systems Once, hastatus -summary works, note the GROUP name used. Usually, it will be "oragrp", but the installer can use any name, so please determine it's name.
First check if group can switch back and forth. On the system that is running (system1), switch veritas to other system (system2): hagrp -switch groupname -to system2 [ie: hagrp -switch oragrp -to e4500b] Watch failover with hastatus -summary. Once it is failed over, switch it back: hagrp -switch groupname -to system1 5. Verify OTHER System Can Go Up & Down Smoothly For Maintanence On system that is OFFLINE (should be system 2 at this point), reboot the computer. ssh system2 /usr/sbin/shutdown -i6 -g0 -y Make sure that the when the system comes up & is running after the reboot. That is, when the reboot is finished, the second system should say it is offline using hastatus. hastatus -summary Once this is done, hagrp -switch groupname -to system2 and repeat reboot for the other system hagrp -switch groupname -to system2 ssh system1 /usr/sbin/shutdown -i6 -g0 -y Verify that system1 is in cluster once rebooted hastatus -summary 6. Test Actual Failover For System 2 (and pray db is okay) To do this, we will kill off the listener process, which should force a failover. This test SHOULD be okay for the db (that is why we choose LISTENER) but there is a very small chance things will go wrong .. hence the "pray" part :). On system that is online (should be system2), kill off ORACLE LISTENER Process ps -ef | grep LISTENER Output should be like: root 1415 600 0 20:43:58 pts/0 0:00 grep LISTENER oracle 831 1 0 20:27:06 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit kill -9 process-id (the first # in list - in this case 831)
Failover will take a few minutes You will note that system 2 is faulted -- and system 1 is now online You need to CLEAR the fault before trying to fail back over. hares -display | grep FAULT for the resource that is failed (in this case, LISTENER) Clear the fault hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500b] 7. Test Actual Failover For System 1 (and pray db is okay) Now we do same thing for the other system first verify that the other system is NOT faulted hastatus -summary Now do the same thing on this system... To do this, we will kill off the listener process, which should force a failover. On system that is online (should be system2), kill off ORACLE LISTENER Process ps -ef | grep LISTENER Output should be like: oracle 987 1 0 20:49:19 ? 0:00 /apps/oracle/product/8.1.5/bin/tnslsnr LISTENER -inherit root 1330 631 0 20:58:29 pts/0 0:00 grep LISTENER kill -9 process-id (the first # in list - in this case 987) Failover will take a few minutes You will note that system 1 is faulted -- and system 1 is now online You need to CLEAR the fault before trying to fail back over. hares -display | grep FAULT for the resource that is failed (in this case, LISTENER) Clear the fault hares -clear resource-name -sys faulted-system [ie: hares -clear LISTENER -sys e4500a] Run: hastatus -summary