Recovering from a Major Fibre Channel Fabric and VMware Incident Follow

Periodically, there is a Fibre Channel (FC) fabric event that causes a snowball effect on some or all of the VMware hosts (and the virtual machines that run on the hosts) running on the IBM blades in HIO and rarely in SLC. This is represented by warning and/or critical alert status for the affected VMware hosts and the hosts not responding to configuration or status updates, both within VMware vSphere Client. On the hosts that are running on the full ESX (versus ESXi), running "top" from the local console will show system load well above "5.0" across the board.

Once a major FC fabric incident happens, the VMware host runs into enough I/O blocking that it fills up all of the disk read/write queues and will spend many CPU cycles on trying to complete the I/O transactions. Over a period of an hour or so, the System Console no longer has enough CPU cycles and/or memory resources available to respond to ESX/ESXi management agent requests.

When such an incident happens, the blade-side FC ports on both of the QLogic FC Pass-through Modules in the chassis must be reset. In some additional cases, access to the LUNs presented on by the Nexsan E18 array must also be reset. One thing that must not be done while working through the incident: do not disconnect/reconnect the hosts from vCenter and from the cluster. This will completely prevent the ability to manage virtual machines and control the host without first power cycling the host.

To recover from such an incident, you will need to log on to the management console of both QLogic FC Pass-through Modules in the corresponding chassis using the URLs below from PDX-SANMGMT-01 on the MGMT network, as well as in the password entry in the Systems KeePass database under: Storage > IBM/QLogic FC IPT Module. If a Java warning message appears asking if you want to block the applet, make sure to choose "No".

HIO        Module 1             http://172.16.25.241
             Module 2             http://172.16.25.242
SLC        Module 1             http://172.16.30.241
             Module 2             http://172.16.30.242

Once logged in, select all of the ports displayed below the blade chassis graphic by clicking on the first port, then holding down SHIFT, click on the last port.

With all ports selected, open the "Port" menu, choose "Reset Port" and confirm that you want to reset all of the selected ports. Bulk resetting of ports can take up to a minute to complete. To verify that traffic starts flowing through the module, click on the "Port Stats" tab at the bottom of the user interface, choose "Baseline" from the drop down and click "Clear Baseline".

This will zero out all of the counters and within 1-2 minutes, you should see the numbers for "Class3 Frames In" and "Class3 Frames Out" for all active ports start to increment. You will want to wait an additional 1-2 minutes after that before doing the same on the second module in the chassis.

If you reset all of the blade-side ports on both modules at the same time, you risk dropping all of the presented LUNs and MetaLUNs, corrupting data and crash the running virtual machines.

Once all of the blade-side FC ports have been reset, the VMware hosts should start their recovery within 5-10 minutes.

If the conditions of the hosts do not improve over the next 10-15 minutes, the next step will be to reset LUN access for the Nexsan LUNs. To reset LUN access on the Nexsan array, log on to the web-based administration tool at http://hio-nexsan-01/, click on "Configure Volumes" in the left-hand menu (at this point, you should be prompted for a username and password, which is stored under Storage > Nexsan in the Systems KeePass database).

Next, click on the "Map Volume" tab and click on the "Next" arrow for the first volume listed. On the volume mapping page, change the "Default Access" from "R/W" to "Deny" and scroll to the bottom of the page and click "Apply Changes".

Wait 30-60 seconds and change the "Default Access" back to "R/W" and click on "Apply Changes" again. You will need to repeat this process for all of the other volumes that correspond to datastores named "NEXSAN_XXX". Do not make any changes to the "NULL0" volume, since that volume is not a VMFS formatted volume.

After fully recovering from the incident, verify that the datastores and virtual machines show a multipathing status of "Full Redundancy" under the "Storage Views" at the cluster level. You may need to click on the "Update..." link in the upper-right corner of the page if the last update timestamp is more than 2-3 minutes.

 

If the hosts are still unable to access any or all of the Nexsan LUNs, a SCSI reset command must be sent from one of the hosts that have access to those LUNs and have SSH enabled.

Lookup NAA ID for Volume

Verify connectivity to LUN

hexdump -C /dev/disks/"NAA ID" | less

If no output, run

vmkfstools -L lunreset /vmfs/devices/disks/"NAA ID"

vmkfstools -L lunreset /vmfs/devices/disks/naa.6000402002d841bf6515dcb900000000

vmkfstools -L lunreset /vmfs/devices/disks/naa.6000402002d841bf6515dc4f00000000

vmkfstools -L lunreset /vmfs/devices/disks/naa.6000402002d841bf6515dbe900000000

vmkfstools -L lunreset /vmfs/devices/disks/naa.6000402002d841bf7b99248900000000

 

Rescan storage in command line

esxcfg-rescan -A

Wait 2-3 minutes

Run rescan of HBA and volumes in vSphere Client

Repeat if necessary

Have more questions? Submit a request

Comments

Powered by Zendesk