Proactive HA with Cisco UCS

Proactive HA was introduced with many other features in vSphere 6.5. We all know what vSphere HA is, but with this feature we can prepare for issues what hasn’t happened yet and allow us to avoid outages. This feature requires a plug-in, which connects the hardware to the software on higher level than it was before. With this feature Virtual Machines could be migrated proactively to another hosts in the cluster with vmotion, therefore DRS must be enabled.

The Health Provider, based on HW sensor data can mark hosts as Healthy, Moderate or Severe degradation or Unknown. According to that information vSphere DRS places the affected ESXi host into Quarantine Mode or Maintenance Mode. Maintenance mode is obvious, while in Quarantine Mode, VMs could leave on the host and affinity rules are also in effect, but new VM will not be placed onto the host. Third option is Mixed where we have Quarantine Mode for Moderate and Maintenance for Severe failure.

Quarantine modes can be changed on the UI, its part of the cluster’s HA configuration.

Proactive HA can respond to the following failures:
-Power Supply
-Fan
-Storage
-Network
-Memory

Which state results Moderate or Severe is up to the vendor and is built-in to the Health Provider. This automation comes with the plugin provided by different vendors.

Cisco Implemented the following failure conditions in Proactive HA plugin:

Network,F0539,IO controller temperature is outside the upper or lower critical threshold
Memory,F1706,ADDDC Memory RAS Problem
Power,F0391,Equipment PSU Voltage Threshold Non Recoverable
Fan,F0373,Equipment Fan Inoperable
Power,F0174,processor is inoperable
Power,F0374,Equipment PSU Inoperable
Memory,F37600,Memory temperature beyond threshold
Power,F35962,Motherboard power consumption beyond threshold
Power,F0311,Compute Physical Power Problem
Power,F1004,storage controller is inaccessible
Power,F0310,Compute Board Power Error
Power,F0313,Compute Physical BIOS POST Timeout
Power,F1007,virtual drive has become inoperable
Storage,F0317,Compute Physical Inoperable
Network,F0209,network facing adapter interface is down
Memory,F0190,Memory array voltage exceeds the specified hardware voltage
Memory,F0191,Memory Array Voltage Threshold Non Recoverable
Power,F0181,Local disk has become inoperable
Fan,F0382,Equipment Fan Module Thermal Threshold Critical
Memory,F0185,Memory Unit Inoperable
Fan,F0384,Equipment Fan Module Thermal Threshold Non Recoverable
Power,F0383,Equipment PSU Thermal Threshold Critical
Memory,F0187,Memory Unit Thermal Threshold Critical
Network,F0540,Compute IOHub Thermal Threshold Non Recoverable
Fan,F0484,Equipment Fan Performance Threshold Lower Non Recoverable
Memory,F0188,Memory Unit Thermal Threshold Non Recoverable
Storage,F0385,Equipment PSU Thermal Threshold Non Recoverable
Power,F0389,Equipment PSU Voltage Threshold Critical
Power,F0425,Compute Board CMOS Voltage Threshold Non Recoverable
Power,F0369,Equipment PSU Power Supply Problem

As I mentioned it’s on the vendor (Cisco in this case) to mark failures Moderate or Severe.

Configure Cisco UCS management plugin:

The plugin itself can be downloaded from Cisco (link below), but it needs registration.

Installing the plugin is a simple process, but you need access to a web server which can be HTTP or HTTPS. By default in vSphere only HTTPS connection is allowed as source, and for enabling HTTP, “allowHttp = true” should be added to client configuration file(s).
FLEX: /var/lib/vmware/vsphere-client/webclient.properties
HTML5: /var/lib/vmware/vsphere-ui/webclient.properties

During the registration procedure populate the required fields:
IP/Hostname – of the vcenter server
Username – administrative user account
Password – of the administrative user
Plugin location – the http server location

Once its completete and it was successful, the Cisco UCS icon should be present on home screen:


The next two steps are the UCS Domain and Proactive HA registration. I recommend to use a domain service account as usual.
To register any UCS Domain, the user must have at least read-only permission in the UCS Manager. As the plugin has no built-in permissions in vCenter server, every user will see it who has rights to log in – that is if you checked Visible to All Users checkbox. I recommend to assign read-only + KVM access, because during the daily operations this makes the console access quicker and needs authentication anyway.

To register UCS domain you’ll need to fill out the form:


UCS domain is registered and appears on the list of Registered UCS Domains. This must be done one by one for all UCS Domains which you want to integrate into the vCenter Server(s).

Next step is register the plugin to have Proactive HA. Switch to the tab and fill the user and password fields.


Configure Proactive HA

Its very straightforward, but it is disabled by default therefore you have to enable the service first. Just go to:
home » hosts and clusters » cluster » configure » vsphere availability » edit
then select the checkbox: Turn on Proactive HA.

You can configure configuration options under Proactive HA Failures and Responses:

There are two Autiomation levels: Manual and Automated. This works exactly the same way as standard DRS. In Manual mode vCenter only recommends actions, but you need to accept it manually, while in Automated mode VMs will be managed automatically according to the Remediation Mode.

Remediation modes could be Mixed, Quarantine or Maintenance mode. I already explained how each of the modes are working. My recommendation is Mixed mode as it keeps good balance between performance and stability.

If you have hosts from different vendors, and their plugins are installed, it can be enabled, disabled and configured per cluster.

Even though we can edit the actions, we have a very little freedom to fine tune or even configure the failure conditions. We can enable or disable individual failure conditions and apply them on individual hosts or on cluster level.

However with this last setup we are done with the Proactive HA configuration we have extra details available coming with the plugin.

First and the most conspicuous, is on the host Summary Tab. A couple of basic info will be shown here like UCS Domain, Server Location, Service Profile, Serial Number etc… Also KVM or UCSM session can be initiated which could be really helpful during day-to-day operations.

A new UCS tab was added in Monitoring too, where the Faults of the specific host is visible.On the following path you could reach this:
home » hosts and clusters » host » monitor cisco ucs

The last pages are a bit hidden, but you can find many details about the domain(s) and all of its components, included chassis’, rack mounts – if you have any – and the FI(s). Standard pages like summary , faults tab and related ‘sub’ objects also can be found here.
Path to access:
Home » Cisco Ucs » Registered UCS Domains » 2x click on a domain

Based on permissions, the different actions can be taken on domain level like:

  • Service profile configurations
  • Managing service profile templates
  • Server pools
  • Firmware package managements, included upload, removal or modify them

But, be aware! This information could be available for every users with access to the vCenter server. Currently no restrictions can be set for individual users or groups.

All components have sub-pages, basically with the same information like versions, components, faults etc….

Now we have Proactive HA enabled and we are familiar with the Cisco UCS plugin and its features.

Thanks for reading this article. If you have any questions please feel free to comment or contact me directly.

Links:

VMware library: link
UCS Manager plugins v3.03: link
Release Notes: link