Friday, January 24, 2014

Monitoring VMware View, Desktop Pool Availability

VMware View is difficult to monitor. Basic things, such as the View Connection server/broker, are easy: the service is either up or down, running out of memory or not.  Availability of desktop pools are a different matter and VMware View has no native alarm or health check features built in to help the View administrator monitor availability.


I was tasked with developing a health check to monitor our VMware View floating, non-persistent, linked clone desktop pools. We had a couple of instances where either provisioning was turned off (human error) or the pool's Max Number of Desktops was set too low.


An avid redditor, I turned to r/vmware first. See my thread here and the many helpful comments.  Some folks said buy vCOPS for View but that's not an option.  A nice redditor named BlowDuck gave me the initial start on a health check.  Their health check was good but I found a couple ways to improve it.


I whiteboarded the following matrix while thinking about their health check and what I was monitoring:
Availability
Remote Sessions
Desktops Available
Max Number of Desktops
Remote Sessions
X
Composer/Storage/vCenter needs to keep up with demand as more users log in.
Pool needs to be configured to support maximum number of sessions.
Desktops Available
X
X
Number of Desktops Available desktops will approach the Maximum Desktops Configured as more users are entitled to the pool. Early warning detection before the number of Remote Sessions reaches the Maximum Desktops Configured.
Max Number of Desktops
X
X
X


I decided to change their monitoring script after seeing the intersection of the variables involved and what they represented.


First, BlowDuck's original monitoring script called for monitoring pool by looking at the number of remote sessions and checking if it was within X of the pool's Max Number of Desktops.  This is a static way to monitor the pool. The health check would need to be rewritten based on the number of users or size of the pool.

Second, Max Number of Desktops is not a representation of the actual number of running and available desktops, only what could be potentially available. What if a provisioning error or the environment's capacity prevented the pool from reaching Max Number of Desktops to service an increasing number of remote sessions?

My script addresses those issues by using the number of remote sessions and the number of desktops for a pool.  It calculates what percentage of desktops within the pool are currently utilized.  The script is passed WarningLevel and CriticalLevel arguments as percentages which tell it when to alarm, i.e. alarm at 75% utilization.


A couple notes on using number of remote sessions and desktops: 1) it's slow, 2) it isn't a true representation of what is available.

First, there is no extension data in View PowerCLI to make the counts readily available so the script uses the object count method.  This makes the script slow, really slow. It will take almost 3 minutes to count a pool of 150 desktops and sessions.  The script is slow I had to modify our Nagios configuration in several places, increasing the timeout values, to allow the script to run.

Second, the script doesn't really count available desktops. It counts the number of desktops in a pool. The state of those desktops is unknown to the script. They could be Available, Agent Unreachable, Deleting, Deleting(missing), Customizing or any other state.

You can find the script on GitHub at https://github.com/mmarseglia/view-tools/blob/master/desktopsAvailable.ps1.