Troubleshoot with VMware vRealize Operations Part 1
Updated: Apr 27, 2022
In our previous blogs about VMware vRealize Operations (vROps) we explored the first two pillars of the Home page: Optimize Performance and Optimize Capacity. Today we will explore the Third: Troubleshoot.
The Troubleshoot feature has been around for several releases, but got significantly better in vROps 8.0, with the release of the Workbench. Note: all screenshots in this blog are taken from a vROps 8.2 instance.
The Workbench provides the user with a framework around which they can troubleshoot problems. Let's explore by clicking the Workbench, the first item in the Troubleshoot pillar of vROps. You''ll be presented with the Workbench itself.
First, search for the object you want to troubleshoot, this can be any object vROps is aware of. Maybe a vSphere object like a VM or ESXi Host, or it could be a vRTVS discovered object like a NetApp FAS Aggregate or a Microsoft SQL Server Query. In this case, let's search for a VM, type it into the search bar and hit enter. You'll be presented with the Workbench itself.
This new dashboard, the Troubleshooting Workbench, changed the game for vROps users. It provides a single pane within which object relationships are shown, events are listed, alerts are published, and anomalous metrics are highlighted, and more. Let's go through this from left to right.
The left hand pane explores the object you searched for and objects related to it. When the Selected Scope is Level 1, it will show the target object (in this case a VM called sql-prd-00) and all of its parent and child objects. Things like Datastores, ESXi Hosts, Microsoft SQL Instances, and Cisco Switches/Ports. The non-vSphere objects were discovered by vRTVS adapters.
If you'd like to dig deeper you can increase the Selected Scope (max is Level 5), let's increase to Level 3 and click the CUSTOM button to explore relationships.
You are taken to an Advanced Object Relationship dashboard focused on the target object, VM sql-prd-00. You'll also see parent and child objects, in this case three steps beyond the target. This gives us immediately visibility into the VM, child objects that could be effecting it, and parents objects that it could potentially be effecting.
The little icons next to each object represent the health of them: green square is health, yellow triangle is marginally healthy, and red circle is unhealthy. In the diagram above I've noted what each object is, giving you visibility beyond the vSphere edge. The power of this visibility is vast, allow you to troubleshooting from application layers far above the hypervisor down into your hardware.
Back on the main Troubleshooting Workbench you'll notice five tabs along the top, let's explore each.
The first tab is Potential Evidence. It will show information for all objects in the left hand pane, unless you've selected onc manually. Choose the time frame you'd like to explore with the Time Range drop down. You can also Hide Consequential Evidence by checking that box. This allows you to hide evidence across objects that might be a consequence as opposed to a root cause of the problem you're troubleshooting. I generally start with this box unchecked so I have visibility into everything, but it's useful as a filter.
Below that we have three pillars:
Events - Major events and metrics that have breached their dynamic thresholds. Events are things like an ESXi Host not responding or a VM being powered off. They don't necessarily generate alerts, but we do want to make note of them. vROps events are described nicely here: https://docs.vmware.com/en/vRealize-Operations-Manager/8.2/com.vmware.vcom.user.doc/GUID-673D574D-6F79-4FA6-B7D0-1FC432F8BA06.html
Property Changes - impactful property and configuration changes within the selected time frame against all objects or the selected object at left.
Anomalous Metrics - metrics which have changed drastically within the selected time frame. Note that anomalous metrics and metrics which have exceeded their dynamic thresholds (bullet point 1 above) are not the same. John Dias explains the difference quite nicely in this blog: https://blogs.vmware.com/management/2019/11/27137.html#:~:text=Anomalous%20Metrics%20are%20statistically%20significant,a%20single%2C%20large%20spike
You can explore all the events, property changes, and anomalous metrics in more detail by clicking the Pop Out button in the top right of each widget. This has been around since vROps 8.0, but in 8.1 we were given the ability to create an Alert Definition for any of these. This is an easy way to create alert definitions for potential problems.
The Alerts tab will show you all Active alerts (default) for the target object and objects in the selected scope.
You can use the ALL FILTERS tab to filter the Alerts. The Group By dropdown allows you to see Alerts grouped several different ways. The superimposed x-box just above the list of Alerts allows you to clear all Alerts from your list. You can then select an object in the left hand pane to see just it's Alerts. You can quickly go back to all Alerts by clicking the superimposed check-box just above the list of alerts.
Once an Alert has been chosen, you can click the ACTIONS dropdown and will be presented with several options.
If you've configured the vRealize Log Insight (vRLI) and vRealize Network Insight (vRNI) integrations in vROps, you will be able to go directly to those tools and see the object in context. The vRealize Integrations are created here, I have them for vRLI, vRNI, and vRealize Automation (vRA).
The Events tab will show all alerts and events in your environment with the flexibility to filter them based on Alert Criticality, Alert Status, or Alert Type. You can also filter events based on Event Types via the EVENT FILTERS dropdown.
You can focus in on the time frame you want, clear all events via the superimposed x-box, select an object on the left to see events for just that object, and quickly see all events again by clicking the superimposed check-box next to the FILTERS drop down.
The top pane shows the alerts and events over the selected time frame, which gives you visibility into the chronology of things. The bottom pane is a list of those same alerts and events.
The Timeline tab will show you the timeline of events over the last 24 hours, the last week, and the last month. This provides higher level visibility into when potential problems started happening.
The fifth and final tab is Logs, which is basically an interface to vRLI.
Once authenticated you'll be taken to the Interactive Analysis tab of vRLI for the selected object (the filter is already in place). This gives the user visibility into logs related to the selected object. This of course, depends on vRLI being configured to consume logs from various targets, that will be explored in our net blog series.
Here are a couple useful links around the Troubleshooting Workbench, including a blog from VMware Technical Marketing Manager John Dias. Few people in the world know more about the platform than he does.
1. Documentation - https://docs.vmware.com/en/VMware-vRealize-Operations-Cloud/services/user-guide/GUID-EFF1CB80-30B0-4F34-9553-16CC4B362253.html
2. John Dias Blog - https://blogs.vmware.com/management/2019/11/27137.html#:~:text=Anomalous%20Metrics%20are%20statistically%20significant,a%20single%2C%20large%20spike