Follow

How do I setup alarms?

Introduction

The great thing about working on TrueSight Pulse is that we get to build tools that we want to use ourselves, everyday. Being developers for many years we've been exposed to plenty of alarming / alerting tools. In fact, alarming is usually the second thing we set up after our servers are created. We can't watch our server monitors constantly so getting told when things are suspect is vital.

The usual suspects of alarming haven't really changed over the years. At the application layer the alarms will typically test for slow or missing responses from a web request. At the system level, there are basic resource threshold checks. One of the biggest issues that surrounds alarming techniques is the ability to produce valid alarms. The basic problem is that, as we all know, systems fluctuate throughout the day, mostly for known reasons which might cause alarms which are superfluous. You then must dig through the notifications that are generated in order to find the real problems. There are various ideas around 'flapping detection', application scoring, and other somewhat arcane techniques so we thought "why not use our existing analytic engine behind TrueSight Pulse to make a better alarming solution?"

TrueSight Pulse Alarms are able to be triggered when a measurement value exceeds a threshold, or when communication ceases between a meter or a measurement is not longer being sent.

Here is a video that covers alarms and actions as well:

TrueSight Pulse - Alarms and Actions

Setup

Firstly, you can add alarms quickly and easily from the Alarms tab in the settings dialog.

On the Alarms tab we can see our existing alarms and click to add a new one:

Next we see the alarm settings:

We tried to make alarms as easy as possible to set up.

First you give the alarm a name. This is a simple, short, title that you can create so you can quickly get an idea of what the issue is when you see the alarm.

For a threshold alarm we select Threshold from the drop down menu. A threshold alarm indicates that we want the alarm to trigger.

Next is the metric. This is can be any one of those listed in the drop down menu.

These settings are typical with what you've seen elsewhere but now we get to a feature that is a little different; threshold.

What sets apart TrueSight Pulse alarms is that we can not only alarm basic aggregates like min, max, count but we can also alarm for an average. The aggregates work in conjunction with the time period:

 

 

The time period can range from 1 second up to 1 hour. For example, if I set a CPU threshold of, say, avg > 80% and a period of 1 minute then this translates to:

"Tell me when my CPU average is above 80% for at least 1 minute"

In practice, this technique for detecting actually bad states has been most effective for us at BMC.

After setting the threshold and period you can set an associated Action that runs when the alarm threshold is exceeded. See Action Send Notification by Email for instructions for configuring an email Action. For complete discussion on Actions see TrueSight Pulse Actions.

The Note will be passed along at the time the alarm notification is sent.

You can choose which servers you wish to alarm by selecting a filter from the drop down menu (See Filtering & Searching Sources - How To for additional details on creating and using filters). By default you can simply alarm for All Sources. This setting works in conjunction with the Notify for every change feature.

To illustrate how these settings work let's take a simple of example of monitoring 3 servers, A, B, and C at a 1 minute period. Consider the following scenario:

  • 1:00pm - Alarm A triggered
  • 1:01pm - Alarms B and C triggered
  • 1:02pm - Alarms A and B resolved
  • 1:03pm - Alarm C resolved

By default, without the 'notify for every change' option you would receive just two notifications:

  • ALARM at 1:00pm because at least on of your server alarms triggered
  • RESOLVED at 1:03pm when all of your server alarms were no longer triggered

However, if you enable 'notify for every change' you would receive four notifications as follows:

  • ALARM at 1:00pm because at least on of your server alarms triggered
  • ALARM UPDATE at 1:01pm stating that B and C are now triggered, A continues
  • ALARM UPDATE at 1:02pm stating that A and B are resolved, C continues
  • RESOLVED at 1:03pm when all of your server alarms were no longer triggered

Speaking of notifications, here's what a typical notification mail looks like:

Notice that you are told exactly which alarms are triggered (or resolved as the case may be), why they are triggered, and a link to the dashboard at the point in time of the trigger.

When you go to your dash you will now see some differences. The first thing you'll notice is that the graph representing the metric that has triggered the alarm now has a flashing red border:

 

The next thing you'll notice is a red bell icon in the graph title bar. We can click this to see what has triggered:

 

 

You can now click on the triggered alarm to go directly to the point in time of the trigger.

That is the short and sweet approach to TrueSight Pulse alarms. We hope these alarms are as useful and pertinent to your business as it has been to ours.

Have more questions? Submit a request