Automate your NOC World Map at scale

Managing hundreds of devices with your monitoring system might be a tedious task, especially when using GUI based device onboarding. But why not let your config management tool of choice take care of it? This blog post is about a declarative Ansible playbook to generate Telegraf configuration files leveraging the inputs.ping plugin and populate a Grafana World Map.

First of all: You need a working installation of the TIG-Stack (Telegraf, InfluxDB, Grafana) and an Ansible automation host as well.

Geohash

Geohash is an open source geocode system, basically a way to represent a coordinate by a single alphanumerical value. The precision is a matter of character count, reaching +/- 20m with eight characters. This demo is based on worldwide AWS Regions. Just visit the GeohashExplorer or a geohash translator and add an additional variable to every ‘host’ in the Ansible inventory.

[aws]
ec2.us-east-1.amazonaws.com geohash=dqb0
ec2.us-east-2.amazonaws.com geohash=dpjn
ec2.us-west-1.amazonaws.com geohash=9qcp
ec2.us-west-2.amazonaws.com geohash=9rf7
ec2.ap-east-1.amazonaws.com geohash=wecp
ec2.ap-south-1.amazonaws.com geohash=te7u
ec2.ap-northeast-2.amazonaws.com geohash=wydm
ec2.ap-southeast-1.amazonaws.com geohash=w21z
ec2.ap-southeast-2.amazonaws.com geohash=r3gx
ec2.ap-northeast-1.amazonaws.com geohash=xn7t
ec2.ap-northeast-3.amazonaws.com geohash=xn0h
ec2.ca-central-1.amazonaws.com geohash=f244
ec2.eu-central-1.amazonaws.com geohash=u0yh
ec2.eu-west-1.amazonaws.com geohash=gc7r
ec2.eu-west-2.amazonaws.com geohash=gcpu
ec2.eu-west-3.amazonaws.com geohash=u09t
ec2.eu-north-1.amazonaws.com geohash=u6sc
ec2.me-south-1.amazonaws.com geohash=theu
ec2.sa-east-1.amazonaws.com geohash=6gyc

Yep, and that’s the only manual task left …

Ansible playbook

The playbook and all additional files are available on github. Just specify the variable ‘telegrafhost’ according to your environment. I use a little RaspberryPi as my TIG-Stack, so it is ‘grafanapi’ as mentioned in the Ansible hosts file – including credentials (but please use Ansible vault or another form of secure credential handling in production).

- name: TEMPLATE OUT TELEGRAF.CONF FILES
    template:
      src: grafanawm.j2
      dest: /etc/telegraf/telegraf.d/{{inventory_hostname}}_wm.conf
    delegate_to: "{{ telegrafhost }}"
    when: geohash is defined
    notify: RELOAD TELEGRAF

This task generates one config file for every host in the Telegraf configration directory, but only if the additional variable ‘geohash’ is defined. It uses the following Jinja2 template, located in the local working directory, to activate the telegraf ping plugin.

# {{ ansible_managed }} by {{ template_host }}

[[inputs.ping]]
  urls = ["{{ inventory_hostname }}"]
  interval = "60s"
  count = 4
  ping_interval = 1.0
  timeout = 1.0
  deadline = 10

  [inputs.ping.tags]
     geohash="{{ geohash }}"

The next two tasks are required to render this playbook declarative. Simple changes to existing config files are handled by the idempotent Ansible template module itself. But the goal is to use the Ansible inventory as a single source of truth for the world map – meaning orphaned Telegraf config files are not allowed!

- name: READ LIST OF ALL TELEGRAF.CONF FILES IN DIRECTORY
    find:
      paths: /etc/telegraf/telegraf.d/
      file_type: file
      recurse: no
      patterns: "*wm.conf"
    delegate_to: "{{ telegrafhost }}"
    register: files_matched
    run_once: true

  - name: DELETE STALE TELEGRAF.CONF FILES
    file:
      path: "{{ item.path }}"
      state: absent
    loop: "{{ files_matched.files|flatten(levels=1) }}"
    loop_control:
      label: "{{ item.path }}"
    delegate_to: "{{ telegrafhost }}"
    when: (item.path | basename | regex_replace('_wm.conf') not in ansible_play_hosts_all)
    notify: RELOAD TELEGRAF
    run_once: true

It first generates a list of all files in the /etc/telegraf/telegraf.d/ directory which matches the suffix ‘_wm.conf’. We use a special worldmap suffix, because there might be other config files for the same host using other telegraf plugins. The next task loops over this list (files_matched) and deletes every file without a corresponding entry in the Ansible inventory. All the magic hides behind the conditional ‘when’:

item.path/etc/telegraf/telegraf.d/ec2.us-east-1.amazonaws.com_wm.conf
item.path | basenameec2.us-east-1.amazonaws.com_wm.conf
item.path | basename | regex_replace(‘_wm.conf’)ec2.us-east-1.amazonaws.com

The handler at the bottom of the playbook reloads the telegraf service, of cause only in the case of changes to the config files. IMPORTANT: Add the telegraf reload command to the /etc/sudoers with the nopasswd attribute for the desired user group, in my case: %pi.

%pi ALL= NOPASSWD: /bin/systemctl reload telegraf.service

This asciinema recording demonstrates the initial onboarding of all AWS regions. Whereas this one shows the declarative behaviour, when random regions get deleted in the Ansible hosts file.

Grafana – World Map plugin

At this point telegraf sends metrics for every host with a tag geohash to the InfluxDB of your TIG-Stack.

pi@grafanapi:~ $ influx
> use telegraf
Using database telegraf
> select * from ping where geohash !='' AND time > now() -2m limit 5
name: ping
time average_response_ms geohash host location maximum_response_ms minimum_response_ms packets_received packets_transmitted percent_packet_loss result_code standard_deviation_ms url
---- ------------------- ------- ---- -------- ------------------- ------------------- ---------------- ------------------- ------------------- ----------- --------------------- ---
1586155683000000000 260.195 6gyc grafanapi 260.416 259.98 4 4 0 0 0.532 ec2.sa-east-1.amazonaws.com
1586155683000000000 199.53 9qcp grafanapi 223.627 191.327 4 4 0 0 13.916 ec2.us-west-1.amazonaws.com
1586155683000000000 112.548 dpjn grafanapi 112.679 112.481 4 4 0 0 0.249 ec2.us-east-2.amazonaws.com
1586155683000000000 20.428 gcpu grafanapi 20.65 20.278 4 4 0 0 0.145 ec2.eu-west-2.amazonaws.com
1586155683000000000 261.581 wecp grafanapi 262.124 261.21 4 4 0 0 0.335 ec2.ap-east-1.amazonaws.com

Now it’s time to get the Grafana Worldmap panel installed.

grafana-cli plugins install grafana-worldmap-panel
sudo service grafana-server restart

A simple Grafana dashboard to start with can be found at the git repo, but here are the steps to create it on your own. Just add a new panel and enter the (Add) Query configuration.

Toggle to Text edit mode and enter the query.

SELECT percent_packet_loss AS "metric" FROM "ping" WHERE ("geohash" != '') AND time > now() - 2m GROUP BY "geohash", "url" LIMIT 1

It selects the newest percent_packet_loss metric from all ‘ping’ measurements with a geohash and renames it to ‘metric’. The table formatted result

name:ping
tags: geohash=6gyc,
url=ec2.sa-east-1.amazonaws.com
timemetric
2020-04-06T06:52:03Z0
name:ping
tags:geohash=9qcp,
url=ec2.us-west-1.amazonaws.com
timemetric
2020-04-06T06:52:03Z0
(…)

feeds into the Worldmap plugin, which can be selected under the Visualization Tab. The Map Visual Options are pretty self-explanatory, only the Map Data Options need some attention. We expect the Location Data in table format with the following Field Mapping:

Give the panel a meaningful title under the ‘General’ Tab and there you go:

AWS Regions – Packet loss

Closing

If the playbook runs frequently as a cronjob (every hour for instance), you now have a fully automated World Map Dashboard reflecting all the hosts in your Ansible inventory with an additional geohash value.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.