Monitor Cisco ACI via REST-API

Modern controller based networks are quite different from a monitoring perspective, all the fancy network abstraction information is hiding behind this thing called API. SNMP might still be there, but is missing most of the interesting bits like health scores, faults and Tenant/App/Policy based metrics. And sometimes your legacy ehm, established NMS has no clue how to query or interpret those programmable interfaces…

The TIG-Stack (Telegraf, InfluxDB, Grafana) with over 200 different input sources, a scalable time series database and a powerful dashboard front-end comes to rescue – it has all you need for a single holistic view over the whole infrastructure stack.

Cisco ACI is one of the few ‘API-first’ network solutions on the market, meaning really every bit of information is available via the programmable interface. You might monitor single fabric nodes via established SNMP processes, this blog post though is all about the interesting metrics exposed by REST-API on the controller named APIC.

Prerequisites

First of all, you need a working TIG-Stack, the installation is pretty simple and not in the scope of this blog post. In addition to that, your Telegraf instance needs curl and in the case of signature based authentication: OpenSSL.

sudo yum install curl
sudo yum install openssl

On the APIC side, just create a dedicated Telegraf user with admin role and read-only rights on all security domains. BTW: Thanks to Cisco DevNet for providing an open sandbox APIC to play with! All demo files can be found on a dedicated GitHub project.

User/password based API Call

There are basically two ways to query the API of an APIC – either with user/password or signature based authentication. To authenticate via the former you first have to login via an HTTP POST-Request and get back a session cookie with 5 minutes lifetime. Subsequent API-Calls are then authorized by sending this cookie in an HTTP GET-Request. For instance:

POST https://sandboxapicdc.cisco.com/api/aaaLogin.json

{
  "aaaUser" : {
    "attributes" : {
      "name" : "telegraf",
      "pwd" : "telegraf"
    }
  }
}

RESPONSE:
{
  "imdata" : [{
      "aaaLogin" : {
        "attributes" : {
          "token" : 
             "GkZl(...)=",
          "refreshTimeoutSeconds" : "300",
          "lastName" : "tele",
          "firstName" : "graf"
        },
        "children" : [{
    (...)
}

As the Telegraf inputs.http plugin doesn’t handle session cookies, I decided to break out to the Linux shell and use a little script as a wrapper around the API calls with cURL.

apic_query.sh

#!/bin/bash
#
# Invoke: sh apic_query.sh <APIC-FQDN or IP> <API-Operation> <username> <password>
# Example: sh apic_query.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf
#

# Pipe bash arguments to variables
apic=$1
operation=$2
user=$3
pass=$4

# Create random cookie filename to avoid race conditions by multiple, concurrent script executions
cookiefilename=apic_cookie_$RANDOM

# APIC Login and store session cookie to /etc/telegraf
curl -s -k -d "<aaaUser name=$user pwd=$pass/>" -c /etc/telegraf/$cookiefilename -X POST https://$apic/api/mo/aaaLogin.xml > /dev/null

# APIC Query Operation using the session cookie
curl -s -k -X GET https://$apic$operation -b /etc/telegraf/$cookiefilename

# APIC Logout
curl -s -k -d "<aaaUser name=$user/>" -X POST https://$apic/api/mo/aaaLogout.json -b /etc/telegraf/$cookiefilename > /dev/null

# Remove session cookie
rm /etc/telegraf/$cookiefilename

This script can be used by the Telegraf inputs.exec plugin to process the JSON formatted result. It takes <APIC-FQDN or IP> <API-Operation> <username> <password> as input, queries the API and does all the cookie handling. You may decide to statically set the user credentials inside the shell script to set up a little credentials leak barrier via the Linux rights management.

Keep in mind that all user/password based API Calls are rate limited by the NGINX process at the APIC. So, depending on the scope and frequency you are better off with cert based access. For this demo I choose a query interval of 60s and had no problems, but it’ s definitely not suited for a massive downloads of interface statistics in a large fabric environment.

Signature based API Call

Cert based authentication needs, well, an X.509 certificate generated at first:

$openssl req -new -newkey rsa:4096 -days 3650 -nodes -x509 -keyout telegraf.key -out telegraf.crt -subj '/CN=telegraf/O=NWMICHL/C=DE' 

No surprises here, but the CN should match the APIC username. Oh, and your compliance regulations may suggest to choose a shorter lifetime in production ;-).
Anyway, the telegraf.key file is your private key, should be kept secret and will be used to sign API-Calls to authenticate every request. The public key has to be installed at the APIC side.

Set the User Certificate Attribute to the certificate DN (telegraf) and copy/paste the output of ‘cat telegraf.crt’ to the APIC (User Certificates, click +).

The Certificate Name should be in the form <username>.crt for the following bash script to work. After submit, the new certificate should be active.

apic_querysig.sh

This script takes <APIC-FQDN or IP> <API-Operation> <username> <private.key filename> as inputs, generates a proper signature and executes the API-Call.

#!/bin/bash
#
# Invoke: sh apic_querysig <APIC-FQDN or IP> <API-Operation> <username> <private.key filename>
# Example: sh apic_querysig.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf.key
#

# Variable definition from bash arguments
apic=$1
operation=$2
username=$3
privatekeyfile=$4

# Generate X.509 Signature to sign the REST-Call
#
# echo -n => print out "GET"$operation without newline
# openssl dgst -sha256 -sign $privatekeyfile => generate a sha256 signature in binary format by using the X.509 private key
# openssl enc -A -base64 => convert binary format to base64 and eliminate newlines by using the -A option
#
sig="$(/bin/echo -n "GET"$operation | openssl dgst -sha256 -sign $privatekeyfile | openssl enc -A -base64)"

# Build http header cookie from various variables
#
# Cookie:
# APIC-Certificate-Algorithm=v1.0;
# APIC-Certificate-DN=uni/userext/user-$username/usercert-$username.crt; # Destinguished Name of the APIC mngt object where the user's public key is stored
# APIC-Certificate-Fingerprint=fingerprint;
# APIC-Request-Signature=$sig # Generated one step above
#
header="Cookie: APIC-Certificate-Algorithm=v1.0; APIC-Certificate-DN=uni/userext/user-$username/usercert-$username.crt; APIC-Certificate-Fingerprint=fingerprint; APIC-Request-Signature=$sig"

# APIC Query Operation
curl -s -k -X GET https://$apic$operation -H "$header"

It took me quite some time to figure it out, because the Cisco Configuration Guide is incorrect regarding URL/path and I only found one other example using postman with incomplete http header / cookie usage documentation out on the Interwebs …
Here is how a dry run looks like in bash (JSON prettified of cause):

$sh apic_querysig.sh sandboxapicdc.cisco.com /api/class/infraWiNode.json telegraf telegraf.key

{
   "totalCount":"1",
   "imdata":[
      {
         "infraWiNode":{
            "attributes":{
               "addr":"10.0.0.1",
               "adminSt":"in-service",
               "annotation":"",
               "apicMode":"active",
               "chassis":"10220833-ea00-3bb3-93b2-ef1e7e645889",
               "childAction":"",
               "cntrlSbstState":"approved",
               "dn":"topology/pod-1/node-1/av/node-1",
               "extMngdBy":"",
               "failoverStatus":"idle",
               "health":"fully-fit",
               "id":"1",
               "lcOwn":"local",
               "mbSn":"TEP-1-1",
               "modTs":"2020-04-18T05:24:07.722+00:00",
               "monPolDn":"uni/fabric/monfab-default",
               "mutnTs":"2020-04-18T05:23:21.053+00:00",
               "name":"",
               "nameAlias":"",
               "nodeName":"apic1",
               "operSt":"available",
               "podId":"1",
               "routableIpAddr":"0.0.0.0",
               "status":"",
               "targetMbSn":"",
               "uid":"0"
            }
         }
      }
   ]
}

Telegraf Configuration

These two shell scripts can now be invoked by the Telegraf inputs.exec plugin. Every API call gets a dedicated Telegraf configuration file to live in the /etc/telegraf/telegraf.d/ directory, because the API path / operation is different depending on the requested information.

Example to query the Total System Health Score (ACI_SystemHealth.conf):

[[inputs.exec]]
  name_override = "ACI_SystemHealth"
  commands = ["sh /etc/telegraf/apic_query.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf"]
  timeout = "10s"
  data_format = "json"

  json_query = "imdata"
  tag_keys = ["fabricHealthTotal_attributes_dn"]
  json_string_fields = ["*cur"]

  [inputs.exec.tags]
     apic = "sandboxapicdc.cisco.com"

[[processors.converter]]
  namepass = ["ACI_SystemHealth"]

  [processors.converter.fields]
    integer = ["fabricHealthTotal_attributes_cur"]

[[processors.regex]]
  namepass = ["ACI_SystemHealth"]

  [[processors.regex.tags]]
    key = "fabricHealthTotal_attributes_dn"
    pattern = "topology/"
    replacement = ""

From top to bottom:

  • name_override
    Sets a unique measurement name, otherwise it would be ‘exec’ for all queries.
  • commands
    Invoke one of the two scripts depending on the authentication method
  • json_query =”imdata”
    Gets rid of the first JSON level of the response
  • tag_keys
    Identify tags to store in the measurement by the full JSON path
  • json_string_fields
    Identify string formatted JSON key/value pairs to store as measurement fields
  • inputs.exec.tags
    Sets the apic tag to distinguish between multiple APIC-Cluster / Fabrics, if needed.
  • processors.converter.fields
    Convert string to an integer value to allow numeric database operations (min, max, avg,..)
  • processors.regex.tags
    strip redundant ‘topology/’ prefix of the fabricHealthTotal_attributes_dn Tag

Fun fact: The Telegraf JSON input accepts only numbers by default and stores them as float metric fields! As the APIC returns all JSON elements as strings, guess how long I tried to figure out why Telegraf is sending no data to the InfluxDB (without any helpful error message) … nice one!

To switch between the two authentication methods, just change the shell command:

sh /etc/telegraf/apic_query.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf

sh /etc/telegraf/apic_querysig.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf.key

Before reloading Telegraf to accept the new configuration, you can test the agent to show what lines will be added to the InfluxDB:

$ telegraf --test --config-directory /etc/telegraf/telegraf.d --input-filter exec
2020-04-19T08:09:32Z I! Starting Telegraf 1.14.0
2020-04-19T08:09:32Z I! Using config file: /etc/telegraf/telegraf.conf
ACI_SystemHealth,apic=sandboxapicdc.cisco.com,fabricHealthTotal_attributes_dn=topology/health,host=localhost.localdomain fabricHealthTotal_attributes_cur="81" 1587283774000000000

Note: The Telegraf processor plugin does not work in –test mode, so the topology/ prefix is still reported and the health score is a string instead of an integer.

$sudo systemctl reload telegraf

Grafana Dashboard

The following table summarizes all the details queried by the Telegraf .conf files to populate the Grafana demo dashboard.

Filenametagsfields
ACI_APIC.confapic_cluster, name, addresshealth, apicMode, adminSt, operSt, failoverStatus
ACI_Faults.confapic_clustercrit, maj, minor, warn
ACI_NodeHealth.confapic_cluster, name, oobMgmtAddr, podId, rolestate, systemUpTime, current (health)
ACI_SystemHealth.confapic_cluster, dn (to seperate pods)current (health)
ACI_TenantHealth.confapic_cluster, namecurrent (health)

The dashboard.json can be downloaded from github and imported via Grafana / click + / choose Import. Just enter a name, the database source and change the UID if needed.

Closing

I hope these (code) lines get you started to develop your own dashboard ideas and help to integrate relevant ACI metrics into the infrastructure monitoring. Based on the idea behind ACI, namely the possibility to model app specific network infrastructure, we are now able to provide dedicated full stack dashboards – from the network (App health score, faults, events) up to the application telemetry itself (DB queries, web counter, logs …) . You may find interesting metrics in the API Documentation, the ACI ‘visore’ or by checking out the API Inspector during GUI operations.

And please don’t stop there! What about a new Grafana Alert on the Node Health Panel, notifying your team slack when a Switch drops under 50 ? The possibilities are endless …

2 thoughts on “Monitor Cisco ACI via REST-API

    1. Thanks for saying that, happy that it’s helpful!
      One thing I recently noticed: I need to fix the ACI_NodeHealth.conf, because it’s an old version. So stay tuned …

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.