Monitor Cisco ACI via REST-API

Modern controller based networks are quite different from a monitoring perspective, all the fancy network abstraction information is hiding behind this thing called API. SNMP might still be there, but is missing most of the interesting bits like health scores, faults and Tenant/App/Policy based metrics. And sometimes your legacy ehm, established NMS has no clue how to query or interpret those programmable interfaces…

The TIG-Stack (Telegraf, InfluxDB, Grafana) with over 200 different input sources, a scalable time series database and a powerful dashboard front-end comes to rescue – it has all you need for a single holistic view over the whole infrastructure stack.

Cisco ACI is one of the few ‘API-first’ network solutions on the market, meaning really every bit of information is available via the programmable interface. You might monitor single fabric nodes via established SNMP processes, this blog post though is all about the interesting metrics exposed by REST-API on the controller named APIC.

Prerequisites

First of all, you need a working TIG-Stack, the installation is pretty simple and not in the scope of this blog post. In addition to that, your Telegraf instance needs curl and in the case of signature based authentication: OpenSSL.

sudo yum install curl
sudo yum install openssl

On the APIC side, just create a dedicated Telegraf user with admin role and read-only rights on all security domains. BTW: Thanks to Cisco DevNet for providing an open sandbox APIC to play with! All demo files can be found on a dedicated GitHub project.

User/password based API Call

There are basically two ways to query the API of an APIC – either with user/password or signature based authentication. To authenticate via the former you first have to login via an HTTP POST-Request and get back a session cookie with 5 minutes lifetime. Subsequent API-Calls are then authorized by sending this cookie in an HTTP GET-Request. For instance:

POST https://sandboxapicdc.cisco.com/api/aaaLogin.json

{
  "aaaUser" : {
    "attributes" : {
      "name" : "telegraf",
      "pwd" : "telegraf"
    }
  }
}

RESPONSE:
{
  "imdata" : [{
      "aaaLogin" : {
        "attributes" : {
          "token" : 
             "GkZl(...)=",
          "refreshTimeoutSeconds" : "300",
          "lastName" : "tele",
          "firstName" : "graf"
        },
        "children" : [{
    (...)
}

As the Telegraf inputs.http plugin doesn’t handle session cookies, I decided to break out to the Linux shell and use a little script as a wrapper around the API calls with cURL.

apic_query.sh

#!/bin/bash
#
# Invoke: sh apic_query.sh <APIC-FQDN or IP> <API-Operation> <username> <password>
# Example: sh apic_query.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf
#

# Pipe bash arguments to variables
apic=$1
operation=$2
user=$3
pass=$4

# Create random cookie filename to avoid race conditions by multiple, concurrent script executions
cookiefilename=apic_cookie_$RANDOM

# APIC Login and store session cookie to /etc/telegraf
curl -s -k -d "<aaaUser name=$user pwd=$pass/>" -c /etc/telegraf/$cookiefilename -X POST https://$apic/api/mo/aaaLogin.xml > /dev/null

# APIC Query Operation using the session cookie
curl -s -k -X GET https://$apic$operation -b /etc/telegraf/$cookiefilename

# APIC Logout
curl -s -k -d "<aaaUser name=$user/>" -X POST https://$apic/api/mo/aaaLogout.json -b /etc/telegraf/$cookiefilename > /dev/null

# Remove session cookie
rm /etc/telegraf/$cookiefilename

This script can be used by the Telegraf inputs.exec plugin to process the JSON formatted result. It takes <APIC-FQDN or IP> <API-Operation> <username> <password> as input, queries the API and does all the cookie handling. You may decide to statically set the user credentials inside the shell script to set up a little credentials leak barrier via the Linux rights management.

Keep in mind that all user/password based API Calls are rate limited by the NGINX process at the APIC. So, depending on the scope and frequency you are better off with cert based access. For this demo I choose a query interval of 60s and had no problems, but it’ s definitely not suited for a massive downloads of interface statistics in a large fabric environment.

Signature based API Call

Cert based authentication needs, well, an X.509 certificate generated at first:

$openssl req -new -newkey rsa:4096 -days 3650 -nodes -x509 -keyout telegraf.key -out telegraf.crt -subj '/CN=telegraf/O=NWMICHL/C=DE' 

No surprises here, but the CN should match the APIC username. Oh, and your compliance regulations may suggest to choose a shorter lifetime in production ;-).
Anyway, the telegraf.key file is your private key, should be kept secret and will be used to sign API-Calls to authenticate every request. The public key has to be installed at the APIC side.

Set the User Certificate Attribute to the certificate DN (telegraf) and copy/paste the output of ‘cat telegraf.crt’ to the APIC (User Certificates, click +).

The Certificate Name should be in the form <username>.crt for the following bash script to work. After submit, the new certificate should be active.

apic_querysig.sh

This script takes <APIC-FQDN or IP> <API-Operation> <username> <private.key filename> as inputs, generates a proper signature and executes the API-Call.

#!/bin/bash
#
# Invoke: sh apic_querysig <APIC-FQDN or IP> <API-Operation> <username> <private.key filename>
# Example: sh apic_querysig.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf.key
#

# Variable definition from bash arguments
apic=$1
operation=$2
username=$3
privatekeyfile=$4

# Generate X.509 Signature to sign the REST-Call
#
# echo -n => print out "GET"$operation without newline
# openssl dgst -sha256 -sign $privatekeyfile => generate a sha256 signature in binary format by using the X.509 private key
# openssl enc -A -base64 => convert binary format to base64 and eliminate newlines by using the -A option
#
sig="$(/bin/echo -n "GET"$operation | openssl dgst -sha256 -sign $privatekeyfile | openssl enc -A -base64)"

# Build http header cookie from various variables
#
# Cookie:
# APIC-Certificate-Algorithm=v1.0;
# APIC-Certificate-DN=uni/userext/user-$username/usercert-$username.crt; # Destinguished Name of the APIC mngt object where the user's public key is stored
# APIC-Certificate-Fingerprint=fingerprint;
# APIC-Request-Signature=$sig # Generated one step above
#
header="Cookie: APIC-Certificate-Algorithm=v1.0; APIC-Certificate-DN=uni/userext/user-$username/usercert-$username.crt; APIC-Certificate-Fingerprint=fingerprint; APIC-Request-Signature=$sig"

# APIC Query Operation
curl -s -k -X GET https://$apic$operation -H "$header"

It took me quite some time to figure it out, because the Cisco Configuration Guide is incorrect regarding URL/path and I only found one other example using postman with incomplete http header / cookie usage documentation out on the Interwebs …
Here is how a dry run looks like in bash (JSON prettified of cause):

$sh apic_querysig.sh sandboxapicdc.cisco.com /api/class/infraWiNode.json telegraf telegraf.key

{
   "totalCount":"1",
   "imdata":[
      {
         "infraWiNode":{
            "attributes":{
               "addr":"10.0.0.1",
               "adminSt":"in-service",
               "annotation":"",
               "apicMode":"active",
               "chassis":"10220833-ea00-3bb3-93b2-ef1e7e645889",
               "childAction":"",
               "cntrlSbstState":"approved",
               "dn":"topology/pod-1/node-1/av/node-1",
               "extMngdBy":"",
               "failoverStatus":"idle",
               "health":"fully-fit",
               "id":"1",
               "lcOwn":"local",
               "mbSn":"TEP-1-1",
               "modTs":"2020-04-18T05:24:07.722+00:00",
               "monPolDn":"uni/fabric/monfab-default",
               "mutnTs":"2020-04-18T05:23:21.053+00:00",
               "name":"",
               "nameAlias":"",
               "nodeName":"apic1",
               "operSt":"available",
               "podId":"1",
               "routableIpAddr":"0.0.0.0",
               "status":"",
               "targetMbSn":"",
               "uid":"0"
            }
         }
      }
   ]
}

Telegraf Configuration

These two shell scripts are stored in the /etc/telegraf/ folder (make sure to use absolute path in the telegraf.conf) and can now be invoked by the Telegraf inputs.exec plugin. Every API call gets a dedicated Telegraf configuration file to live in the /etc/telegraf/telegraf.d/ directory, because the API path / operation is different depending on the requested information.

Example to query the Total System Health Score (ACI_SystemHealth.conf):

[[inputs.exec]]
  name_override = "ACI_SystemHealth"
  commands = ["sh /etc/telegraf/apic_query.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf"]
  timeout = "10s"
  data_format = "json"

  json_query = "imdata"
  tag_keys = ["fabricHealthTotal_attributes_dn"]
  json_string_fields = ["*cur"]

  [inputs.exec.tags]
     apic = "sandboxapicdc.cisco.com"

[[processors.converter]]
  namepass = ["ACI_SystemHealth"]

  [processors.converter.fields]
    integer = ["fabricHealthTotal_attributes_cur"]

[[processors.regex]]
  namepass = ["ACI_SystemHealth"]

  [[processors.regex.tags]]
    key = "fabricHealthTotal_attributes_dn"
    pattern = "topology/"
    replacement = ""

From top to bottom:

  • name_override
    Sets a unique measurement name, otherwise it would be ‘exec’ for all queries.
  • commands
    Invoke one of the two scripts depending on the authentication method
  • json_query =”imdata”
    Gets rid of the first JSON level of the response
  • tag_keys
    Identify tags to store in the measurement by the full JSON path
  • json_string_fields
    Identify string formatted JSON key/value pairs to store as measurement fields
  • inputs.exec.tags
    Sets the apic tag to distinguish between multiple APIC-Cluster / Fabrics, if needed.
  • processors.converter.fields
    Convert string to an integer value to allow numeric database operations (min, max, avg,..)
  • processors.regex.tags
    strip redundant ‘topology/’ prefix of the fabricHealthTotal_attributes_dn Tag

Fun fact: The Telegraf JSON input accepts only numbers by default and stores them as float metric fields! As the APIC returns all JSON elements as strings, guess how long I tried to figure out why Telegraf is sending no data to the InfluxDB (without any helpful error message) … nice one!

To switch between the two authentication methods, just change the shell command:

sh /etc/telegraf/apic_query.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf

sh /etc/telegraf/apic_querysig.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf telegraf.key

Before reloading Telegraf to accept the new configuration, you can test the agent to show what lines will be added to the InfluxDB:

$ telegraf --test --config-directory /etc/telegraf/telegraf.d --input-filter exec
2020-04-19T08:09:32Z I! Starting Telegraf 1.14.0
2020-04-19T08:09:32Z I! Using config file: /etc/telegraf/telegraf.conf
ACI_SystemHealth,apic=sandboxapicdc.cisco.com,fabricHealthTotal_attributes_dn=topology/health,host=localhost.localdomain fabricHealthTotal_attributes_cur="81" 1587283774000000000

Note: The Telegraf processor plugin does not work in –test mode, so the topology/ prefix is still reported and the health score is a string instead of an integer.

$sudo systemctl reload telegraf

Grafana Dashboard

The following table summarizes all the details queried by the Telegraf .conf files to populate the Grafana demo dashboard.

Filenametagsfields
ACI_APIC.confapic_cluster, name, addresshealth, apicMode, adminSt, operSt, failoverStatus
ACI_Faults.confapic_clustercrit, maj, minor, warn
ACI_NodeHealth.confapic_cluster, name, oobMgmtAddr, podId, rolestate, systemUpTime, current (health)
ACI_SystemHealth.confapic_cluster, dn (to seperate pods)current (health)
ACI_TenantHealth.confapic_cluster, namecurrent (health)

The dashboard.json can be downloaded from github and imported via Grafana / click + / choose Import. Just enter a name, the database source and change the UID if needed.

Closing

I hope these (code) lines get you started to develop your own dashboard ideas and help to integrate relevant ACI metrics into the infrastructure monitoring. Based on the idea behind ACI, namely the possibility to model app specific network infrastructure, we are now able to provide dedicated full stack dashboards – from the network (App health score, faults, events) up to the application telemetry itself (DB queries, web counter, logs …) . You may find interesting metrics in the API Documentation, the ACI ‘visore’ or by checking out the API Inspector during GUI operations.

And please don’t stop there! What about a new Grafana Alert on the Node Health Panel, notifying your team slack when a Switch drops under 50 ? The possibilities are endless …

17 thoughts on “Monitor Cisco ACI via REST-API

    1. Thanks for saying that, happy that it’s helpful!
      One thing I recently noticed: I need to fix the ACI_NodeHealth.conf, because it’s an old version. So stay tuned …

  1. Keep in mind when you use apic_querysig.sh method you must specify “absolute paths” in command section of telegraf config, otherwise it wouldn’t work
    and telegraf debugging is not very helpful indicating these errors

    1. Ah OK, so that was the problem with your Telegraf execution, Federico?
      Thanks for pointing it out, I will highlight this detail in the blog post.

  2. Yes sure, you have to specify the key file with the full path.
    This way:
    commands = [“sh /etc/telegraf/apic_query.sh sandboxapicdc.cisco.com /api/class/fabricHealthTotal.json telegraf /etc/telegraf/telegraf.key”]

  3. I just tried to install this and I get no data in influx.
    Basically if I do show measurements on telegraf there are no names with APIC in them.
    I am using the full path to reach apic_querysig.sh and the sample ones made by you.
    I also executed the scripts by hand to test and by hand they work.
    Telegraf reports no error.
    # telegraf –test –config-directory /etc/telegraf/telegraf.d –input-filter exec
    2020-11-14T20:10:02Z I! Starting Telegraf 1.14.3
    2020-11-14T20:10:02Z I! Using config file: /etc/telegraf/telegraf.conf
    root@grafana:/etc/telegraf/telegraf.d#

    This one says nothing.
    I am a bit lost.
    Any hints?

    Thank you.

      1. Hi,

        In the end it was way easier than I thought but I forgot to write back. I was thinking too complicated.
        I added some echo-es >> /tmp/log in the existing bash scripts and saw that I was getting an “invalid signature”.
        After that I narrowed it down to giving it the absolute path toward the telegraf.key file (I had it in /etc/telegraf folder).
        Now all works like a charm.

        Sorry for the trouble,
        Mihai

  4. I also activated debug = true in telegraf and am wondering if that option is just having a psychological effect.
    I get almost no extra logging info rather than just basic so I feel a bit cornered at tracking this down.

    1. Hi Mihai

      If you can be bit more specific about the >> /tmp/log redirect, may be I can solve my issue
      My command is working but DB is not getting any entries.
      How can I check if telegraf is loading the files and telegraf is trying to write the database?

  5. I am trying this
    sh /etc/telegraf/apic_querysig.sh sandboxapicdc.cisco.com /api/node/mo/topology/pod-1/node-101/sys/phys-\[eth1/35\].json telegraf telegraf.key
    But nothings seems to be polled and no response for this GET request.
    please suggest what am I missing here

    1. I’d check this URI with plain curl first to get verbose logging.
      There are many things that can go south, quite impossible to remote troubleshoot.

      Regards,
      Michael

      1. Hi Michael,
        Simple commands looks ok, Do you suggest checking the DB for new table for writing other content of conf file?

        I mean tags, string_Fields, etc. All in all how do I write accurate conf file for a histogram graph?

        ashokd@Ubuntu:/etc/telegraf/telegraf.d$ sudo sh /etc/telegraf/apic_querysig.sh sandboxapicdc.cisco.com /api/node/mo/topology/pod-1/node-103/sys/phys-[eth1/8]/phys.json?query-target=children& telegraf telegraf.key
        [1] 15419
        dgst: Option -sign needs a value
        dgst: Use -help for summary.
        2020-11-18T10:50:55Z I! Starting Telegraf 1.15.3
        2020-11-18T10:50:55Z I! Using config file: /etc/telegraf/telegraf.conf
        2020-11-18T10:50:55Z I! Loaded inputs: disk diskio kernel mem processes swap system cpu
        2020-11-18T10:50:55Z I! Loaded aggregators:
        2020-11-18T10:50:55Z I! Loaded processors:
        2020-11-18T10:50:55Z I! Loaded outputs: influxdb
        2020-11-18T10:50:55Z I! Tags enabled: host=nextgenmonitoring
        2020-11-18T10:50:55Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:”nextgenmonitoring”, Flush Interval:10s

        ^C2020-11-18T10:53:23Z I! [agent] Hang on, flushing any cached metrics before shutdown
        [1]+ Exit 3 sudo sh /etc/telegraf/apic_querysig.sh sandboxapicdc.cisco.com /api/node/mo/topology/pod-1/node-103/sys/phys-[eth1/8]/phys.json?query-target=children

      2. Do you have telegraf.key in the folder where you are calling the apic_quierysig.sh from? what if you add the absolute path to telegraf.key?
        The sign needs a value seems to imply that it did not compute the signature to be sent and that one needs the key (in my case it was not finding it in the same folder:P)

  6. Hi Michael,

    This time ran the command from the folder where sig file is located, but output seems to have not difference.

    ashokd@ubuntu:/etc/telegraf$ sudo sh /etc/telegraf/apic_querysig.sh sandboxapicdc.cisco.com /api/node/mo/topology/pod-1/node-103/sys/phys-[eth1/8]/phys.json?query-target=children& telegraf /etc/telegraf/telegraf.d/telegraf.key
    [2] 6963
    2020-11-18T12:15:30Z I! Starting Telegraf 1.15.3
    2020-11-18T12:15:30Z I! Using config file: /etc/telegraf/telegraf.conf
    2020-11-18T12:15:30Z I! Loaded inputs: diskio kernel mem processes swap system cpu disk
    2020-11-18T12:15:30Z I! Loaded aggregators:
    2020-11-18T12:15:30Z I! Loaded processors:
    2020-11-18T12:15:30Z I! Loaded outputs: influxdb
    2020-11-18T12:15:30Z I! Tags enabled: host=nextgenmonitoring
    2020-11-18T12:15:30Z I! [agent] Config: Interval:1m0s, Quiet:false, Hostname:”nextgenmonitoring”, Flush Interval:10s
    ^C2020-11-18T12:15:39Z I! [agent] Hang on, flushing any cached metrics before shutdown

    [2]+ Stopped sudo sh /etc/telegraf/apic_querysig.sh sandboxapicdc.cisco.com /api/node/mo/topology/pod-1/node-103/sys/phys-[eth1/8]/phys.json?query-target=children

  7. Hello Guys,

    Thanks for hard work. I’ve used the telegraf conf since last week (I’ve changed from sh to bash for information regarding variable cookie which not work on sh on my telegraf container).
    But I’m totally disappointed about telegraf. Indeed, I use telegraf docker image, and since yesterday, I’ve no more working metrics to my grafana. After some debug, I’ve notice problem in telegraf since I’ve no more metrics into influxdb.
    I’ve made some bash redirect to validate each input.exec is lauch each minut which is good. But I’ve no output sent to influxdb and I’ve no idea why…. . I’ve validate too APIC response is json formated (with jq).

    Seem telegraf is a mess to debug, I’ve no idea what to do now^^ (I’ve tried influxdb 1.16 and 1.16.2)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.