Monitor Cisco NX-OS/ACI via SNMP and the TIG-Stack

I know, even Cisco NX-OS has a REST-API and Streaming Telemetry these days.
But you, or established processes in your organisation, might find it helpful to handle all switch ‘Telemetry’ in the same way using good old per device SNMP polling. A quick poll* on the Twitters seems to validate that ~80% of production network metrics are still SNMP anyway.

*See what I did there?

This post features a Telegraf configuration to pipe common SNMP statistics from Cisco NX-OS and even ACI mode Switches to an InfluxDB, as well as a basic per-device Grafana dashboard to start with. The device onboarding will of cause be automated by a declarative Ansible playbook and a Jinja2 template, to get rid of this tedeous task in day-to-day operations.

Prerequisites

Telegraf / InfluxDB / Grafana Installation

First of all, you need a working TIG-Stack (Telegraf, InfluxDB, Grafana), the installation is pretty simple and not in the scope of this blog post. A single Linux instance of your choice is sufficient for the first steps. In addition to that an Ansible installation is needed to automate the onboarding process.

Syslog processing

As the switch dashboard expects syslog messages to be stored in the InfluxDB, please visit my blog post if you’d like to use this feature as well.

Device configuration

This post assumes, that the device configuration regarding SNMPv3 and Syslog has been done already during Day-0 provisioning. But you can add the following tasks to the Ansible playbook if the config management regarding the TIG-Stack Monitoring should be in one place.

- name: SNMPv3 USER !Not idempotent because of priv credentials!
  nxos_snmp_user:
    user: "{{ snmpv3_name }}"
    group: network-operator
    authentication: "{{ snmpv3_authprot }}"
    pwd: "{{ snmpv3_authpw }}"
    encrypt: yes
    privacy: "{{ snmpv3_privpw }}"
  when: ansible_network_os == 'nxos'

- name: SYSLOG CONFIGURATION
  nxos_config:
    lines:
      - logging server {{ telegrafhost }} 5 use-vrf management
      - logging source-interface mgmt0
      - logging message interface type ethernet description
  when: ansible_network_os == 'nxos'

Telegraf Configuration

All the relevant files can be found in this github project. If you take a look at NXOS_SNMP_Grafana.j2, you will find the templated Telegraf configuration file to query one host via SNMPv3. As I don’t like to waste database space, only the relevant metrics are polled, not a full IF-MIB table for instance.

MeasurementTagsFields
snmpsource
(hostname via RFC1213-MIB::sysName.0)
uptime
CPU1m
MemUsed
MemFree
FRUsource FAN_Status
PSU_Status
IFsource
ifOperStatusCause
ifHighSpeed
ifAdminStatus
ifOperStatus
ifHCInOctets
ifHCOutOctets
ifInDiscards
ifOutDiscards
ifInErrors
ifOutErrors
ifInUnknownProtos
Metrics written to the InfluxDB

The [inputs.snmp.tagpass] at the end allows only metrics of physical Interfaces to pass, not SVI, portchannel or loopback Interfaces. You might have to adjust this configuration according to your needs.

Ansible playbook

Alright, now to the real magic here. We want our network device to automatically onboard to the TIG-Stack. The Ansible inventory makes up an ideal source-of-truth, as we only need the management IP from every host and this file should be up to date anyway when you provision and configure your network devices with Ansible.

[NXOS]
dcwest-core001 ansible_host=172.16.0.1
dcwest-core002 ansible_host=172.16.0.2
dcwest-sw001 ansible_host=172.16.1.1
...

The playbook provides a convenient way to generate Telegraf configuration files per device and stores them in the /etc/telegraf/telegraf.d directory of the TIG Host.

---
- name: TELEGRAF CONFIG
  hosts: all
  ignore_errors: no
  gather_facts: no
  vars:
    telegrafhost: <Telegraf Instance by FQDN/IP>
    snmpv3_name: telegraf
    snmpv3_authprot: MD5
    snmpv3_authpw: <SECRET>
    snmpv3_privprot: AES
    snmpv3_privpw: <SECRET>

  tasks:

  - name: TEMPLATE OUT TELEGRAF.CONF FILES
    template:
      src: nxos_snmp_telegraf.j2
      dest: /etc/telegraf/telegraf.d/{{inventory_hostname}}_snmp.conf
    delegate_to: "{{ telegrafhost }}"
    notify: RELOAD TELEGRAF
    when: ansible_network_os == 'nxos'

  - name: READ LIST OF ALL TELEGRAF.CONF FILES IN DIRECTORY
    find:
      paths: /etc/telegraf/telegraf.d/
      file_type: file
      recurse: no
      patterns: "*snmp.conf"
    delegate_to: "{{ telegrafhost }}"
    register: files_matched
    run_once: true

  - name: DELETE STALE TELEGRAF.CONF FILES
    file:
      path: "{{ item.path }}"
      state: absent
    loop: "{{ files_matched.files|flatten(levels=1) }}"
    loop_control:
      label: "{{ item.path }}"
    delegate_to: "{{ telegrafhost }}"
    when: (item.path | basename | regex_replace('_snmp.conf') not in ansible_play_hosts_all)
    notify: RELOAD TELEGRAF
    run_once: true

  handlers:

  - name: RELOAD TELEGRAF
    shell: sudo systemctl reload telegraf
    delegate_to: "{{ telegrafhost }}"
    run_once: true

At the beginning, a vars section specifies the remote Telegraf host and all the SNMPv3 parameters needed. This is the place to adapt to your local environment.

The first task uses the Jinja2 template to generate the Telegraf configuration file on the remote host, adding a snmp.conf suffix as a distinguisher. Only devices with nxos as the specified ansible_network_os in the Inventory will be used. This task is idempotent, meaning that changes only happen if a new host has to be deployed or there are changes to the central Jinja2 template.

The next two tasks render this playbook declarative. A list of files in the telegraf.d directory with the suffix snmp.conf will be compared with all the hosts in the current play (ansible_play_hosts_all) and stale entries will be deleted.

Automate all the things

Of course you can just run this playbook by hand using something like this

$ ansible-playbook -k nxos_telegraf_snmp.yml 

But there are also several ways to automate the process. If your inventory is at your Ansible control node as /etc/ansible/hosts for instance, you can establish a cron job regularly executing the playbook. Changes to the local inventory will then be reflected by the files in the telegraf.d directory over time.

The NetDevOps style would be to host your Inventory at a central repository like GitHub / GitLab. You can then use GitHub actions or the GitLab CI pipeline to trigger a playbook run at your control node or even a virtualenv anywhere in your infrastructure everytime a change to the Inventory is pushed to your repository.

Of course you might need a solution to store the credentials of the remote telegrafhost, something on the lines of ssh keys, Ansible vault or GitHub secrets.

Grafana Dashboard

The dashboard called Cisco_NXOS_Dashboard.json visualizes all the metrics and uses variables (Host, ifName) to switch between devices or focus on some Interfaces only. That way you don’t have to touch the Grafana side of the house, when new devices or line cards / FEX are added. Speaking of Fabric Extender: They are monitored under the parent switch as well, including PSU and FAN.

I plan to add more dashboards we use in production, because a simple device dashboard can only be the start of the exciting Grafana journey.

8 thoughts on “Monitor Cisco NX-OS/ACI via SNMP and the TIG-Stack

  1. This is great, I really appreciate your JSON telegraf configuration:
    https://grafana.com/grafana/dashboards/12432

    I was also able to use the basic one here for basic system monitoring:
    https://computingforgeeks.com/monitor-linux-system-with-grafana-and-telegraf/

    I’m struggling with the telegraf configuration spewing invalid outputs however. I get a cpu rating of 1260i or something. Any idea what the issue could be? I already made sure that I have every MIB installed for Cisco and I can use the same credentials for snmpwalk and all OIDs can be transalate via snmptranslate. It seems to pull and graph fine but all the data is invalid. I get IPs, no hostnames (although they pull via SNMPWALK), and metrics that end with the letter “i”.

  2. Ok, so just to be sure, you try to monitor a Cisco NX-OS or ACI mode Switch, right ? Because your working example from computingforgeeks is for Linux hosts only.

    You don’t need to import Cisco MIBs, I tried to get along with the standard MIB Tree and use the specific NX-OS OIDs where needed. If you use the Telegraf.conf generated by Ansible or the template from https://grafana.com/grafana/dashboards/12476, the agent_host is always the IP of the Switch, so thats correct. The Letter i indicates an Integer value, so that might be OK either.

    Tip: use telegraf –test –config to test your SNMP polling, without writing to the InfluxDB.

  3. Oh yeah, I’m trying your Telegraf configuration on a 7010 and 5K using NX-OS and a 93180YC-EX and 9508 running ACI. Trying to get a picture of a metrics solution across the platforms. Something that isn’t usually so easy. That is why I liked the brilliance of the TIG-solution. That I may adjust later as necessary to get meaningful metrics for either platform.

    Anyway, my point was that the TIG solution is working for two examples, both your ACI-only dashboard and for monitoring a few Linux hosts using that solution.

    As for the test on my little lab setup, I’m getting these odd results:
    [root@labserver ~]# telegraf –test –config /etc/telegraf/telegraf.d/snmp.conf

    2020-06-23T10:43:58Z I! Starting Telegraf 1.14.4
    > snmp,agent_host=10.189.14.17,host=labserver.mydomain.com,source=213073 CPU1m=213070i,MemFree=213072i,MemUsed=213071i,hostname=213068i,uptime=213069i 1592909039000000000
    > snmp,agent_host=10.189.14.32,host=labserver.mydomain.com,source=213101 CPU1m=213098i,MemFree=213100i,MemUsed=213099i,hostname=213096i,uptime=213097i 1592909039000000000
    > snmp,agent_host=10.72.68.7,host=labserver.mydomain.com,source=213151 CPU1m=213148i,MemFree=213150i,MemUsed=213149i,hostname=213146i,uptime=213147i 1592909039000000000
    > snmp,agent_host=10.72.68.8,host=labserver.mydomain.com,source=213431 CPU1m=213428i,MemFree=213430i,MemUsed=213429i,hostname=213426i,uptime=213427i 1592909039000000000
    > snmp,agent_host=10.16.5.4,host=labserver.mydomain.com,source=213795 CPU1m=213792i,MemFree=213794i,MemUsed=213793i,hostname=213790i,uptime=213791i 1592909039000000000
    > snmp,agent_host=10.16.5.5,host=labserver.mydomain.com,source=214460 CPU1m=214457i,MemFree=214459i,MemUsed=214458i,hostname=214455i,uptime=214456i 1592909039000000000

    I suppose what got me thinking about conversions is your comment on your other dashboard blog:
    “Fun fact: The Telegraf JSON input accepts only numbers by default and stores them as float metric fields! As the APIC returns all JSON elements as strings, guess how long I tried to figure out why Telegraf is sending no data to the InfluxDB (without any helpful error message) … nice one!”

    That got it in my head that perhaps I was seeing the wrong format, and other reading told me that the telegraf –test does not run the converter processors.

    1. My comment about conversion is only valid for the ACI-API use case, the snmp polling follows a different mechanic and is (usually) pretty straight forward.

      Your Telegraf output shows a metric called hostname, which is not used in my configuration templates, so I think there may be something wrong with the configuration file.
      It has to look like the example from the dashboard documentation at https://grafana.com/grafana/dashboards/12476.

      Just tested it for the snmp measurement part with an N9K:

      2020-06-23T13:24:45Z I! Starting Telegraf 1.13.4
      snmp,agent_host=10.10.10.10,host=myhost.local,source=nxos1 CPU1m=3i,MemFree=18726044i,MemUsed=5907436i,uptime=33482249i 1592918686000000000

      [[inputs.snmp]]
      agents = [ “10.10.10.10” ]
      timeout = “5s”
      retries = 3
      version = 3
      sec_name = “”
      auth_protocol = “SHA”
      auth_password = “”
      sec_level = “authPriv”
      priv_protocol = “AES”
      priv_password = “”

      [[inputs.snmp.field]]
      name = “uptime”
      oid = “DISMAN-EVENT-MIB::sysUpTimeInstance”
      [[inputs.snmp.field]]
      name = “CPU1m”
      oid = “1.3.6.1.4.1.9.9.109.1.1.1.1.7.1”
      [[inputs.snmp.field]]
      name = “MemUsed”
      oid = “1.3.6.1.4.1.9.9.109.1.1.1.1.12.1”
      [[inputs.snmp.field]]
      name = “MemFree”
      oid = “1.3.6.1.4.1.9.9.109.1.1.1.1.13.1”
      [[inputs.snmp.field]]
      oid = “RFC1213-MIB::sysName.0”
      name = “source”
      is_tag = true

  4. Ah, well that is explained by me going off on a tangent because I *felt* the “source” was supposed to be text (despite it being obfuscated in the screenshot) and so I tried to duplicate that into a hostname. With hundreds of devices I was looking to see a hostname in the dashboard drop-down instead of an IP address.

    Anyway, I took you suggestion and tried to duplicate your simpler test with the smaller config file above:

    [root@metrichost telegraf.d]# telegraf -test -debug -config /etc/telegraf/telegraf.d/snmp_blog.conf
    2020-06-23T14:46:02Z I! Starting Telegraf 1.14.4
    2020-06-23T14:46:02Z D! [agent] Initializing plugins
    2020-06-23T14:46:02Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “DISMAN-EVENT-MIB::sysUpTimeInstance”
    2020-06-23T14:46:02Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.109.1.1.1.1.7.1”
    2020-06-23T14:46:02Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.109.1.1.1.1.12.1”
    2020-06-23T14:46:02Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.109.1.1.1.1.13.1”
    2020-06-23T14:46:02Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “RFC1213-MIB::sysName.0”
    > snmp,agent_host=10.4.6.56,host= metrichost.domain.com,source=258098 CPU1m=258095i,MemFree=258097i,MemUsed=258096i,uptime=258094i 1592923563000000000

    Those results look much more like yours. So I tried it again with the original config file and also got similar results:

    [root@metricserver telegraf.d]# telegraf -test -debug -config /etc/telegraf/telegraf.d/snmp_metrics.conf
    2020-06-23T15:05:15Z I! Starting Telegraf 1.14.4
    2020-06-23T15:05:15Z D! [agent] Initializing plugins
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.117.1.4.1.1.1”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.117.1.1.2.1.2”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifName”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifAlias”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.276.1.1.2.1.10”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifHighSpeed”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifAdminStatus”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifOperStatus”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifHCInOctets”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifHCOutOctets”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifInDiscards”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifOutDiscards”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifInErrors”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifOutErrors”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “IF-MIB::ifInUnknownProtos”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “DISMAN-EVENT-MIB::sysUpTimeInstance”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.109.1.1.1.1.7.1”
    2020-06-23T15:05:15Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.109.1.1.1.1.12.1”
    2020-06-23T15:05:16Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “-m” “all” “1.3.6.1.4.1.9.9.109.1.1.1.1.13.1”
    2020-06-23T15:05:16Z D! [inputs.snmp] executing “snmptranslate” “-Td” “-Ob” “RFC1213-MIB::sysName.0”
    > snmp,agent_host=10.40.5.5,host=metricserver.domain.com,source=258423 CPU1m=258420i,MemFree=258422i,MemUsed=258421i,uptime=258419i 1592924716000000000
    > snmp,agent_host=10.40.5.6,host=metricserver.domain.com,source=255238 CPU1m=255235i,MemFree=255237i,MemUsed=255236i,uptime=255234i 1592924716000000000
    [root@ metricserver telegraf.d]#

    Now I just need to find out why my numbers look unreasonable such as 258,000% CPU and 3 days of uptime instead of the 264 shown in the CLI. I expect that I just need to update the OIDs for a Nexus 93180YC-EX switch. Thank you for the help and your blog posts with the dashboards. So much help to have a place to start.

  5. @NWMichl any idea why snmpget failing for few OID and interfaces, while succesfull for others

    >>:/etc/telegraf/telegraf.d$ snmpget -v 3 10.11.166.177 -u telegraf -x AES -X xxxxxxxx -a SHA -A xxxxxxx -l authPriv 1.3.6.1.4.1.9.9.109.1.1.1.1.12.1
    iso.3.6.1.4.1.9.9.109.1.1.1.1.12.1 = Gauge32: 58791
    >>:/etc/telegraf/telegraf.d$ snmpget -v 3 10.11.166.177 -u telegraf -x AES -X xxxxxxxxx -a SHA -A xxxxxxxxx -l authPriv 1.3.6.1.4.1.9.9.109.1.1.1.1.13.1
    iso.3.6.1.4.1.9.9.109.1.1.1.1.13.1 = Gauge32: 5115
    >>:/etc/telegraf/telegraf.d$ snmpget -v 3 10.11.166.177 -u telegraf -x AES -X xxxxxxxxx -a SHA -A xxxxxxxxx -l authPriv 1.3.6.1.4.1.9.9.117.1.4.1.1.1
    iso.3.6.1.4.1.9.9.117.1.4.1.1.1 = No Such Instance currently exists at this OID
    >>:/etc/telegraf/telegraf.d$ snmpget -v 3 10.11.166.177 -u telegraf -x AES -X xxxxxxxxx -a SHA -A xxxxxxxxx -l authPriv 1.3.6.1.4.1.9.9.117.1.1.2.1.2
    iso.3.6.1.4.1.9.9.117.1.1.2.1.2 = No Such Instance currently exists at this OID
    >>:/etc/telegraf/telegraf.d$ snmpget -v 3 10.11.166.177 -u telegraf -x AES -X xxxxxxxxx -a SHA -A xxxxxxxxx -l authPriv IF-MIB::ifName
    IF-MIB::ifName = No Such Object available on this agent at this OID

    1. Hi,

      it might be that these particular OID are not supported on your platform. These little guys vary from time to time (with different software versions for instance).
      Regarding the IF-MIB, I think you should give snmpwalk a try, because this isn’t ONE object but a table with plenty of elements.

      Regards,
      Michael

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.