使用 snmp_exporter 抓取設備流量

前言

本篇筆記紀錄如何使用 snmp_exporter 搭配 prometheus 與 Grafana 來即時監控網路設備流量。
公司原先使用 LibreNMS 作為即時流量的依據，由於 default polling interval 是 5 分鐘抓取一次，而且一次抓取 (walk) 的 OID 也是世界多~
雖然說 LibreNMS 有提供 1-Minute Polling 的方式，不過一次抓取就是長長一串造成設備不必要的負擔，更可能造成風暴 (前一次抓取尚未完成，時間到了又必須進行下一次)。
況且即使 1-Minute polling 完美運作，對於流量的精細程度個人認為是遠遠不足的! 被打 DDoS 幾秒之內流量就可能飆高，若針對流量 moniting 顆粒度太大就失去了意義、無法反映實際狀況。

另外這次 Prometheus, Grafana 都是利用 Docker 安裝，建議參考先前寫的這篇 Docker 容器與容器的連結。Grafana + Prometheus + Blackbox_expoter

安裝 snmp_exporter

Prometheus 相關的 exporter 都是使用 Go lang 撰寫的，通常用 Go 寫的程式 (我目前遇到的啦) 都可以直接以 binary 直接執行! (例如: Hugo，連在 Windows x86 平台上也沒問題)

直接到 snmp_exporter - release 下載對應版本解壓縮後即可使用。

1
2
3


wget https://github.com/prometheus/snmp_exporter/releases/download/v0.20.0/snmp_exporter-0.20.0.linux-amd64.tar.gz
tar -zxf snmp_exporter-0.20.0.linux-amd64.tar.gz
cd snmp_exporter-0.20.0.linux-amd64/

1
2
3
4
5


total 17M
-rw-r--r-- 1 3434 3434  12K Feb 12  2021 LICENSE
-rw-r--r-- 1 3434 3434   63 Feb 12  2021 NOTICE
-rwxr-xr-x 1 3434 3434  15M Feb 12  2021 snmp_exporter
-rw-r--r-- 1 3434 3434 1.3M Feb 12  2021 snmp.yml

snmp exporter 提供預設的 config，裡面包含很多已經整理好的 module。我們可以透過 yq 方便查看!

install `yq`

yq - Install

1

wget https://github.com/mikefarah/yq/releases/download/v4.27.2/yq_linux_amd64 -O /usr/bin/yq && chmod +x /usr/bin/yq

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


yq keys snmp.yml

- apcups
- arista_sw
- cisco_wlc
- ddwrt
- if_mib
- infrapower_pdu
- keepalived
- kemp_loadmaster
- liebert_pdu
- mikrotik
- nec_ix
- paloalto_fw
- printer_mib
- raritan
- servertech_sentry3
- servertech_sentry4
- synology
- ubiquiti_airfiber
- ubiquiti_airmax
- ubiquiti_unifi
- wiener_mpod

▲ 官方建議如果要抓取 switch, access point, router 可以使用 if_mib module。

建立 `systemd` config

snmp_exporter.service

使用 systemd 來把 snmp exporter 變成 daemond 使用/管理上會比較方便。

1
2
3
4
5


useradd prometheus
mkdir -p /home/prometheus/snmp_exporter/
chown prometheus:prometheus -R /home/prometheus/snmp_exporter/
ln -s /root/snmp_exporter-0.20.0.linux-amd64/snmp_exporter /home/prometheus/snmp_exporter/snmp_exporter
ln -s /root/snmp_exporter-0.20.0.linux-amd64/snmp.yml /home/prometheus/snmp_exporter/snmp.yml

因為 systemd config 內預設路徑是在 /home/prometheus/snmp_exporter/ 底下，使用 ln -s 連結過去。

1
2


wget https://raw.githubusercontent.com/prometheus/snmp_exporter/main/examples/systemd/snmp_exporter.service -O /usr/lib/systemd/system/snmp_exporter.service
chmod 644 /usr/lib/systemd/system/snmp_exporter.service

1

ExecStart=/home/prometheus/snmp_exporter/snmp_exporter --config.file=/home/prometheus/snmp_exporter/snmp.yml

▲ 官方提供的 systemd config 在 CentOS 7.9 2009 底下會不能運作，必須把 --config.file= 的 '' 拿掉才會正常!
(厚，這問題搞了五個小時! 直接用 shell 執行 ExecStart 的指令都沒有問題，不管以 root 或者 prometheus 身分執行)

附上錯誤訊息希望能被 Google 收納，~~拯救蒼生~~

1
2
3
4
5
6
7
8


Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: Unit snmp_exporter.service entered failed state.                                    
Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: snmp_exporter.service failed.                                                       
Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: snmp_exporter.service holdoff time over, scheduling restart.                        
Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: Stopped SNMP Exporter.                                                              
Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: start request repeated too quickly for snmp_exporter.service                        
Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: Failed to start SNMP Exporter.                                                      
Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: Unit snmp_exporter.service entered failed state.                                    
Aug 26 09:43:16 Eric_Prometheus_SNMP systemd[1]: snmp_exporter.service failed.           

安裝 Prometheus via Docker container

Prometheus Installation Document

1
2


mkdir -p /root/yml/prometheus/ && cd /root/yml/prometheus/
vim prometheus.yml

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


global:
  scrape_interval:     5s # Set the scrape interval to every 5 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

scrape_configs:
  - job_name: 'local_prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'snmp'
    static_configs:
      - targets:
        - 192.168.xxx.xxx  # SNMP device.
    metrics_path: /snmp
    params:
      module: [if_mib]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9116  # The SNMP exporter's real hostname:port.

▲ 將抓取時間更改為 5s

1

docker run -itd --network host --name prometheus -v /root/yml/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v prometheus:/prometheus --restart always  prom/prometheus

Use host networking 讓 prometheus container 使用跟 host OS 相同的 network namespace，目的是讓 prometheus config 當中的 snmp_exporter IP address 保持使用 127.0.0.1。
將 prometheus config mapping (bind mount)，目的是方便修改。
使用名為 “prometheus” 的 volume，不然預設是隨機產生名稱。

Storage-Prometheus

Storage

Prometheus 預設保存 15d 的資料，若要增加執行時必須給參數 --storage.tsdb.retention.time。
Docker 的方式如下 (假設已經執行過上面的指令，已經建立一個名為 prometheus 的 Docker container)

1
2
3
4


## stop and remove docker container
docker stop prometheus && docker rm prometheus

docker run -itd --network host --name prometheus -v /root/yml/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v prometheus:/prometheus --restart always  prom/prometheus --storage.tsdb.retention.time=1y --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/prometheus

▲ 除了加上 --storage.tsdb.retention.time=1y 將儲存時間改為 1y，還必須加上 --config.file=/etc/prometheus/prometheus.yml 才能正常執行。

註: 若沒加上 --storage.tsdb.path=/prometheus 預設是使用 /data。

參考資料，確認方式可以打開瀏覽器查看 http://<ip>:9090/flags

live reload

Prometheus 在 version 2.0 之後若要使用 curl -vX POST http://<ip>:<port>/-/reload 來 live reload 讓 prometheus 重新吃設定檔，必須在 prometheus 執行時加上 --web.enable-lifecycle 參數，不然會得到 403 Forbidden。

Prometheus 2.0 migration guide#prometheus-lifecycle

這對直接使用官方 container image ~~懶惰的我~~來說會有點麻煩，好險還有另外一種方法 Frequently Asked Questions#Can I reload Prometheus’s configuration? 傳送 SIGHUP。

1

docker exec prometheus killall -HUP prometheus

▲ 雖然說這個指令能夠傳入 SIGHUP 讓 prometheus live reload config，但是遇到 Docker 這個問題 File mount does not update with changes from host 被我們 bind mount 進去 container 的 config file 其 inode 不會被更新到，白話文就是檔案內容沒有被更新啦! 所以 reload 也沒用 QQ，還是乖乖 docker restart prometheus 吧!

Reverse Proxy Prometheus with Nginx

1
2
3


yum install nginx -y
systemctl enable --now nginx.service
vim /etc/nginx/conf.d/prometheus.conf

1
2
3
4
5
6
7
8


server {
    listen 80;
    server_name  example.org;

    location / {
        proxy_pass           http://localhost:9090/;
    }
}

1
2


nginx -t
systemctl reload nginx.service

PromQL

snmp_exporter 成功執行後，使用瀏覽器打開 http://<IP>:9116 即可看到 snmp_exporter 簡單明瞭的網頁畫面。使用者可以自行輸入 target 與欲使用的 module，或者利用 GET method 直接快速輸入 http://<IP>:9116/snmp?target=<target_IP>&module=<module_name> 即可拿到 snmp walk 來的資訊。

prometheus_ql_0

▲ 拿到這些資訊複製貼上到 Prometheus 就能 Query。

prometheus_ql_1

▲ ifHCInOctets 所記錄的是累積傳輸總量 (Byte)

prometheus_ql_2

▲ 因此要使用 prometheus 內建的 function irate。irate 會輸出 per-second rate ，其中 [5m] 即是 range vectors 算是使用 irate function 必須給予的參數 (註: 實測 irate range vectors 值不影響精準度)，而 *8 的部分目的是要將 Byte 轉換成 bits，因為我們講的 ‘妹’ 實際上是 Mbps。

對 instant vectors 或 range vectors 有興趣的可以參考:

Before Gafana

在進入 Grafana 做視覺化之前，我想要先把指標項目列出來! 這樣才能比較清楚到底在監控什麼、有哪些指標。

Prometheus 抓取花費時間: scrape_duration_seconds{instance="<instance_IP>", job="snmp"}
snmp_exporter 透過 snmp 抓取 target 花費時間: snmp_scrape_duration_seconds{instance="<instance_IP>", job="snmp"} (包含 walk + process 時間)
snmp_exporter snmp walk 花費時間: snmp_scrape_walk_duration_seconds{instance="<instance_IP>", job="snmp"}
ifHC 總流量累積系列，使用 64 bits counter 來記錄 (避免 32 bits 最高只能記錄到 4GB 會出現的一些問題)
(承上) 其實 ifHC 不只有總流量，還有封包種類。例如: ifHCInUcastPkts unicast packet, ifHCInBroadcastPkts 廣播封包

ifhc_series

▲ ifHC 流量相關系列。

compare_with_exist_monition

▲ 與現有辦公室流量監控顆粒度差異比較。(雖然說這張圖 Prometheus 抓的設備 Port-Channel1 不完全是整間辦公室的 WAN 流量)

Grafana

1

docker run -itd --network host --name grafana -v grafana-storage:/var/lib/grafana --restart always grafana/grafana

因為我們 Prometheus docker container network type 是 host 因此 Grafana docker container 不能使用 --link prometheus:prometheus 聯結。
如果有需要安裝 Grafana plugins => Install official and community Grafana plugins
預設登入帳號密碼都是 admin。 Sign in to Grafana
使用 port 3000 (如果要更改成 80/tcp 請自行修改 Dockerfile)
Configure a Grafana Docker image

grafana_config_0

▲ 更改 scrape interval 為 5s，因為我們 prometheus 就是設定 5s。

增加完 data source 之後就來拉 dashboard 吧!

grafana_query_0

▲ 為了後面方便將 query override，建議將 query name 命名!

grafana_query_1

▲ Panel options 可以設定 title 與 description。

grafana_query_2

▲ Legend (n.) 圖例。可以切換樣式、位置、顯示那些值。

grafana_query_3

▲ Axis 控管軸線，Timezone 預設 (default) 使用 browser 的時區。 順帶一提，Prometheus 預設使用 UTC 時區，而且不建議調整。
Scale 的部分則是 x 軸顯示數值的基準 (~~很難翻啦~~)

grafana_query_4

▲ 上面這張圖是以 log 10 為基底顯示。 (~~高中數學已還老師~~)

grafana_query_5

▲ Unit 有內建好幾種供選擇，Decimal 的部分則是小數精準度 (小數點後幾位)。

grafana_query_6

▲ Threshold 閥值設定。

grafana_query_7

▲ Tooltip 切換成 All 能顯示同一 data point 所有 query 的值。

grafana_query_8

▲ Dashboard 開啟 Shared Tooltip 效果。

Run Grafana behind a reverse proxy

Grafana 預設跑在 3000/tcp，而且我們沒有設定 domain name 給它，不想要跟 Grafana config 糾纏的話就在前面加個 Nginx reverse proxy 吧!
官方有提供正常版與 path rewrite 版範例 (例如: http://example.com/grafana 就需要用到)

1
2
3


yum install nginx -y
systemctl enable --now nginx.service
vim /etc/nginx/conf.d/grafana.conf

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


# this is required to proxy Grafana Live WebSocket connections.
map $http_upgrade $connection_upgrade {
  default upgrade;
  '' close;
}

upstream grafana {
  server localhost:3000;
}

server {
  listen 80;
  server_name  example.org  www.example.org;
  root /usr/share/nginx/html;
  index index.html index.htm;

  location / {
    proxy_set_header Host $http_host;
    proxy_pass http://grafana;
  }

  # Proxy Grafana Live WebSocket connections.
  location /api/live/ {
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_set_header Host $http_host;
    proxy_pass http://grafana;
  }
}

1
2


nginx -t
systemctl reload nginx.service

Grafana alert 警報設定

圖片取自 [Grafana.com] Grafana Alerting

Grafana 在實作警報上面分為幾個步驟:

Alert rules，利用一條或多條 query/expression 制定警報規則，其「求值間隔」 (evaluation)、持續多久 fire the alert (意味著 pending 時長) 都是在 rule 設定。
警報能夠是全域的 (或者官方稱做 multi-dimensional) 或者單一 panel。
Labels，將 alert rule 打上標籤。Notification policy 以及 silences (靜音功能) 都是透過 label 來判斷。
目前想到的有 serverity (嚴重程度), team (屬於哪個團隊的責任範圍), IDC (資料中心別), type (例如: VM,Network)
Notification policies，通知政策。設定哪個警報要發、要發到哪裡 (contact point)。
Contact points，警報通路/管道。

grafana_alert_0

▲ 切到 panel 的 alert 頁面就能針對單一 panel 加 rule。

grafana_alert_1

▲ 在加 rule 之前記得先去建議一個 Folder，預設的 General 沒辦法放。

grafana_alert_3

▲ 以 Expression 制定 alert rule。按下 run queries 可以看到警報狀態。

grafana_alert_4

▲ 每 5 秒抓值一次，若符合 alert rule 持續 20 秒 => fire alert
但圖上可以看到 5s 實際上是不給設定的，因為 Grafana global evaluation 是 10s 不能設的比它還低!
再來下方定義當 no data/error timeout 時 alert state 為何。

grafana_alert_5