介绍
黑盒和白盒监控的区别
黑盒监控:主要关注的现象,一般都是正在发生的东西,例如出现一个告警,业务接口不正常,那么这种监控就是站在用户的角度能看到的监控,重点在于能对正在发生的故障进行告警。
白盒监控:主要关注的是原因,也就是系统内部暴露的一些指标,例如 redis 的 info 中显示 redis slave down,这个就是 redis info 显示的一个内部的指标,重点在于原因,可能是在黑盒监控中看到 redis down,而查看内部信息的时候,显示 redis port is refused connection。
功能
Blackbox Exporter 是 Prometheus 社区提供的官方黑盒监控解决方案,其允许用户通过:HTTP、HTTPS、DNS、TCP 以及 ICMP 的方式对网络进行探测。列表如下:
- HTTP 测试
- 定义 Request Header 信息
- 判断 Http status / Http Respones Header / Http Body 内容
- TCP 测试
- 业务组件端口状态监听
- 应用层协议定义与监听
- ICMP 测试
- 主机探活机制
- POST 测试
- 接口联通性
- SSL 证书过期时间
Blackbox_exporter安装及配置
安装
# wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
# tar xzvf blackbox_exporter-0.24.0.linux-amd64.tar.gz -C /usr/local
# ls /usr/local/blackbox_exporter-0.24.0.linux-amd64
blackbox_exporter blackbox.yml LICENSE NOTICE
# 配置启动脚本
# cat /usr/lib/systemd/system/blackbox_exporter.service
[Unit]
Description=Blackbox_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter --config.file=/usr/local/blackbox_exporter-0.24.0.linux-amd64/blackbox.yml --web.listen-address=:9115
Restart=on-failure
[Install]
WantedBy=multi-user.target
# 启动
systemctl daemon-reload && systemctl enable --now blackbox_exporter
# web页面
此时可以通过ip+9115访问web页面。在web页面可以查看相关的metrics。
配置
默认配置文件如下:
# cat /usr/local/blackbox_exporter-0.24.0.linux-amd64/blackbox.yml
modules:
http_2xx:
prober: http # http get请求
http:
preferred_ip_protocol: "ip4"
http_post_2xx: # http post请求
prober: http
http:
method: POST
tcp_connect: # tcp端口检测
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
grpc:
prober: grpc
grpc:
tls: true
preferred_ip_protocol: "ip4"
grpc_plain:
prober: grpc # grpc协议检查
grpc:
tls: false
service: "service1"
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: "SSH-2.0-blackbox-ssh-check"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp # icmp检测
icmp_ttl5:
prober: icmp
timeout: 5s
icmp:
ttl: 5
blackbox通过配置不同的module来进行不同功能的探测,通过module中prober指定的协议进行处理。可以定义多个module,多个module可以使用同一种协议进行探测,不同名称的module配置相同的prober,prober具体检测参数可以进行不同配置来完成不同的检测功能。探测器协议分为如下:
http
HTTP探针是进行黑盒监控时最常用的探针之一,通过HTTP探针能够网站或者HTTP服务建立有效的监控,包括其本身的可用性,以及用户体验相关的如响应时间等等。除了能够在服务出现异常的时候及时报警,还能帮助系统管理员分析和优化网站体验。
modules:
http_2xx_example:
prober: http
http:
通过prober配置项指定探针类型。配置项http用于自定义探针的探测方式,这里有没对http配置项添加任何配置,表示完全使用HTTP探针的默认配置,该探针将使用HTTP GET的方式对目标服务进行探测,并且验证返回状态码是否为2XX,是则表示验证成功,否则失败。
自定义HTTP请求
HTTP服务通常会以不同的形式对外展现,有些可能就是一些简单的网页,而有些则可能是一些基于REST的API服务。 对于不同类型的HTTP的探测需要管理员能够对HTTP探针的行为进行更多的自定义设置,包括:HTTP请求方法、HTTP头信息、请求参数等。对于某些启用了安全认证的服务还需要能够对HTTP探测设置相应的Auth支持。对于HTTPS类型的服务还需要能够对证书进行自定义设置。
如下所示,这里通过method定义了探测时使用的请求方法,对于一些需要请求参数的服务,还可以通过headers定义相关的请求头信息,使用body定义请求内容:
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{}'
如果HTTP服务启用了安全认证,Blockbox Exporter内置了对basic_auth的支持,可以直接设置相关的认证信息即可:
http_basic_auth_example:
prober: http
timeout: 5s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"
对于使用了Bear Token的服务也可以通过bearer_token配置项直接指定令牌字符串,或者通过bearer_token_file指定令牌文件。
对于一些启用了HTTPS的服务,但是需要自定义证书的服务,可以通过tls_config指定相关的证书信息:
http_custom_ca_example:
prober: http
http:
method: GET
tls_config:
ca_file: "/certs/my_cert.crt"
自定义探针行为
在默认情况下HTTP探针只会对HTTP返回状态码进行校验,如果状态码为2XX(200 <= StatusCode < 300)则表示探测成功,并且探针返回的指标probe_success值为1。
如果用户需要指定HTTP返回状态码,或者对HTTP版本有特殊要求,如下所示,可以使用valid_http_versions和valid_status_codes进行定义:
http_2xx_example:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: []
默认情况下,Blockbox返回的样本数据中也会包含指标probe_http_ssl,用于表明当前探针是否使用了SSL:
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
而如果用户对于HTTP服务是否启用SSL有强制的标准。则可以使用fail_if_ssl和fail_if_not_ssl进行配置。fail_if_ssl为true时,表示如果站点启用了SSL则探针失败,反之成功。fail_if_not_ssl刚好相反。
http_2xx_example:
prober: http
timeout: 5s
http:
valid_status_codes: []
method: GET
no_follow_redirects: false
fail_if_ssl: false
fail_if_not_ssl: false
除了基于HTTP状态码,HTTP协议版本以及是否启用SSL作为控制探针探测行为成功与否的标准以外,还可以匹配HTTP服务的响应内容。使用fail_if_matches_regexp和fail_if_not_matches_regexp用户可以定义一组正则表达式,用于验证HTTP返回内容是否符合或者不符合正则表达式的内容。
http_2xx_example:
prober: http
timeout: 5s
http:
method: GET
fail_if_matches_regexp:
- "Could not connect to database"
fail_if_not_matches_regexp:
- "Download the latest version here"
最后需要提醒的时,默认情况下HTTP探针会走IPV6的协议。 在大多数情况下,可以使用preferred_ip_protocol=ip4强制通过IPV4的方式进行探测。在Bloackbox响应的监控样本中,也会通过指标probe_ip_protocol,表明当前的协议使用情况:
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 6
http探测参数详细说明
# Accepted status codes for this probe. Defaults to 2xx.
[ valid_status_codes: <int>, ... | default = 2xx ]
# Accepted HTTP versions for this probe.
[ valid_http_versions: <string>, ... ]
# The HTTP method the probe will use.
[ method: <string> | default = "GET" ]
# The HTTP headers set for the probe.
headers:
[ <string>: <string> ... ]
# The maximum uncompressed body length in bytes that will be processed. A value of 0 means no limit.
#
# If the response includes a Content-Length header, it is NOT validated against this value. This
# setting is only meant to limit the amount of data that you are willing to read from the server.
#
# Example: 10MB
[ body_size_limit: <size> | default = 0 ]
# The compression algorithm to use to decompress the response (gzip, br, deflate, identity).
#
# If an "Accept-Encoding" header is specified, it MUST be such that the compression algorithm
# indicated using this option is acceptable. For example, you can use `compression: gzip` and
# `Accept-Encoding: br, gzip` or `Accept-Encoding: br;q=1.0, gzip;q=0.9`. The fact that gzip is
# acceptable with a lower quality than br does not invalidate the configuration, as you might
# be testing that the server does not return br-encoded content even if it's requested. On the
# other hand, `compression: gzip` and `Accept-Encoding: br, identity` is NOT a valid
# configuration, because you are asking for gzip to NOT be returned, and trying to decompress
# whatever the server returns is likely going to fail.
[ compression: <string> | default = "" ]
# Whether or not the probe will follow any redirects.
[ follow_redirects: <boolean> | default = true ]
# Probe fails if SSL is present.
[ fail_if_ssl: <boolean> | default = false ]
# Probe fails if SSL is not present.
[ fail_if_not_ssl: <boolean> | default = false ]
# Probe fails if response body matches regex.
fail_if_body_matches_regexp:
[ - <regex>, ... ]
# Probe fails if response body does not match regex.
fail_if_body_not_matches_regexp:
[ - <regex>, ... ]
# Probe fails if response header matches regex. For headers with multiple values, fails if *at least one* matches.
fail_if_header_matches:
[ - <http_header_match_spec>, ... ]
# Probe fails if response header does not match regex. For headers with multiple values, fails if *none* match.
fail_if_header_not_matches:
[ - <http_header_match_spec>, ... ]
# Configuration for TLS protocol of HTTP probe.
tls_config:
[ <tls_config> ]
# The HTTP basic authentication credentials.
basic_auth:
[ username: <string> ]
[ password: <secret> ]
[ password_file: <filename> ]
# Sets the `Authorization` header on every request with
# the configured credentials.
authorization:
# Sets the authentication type of the request.
[ type: <string> | default: Bearer ]
# Sets the credentials of the request. It is mutually exclusive with
# `credentials_file`.
[ credentials: <secret> ]
# Sets the credentials of the request with the credentials read from the
# configured file. It is mutually exclusive with `credentials`.
[ credentials_file: <filename> ]
# HTTP proxy server to use to connect to the targets.
[ proxy_url: <string> ]
# Comma-separated string that can contain IPs, CIDR notation, domain names
# that should be excluded from proxying. IP and domain names can
# contain port numbers.
[ no_proxy: <string> ]
# Use proxy URL indicated by environment variables (HTTP_PROXY, https_proxy, HTTPs_PROXY, https_proxy, and no_proxy)
[ proxy_from_environment: <bool> | default: false ]
# Specifies headers to send to proxies during CONNECT requests.
[ proxy_connect_headers:
[ <string>: [<secret>, ...] ] ]
# Skip DNS resolution and URL change when an HTTP proxy (proxy_url or proxy_from_environment) is set.
[ skip_resolve_phase_with_proxy: <boolean> | default = false ]
# OAuth 2.0 configuration to use to connect to the targets.
oauth2:
[ <oauth2> ]
# Whether to enable HTTP2.
[ enable_http2: <bool> | default: true ]
# The IP protocol of the HTTP probe (ip4, ip6).
[ preferred_ip_protocol: <string> | default = "ip6" ]
[ ip_protocol_fallback: <boolean> | default = true ]
# The body of the HTTP request used in probe.
[ body: <string> ]
# Read the HTTP request body from from a file.
# It is mutually exclusive with `body`.
[ body_file: <filename> ]
dns
# The IP protocol of the DNS probe (ip4, ip6).
[ preferred_ip_protocol: <string> | default = "ip6" ]
[ ip_protocol_fallback: <boolean | default = true> ]
# The source IP address.
[ source_ip_address: <string> ]
[ transport_protocol: <string> | default = "udp" ] # udp, tcp
# Whether to use DNS over TLS. This only works with TCP.
[ dns_over_tls: <boolean | default = false> ]
# Configuration for TLS protocol of DNS over TLS probe.
tls_config:
[ <tls_config> ]
query_name: <string>
[ query_type: <string> | default = "ANY" ]
[ query_class: <string> | default = "IN" ]
# Set the recursion desired (RD) flag in the request.
[ recursion_desired: <boolean> | default = true ]
# List of valid response codes.
valid_rcodes:
[ - <string> ... | default = "NOERROR" ]
validate_answer_rrs:
fail_if_matches_regexp:
[ - <regex>, ... ]
fail_if_all_match_regexp:
[ - <regex>, ... ]
fail_if_not_matches_regexp:
[ - <regex>, ... ]
fail_if_none_matches_regexp:
[ - <regex>, ... ]
validate_authority_rrs:
fail_if_matches_regexp:
[ - <regex>, ... ]
fail_if_all_match_regexp:
[ - <regex>, ... ]
fail_if_not_matches_regexp:
[ - <regex>, ... ]
fail_if_none_matches_regexp:
[ - <regex>, ... ]
validate_additional_rrs:
fail_if_matches_regexp:
[ - <regex>, ... ]
fail_if_all_match_regexp:
[ - <regex>, ... ]
fail_if_not_matches_regexp:
[ - <regex>, ... ]
fail_if_none_matches_regexp:
[ - <regex>, ... ]
tcp
# The IP protocol of the TCP probe (ip4, ip6).
[ preferred_ip_protocol: <string> | default = "ip6" ]
[ ip_protocol_fallback: <boolean | default = true> ]
# The source IP address.
[ source_ip_address: <string> ]
# The query sent in the TCP probe and the expected associated response.
# starttls upgrades TCP connection to TLS.
query_response:
[ - [ [ expect: <string> ],
[ send: <string> ],
[ starttls: <boolean | default = false> ]
], ...
]
# Whether or not TLS is used when the connection is initiated.
[ tls: <boolean | default = false> ]
# Configuration for TLS protocol of TCP probe.
tls_config:
[ <tls_config> ]
icmp
# The IP protocol of the ICMP probe (ip4, ip6).
[ preferred_ip_protocol: <string> | default = "ip6" ]
[ ip_protocol_fallback: <boolean | default = true> ]
# The source IP address.
[ source_ip_address: <string> ]
# Set the DF-bit in the IP-header. Only works with ip4, on *nix systems and
# requires raw sockets (i.e. root or CAP_NET_RAW on Linux).
[ dont_fragment: <boolean> | default = false ]
# The size of the payload.
[ payload_size: <int> ]
# TTL of outbound packets. Value must be in the range [0, 255]. Can be used
# to test reachability of a target within a given number of hops, for example,
# to determine when network routing has changed.
[ ttl: <int> ]
grpc
# The service name to query for health status.
[ service: <string> ]
# The IP protocol of the gRPC probe (ip4, ip6).
[ preferred_ip_protocol: <string> ]
[ ip_protocol_fallback: <boolean> | default = true ]
# Whether to connect to the endpoint with TLS.
[ tls: <boolean | default = false> ]
# Configuration for TLS protocol of gRPC probe.
tls_config:
[ <tls_config> ]
==下方配置说明针对上面的探测配置进行补充,解释其中配置项。==
tls_config配置说明
# Disable target certificate validation.
[ insecure_skip_verify: <boolean> | default = false ]
# The CA cert to use for the targets.
[ ca_file: <filename> ]
# The client cert file for the targets.
[ cert_file: <filename> ]
# The client key file for the targets.
[ key_file: <filename> ]
# Used to verify the hostname for the targets.
[ server_name: <string> ]
# Minimum acceptable TLS version. Accepted values: TLS10 (TLS 1.0), TLS11 (TLS
# 1.1), TLS12 (TLS 1.2), TLS13 (TLS 1.3).
# If unset, Prometheus will use Go default minimum version, which is TLS 1.2.
# See MinVersion in https://pkg.go.dev/crypto/tls#Config.
[ min_version: <string> ]
http_header_match_spec配置说明
header: <string>,
regexp: <regex>,
[ allow_missing: <boolean> | default = false ]
oauth2配置说明
client_id: <string>
[ client_secret: <secret> ]
# Read the client secret from a file.
# It is mutually exclusive with `client_secret`.
[ client_secret_file: <filename> ]
# Scopes for the token request.
scopes:
[ - <string> ... ]
# The URL to fetch the token from.
token_url: <string>
# Optional parameters to append to the token URL.
endpoint_params:
[ <string>: <string> ... ]
官方配置示例
modules:
http_2xx_example:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [] # Defaults to 2xx
method: GET
headers:
Host: vhost.example.com
Accept-Language: en-US
Origin: example.com
follow_redirects: true
fail_if_ssl: false
fail_if_not_ssl: false
fail_if_body_matches_regexp:
- "Could not connect to database"
fail_if_body_not_matches_regexp:
- "Download the latest version here"
fail_if_header_matches: # Verifies that no cookies are set
- header: Set-Cookie
allow_missing: true
regexp: '.*'
fail_if_header_not_matches:
- header: Access-Control-Allow-Origin
regexp: '(\*|example\.com)'
tls_config:
insecure_skip_verify: false
preferred_ip_protocol: "ip4" # defaults to "ip6"
ip_protocol_fallback: false # no fallback to "ip6"
http_with_proxy:
prober: http
http:
proxy_url: "http://127.0.0.1:3128"
skip_resolve_phase_with_proxy: true
http_with_proxy_and_headers:
prober: http
http:
proxy_url: "http://127.0.0.1:3128"
proxy_connect_header:
Proxy-Authorization:
- Bearer token
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{}'
http_post_body_file:
prober: http
timeout: 5s
http:
method: POST
body_file: "/files/body.txt"
http_basic_auth_example:
prober: http
timeout: 5s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"
http_2xx_oauth_client_credentials:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
follow_redirects: true
preferred_ip_protocol: "ip4"
valid_status_codes:
- 200
- 201
oauth2:
client_id: "client_id"
client_secret: "client_secret"
token_url: "https://api.example.com/token"
endpoint_params:
grant_type: "client_credentials"
http_custom_ca_example:
prober: http
http:
method: GET
tls_config:
ca_file: "/certs/my_cert.crt"
http_gzip:
prober: http
http:
method: GET
compression: gzip
http_gzip_with_accept_encoding:
prober: http
http:
method: GET
compression: gzip
headers:
Accept-Encoding: gzip
tls_connect:
prober: tcp
timeout: 5s
tcp:
tls: true
tcp_connect_example:
prober: tcp
timeout: 5s
imap_starttls:
prober: tcp
timeout: 5s
tcp:
query_response:
- expect: "OK.*STARTTLS"
- send: ". STARTTLS"
- expect: "OK"
- starttls: true
- send: ". capability"
- expect: "CAPABILITY IMAP4rev1"
smtp_starttls:
prober: tcp
timeout: 5s
tcp:
query_response:
- expect: "^220 ([^ ]+) ESMTP (.+)$"
- send: "EHLO prober\r"
- expect: "^250-STARTTLS"
- send: "STARTTLS\r"
- expect: "^220"
- starttls: true
- send: "EHLO prober\r"
- expect: "^250-AUTH"
- send: "QUIT\r"
irc_banner_example:
prober: tcp
timeout: 5s
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp_example:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
source_ip_address: "127.0.0.1"
dns_udp_example:
prober: dns
timeout: 5s
dns:
query_name: "www.prometheus.io"
query_type: "A"
valid_rcodes:
- NOERROR
validate_answer_rrs:
fail_if_matches_regexp:
- ".*127.0.0.1"
fail_if_all_match_regexp:
- ".*127.0.0.1"
fail_if_not_matches_regexp:
- "www.prometheus.io.\t300\tIN\tA\t127.0.0.1"
fail_if_none_matches_regexp:
- "127.0.0.1"
validate_authority_rrs:
fail_if_matches_regexp:
- ".*127.0.0.1"
validate_additional_rrs:
fail_if_matches_regexp:
- ".*127.0.0.1"
dns_soa:
prober: dns
dns:
query_name: "prometheus.io"
query_type: "SOA"
dns_tcp_example:
prober: dns
dns:
transport_protocol: "tcp" # defaults to "udp"
preferred_ip_protocol: "ip4" # defaults to "ip6"
query_name: "www.prometheus.io"
指标获取流程
- blackbox_exporter配置相关模块,模块中指明探测方式。
- 启动重启blackbox_exporter后,可以通过
IP+PORT/probe?module=MODULE&target=URL
来进行检测,MODULE用于指定在blackbox中定义的module。访问结果为metrics指标页面。 - 通过prometheus配置需要请求的信息,使用对应模块访问对应URL,获取指标数据。
- prometheus将指标数据进行存储。存储后可以通过rules+alertmanager进行告警,也可以通过grafana来进行数据展示。
Prometheus配置
接下来,只需要在Prometheus下配置对Blockbox Exporter实例的采集任务即可。最直观的配置方式:
- job_name: baidu_http2xx_probe
params:
module:
- http_2xx
target:
- baidu.com
metrics_path: /probe
static_configs:
- targets:
- 127.0.0.1:9115
- job_name: prometheus_http2xx_probe
params:
module:
- http_2xx
target:
- prometheus.io
metrics_path: /probe
static_configs:
- targets:
- 127.0.0.1:9115
这里分别配置了名为baidu_http2x_probe和prometheus_http2xx_probe的采集任务,并且通过params指定使用的探针(module)以及探测目标(target)。
那问题就来了,假如我们有N个目标站点且都需要M种探测方式,那么Prometheus中将包含N * M个采集任务,从配置管理的角度来说显然是不可接受的。通过Prometheus的Relabeling能力对这些配置进行简化,配置方式如下:
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://prometheus.io # Target to probe with http.
- https://prometheus.io # Target to probe with https.
- http://example.com:8080 # Target to probe with http on port 8080.
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115
这里针对每一个探针服务(如http_2xx)定义一个采集任务,并且直接将任务的采集目标定义为我们需要探测的站点。在采集样本数据之前通过relabel_configs对采集任务进行动态设置。
- 第1步,根据
Target
实例的地址,写入__param_target
标签中。__param_<name>
形式的标签表示,在采集任务时会在请求目标地址中添加<name>参数,等同于params
的设置; - 第2步,获取
__param_target
的值,并覆写到instance
标签中; - 第3步,覆写
Target
实例的__address__
标签值为BlockBox Exporter实例的访问地址。
通过以上3个relabel步骤,即可大大简化Prometheus任务配置的复杂度:
告警规则rules
参考社区awesome提供的告警规则:https://samber.github.io/awesome-prometheus-alerts/rules#blackbox