kubernetes 监控告警

主机数据收集

主机数据的采集是集群监控的基础;外部模块收集各个主机采集到的数据分析就能对整个集群完成监控和告警等功能。一般主机数据采集和对外提供数据使用cAdvisor 和node-exporter等工具。

cAdvisor

概述

Kubernetes的生态中,cAdvisor是作为容器监控数据采集的Agent,其部署在每个节点上,内部代码结构大致如下:代码结构很良好,collector和storage部分基本可做到增量扩展开发。

cAdvisor.png

关于cAdvisor支持自定义指标方式能力,其自身是通过容器部署的时候设置lable标签项:io.cadvisor.metric.开头的lable,而value则为自定义指标的配置文件,形如下:

{
  "endpoint" : {
    "protocol": "https",
    "port": 8000,
    "path": "/nginx_status"
  },
  "metrics_config"  : [
    { "name" : "activeConnections",
      "metric_type" : "gauge",
      "units" : "number of active connections",
      "data_type" : "int",
      "polling_frequency" : 10,
      "regex" : "Active connections: ([0-9]+)"
    },
    { "name" : "reading",
      "metric_type" : "gauge",
      "units" : "number of reading connections",
      "data_type" : "int",
      "polling_frequency" : 10,
      "regex" : "Reading: ([0-9]+) .*"
    },
    { "name" : "writing",
      "metric_type" : "gauge",
      "data_type" : "int",
      "units" : "number of writing connections",
      "polling_frequency" : 10,
      "regex" : ".*Writing: ([0-9]+).*"
    },
    { "name" : "waiting",
      "metric_type" : "gauge",
      "units" : "number of waiting connections",
      "data_type" : "int",
      "polling_frequency" : 10,
      "regex" : ".*Waiting: ([0-9]+)"
    }
  ]
}

但kubernetes1.6开始 cAdvisor 被集成到kubernetes中,可以在安装kubernetes通过参数激活 cAdvisor

当前cAdvisor只支持http接口方式,也就是被监控容器应用必须提供http接口,所以能力较弱,如果我们在collector这一层做扩展增强,提供数据库,mq等等标准应用的监控模式是很有价值的。在此之前的另一种方案就是如上图所示搭配promethuese(其内置有非常丰富的标准应用的插件涵盖了APM所需的采集大部分插件),但是这往往会导致系统更复杂(如果应用层并非想使用promethuse)

在Kubernetes监控生态中,一般是如下的搭配使用:


cAdvisor-promethus.png

Node-exporter

概述

node-exporter 运行在节点上采集节点主机本身的cpu和内存等使用信息,并对外提供获取主机性能开销的信息。

部署

下面是node-exporter在k8s下的部署文件

apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
  labels:
    app: node-exporter
    name: node-exporter
  name: node-exporter
  namespace: kube-system
spec:
  clusterIP: None
  ports:
  - name: scrape
    port: 9100
    protocol: TCP
  selector:
    app: node-exporter
  type: ClusterIP
---
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: node-exporter
  namespace: kube-system
spec:
  template:
    metadata:
      labels:
        app: node-exporter
      name: node-exporter
    spec:
      containers:
      - image: prom/node-exporter:latest
        name: node-exporter
        ports:
        - containerPort: 9100
          hostPort: 9100
          name: scrape
      hostNetwork: true
      hostPID: true
      restartPolicy: Always

监控

完成对kubernetes的监控, 监控收集数据一般有PULL和PUSH两种方式。PULL方式是监控平台从集群中的主机上主动拉取采集到的主机信息,而PUSH方式是主机将采集到的信息推送到监控平台。常用的监控平台是Prometheus,是采用PULL的方式采集主机信息。

Prometheus

概述

Prometheus 是源于 Google Borgmon 的一个系统监控和报警工具,用 Golang 语言开发。基本原理是通过 HTTP 协议周期性地抓取被监控组件的状态(pull 方式),这样做的好处是任意组件只要提供 HTTP 接口就可以接入监控系统,不需要任何 SDK 或者其他的集成过程。

这样做非常适合虚拟化环境比如 VM 或者 Docker ,故其为为数不多的适合 Docker、Mesos 、Kubernetes 环境的监控系统之一,被很多人称为下一代监控系统。

特性

  • 自定义多维度的数据模型
  • 非常高效的存储 平均一个采样数据占 ~3.5 bytes左右,320万的时间序列,每30秒采样,保持60天,消耗磁盘大概228G。
  • 强大的查询语句
  • 轻松实现数据可视化

优点

  • 非常少的外部依赖,安装使用超简单
  • 已经有非常多的系统集成 例如:docker HAProxy Nginx JMX等等
  • 服务自动化发现
  • 直接集成到代码
  • 设计思想是按照分布式、微服务架构来实现的

组件

  • Prometheus server
    主要负责数据采集和存储,提供 PromQL 查询语言的支持;
  • Push Gateway
    支持临时性 Job 主动推送指标的中间网关;
  • Exporters
    提供被监控组件信息的 HTTP 接口被叫做 exporter ,目前互联网公司常用的组件大部分都有 exporter 可以直接使用,比如 Varnish、Haproxy、Nginx、MySQL、Linux 系统信息 (包括磁盘、内存、CPU、网络等等);
  • PromDash
    使用 rails 开发的 dashboard,用于可视化指标数据;
  • WebUI
    9090 端口提供的图形化功能;
  • Alertmanager
    用来进行报警;
  • APIclients
    提供 HTTPAPI 接口
Prometheus

部署

  • 安装Rbac
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: kube-system
  • 安装Configmap
# cat configmap.yaml 
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: kube-system
data:
  prometheus.yml: |
    global:
      scrape_interval:     15s
      evaluation_interval: 15s
    scrape_configs:

    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

    - job_name: 'kubernetes-nodes'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics

    - job_name: 'kubernetes-cadvisor'
      kubernetes_sd_configs:
      - role: node
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      relabel_configs:
      - action: labelmap
        regex: __meta_kubernetes_node_label_(.+)
      - target_label: __address__
        replacement: kubernetes.default.svc:443
      - source_labels: [__meta_kubernetes_node_name]
        regex: (.+)
        target_label: __metrics_path__
        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

    - job_name: 'kubernetes-service-endpoints'
      kubernetes_sd_configs:
      - role: endpoints
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (https?)
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        action: replace
        target_label: kubernetes_name

    - job_name: 'kubernetes-services'
      kubernetes_sd_configs:
      - role: service
      metrics_path: /probe
      params:
        module: [http_2xx]
      relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter.example.com:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_service_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_service_name]
        target_label: kubernetes_name

    - job_name: 'kubernetes-ingresses'
      kubernetes_sd_configs:
      - role: ingress
      relabel_configs:
      - source_labels: [__meta_kubernetes_ingress_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_ingress_scheme,__address__,__meta_kubernetes_ingress_path]
        regex: (.+);(.+);(.+)
        replacement: ${1}://${2}${3}
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter.example.com:9115
      - source_labels: [__param_target]
        target_label: instance
      - action: labelmap
        regex: __meta_kubernetes_ingress_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_ingress_name]
        target_label: kubernetes_name

    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name
  • 部署 Prometheus
---
apiVersion: apps/v1beta2
kind: Deployment
metadata:
  labels:
    name: prometheus-deployment
  name: prometheus
  namespace: kube-system
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - image: prom/prometheus:v2.0.0
        name: prometheus
        command:
        - "/bin/prometheus"
        args:
        - "--config.file=/etc/prometheus/prometheus.yml"
        - "--storage.tsdb.path=/prometheus"
        - "--storage.tsdb.retention=24h"
        ports:
        - containerPort: 9090
          protocol: TCP
        volumeMounts:
        - mountPath: "/prometheus"
          name: data
        - mountPath: "/etc/prometheus"
          name: config-volume
        resources:
          requests:
            cpu: 100m
            memory: 100Mi
          limits:
            cpu: 500m
            memory: 2500Mi
      serviceAccountName: prometheus    
      volumes:
      - name: data
        emptyDir: {}
      - name: config-volume
        configMap:
          name: prometheus-config 

---
kind: Service
apiVersion: v1
metadata:
  labels:
    app: prometheus
  name: prometheus
  namespace: kube-system
spec:
  type: NodePort
  ports:
  - port: 9090
    targetPort: 9090
    nodePort: 30003
  selector:
    app: prometheus

测试

访问集群中 http://prometheus地址:30003后。在Graph页面
输入:

node_cpu

查询命令可以看到节点cpu的使用信息。prometheus监控节点信息成功。
访问targets页面可以看到prometheus采集的监控信息的来源。

告警

Prometheus的告警是使用AlertManger来一同完成的。Prometheus在监控信息超过设定阀值时就将告警信息发送给AlertManger模块,AlertManger模块负责告警。

AlertManger

概述

Alertmanager与Prometheus是相互分离的两个组件。Prometheus服务器根据报警规则将警报发送给Alertmanager,然后Alertmanager将silencing、inhibition、aggregation等消息通过电子邮件、PaperDuty和HipChat发送通知。

设置警报和通知的主要步骤:

  • 安装配置Alertmanager
  • 配置Prometheus通过-alertmanager.url标志与Alertmanager通信
  • 在Prometheus中创建告警触发规则。
  • 在Alertmanager中设置告警通知规则

告警通知规则

Alertmanager处理由例如Prometheus服务器等客户端发来的警报。它负责删除重复数据、分组,并将警报通过路由发送到正确的接收器,比如电子邮件、Slack等。Alertmanager还支持groups,silencing和警报抑制的机制。

分组

分组是指将同一类型的警报分类为单个通知。当许多系统同时宕机时,很有可能成百上千的警报会同时生成,这种机制特别有用。
例如,当数十或数百个服务的实例在运行,网络发生故障时,有可能一半的服务实例不能访问数据库。在prometheus告警规则中配置为每一个服务实例都发送警报的话,那么结果是数百警报被发送至Alertmanager。

但是作为用户只想看到单一的报警页面,同时仍然能够清楚的看到哪些实例受到影响,因此,可以通过配置Alertmanager将警报分组打包,并发送一个相对看起来紧凑的通知。

分组警报、警报时间,以及接收警报的receiver是在alertmanager配置文件中通过路由树配置的。

抑制(Inhibition)

抑制是指当警报发出后,停止重复发送由此警报引发其他错误的警报的机制。(比如网络不可达,导致其他服务连接相关警报)

例如,当整个集群网络不可达,此时警报被触发,可以事先配置Alertmanager忽略由该警报触发而产生的所有其他警报,这可以防止通知数百或数千与此问题不相关的其他警报。

抑制机制也是通过Alertmanager的配置文件来配置。

沉默(Silences)

Silences是一种简单的特定时间不告警的机制。silences警告是通过匹配器(matchers)来配置,就像路由树一样。传入的警报会匹配RE,如果匹配,将不会为此警报发送通知。
silences报警机制可以通过Alertmanager的Web页面进行配置。

接收

使用Receiver定义各种通知用户的途径,告警经过分组,过滤处理后选择匹配的通知渠道发送给接收用户。

部署

报警触发规则

定义报警规则
报警规则通过以下格式定义:

ALERT <alert name>
  IF <expression>
  [ FOR <duration> ]
  [ LABELS <label set> ]
  [ ANNOTATIONS <label set> ]
  • 可选的FOR语句,使得Prometheus在表达式输出的向量元素(例如高HTTP错误率的实例)之间等待一段时间,将警报计数作为触发此元素。如果元素是active,但是没有firing的,就处于pending状态。

  • LABELS(标签)语句允许指定一组标签附加警报上。将覆盖现有冲突的任何标签,标签值也可以被模板化。

  • ANNOTATIONS(注释)它们被用于存储更长的其他信息,例如警报描述或者链接,注释值也可以被模板化。

  • Templating(模板) 标签和注释值可以使用控制台模板进行模板化。labels变量保存警报实例的标签键/值对,value保存警报实例的评估值。

# To insert a firing element's label values:
{{ $labels.<labelname> }}
# To insert the numeric expression value of the firing element:
{{ $value }}

例子:
prometheus.yml中配置Prometheus和AlertManager通信的通信方式以及告警触发规则。

在prometheus.yml中配置Prometheus来和AlertManager通信

alerting:
  alertmanagers:
    - static_configs:
      - targets: ["alertManager域名:9093"]

在prometheus.yml中指定匹配报警规则的间隔

# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]

在prometheus.yml中指定规则文件(可使用通配符,如rules/*.rules)

rule_files:
    - /etc/prometheus/rules.yml

其中rule_files就是用来指定报警规则的,这里我们将rules.yml用ConfigMap的形式挂载到/etc/prometheus目录下面即可:

rules.yml: |
    groups:
    - name: test-rule
      rules:
      - alert: NodeFilesystemUsage
        expr: (node_filesystem_size{device="rootfs"} - node_filesystem_free{device="rootfs"}) / node_filesystem_size{device="rootfs"} * 100 > 80
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High Filesystem usage detected"
          description: "{{$labels.instance}}: Filesystem usage is above 80% (current value is: {{ $value }}"
      - alert: NodeMemoryUsage
        expr: (node_memory_MemTotal - (node_memory_MemFree+node_memory_Buffers+node_memory_Cached )) / node_memory_MemTotal * 100 > 80
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High Memory usage detected"
          description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }}"
      - alert: NodeCPUUsage
        expr: (100 - (avg by (instance) (irate(node_cpu{job="kubernetes-node-exporter",mode="idle"}[5m])) * 100)) > 80
        for: 2m
        labels:
          team: node
        annotations:
          summary: "{{$labels.instance}}: High CPU usage detected"
          description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }}"

配置文件设置好后,需要让prometheus重新读取,有两种方法:

告警通知规则

全局
要指定加载的配置文件,需要使用-config.file标志。该文件使用YAML来完成,通过下面的描述来定义。带括号的参数表示是可选的,对于非列表的参数的值,将被设置为指定的缺省值。

通用占位符定义解释:

<duration> : 与正则表达式匹配的持续时间值,[0-9]+(ms|[smhdwy])
<labelname>: 与正则表达式匹配的字符串,[a-zA-Z_][a-zA-Z0-9_]*
<labelvalue>: unicode字符串
<filepath>: 有效的文件路径
<boolean>: boolean类型,true或者false
<string>: 字符串
<tmpl_string>: 模板变量字符串

global全局配置文件参数在所有配置上下文生效,作为其他配置项的默认值,可被覆盖.

global:
      resolve_timeout: 30s
      smtp_smarthost: "smtp.163.com:25"
      smtp_from: 'xiyanxiyan10@163.com'
      smtp_auth_username: "xiyanxiyan10@163.com" 
      smtp_auth_password: "xiyanxiyan10" 
      smtp_require_tls: false

路由(route)
路由块定义了路由树及其子节点。如果没有设置的话,子节点的可选配置参数从其父节点继承。

每个警报都会在配置的顶级路由中进入路由树,该路由树必须匹配所有警报(即没有任何配置的匹配器)。然后遍历子节点。如果continue的值设置为false,它在第一个匹配的子节点之后就停止;如果continue的值为true,警报将继续进行后续子节点的匹配。如果警报不匹配任何节点的任何子节点(没有匹配的子节点,或不存在),该警报基于当前节点的配置处理。

    route:
      receiver: mailhook
      group_wait: 30s
      group_interval: 1m
      repeat_interval: 1m
      group_by: [NodeMemoryUsage, NodeCPUUsage, NodeFilesystemUsage]

      routes:
      - receiver: mailhook
        group_wait: 10s
        match:
          team: node

设置alertmanager.yml的的route与receivers

route:
  # The labels by which incoming alerts are grouped together. For example,
  # multiple alerts coming in for cluster=A and alertname=LatencyHigh would
  # be batched into a single group.
  group_by: ['alertname']

  # When a new group of alerts is created by an incoming alert, wait at
  # least 'group_wait' to send the initial notification.
  # This way ensures that you get multiple alerts for the same group that start
  # firing shortly after another are batched together on the first 
  # notification.
  group_wait: 5s

  # When the first notification was sent, wait 'group_interval' to send a batch
  # of new alerts that started firing for that group.
  group_interval: 1m

  # If an alert has successfully been sent, wait 'repeat_interval' to
  # resend them.
  repeat_interval: 3h 

  # A default receiver
  receiver: mengyuan

receivers:
- name: 'mengyuan'
  webhook_configs:
  - url: http://192.168.0.53:8080
  email_configs:
  - to: 'xiyanxiyan10@hotmail.com'

路由配置格式

#报警接收器
[ receiver: <string> ]

#分组
[ group_by: '[' <labelname>, ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]

# A set of equality matchers an alert has to fulfill to match the node.
#根据匹配的警报,指定接收器
match:
  [ <labelname>: <labelvalue>, ... ]

# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
#根据匹配正则符合的警告,指定接收器
  [ <labelname>: <regex>, ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> ]

# How long to wait before sending notification about new alerts that are
# in are added to a group of alerts for which an initial notification
# has already been sent. (Usually ~5min or more.)
[ group_interval: <duration> ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> ]

# Zero or more child routes.
routes:
  [ - <route> ... ]

例子:

// Match does a depth-first left-to-right search through the route tree
// and returns the matching routing nodes.
func (r *Route) Match(lset model.LabelSet) []*Route {
Alert
Alert是alertmanager接收到的报警,类型如下。

// Alert is a generic representation of an alert in the Prometheus eco-system.
type Alert struct {
    // Label value pairs for purpose of aggregation, matching, and disposition
    // dispatching. This must minimally include an "alertname" label.
    Labels LabelSet `json:"labels"`

    // Extra key/value information which does not define alert identity.
    Annotations LabelSet `json:"annotations"`

    // The known time range for this alert. Both ends are optional.
    StartsAt     time.Time `json:"startsAt,omitempty"`
    EndsAt       time.Time `json:"endsAt,omitempty"`
    GeneratorURL string    `json:"generatorURL"`
}

具有相同Lables的Alert(key和value都相同)才会被认为是同一种。在prometheus rules文件配置的一条规则可能会产生多种报警

抑制规则 inhibit_rule

抑制规则,是存在另一组匹配器匹配的情况下,使其他被引发警报的规则静音。这两个警报,必须有一组相同的标签。

抑制配置格式

# Matchers that have to be fulfilled in the alerts to be muted.
##必须在要需要静音的警报中履行的匹配者
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]

# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
#必须存在一个或多个警报以使抑制生效的匹配者。
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
#在源和目标警报中必须具有相等值的标签才能使抑制生效
[ equal: '[' <labelname>, ... ']' ]

接收器(receiver)

顾名思义,警报接收的配置。
通用配置格式

# The unique name of the receiver.
name: <string>

# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
slack_config:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]
邮件接收器email_config

# Whether or not to notify about resolved alerts.
#警报被解决之后是否通知
[ send_resolved: <boolean> | default = false ]

# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]

# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] 

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]

Slcack接收器slack_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]

# The Slack webhook URL.
[ api_url: <string> | default = global.slack_api_url ]

# The channel or user to send notifications to.
channel: <tmpl_string>

# API request data as defined by the Slack webhook API.
[ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
[ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
[ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
[ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
[ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
[ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]

Webhook接收器webhook_config

 # Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]

 # The endpoint to send HTTP POST requests to.
url: <string>

Alertmanager会使用以下的格式向配置端点发送HTTP POST请求:

{
  "version": "3",
  "groupKey": <number>     // key identifying the group of alerts (e.g. to deduplicate)
  "status": "<resolved|firing>",
  "receiver": <string>,
  "groupLabels": <object>,
  "commonLabels": <object>,
  "commonAnnotations": <object>,
  "externalURL": <string>,  // backling to the Alertmanager.
  "alerts": [
    {
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>"
    },
    ...
  ]
}

例子:

 receivers:
    - name: mailhook
      email_configs:
      - to: "xiyanxiyan10@hotmail.com"
        html: '{{ template "alert.html" . }}'
        headers: { Subject: "[WARN] Warn Email" }

集群信息展示

集群的整体信息已经收集汇总在Prometheus中,但Prometheus主要是对外提供数据获取接口,并不负责完成完善的图形展示,因此需要使用DashBoard工具对接Prometheus完成集群信息的图形化展示.

Grafana

概述

grafana 是一款采用 go 语言编写的开源应用,主要用于大规模指标数据的可视化展现,基于商业友好的 Apache License 2.0 开源协议。grafana有热插拔控制面板和可扩展的数据源,目前已经支持绝大部分常用的时序数据库。

目前grafana支持的数据源

支持的数据源

上文已经提到,Grafana支持很多的数据源,主要支持的有如下数据源:

  1. Graphite
  2. Elasticsearch
  3. CloudWatch
  4. InfluxDB
  5. OpenTSDB
  6. Prometheus

部署

部署服务

# cat grafana-deploy.yaml 

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: grafana-core
  namespace: kube-system
  labels:
    app: grafana
    component: core
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: grafana
        component: core
    spec:
      containers:
      - image: grafana/grafana:4.2.0
        name: grafana-core
        imagePullPolicy: IfNotPresent
        # env:
        resources:
          # keep request = limit to keep this container in guaranteed class
          limits:
            cpu: 100m
            memory: 100Mi
          requests:
            cpu: 100m
            memory: 100Mi
        env:
          # The following env variables set up basic auth twith the default admin user and admin password.
          - name: GF_AUTH_BASIC_ENABLED
            value: "true"
          - name: GF_AUTH_ANONYMOUS_ENABLED
            value: "false"
          # - name: GF_AUTH_ANONYMOUS_ORG_ROLE
          #   value: Admin
          # does not really work, because of template variables in exported dashboards:
          # - name: GF_DASHBOARDS_JSON_ENABLED
          #   value: "true"
        readinessProbe:
          httpGet:
            path: /login
            port: 3000
          # initialDelaySeconds: 30
          # timeoutSeconds: 1
        volumeMounts:
        - name: grafana-persistent-storage
          mountPath: /var
      volumes:
      - name: grafana-persistent-storage
        emptyDir: {}
--- 
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: kube-system
  labels:
    app: grafana
    component: core
spec:
  type: NodePort
  ports:
    - port: 3000
      targetPort: 3000
      nodePort: 30009
  selector:
    app: grafana

配置grafana

访问Grafana部署的端口可以看到以下页面

image.png

默认用户名和密码都是admin

image.png

配置数据源为prometheus

image.png

导入面板,可以直接输入模板编号315在线导入,或者下载好对应的json模板文件本地导入,面板模板下载地址https://grafana.com/dashboards/315

image.png

导入面板之后就可以看到对应的监控数据了。


image.png

示例配置

配置源文件

推荐阅读更多精彩内容