Prometheus基础文档

Prometheus

TSDB是什么？ (Time Series Database)

简单的理解为.一个优化后用来处理时间序列数据的软件,并且数据中的数组是由时间进行索引的

l 大部分时间都是写入操作

l 写入操作几乎是顺序添加;大多数时候数据到达后都以时间排序.

l 写操作很少写入很久之前的数据,也很少更新数据.大多数情况在数据被采集到数秒或者数分钟后就会被写入数据库.

l 删除操作一般为区块删除,选定开始的历史时间并指定后续的区块.很少单独删除某个时间或者分开的随机时间的数据.

l 数据一般远远超过内存大小,所以缓存基本无用.系统一般是 IO 密集型

l 读操作是十分典型的升序或者降序的顺序读,

l 高并发的读操作十分常见.

Prometheus是什么

Prometheus 是由 SoundCloud 开发的开源监控报警系统和时序列数据库(TSDB)

Prometheus 在2016加入 CNCF (Cloud Native Computing Foundation), 作为在 kubernetes 之后的第二个由基金会主持的项目

Prometheus 的特点

l 多维数据模型（时序列数据由metric名和一组key/value组成）

l 在多维度上灵活的查询语言(PromQl)

l 不依赖分布式存储，单主节点工作.

l 通过基于HTTP的pull方式采集时序数据

l 可以通过中间网关进行时序列数据推送(pushing)

l 目标服务器可以通过发现服务或者静态配置实现

l 多种可视化和仪表盘支持

Prometheus 生态系统

l Prometheus 主服务,用来抓取和存储时序数据

l client library 用来构造应用或 exporter 代码 (go,java,python,ruby)

l push 网关可用来支持短连接任务

l 可视化的dashboard (两种选择,promdash 和 grafana.目前主流选择是 grafana.)

l 一些特殊需求的数据出口(用于HAProxy, StatsD, Graphite等服务)

l 实验性的报警管理端(alartmanager,单独进行报警汇总,分发,屏蔽等 )

<v:shapetype id="_x0000_t75" coordsize="21600,21600" o:spt="75" o:preferrelative="t" path="m@4@5l@4@11@9@11@9@5xe" filled="f" stroked="f"><v:stroke joinstyle="miter"><v:formulas></v:formulas><v:path o:extrusionok="f" gradientshapeok="t" o:connecttype="rect"></v:path></v:stroke></v:shapetype><v:shape id="图片_x0020_2" o:spid="_x0000_i1031" type="#_x0000_t75" style="width:414.75pt;height:195.75pt;visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image001.png" o:title=""></v:imagedata></v:shape>

部署和配置

下载

地址: https://prometheus.io/download/

部署

下载 prometheus-*.tar.gz

解压

配置

在prometheus目录下有一个名为 prometheus.yml 的主配置文件.其中包含大多数标准配置及 prometheus 的自检控配置,配置文件如下:

my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. [ 抓取的间隔时间]

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. [计算的间隔时间]

scrape_timeout is set to the global default (10s).

Alertmanager configuration

alerting:

alertmanagers:

static_configs:
targets:
'172.17.20.231:20507' [连接报警管理器]

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

- "first_rules.yml"

- "second_rules.yml"

"alert-rule.yml" [此处有两个规则，一个为计算规则，一个为报警规则]

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

scrape_configs:

The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

job_name: 'prometheus' [抓取的目标]

metrics_path defaults to '/metrics' // [连接的prometheus 自带的 exporter]

scheme defaults to 'http'.

static_configs:

targets: ['localhost:20504'] // [prometheus 启动的端口]
job_name: 'spring-boot'

metrics_path: '/prometheus' // [自己写的spring-boot的exporter地址]

static_configs:

targets: ['localhost:20506'] [spring-boot 启动的端口]

启动

编写启动脚本

nohup ./prometheus --config.file=prometheus.yml --web.enable-admin-api --web.listen-address=:20504 >/dev/null 2>&1 &

静默启动 --web-listen-address 指定端口

数据类型

l Counter : Counter表示收集的数据是按照某个趋势（增加／减少）一直变化的。

l Gauge:
Gauge表示搜集的数据是瞬时的，可以任意变高变低。

l Histogram: Histogram可以理解为直方图，主要用于表示一段时间范围内对数据进行采样，（通常是请求持续时间或响应大小），并能够对其指定区间以及总数进行统计。

l Summary: Summary和Histogram十分相似，主要用于表示一段时间范围内对数据进行采样，（通常是请求持续时间或响应大小），它直接存储了 quantile 数据，而不是根据统计区间计算出来的。

时序数据-打点-查询

我们知道每条时序数据都是由 metric（指标名称），一个或一组label（标签），以及float64的值组成的。

标准格式为 <metric name>{<label name>=<label value>,...}

例如：

rpc_invoke_cnt_c{code="0",method="Session.GenToken",job="Center"} 5

rpc_invoke_cnt_c{code="0",method="Relation.GetUserInfo",job="Center"} 12

rpc_invoke_cnt_c{code="0",method="Message.SendGroupMsg",job="Center"} 12

rpc_invoke_cnt_c{code="4",method="Message.SendGroupMsg",job="Center"} 3

rpc_invoke_cnt_c{code="0",method="Tracker.Tracker.Get",job="Center"} 70

这是一组用于统计RPC接口处理次数的监控数据。

其中rpc_invoke_cnt_c为指标名称，每条监控数据包含三个标签：code 表示错误码，service表示该指标所属的服务，method表示该指标所属的方法，最后的数字代表监控值。

针对这个例子，我们共有四个维度（一个指标名称、三个标签），这样我们便可以利用Prometheus强大的查询语言PromQL进行极为复杂的查询。

PromQL

PromQL(Prometheus Query Language) 是 Prometheus 自己开发的数据查询 DSL 语言，语言表现力非常丰富，支持条件查询、操作符，并且内建了大量内置函，供我们针对监控数据的各种维度进行查询。

我们想统计Center组件Relation.GetUserInfo的频率，可使用如下Query语句：

rate(rpc_invoke_cnt_c{method="Relation.GetUserInfo",job="Center"}[1m])

或者基于方法和错误码统计Center的整体RPC请求错误频率：

sum by (method, code)(rate(rpc_invoke_cnt_c{job="Center",code!="0"}[1m]))

如果我们想统计Center各方法的接口耗时，使用如下Query语句即可：

rate(rpc_invoke_time_h_sum{job="Center"}[1m]) / rate(rpc_invoke_time_h_count{job="Center"}[1m])

rate(http_requests_total[5m])

返回范围向量中每个时间序列在过去5分钟内测量的HTTP请求的每秒速率

increase(http_request_total[5m])

返回范围向量中每个时间序列在过去5分钟内测得的HTTP请求数

官方函数库: https://prometheus.io/docs/querying/functions/

另外，配合查询，在打点时metric和labal名称的定义也有一定技巧。

rpc_invoke_cnt_c 表示rpc调用统计

api_req_num_cv 表示httpapi调用统计

msg_queue_cnt_c 表示队列长度统计

命名官方引导： https://prometheus.io/docs/practices/naming/

报警

部署安装

下载地址： https://prometheus.io/download/

制作启动脚本

nohup ./alertmanager --web.listen-address=:20507 >/dev/null 2>&1 &

调整配置文件

alertmanager.yml 文件

制定报警规则

首先制定报警规则，在prometheus 上进行报警 rules 的配置

rule_files:

- "first_rules.yml"

- "second_rules.yml"

"alert-rule.yml" [此处有两个规则，一个为计算规则，一个为报警规则]

自己写对应的报警规则：

groups:

name: example

interval: 1s

rules:

Alert for any instance that is unreachable for >5 minutes.

alert: InstanceDown

expr: up == 0

for: 1s

labels:

severity: page

annotations:

summary: "Instance {{ $labels.instance }} down"

description: "{{ $labels.instance }} of job {{ $labels.job }} has been down"

以上为宕机的报警规则

配置报警设置

以下为简易配置

global:

smtp_smarthost: 'smtp.exmail.qq.com:25' // 配置smtp服务器用于发信

smtp_from: xxx@ulopay.com'

smtp_auth_username: xxx@ulopay.com'

smtp_auth_password: 'xxx'

The directory from which notification templates are read.

templates:

'/etc/alertmanager/template/*.tmpl'

The root route on which each incoming alert enters.

route:

The labels by which incoming alerts are grouped together. For example,

multiple alerts coming in for cluster=A and alertname=LatencyHigh would

be batched into a single group.

group_by: ['alertname', 'cluster', 'service'] //配置组用于后面的一些规则制定

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification. //新建立的组，在发信之前等待时间。组队上车

group_wait: 5s

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

group_interval: 1m // 一个组的发送间隔

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

repeat_interval: 3h // 重发的间隔

A default receiver

receiver: zhangm // 默认收件人

receivers: //配置所有收件人

name: 'zhangm'

email_configs:

to: 'zhangm@ulopay.com'

绘图展示

启动

安装Grafana。https://grafana.com/

下载 grafana.tar.gz 包

解压

进入bin目录

nohup ./grafana-server >/dev/null 2>&1 &

后台启动 grafana

配置

更改端口 conf 目录下的 default.ini http_port 参数

界面

<v:shape id="图片_x0020_1" o:spid="_x0000_i1030" type="#_x0000_t75" style="width:414.75pt;height:265.5pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image002.png" o:title=""></v:imagedata></v:shape>

账号密码

默认账号：admin 密码： admin

新增数据源

<v:shape id="图片_x0020_3" o:spid="_x0000_i1029" type="#_x0000_t75" style="width:415.5pt;height:369.75pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image003.png" o:title=""></v:imagedata></v:shape>

<v:shape id="图片_x0020_4" o:spid="_x0000_i1028" type="#_x0000_t75" style="width:415.5pt;height:167.25pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image004.png" o:title=""></v:imagedata></v:shape>

<v:shape id="图片_x0020_5" o:spid="_x0000_i1027" type="#_x0000_t75" style="width:415.5pt;height:345pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image005.png" o:title=""></v:imagedata></v:shape>

<v:shape id="图片_x0020_6" o:spid="_x0000_i1026" type="#_x0000_t75" style="width:415.5pt;height:207pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image006.png" o:title=""></v:imagedata></v:shape>

<v:shape id="图片_x0020_7" o:spid="_x0000_i1025" type="#_x0000_t75" style="width:414.75pt;height:221.25pt;
visibility:visible;mso-wrap-style:square"><v:imagedata src="file:///C:/Users/ccsou/AppData/Local/Temp/msohtmlclip1/01/clip_image007.png" o:title=""></v:imagedata></v:shape>

集成

集成相关参考 [[Prometheus官方示例]] [Play集成 Prometheus] [Spring集成Prometheus]

参考文献

Prometheus 官网

[Prometheus入门] (http://www.10tiao.com/html/357/201705/2247485232/1.html)

[Prometheus进阶] (http://www.10tiao.com/html/357/201705/2247485249/1.html)

360基于Prometheus的在线服务监控实践

Prometheus官方示例

Play集成 Prometheus

Spring集成Prometheus

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,425评论 4赞 361
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,058评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,186评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,848评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,249评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,554评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,830评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,536评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,239评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,505评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,004评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,346评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,999评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,060评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,821评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,574评论 2赞 271
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,480评论 2赞 267

Prometheus基础文档

TSDB是什么？ (Time Series Database)

Prometheus是什么

Prometheus 的特点

Prometheus 生态系统

部署和配置

下载

部署

配置

my global config

scrape_timeout is set to the global default (10s).

Alertmanager configuration

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

- "first_rules.yml"

- "second_rules.yml"

A scrape configuration containing exactly one endpoint to scrape:

Here it's Prometheus itself.

The job name is added as a label job=<job_name> to any timeseries scraped from this config.

metrics_path defaults to '/metrics' // [连接的prometheus 自带的 exporter]

scheme defaults to 'http'.

启动

数据类型

时序数据-打点-查询

PromQL

报警

部署安装

制定报警规则

- "first_rules.yml"

- "second_rules.yml"

Alert for any instance that is unreachable for >5 minutes.

配置报警设置

The directory from which notification templates are read.

The root route on which each incoming alert enters.

The labels by which incoming alerts are grouped together. For example,

multiple alerts coming in for cluster=A and alertname=LatencyHigh would

be batched into a single group.

When a new group of alerts is created by an incoming alert, wait at

least 'group_wait' to send the initial notification.

This way ensures that you get multiple alerts for the same group that start

firing shortly after another are batched together on the first

notification. //新建立的组，在发信之前等待时间。 组队上车

When the first notification was sent, wait 'group_interval' to send a batch

of new alerts that started firing for that group.

If an alert has successfully been sent, wait 'repeat_interval' to

resend them.

A default receiver

绘图展示

启动

配置

界面

账号密码

新增数据源

集成

参考文献

推荐阅读更多精彩内容

The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

notification. //新建立的组，在发信之前等待时间。组队上车