监控指标

介绍Pigsty中的监控指标

Metrics

There are tons of metrics available in Pigsty.

那么,Pigsty总共包含了多少指标呢? 这里是一副各个指标来源占比的饼图。我们可以看到,右侧蓝绿黄对应的部分是数据库及数据库相关组件所暴露的指标,而左下方红橙色部分则对应着机器节点相关指标。左上方紫色部分则是负载均衡器的相关指标。

数据库指标中,与postgres本身有关的原始指标约230个,与中间件有关的原始指标约50个,基于这些原始指标,Pigsty又通过层次聚合与预计算,精心设计出约350个与DB相关的衍生指标。 因此,对于每个数据库集群来说,单纯针对数据库及其附件的监控指标就有621个。而机器原始指标281个,衍生指标83个一共364个。加上负载均衡器的170个指标,我们总共有接近1200类指标。

注意,这里我们必须辨析一下metric 与 Time-series的区别。 这里我们使用的量词是 类 而不是个 。 因为一个meitric可能对应多个时间序列。例如一个数据库中有20张表,那么 pg_table_index_scan 这样的Mertric就会对应有20个Time Series

Source

Metrics are collected from exporters.

  • Node Metrics (around 2000+ per instance)
  • Postgres database metrics and pgbouncer connection pooler metrics (1000+ per instance)
  • HAProxy load balancer metrics (400+ per instance)

Pigsty的监控数据,主要有四个来源: 数据库本身,中间件,操作系统,负载均衡器。通过相应的exporter对外暴露。 所有的这些指标,还会进行进一步的加工处理。比如,按照不同的层次进行聚合

Category

Metrics can be categorized as four major groups: Error, Saturation, Traffic and Latency.

  • Errors
    • Config Errors: NUMA, Checksum, THP, Sync Commit, etc…
    • Hardware errors: EDAC Mem Error
    • Software errors: TCP Listen Overflow, NTP time shift.
    • Service Aliveness: node, postgres,pgbouncer,haproxy,exporters, etc…
    • Client Queuing, Idle In Transaction, Sage, Deadlock, Replication break, Rollbacks, etc….
  • Saturation
    • PG Load, Node Load
    • CPU Usage, Mem Usage, Disk Space Usage, Disk I/O Usage, Connection Usage, XID Usage
    • Cache Hit Rate / Buffer Hit Rate
  • Traffic
    • QPS, TPS, Xacts, Rollbacks, Seasonality
    • In/Out Bytes of NIC/Pgbouncer, WAL Rate, Tuple CRUD Rate, Block/Buffer Access
    • Disk I/O, Network I/O, Mem Swap I/O
  • Latency
    • Transaction Response Time (Xact RT)
    • Query Response Time (Query RT)
    • Statement Response Time (Statement RT)
    • Disk Read/Write Latency
    • Replication Lag (in bytes or seconds)

There are just a small portion of metrics.

Derived Metrics

In addition to metrics above, there are a large number of derived metrics. For example, QPS from pgbouncer will have following derived metrics

################################################################
#                     QPS (Pgbouncer)                          #
################################################################
# TPS realtime (irate1m)
- record: pg:db:qps_realtime
expr: irate(pgbouncer_stat_total_query_count{}[1m])
- record: pg:ins:qps_realtime
expr: sum without(datname) (pg:db:qps_realtime{})
- record: pg:svc:qps_realtime
expr: sum by(cls, role) (pg:ins:qps_realtime{})
- record: pg:cls:qps_realtime
expr: sum by(cls) (pg:ins:qps_realtime{})
- record: pg:all:qps_realtime
expr: sum(pg:cls:qps_realtime{})

# qps (rate1m)
- record: pg:db:qps
expr: pgbouncer_stat_avg_query_count{datname!="pgbouncer"}
- record: pg:ins:qps
expr: sum without(datname) (pg:db:qps)
- record: pg:svc:qps
expr: sum by (cls, role) (pg:ins:qps)
- record: pg:cls:qps
expr: sum by(cls) (pg:ins:qps)
- record: pg:all:qps
expr: sum(pg:cls:qps)
# qps avg30m
- record: pg:db:qps_avg30m
expr: avg_over_time(pg:db:qps[30m])
- record: pg:ins:qps_avg30m
expr: avg_over_time(pg:ins:qps[30m])
- record: pg:svc:qps_avg30m
expr: avg_over_time(pg:svc:qps[30m])
- record: pg:cls:qps_avg30m
expr: avg_over_time(pg:cls:qps[30m])
- record: pg:all:qps_avg30m
expr: avg_over_time(pg:all:qps[30m])
# qps µ
- record: pg:db:qps_mu
expr: avg_over_time(pg:db:qps_avg30m[30m])
- record: pg:ins:qps_mu
expr: avg_over_time(pg:ins:qps_avg30m[30m])
- record: pg:svc:qps_mu
expr: avg_over_time(pg:svc:qps_avg30m[30m])
- record: pg:cls:qps_mu
expr: avg_over_time(pg:cls:qps_avg30m[30m])
- record: pg:all:qps_mu
expr: avg_over_time(pg:all:qps_avg30m[30m])
# qps σ: stddev30m qps
- record: pg:db:qps_sigma
expr: stddev_over_time(pg:db:qps[30m])
- record: pg:ins:qps_sigma
expr: stddev_over_time(pg:ins:qps[30m])
- record: pg:svc:qps_sigma
expr: stddev_over_time(pg:svc:qps[30m])
- record: pg:cls:qps_sigma
expr: stddev_over_time(pg:cls:qps[30m])
- record: pg:all:qps_sigma
expr: stddev_over_time(pg:all:qps[30m])

There are hundreds of rules defining extra metrics based on primitive metrics.

最后修改 January 1, 0001