高可用演练
模拟几种生产环境的常见故障,以测试Pigsty高可用数据库集群的自愈能力。
Patroni快速上手
使用patronictl
对数据库集群进行控制,Pigsty已经创建了快捷方式pt
:
alias pt='patronictl -c /pg/bin/patroni.yml'
alias pt-up='sudo systemctl start patroni' # 启动Patroni
alias pt-dw='sudo systemctl stop patroni' # 停止Patroni
alias pt-st='systemctl status patroni' # 汇报Patroni抓昂泰
alias pt-ps='ps aux | grep patroni' # 查看Patroni进程
alias pt-log='tail -f /pg/log/patroni.log' # 监控Patroni日志
Patroni相关命令需要使用数据库超级用户(dbsu = postgres) 执行
$ pt --help
Usage: patronictl [OPTIONS] COMMAND [ARGS]...
Options:
-c, --config-file TEXT Configuration file
-d, --dcs TEXT Use this DCS
-k, --insecure Allow connections to SSL sites without certs
--help Show this message and exit.
Commands:
configure Create configuration file
dsn Generate a dsn for the provided member,...
edit-config Edit cluster configuration
failover Failover to a replica
flush Discard scheduled events
history Show the history of failovers/switchovers
list List the Patroni members for a given Patroni
pause Disable auto failover
query Query a Patroni PostgreSQL member
reinit Reinitialize cluster member
reload Reload cluster member configuration
remove Remove cluster from DCS
restart Restart cluster member
resume Resume auto failover
scaffold Create a structure for the cluster in DCS
show-config Show cluster configuration
switchover Switchover to a replica
topology Prints ASCII topology for given cluster
version Output version of patronictl command or a...
场景一:Switchover
Switch是主动切换集群领导者
$ pt switchover
Master [pg-test-3]: pg-test-3
Candidate ['pg-test-1', 'pg-test-2'] []: pg-test-1
When should the switchover take place (e.g. 2020-10-23T17:06 ) [now]: now
Current cluster topology
+ Cluster: pg-test (6886641621295638555) -----+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Tags |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-test-1 | 10.10.10.11 | Replica | running | 2 | 0 | clonefrom: true |
| pg-test-2 | 10.10.10.12 | Replica | running | 2 | 0 | clonefrom: true |
| pg-test-3 | 10.10.10.13 | Leader | running | 2 | | clonefrom: true |
+-----------+-------------+---------+---------+----+-----------+-----------------+
Are you sure you want to switchover cluster pg-test, demoting current master pg-test-3? [y/N]: y
2020-10-23 16:06:11.76252 Successfully switched over to "pg-test-1"
场景二:Failover
# run as postgres @ any member of cluster `pg-test`
$ pt failover
Candidate ['pg-test-2', 'pg-test-3'] []: pg-test-3
Current cluster topology
+ Cluster: pg-test (6886641621295638555) -----+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Tags |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-test-1 | 10.10.10.11 | Leader | running | 1 | | clonefrom: true |
| pg-test-2 | 10.10.10.12 | Replica | running | 1 | 0 | clonefrom: true |
| pg-test-3 | 10.10.10.13 | Replica | running | 1 | 0 | clonefrom: true |
+-----------+-------------+---------+---------+----+-----------+-----------------+
Are you sure you want to failover cluster pg-test, demoting current master pg-test-1? [y/N]: y
+ Cluster: pg-test (6886641621295638555) -----+----+-----------+-----------------+
| Member | Host | Role | State | TL | Lag in MB | Tags |
+-----------+-------------+---------+---------+----+-----------+-----------------+
| pg-test-1 | 10.10.10.11 | Replica | running | 2 | 0 | clonefrom: true |
| pg-test-2 | 10.10.10.12 | Replica | running | 2 | 0 | clonefrom: true |
| pg-test-3 | 10.10.10.13 | Leader | running | 2 | | clonefrom: true |
+-----------+-------------+---------+---------+----+-----------+-----------------+
场景三:从库Patroni/Postgres宕机
场景四:主库Patroni/Postgres宕机
场景五:DCS不可用
场景六:维护模式
问题探讨
关键问题:DCS的SLA如何保障?
==在自动切换模式下,如果DCS挂了,当前主库会在retry_timeout 后Demote成从库,导致所有集群不可写==。
作为分布式共识数据库,Consul/Etcd是相当稳健的,但仍必须确保DCS的SLA高于DB的SLA。
解决方法:配置一个足够大的retry_timeout
,并通过几种以下方式从管理上解决此问题。
- SLA确保DCS一年的不可用时间短于该时长
- 运维人员能确保在
retry_timeout
之内解决DCS Service Down的问题。 - DBA能确保在
retry_timeout
之内将关闭集群的自动切换功能(打开维护模式)。
可以优化的点? 添加绕开DCS的P2P检测,如果主库意识到自己所处的分区仍为Major分区,不触发操作。
关键问题:HA策略,RPO优先或RTO优先?
可用性与一致性谁优先?例如,普通库RTO优先,金融支付类RPO优先。
普通库允许紧急故障切换时丢失极少量数据(阈值可配置,例如最近1M写入)
与钱相关的库不允许丢数据,相应地在故障切换时需要更多更审慎的检查或人工介入。
关键问题:Fencing机制,是否允许关机?
在正常情况下,Patroni会在发生Leader Change时先执行Primary Fencing,通过杀掉PG进程的方式进行。
但在某些极端情况下,比如vm暂停,软件Bug,或者极高负载,有可能没法成功完成这一点。那么就需要通过重启机器的方式一了百了。是否可以接受?在极端环境下会有怎样的表现?
关键操作:选主之后
选主之后要记得存盘。手工做一次Checkpoint确保万无一失。
关键问题:流量切换怎样做,2层,4层,7层
- 2层:VIP漂移
- 4层:Haproxy分发
- 7层:DNS域名解析
关键问题:一主一从的特殊场景
- 2层:VIP漂移
- 4层:Haproxy分发
- 7层:DNS域名解析
HA Procedure
Failure Detection
https://patroni.readthedocs.io/en/latest/SETTINGS.html#dynamic-configuration-settings
Fencing
Configure Watchdog
https://patroni.readthedocs.io/en/latest/watchdog.html