跳到主要内容
版本:1.0.13

手工主从切换

下面我们来模拟主库故障,需要将从库提升为主库的场景。

模拟主库故障

使用kill命令,将主库进程杀死。例如:

$ kill -9 18049

主库故障后,从库系统日志中会出现错误:

FATAL: could not connect to the primary server: could not connect to server: Connection refused

​ Is the server running on host "node1" (192.168.56.101) and accepting

​ TCP/IP connections on port 1921?

从库提示无法连接到主库。


从库提升

使用pg_ctl promote命令来提升从库。

pg_ctl  promote
LOG: received promote request
LOG: redo done at 0/6000110
LOG: selected new timeline ID: 2
LOG: archive recovery complete
LOG: database system is ready to accept connections
done
server promoted

一旦pg_ctl promote命令正确完成,此时从库即备提升为主库,可以正常对外提供服务。


当原主库故障后,通常这些故障并不会导致数据丢失,例如:宕机、机器重启故障。如果原主库故障解决之后,通常我们全把原主库转换成新主库的Standby备库。正常一般是需要重新搭建备库,因为原主库的一些数据还没有同步过去就把备库激活了,备库相当于丢失了一些数据。从而重新搭建备库的话,如果数据库很大,基础备份执行时间会很长,解决这个问题的方法可以使用pg_rewind命令。不需要复制太多的数据就可以把原主库转换成新主库的备库。

使用pg_rewind命令要求原主库必须把wal_log_hints参数设置成on,这样配置的主库在出现故障时才能使用pg_rewind命令。

如果没有把参数wal_log_hints设置成on,运行pg_rewind时会报错:

[halo@node1 halo]$ pg_rewind -D $PGDATA --source-server='host=10.16.16.165 user=pgrewind password=123456 dbname=halo0root' -P
pg_rewind: connected to server
pg_rewind: fatal: target server needs to use either data checksums or "wal_log_hints = on"

下面演示pg_rewind的使用方法。

主库:node1 备库:node2

当node1故障了,现将node2升级为主库。

[halo@node2 ~]$ pg_ctl promtoe
server promoted

之后把node1启动,然后执行pg_rewind命令:

[halo@node1 ~]$ pg_rewind -D $PGDATA --source-server='host=node2 user=pgrewind password=123456 dbname=halo0root' -P
pg_rewind: fatal: target server must be shut down cleanly

报错需要把node1启动后,再正常关闭。

[halo@node1]$ pg_ctl start
done
server started

[halo@node1]$ pg_ctl stop
done
server stopped

然后再执行pg_rewind:

[halo@node1]$ pg_rewind -D $PGDATA --source-server='host=node2 user=pgrewind password=123456 dbname=halo0root' -P
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/30000D8 on timeline 1
pg_rewind: rewinding from last common checkpoint at 0/3000060 on timeline 1
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 51 MB (total source directory size is 72 MB)
52873/52873 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

注意,上面的“-D”参数指向本地的目录,“node2”可以是IP地址,user需要是超级用户。

pg_rewind执行完成之后,需要手动创建文件“standby.signal”:

touch $PGDATA/standby.signal

并将postgresql.auto.conf文件中的信息删掉。然后配置postgresql.conf中添加如下内容:

primary_conninfo = 'user=replica password=123456 channel_binding=prefer host=node2 port=1921 sslmode=prefer sslcompression=0 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=halo target_session_attrs=any'

这样原主库node1才能变成新主库node2的备库。

现在启动node1数据库,就变成了node2的备库:

[halo@node1]$ pg_ctl start
waiting for server to start....2022-04-21 15:08:08.127 CST [23501] LOG: starting Halo 13.6 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.3.1 20190507 (NFS Server 8.3.1-5), 64-bit
2022-04-21 15:08:08.127 CST [23501] LOG: listening on IPv4 address "0.0.0.0", port 1921
2022-04-21 15:08:08.127 CST [23501] LOG: listening on IPv6 address "::", port 1921
2022-04-21 15:08:08.127 CST [23501] LOG: listening on Unix socket "/var/run/halo/.s.PGSQL.1921"
2022-04-21 15:08:08.129 CST [23502] LOG: database system was interrupted while in recovery at log time 2022-04-21 15:05:24 CST
2022-04-21 15:08:08.129 CST [23502] HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2022-04-21 15:08:08.160 CST [23502] LOG: entering standby mode
2022-04-21 15:08:08.161 CST [23502] LOG: redo starts at 0/30000D8
2022-04-21 15:08:08.162 CST [23502] LOG: invalid record length at 0/30190E8: wanted 24, got 0
2022-04-21 15:08:08.166 CST [23505] LOG: started streaming WAL from primary at 0/3000000 on timeline 2
2022-04-21 15:08:08.167 CST [23502] LOG: consistent recovery state reached at 0/3019120
2022-04-21 15:08:08.167 CST [23501] LOG: database system is ready to accept read only connections
done
server started

注意:一定要先建好standby.signal文件,再启动数据库。