【生产环境】 nebula 未使用快照功能,查询不存在leader,无法服务,storage日志为快照恢复相关

  • nebula 版本:3.4.1
  • 部署方式:分布式
  • 安装方式: RPM
  • 是否上生产环境:Y
  • 硬件信息
    • 磁盘 机械
    • CPU、内存信息
  • 问题的具体描述
    查询失败 ,报错为: Not the leader of 107. Please retry later.
    查看storage日志 发现集群中有两个节点 的cpu占用较高,均为snapshot-worker,且有一个节点有磁盘写入,是否为使用快照同步commit数据?,因为写入时异常停止导致的?如何尽快恢复服务呢?很急
    其中一台节点
    iotop:
0692 be/4 root 0.00 B/s 1942.24 K/s  0.00 %  0.00 % nebula-storaged --flagfile /root /nebula-graph-3.4.1/etc/nebula-storaged.conf [snapshot-worker]
40694 be/4 root 0.00 B/s    2.03 M/s  0.00 %  0.00 % nebula-storaged --flagfile /root /nebula-graph-3.4.1/etc/nebula-storaged.conf [snapshot-worker]
40693 be/4 root 0.00 B/s 1800.58 K/s  0.00 %  0.00 % nebula-storaged --flagfile /root /nebula-graph-3.4.1/etc/nebula-storaged.conf [snapshot-worker]

top -H

 90839 root  20   0   24.6g  11.4g  12324 R 14.7  2.3 150:59.72 snapshot-worker
 90840 root  20   0   24.6g  11.4g  12324 S 14.1  2.3 150:08.58 snapshot-worker
 90841 root  20   0   24.6g  11.4g  12324 S 14.1  2.3 150:08.31 snapshot-worker
 90839 root  20   0   24.6g  11.4g  12324 R 13.7  2.3 151:02.78 snapshot-worker
 90842 root  20   0   24.6g  11.4g  12324 S 13.7  2.3 150:05.51 snapshot-worker
  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)
I20230718 08:54:14.984810 90839 NebulaSnapshotManager.cpp:67] Space 168 Part 135 start send snapshot of commitLogId 2283319 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.984802 90842 NebulaSnapshotManager.cpp:67] Space 165 Part 79 start send snapshot of commitLogId 2493751 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.984823 90840 NebulaSnapshotManager.cpp:67] Space 168 Part 135 start send snapshot of commitLogId 2283319 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.984841 90841 NebulaSnapshotManager.cpp:67] Space 168 Part 107 start send snapshot of commitLogId 2288331 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994158 90842 NebulaSnapshotManager.cpp:67] Space 162 Part 149 start send snapshot of commitLogId 2079143 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994171 90840 NebulaSnapshotManager.cpp:67] Space 165 Part 65 start send snapshot of commitLogId 2491622 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994163 90839 NebulaSnapshotManager.cpp:67] Space 162 Part 149 start send snapshot of commitLogId 2079143 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994181 90842 NebulaSnapshotManager.cpp:67] Space 168 Part 51 start send snapshot of commitLogId 2288670 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994186 90840 NebulaSnapshotManager.cpp:67] Space 167 Part 93 start send snapshot of commitLogId 825674 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994194 90839 NebulaSnapshotManager.cpp:67] Space 166 Part 93 start send snapshot of commitLogId 1073235 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994203 90840 NebulaSnapshotManager.cpp:67] Space 164 Part 177 start send snapshot of commitLogId 4437966 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994199 90842 NebulaSnapshotManager.cpp:67] Space 162 Part 107 start send snapshot of commitLogId 2079891 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994215 90839 NebulaSnapshotManager.cpp:67] Space 164 Part 177 start send snapshot of commitLogId 4437966 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994222 90840 NebulaSnapshotManager.cpp:67] Space 167 Part 51 start send snapshot of commitLogId 826229 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994227 90842 NebulaSnapshotManager.cpp:67] Space 168 Part 51 start send snapshot of commitLogId 2288670 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994185 90841 NebulaSnapshotManager.cpp:67] Space 166 Part 93 start send snapshot of commitLogId 1073235 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994238 90839 NebulaSnapshotManager.cpp:67] Space 162 Part 107 start send snapshot of commitLogId 2079891 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994248 90841 NebulaSnapshotManager.cpp:67] Space 167 Part 93 start send snapshot of commitLogId 825674 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994241 90842 NebulaSnapshotManager.cpp:67] Space 164 Part 135 start send snapshot of commitLogId 4440109 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994261 90839 NebulaSnapshotManager.cpp:67] Space 164 Part 135 start send snapshot of commitLogId 4440109 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994268 90841 NebulaSnapshotManager.cpp:67] Space 165 Part 65 start send snapshot of commitLogId 2491622 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994283 90839 NebulaSnapshotManager.cpp:67] Space 165 Part 79 start send snapshot of commitLogId 2493751 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994293 90841 NebulaSnapshotManager.cpp:67] Space 167 Part 51 start send snapshot of commitLogId 826229 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994239 90840 NebulaSnapshotManager.cpp:67] Space 167 Part 107 start send snapshot of commitLogId 821256 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994311 90841 NebulaSnapshotManager.cpp:67] Space 167 Part 107 start send snapshot of commitLogId 821256 commitLogTerm 3, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994274 90842 NebulaSnapshotManager.cpp:67] Space 168 Part 107 start send snapshot of commitLogId 2288331 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994436 90840 NebulaSnapshotManager.cpp:67] Space 164 Part 135 start send snapshot of commitLogId 4440109 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994447 90841 NebulaSnapshotManager.cpp:67] Space 164 Part 135 start send snapshot of commitLogId 4440109 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20230718 08:54:14.994457 90842 NebulaSnapshotManager.cpp:67] Space 168 Part 135 start send snapshot of commitLogId 2283319 commitLogTerm 3, rate limited to 10485760, batch size is 1048576

初步判断是由于写任务失败,造成集群中一些space的part数据不同步?然后使用了快照功能同步数据。不知道这样解释合理不?大佬帮忙确认下

你看下是不是有节点storage服务挂掉过

手动重启过一次。。

这个一般是某个节点storage服务挂了,然后其他节点的wal日志又超过wal_ttl设置的时间被清理了,导致无法通过wal同步数据到这个节点,所以需要重新发snapshot同步数据

2 个赞

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。