3节点集群,其中一个节点服务器宕机后,metad启不起来

  • nebula 版本:3.0.1
  • 部署方式:分布式
  • 安装方式:源码编译
  • 是否上生产环境:Y
  • 硬件信息
    • 磁盘( 推荐使用 SSD)
    • CPU、内存信息
  • 问题的具体描述

3节点集群,其中一个节点服务器宕机后,metad启不起来,metad报错日志:

I20240514 10:14:08.385361 85891 MetaDaemon.cpp:166] localhost = “110.2.46.153”:9559
I20240514 10:14:08.424789 85891 NebulaStore.cpp:51] Start the raft service…
I20240514 10:14:08.431066 85891 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20240514 10:14:08.435412 85891 RaftexService.cpp:63] Init thrift server for raft service, port: 9560
I20240514 10:14:08.436688 85955 RaftexService.cpp:94] Starting the Raftex Service
I20240514 10:14:08.547808 85955 RaftexService.cpp:84] Starting the Raftex Service on 9560
I20240514 10:14:08.547865 85955 RaftexService.cpp:106] Start the Raftex Service successfully
I20240514 10:14:08.548048 85891 NebulaStore.cpp:83] Scan the local path, and init the spaces_
I20240514 10:14:08.548200 85891 NebulaStore.cpp:89] Scan path “/opt/share/Product/DMEGraphService/graph-data/meta/nebula/0”
I20240514 10:14:08.617885 85891 RocksEngine.cpp:142] open rocksdb on /opt/share/Product/DMEGraphService/graph-data/meta/nebula/0/data
I20240514 10:14:08.617933 85891 NebulaStore.cpp:113] Load space 0 from disk
I20240514 10:14:08.617969 85891 NebulaStore.cpp:141] Need to open 1 parts of space 0
I20240514 10:14:08.618456 85940 Part.cpp:53] [Port: 9560, Space: 0, Part: 0] Cannot fetch the last committed log id from the storage engine
I20240514 10:14:08.618475 85940 RaftPart.cpp:299] [Port: 9560, Space: 0, Part: 0] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 0, lastLogTerm 0, committedLogId 0, committedLogTerm 0, term 0
I20240514 10:14:08.618484 85940 RaftPart.cpp:307] [Port: 9560, Space: 0, Part: 0] Add peer “110.2.46.151”:9560
I20240514 10:14:08.618505 85940 RaftPart.cpp:307] [Port: 9560, Space: 0, Part: 0] Add peer “110.2.46.152”:9560
I20240514 10:14:08.618680 85940 NebulaStore.cpp:145] Load part 0, 0 from disk
I20240514 10:14:08.618719 85891 NebulaStore.cpp:160] Load space 0 complete
I20240514 10:14:08.618762 85891 NebulaStore.cpp:169] Init data from partManager for “110.2.46.153”:9559
I20240514 10:14:08.618780 85891 NebulaStore.cpp:264] Data space 0 has existed!
I20240514 10:14:08.618799 85891 NebulaStore.cpp:304] [Space: 0, Part: 0] has existed!
I20240514 10:14:08.618839 85891 NebulaStore.cpp:76] Register handler…
I20240514 10:14:08.618849 85891 MetaDaemonInit.cpp:104] Waiting for the leader elected…
I20240514 10:14:08.618857 85891 MetaDaemonInit.cpp:116] Leader has not been elected, sleep 1s
I20240514 10:14:09.187132 85941 RaftPart.cpp:1018] [Port: 9560, Space: 0, Part: 0] Start leader election, reason: lastMsgDur 569, term 0
I20240514 10:14:09.187196 85941 RaftPart.cpp:1184] [Port: 9560, Space: 0, Part: 0] Sending out an election request (space = 0, part = 0, term = 1, lastLogId = 0, lastLogTerm = 0, candidateIP = 110.2.46.153, candidatePort = 9560), isPreVote = 1
W20240514 10:14:09.198154 85936 RaftPart.cpp:1122] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from “110.2.46.152”:9560, error code is E_RAFT_TERM_OUT_OF_DATE, isPreVote = 1
W20240514 10:14:09.198611 85936 RaftPart.cpp:1122] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from “110.2.46.151”:9560, error code is E_RAFT_TERM_OUT_OF_DATE, isPreVote = 1
I20240514 10:14:09.532050 85936 RaftPart.cpp:1683] [Port: 9560, Space: 0, Part: 0] The current role is Follower. Will follow the new leader “110.2.46.151”:9560 on term 13
I20240514 10:14:09.532181 85944 Part.cpp:206] [Port: 9560, Space: 0, Part: 0] Find the new leader “110.2.46.151”:9560
I20240514 10:14:09.619016 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:09.619061 85891 MetaDaemonInit.cpp:132] I am follower, wait for the leader’s clusterId
I20240514 10:14:09.619071 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId
I20240514 10:14:10.619274 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:10.619324 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId
I20240514 10:14:11.619508 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:11.619555 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId
I20240514 10:14:12.619711 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:12.619755 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId
I20240514 10:14:13.619913 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:13.619968 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId
I20240514 10:14:14.620128 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:14.620190 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId
I20240514 10:14:15.620666 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:15.620735 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId
I20240514 10:14:16.620930 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I20240514 10:14:16.621003 85891 MetaDaemonInit.cpp:134] Waiting for the leader’s clusterId

这三台之前是不是部署启动过服务,可以把安装目录下的 cluster.id 文件删了重新启动下服务

故障节点的cluster.id 和pids都删了的 也是不行

有大佬能看看 感觉一直在死循环
Waiting for the leader’s clusterId
I20240514 10:14:13.619913 85891 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!

如果另外两个meta日志是正常的,可以考虑把这个meta的数据目录清掉重启下,就是这个目录:
/opt/share/Product/DMEGraphService/graph-data/meta/nebula/0
可以mv成0.bak顺便备份下