storaged出现error code is E_RAFT_UNKNOWN_PART,Did not get enough votes

  • nebula 版本:3.6.0
  • 部署方式:分布式
  • 安装方式:RPM
  • 是否上生产环境:Y
  • 问题的具体描述
    停止服务时storaged一直无法停止,于是使用kill -9停止,之后重启服务,storaged服务cpu一直超过100%,且无法ready,查看日志有报错error code is E_RAFT_UNKNOWN_PART,Did not get enough votes,感觉是raft一致性出现问题了,请问该怎么解决,报这个错的图空间很多,影响范围大
  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)
I20260304 17:12:18.562439 1528010 RaftPart.cpp:1261] [Port: 9780, Space: 16299, Part: 3] Receive response about askForVote from "7.227.5.1":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20260304 17:12:18.562467 1528010 RaftPart.cpp:1288] [Port: 9780, Space: 16299, Part: 3] Did not get enough votes from election of term 8, isPreVote = 1
I20260304 17:12:18.562502 1528091 RaftPart.cpp:1294] [Port: 9780, Space: 15712, Part: 5] Start leader election...
I20260304 17:12:18.562522 1528091 RaftPart.cpp:1322] [Port: 9780, Space: 15712, Part: 5] Sending out an election request (space = 15712, part = 5, term = 9, lastLogId = 171982, lastLogTerm = 8, candidateIP = 7.227.56.193, candidatePort = 9780), isPreVote = 1
I20260304 17:12:18.562556 1528036 ThriftClientManager-inl.h:47] Getting a client to "10.28.80.26":9780
I20260304 17:12:18.562587 1528036 ThriftClientManager-inl.h:47] Getting a client to "7.227.5.1":9780
I20260304 17:12:18.570798 1528036 CollectNSucceeded-inl.h:59] Set Value [completed=2, total=2, Result list size=2]
I20260304 17:12:18.570847 1528010 RaftPart.cpp:1261] [Port: 9780, Space: 15712, Part: 5] Receive response about askForVote from "7.227.5.1":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20260304 17:12:18.570873 1528010 RaftPart.cpp:1261] [Port: 9780, Space: 15712, Part: 5] Receive response about askForVote from "10.28.80.26":9780, error code is E_RAFT_UNKNOWN_PART, isPreVote = 1
I20260304 17:12:18.570880 1528010 RaftPart.cpp:1288] [Port: 9780, Space: 15712, Part: 5] Did not get enough votes from election of term 9, isPreVote = 1
I20260304 17:12:18.570914 1528091 RaftPart.cpp:1294] [Port: 9780, Space: 15841, Part: 3] Start leader election...
I20260304 17:12:18.570933 1528091 RaftPart.cpp:1322] [Port: 9780, Space: 15841, Part: 3] Sending out an election request (space = 15841, part = 3, term = 15, lastLogId = 165710, lastLogTerm = 14, candidateIP = 7.227.56.193, candidatePort = 9780), isPreVote = 1

image

通过删除cluster.id重启解决问题了,现在cpu正常,但是storaged日志里在报Snapshot send failed, the leader changed?不知道是正常节点间信息同步还是有什么问题,show hosts都是online的,准备等一段时间看看集群能不能恢复

1 个赞

解决了,发现报the leader changed?的图空间show hosts已经查不到了,应该是metad执行了删除,但之前storaged状态异常没有正确响应导致的垃圾数据,停止集群,删除那几个报错的图空间再重启集群恢复正常

1 个赞

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。