Storaged进程无法启动

提问参考模版:

  • nebula 版本:2.5.1

  • 部署方式:分布式

  • 安装方式:RPM

  • 是否为线上版本:N

  • 硬件信息

    • 磁盘: 8TB SSD * 4
    • CPU、内存信息L: 80core、198GB
  • 问题描述
    多台Storaged进程无法启动,日志报E_UNKNOWN_PART和E_TERM_OUT_OF_DATE错误。出问题前曾逐台更新过三台机器网卡驱动,更新完成后执行balance leader命令无效果,再次执行命令行打印balance正在进行中,等待一段时间无果后选择了停服重启集群。重启后多台storaged报E_UNKNOWN_PART 无法启动。 Storaged日志如下:

E1027 01:12:27.860843 372934 RaftPart.cpp:1118] [Port: 44501, Space: 945, Part: 6] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:13:23.174355 372936 RaftPart.cpp:1118] [Port: 44501, Space: 33293, Part: 146] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:13:46.014721 372936 RaftPart.cpp:1118] [Port: 44501, Space: 18, Part: 6] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:15:22.326761 372937 RaftPart.cpp:1118] [Port: 44501, Space: 1, Part: 17] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:15:27.116305 372935 RaftPart.cpp:1118] [Port: 44501, Space: 945, Part: 111] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:15:47.248643 372937 RaftPart.cpp:1118] [Port: 44501, Space: 1, Part: 136] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:15:58.143913 372935 RaftPart.cpp:1118] [Port: 44501, Space: 38562, Part: 7] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:16:00.925247 372936 RaftPart.cpp:1118] [Port: 44501, Space: 33293, Part: 192] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART
E1027 01:16:28.149758 372935 RaftPart.cpp:1118] [Port: 44501, Space: 945, Part: 32] Receive response about askForVote from "10.201.18.58":44501, error code is E_TERM_OUT_OF_DATE
E1027 01:16:44.338373 372934 RaftPart.cpp:1118] [Port: 44501, Space: 1, Part: 7] Receive response about askForVote from "10.201.18.58":44501, error code is E_TERM_OUT_OF_DATE
E1027 01:16:48.799746 372937 RaftPart.cpp:1118] [Port: 44501, Space: 945, Part: 11] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART

贴INFO日志吧 另外查下网 防火墙 tcp链接是否真的建立了 netstat -antp之类的

执行status all 命令显示

netstat -antp执行结果

INFO日志, 以一个part为例

I1026 23:09:22.943861 372936 FileBasedWal.cpp:67] [Port: 44501, Space: 945, Part: 11] lastLogId in wal is 274229, lastLogTerm is 55, path is /data/hdfs2/nebula_graph_v2/storage/nebula/945/wal/11/0000000000000000001.wal
I1026 23:09:22.943974 372936 RaftPart.cpp:297] [Port: 44501, Space: 945, Part: 11] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 274229, lastLogTerm 55, committedLogId 274228, term 55
I1026 23:09:22.943984 372936 RaftPart.cpp:310] [Port: 44501, Space: 945, Part: 11] Add peer "10.201.16.58":44501
I1026 23:09:22.943997 372936 RaftPart.cpp:310] [Port: 44501, Space: 945, Part: 11] Add peer "10.201.18.58":44501


I1026 23:17:21.113795 372900 RaftPart.cpp:1337] [Port: 44501, Space: 945, Part: 11] Recieved a VOTING request: space = 945, partition = 11, candidateAddr = 10.201.16.58:44501, term = 45, lastLogId = 274199, lastLogTerm = 35
I1026 23:17:21.113802 372900 RaftPart.cpp:1370] [Port: 44501, Space: 945, Part: 11] The partition currently is a Follower, lastLogId 274229, lastLogTerm 55, committedLogId 274228, term 55
I1026 23:17:21.113807 372900 RaftPart.cpp:1384] [Port: 44501, Space: 945, Part: 11] The partition currently is on term 55. The term proposed by the candidate is no greater, so it will be rejected


I1026 23:48:57.538766 372934 RaftPart.cpp:1018] [Port: 44501, Space: 945, Part: 11] Start leader election, reason: lastMsgDur 2375805, term 55
I1026 23:48:57.538782 372934 RaftPart.cpp:1168] [Port: 44501, Space: 945, Part: 11] Sending out an election request (space = 945, part = 11, term = 56, lastLogId = 274229, lastLogTerm = 55, candidateIP = 10.201.17.58, candidatePort = 44501)

I1026 23:49:27.542834 372934 RaftPart.cpp:1251] [Port: 44501, Space: 945, Part: 11] No one is elected, continue the election

I1027 00:27:20.427911 372935 RaftPart.cpp:1168] [Port: 44501, Space: 945, Part: 11] Sending out an election request (space = 945, part = 11, term = 57, lastLogId = 274229, lastLogTerm = 55, candidateIP = 10.201.17.58, candidatePort = 44501)

I1027 00:27:50.432433 372935 RaftPart.cpp:1251] [Port: 44501, Space: 945, Part: 11] No one is elected, continue the election

错误日志

E1027 01:16:48.799746 372937 RaftPart.cpp:1118] [Port: 44501, Space: 945, Part: 11] Receive response about askForVote from "10.201.18.58":44501, error code is E_UNKNOWN_PART

44500端口不正常啊 看看网 一定要双向能通

集群一共5台机器,4台storaged都没启动起来,status all 都是这种状态
image

目前多台机器storaged进程都没启动起来,所以44500端口不正常,

这是一台机器的INFO 和 ERROR日志,请帮忙看下应该如何修复
nebula-storaged.tz-bd-hadoop016058.zeus.lianjia.com.root.log.ERROR.20211027-133010.135211 (465.5 KB)
nebula-storaged.tz-bd-hadoop016058.zeus.lianjia.com.root.log.INFO.20211027-133010.135211 (3.1 MB)

手动 kill 这个进程再重启

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。