nebula集群在运行一段时间后出现故障,无法连接

  • nebula 版本:2.5.1
  • 部署方式:分布式 5机集群,3机上部署了metad,graphd 5个服务器都部署了storaged
  • 安装方式:RPM
  • 是否为线上版本:Y
  • 硬件信息
    • 机械磁盘
    • 32核CPU、256G
  • 数据量:大概四五种顶点,几千万数据;40+条边,2-3亿数据
  • 运行一段时间后,系统不稳定,api和studio都无法连接到graphd服务,无法写入数据和访问数据
  • 出现问题时,后台日志大量出现:
    【storaged.ERROR】
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0607 09:15:52.634805 13196 MetaClient.cpp:635] Send request to "x.x.x.x5":9559, exceed retry limit
E0607 09:15:52.635146 13154 MetaClient.cpp:65] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0607 09:16:05.642508 13199 MetaClient.cpp:635] Send request to "x.x.x.x4":9559, exceed retry limit
E0607 09:16:05.642588 13154 MetaClient.cpp:65] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0607 09:16:41.334895 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 10] Receive response about askForVote from "x.x.x.x2":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.334940 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 10] Receive response about askForVote from "x.x.x.x3":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.335492 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 15] Receive response about askForVote from "x.x.x.x2":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.335503 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 15] Receive response about askForVote from "x.x.x.x3":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.336407 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 2] Receive response about askForVote from "x.x.x.x6":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.336426 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 2] Receive response about askForVote from "x.x.x.x5":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.340544 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 52] Receive response about askForVote from "x.x.x.x6":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.340564 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 52] Receive response about askForVote from "x.x.x.x5":9780, error code is E_TERM_OUT_OF_DATE
E0607 09:16:41.340998 13454 RaftPart.cpp:1118] [Port: 9780, Space: 3, Part: 40] Receive response about askForVote from "x.x.x.x2":9780, error code is E_TERM_OUT_OF_DATE
E0

【metad.ERROR】

E0607 09:16:04.458173 13091 RaftPart.cpp:1118] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "x.x.x.x":9560, error code is E_UNKNOWN_PART
E0607 09:16:06.166172 13092 RaftPart.cpp:1118] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "x.x.x.x":9560, error code is E_UNKNOWN_PART
E0607 09:16:15.475127 13354 ActiveHostsMan.cpp:256] Get last update time failed, error: E_LEADER_CHANGED
E0607 09:16:18.703186 13354 ActiveHostsMan.cpp:256] Get last update time failed, error: E_LEADER_CHANGED
E0607 09:16:18.703187 13353 ActiveHostsMan.cpp:256] Get last update time failed, error: E_LEADER_CHANGED
E0607 09:16:18.703197 13350 ActiveHostsMan.cpp:256] Get last update time failed, error: E_LEADER_CHANGED

【graphd.ERROR】

E0607 09:19:40.356612 13534 MetaClient.cpp:635] Send request to "53.80.6.95":9559, exceed retry limit
E0607 09:19:40.356714 13439 MetaClient.cpp:131] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E0607 09:19:40.821357 13531 GraphSessionManager.cpp:108] Create session failed:LeaderChanged: Leader changed!
E0607 09:19:40.821442 13531 GraphService.cpp:89] Create session for userName: root, ip: ::ffff:x.x.x.x failed: Create session failed: LeaderChanged: Leader changed!
E0607 09:19:55.825130 13531 GraphSessionManager.cpp:108] Create session failed:LeaderChanged: Leader changed!
E0607 09:19:55.825212 13531 GraphService.cpp:89] Create session for userName: root, ip: ::ffff:x.x.x.x failed: Create session failed: LeaderChanged: Leader changed!
E0607 09:20:10.826726 13537 MetaClient.cpp:635] Send request to "53.80.6.95":9559, exceed retry limit
E0607 09:20:10.826838 13531 GraphSessionManager.cpp:108] Create session failed:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E0607 09:20:10.826887 13531 GraphService.cpp:89] Create session for userName: root, ip: ::ffff:x.x.x.x failed: Create session failed: RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E0607 09:20:10.827867 13542 MetaClient.cpp:635] Send request to "53.80.6.95":9559, exceed retry limit
E0607 09:20:10.827927 13531 GraphSessionManager.cpp:108] Create session failed:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E0607 09:20:10.827952 13531 GraphService.cpp:89] Create session for userName: root, ip: ::ffff:x.x.x.x failed: Create session failed: RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E0607 09:20:18.313988 13531 GraphSessionManager.cpp:108] Create session failed:LeaderChanged: Leader changed!
E0607 09:20:18.314088 13531 GraphService.cpp:89] Create session for userName: root, ip: ::ffff:x.x.x.x failed: Create session failed: LeaderChanged: Leader changed!

日志较多,这里无法全部贴出来,同时在使用过程中还会有很多

E0607 04:49:25.925011  2854 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type -58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:49:26.935583  2858 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type -58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:49:26.935657  2832 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type 58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:49:26.935684  2832 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type -58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:49:27.949560  2855 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type -58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:49:27.949683  2862 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type -58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:49:28.959389  2862 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type -58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:50:12.937350  2850 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type 58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:50:12.937407  2850 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type 58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:50:12.937497  2845 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type 58, rank 0, dst RY-001-YYYYYYYYYY
E0607 04:50:12.937737  2850 AddEdgesProcessor.cpp:164] edge locked : src RY-001-xxxxxxxxx, type -58, rank 0, dst RY-001-YYYYYYYYYY

这样的日志,不知道是否有关联关系

能确认Meta还在正常运行吗?从现象上看MetaServer已经无法提供服务了

进程是在的,只是不断输出错误日志,无法提供服务

麻烦多发一些Meta的LOG。信息太少,判断不了原因

INFO里面还有大量的

I0607 09:11:07.145823 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:07.768496 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
E0607 09:11:07.808775 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
I0607 09:11:08.384696 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
E0607 09:11:08.810176 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
I0607 09:11:09.007625 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:09.624696 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
E0607 09:11:09.811523 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
I0607 09:11:10.247066 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:10.871258 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:11.483938 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
E0607 09:11:11.755985 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
I0607 09:11:12.103453 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:12.730909 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:13.343681 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:13.902513 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610734, lastLogTerm 1, lastLogIdSent 22610733, lastLogTermSent 1
I0607 09:11:15.322993 12887 RaftPart.cpp:1567] [Port: 9560, Space: 0, Part: 0] Stale log! The log 22610733, term 1 i had committed yet. My committedLogId is 22610734, term is 1
I0607 09:11:16.436875 12920 EventListener.h:18] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I0607 09:11:16.682651 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610736, lastLogTerm 1, lastLogIdSent 22610735, lastLogTermSent 1
I0607 09:11:16.994772 12920 EventListener.h:30] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I0607 09:11:17.311611 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610736, lastLogTerm 1, lastLogIdSent 22610735, lastLogTermSent 1
I0607 09:11:17.925006 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610736, lastLogTerm 1, lastLogIdSent 22610735, lastLogTermSent 1
I0607 09:11:18.546994 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610736, lastLogTerm 1, lastLogIdSent 22610735, lastLogTermSent 1
I0607 09:11:19.167898 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610736, lastLogTerm 1, lastLogIdSent 22610735, lastLogTermSent 1
E0607 09:11:19.704610 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
I0607 09:11:19.784976 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610736, lastLogTerm 1, lastLogIdSent 22610735, lastLogTermSent 1
I0607 09:11:20.408069 12887 RaftPart.cpp:1623] [Port: 9560, Space: 0, Part: 0] Stale log! Local lastLogId 22610736, lastLogTerm 1, lastLogIdSent 22610735, lastLogTermSent 1

基本都是这个日志,基本没有别的了

WARNING的

W0607 09:07:25.751772 13215 SessionManagerProcessor.cpp:59] Session id `1654339279724035' not found
E0607 09:07:25.821975 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:07:30.607702 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:07:34.271054 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:07:55.825500 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:08.598156 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:13.311334 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:14.077035 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:16.306660 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:20.394398 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:32.438047 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:39.127771 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:08:44.691421 13215 ActiveHostsMan.cpp:256] Get last update time failed, error: E_LEADER_CHANGED
E0607 09:09:09.817876 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:09:14.284373 13215 ActiveHostsMan.cpp:256] Get last update time failed, error: E_LEADER_CHANGED
E0607 09:09:22.339228 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
W0607 09:09:34.449908 13215 SessionManagerProcessor.cpp:59] Session id `1653501062283264' not found
E0607 09:09:41.351663 13215 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:09:41.351755 13209 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:09:41.351857 13217 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:09:41.351909 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:09:43.077225 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:10:16.090349 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:10:17.695904 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:10:23.422399 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:10:25.284683 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:10:25.363314 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:10:33.141467 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:10:41.915434 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
W0607 09:10:46.867976 13218 SessionManagerProcessor.cpp:59] Session id `1654339279727875' not found
E0607 09:10:47.521354 13218 ActiveHostsMan.cpp:256] Get last update time failed, error: E_LEADER_CHANGED
E0607 09:10:48.558354 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:11:07.808775 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:11:08.810176 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:11:09.811523 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:11:11.755985 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:11:19.704610 13218 SessionManagerProcessor.cpp:17] User does not exist, errorCode: E_LEADER_CHANGED
E0607 09:11:20.904713 13197 JobDescription.cpp:188] Loading Job Description FailedE_LEADER_CHANGED

看起来是Raft死锁。直接重启MetaServer吧。

另外,2.5版本太老,存在各种问题。建议升级

1 个赞

meta重启了貌似也不管用。。

另外升级的话nGQL会有变化吗?客户端API也得升级,因为我们还有一些调API进行顶点和边创建的动作,估计会有些工作量。

部分语句会有些小出入,你可以看下文档的 release note。

storage 有重启吗?

所有组件都重启了,还是不行,最后只能把data目录移走,重建namespace,重新导入数据来解决

数据导入的时候执行compact任务有可能会导致死锁吗?
compact一般需要多少时间执行一次?不执行compact会出现什么问题?查询速度变慢?

一天会有一次。

对,如果不执行 compaction 会有一些重复 kv 数据不能得到及时的合并,数据查询时需要找的数据不能减少就会变慢(数据一直在累积)。

不会死锁

不用配置什么自动就有一次?

默认是开启的。

我以为没有,我自己搞了个定时任务也在做compact,会不会是两个撞到一起了?我把自己的停掉吧

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。