NebulaGraph3.4 Graphd宕机如何排查?

Nebula Graph 3.4.0 物理机3台部署在centos上面
昨晚20240411 22:17:22 提示graphd挂了,查看后面日志发现nebula-graphd.ERROR有一些kill查询的错误,但不是这个时间点

QueryInstance.cpp:151] Execution had been killed, query

nebula-storaged.ERROR 有错误时间点对上了

E20240411 22:17:22.655017 28201 AddVerticesProcessor.cpp:329] Error! ret = E_LEADER_LEASE_FAILED, spaceId 7

请问这是什么错误,如何解决呢?

分类没有选对,我给你改了。

可能是什么大的查询把服务给查挂了。你可以看下这个帖子:图库报错E_LEADER_LEASE_FAILED - #16,来自 codelone

查询集群节点日志发现同一时间出现

E20240411 21:55:41.876806 26647 QueryInstance.cpp:151] Execution had been killed, query:


E20240411 22:17:09.297816 26646 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.304505 26646 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.304510 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.304514 26644 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.333422 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.363343 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.373229 26644 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.373732 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.424813 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.424891 26646 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.443437 26644 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.494361 26644 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.564332 26646 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.564335 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.564335 26644 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.564392 26646 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:09.564426 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:10.059537 26645 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:10.225809 26646 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 43
E20240411 22:17:10.377080 26644 QueryInstance.cpp:151] Storage Error: Not the leader of 43. Please retry later.,

E20240411 22:17:22.652688 26644 StorageAccessExecutor.h:40] InsertVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 12
E20240411 22:17:22.652741 26644 StorageAccessExecutor.h:144] Storage Error: part: 12, error: E_LEADER_LEASE_FAILED(-3531).
E20240411 22:17:22.652769 26645 QueryInstance.cpp:151] Storage Error: part: 12, error: E_LEADER_LEASE_FAILED(-3531)., query:
E20240411 22:17:22.653806 26647 StorageAccessExecutor.h:40] InsertVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 12
E20240411 22:17:22.653827 26647 StorageAccessExecutor.h:144] Storage Error: part: 12, error: E_LEADER_LEASE_FAILED(-3531).
E20240411 22:17:22.653846 26647 QueryInstance.cpp:151] Storage Error: part: 12, error: E_LEADER_LEASE_FAILED(-3531)., query: 
E20240411 22:17:22.654491 26644 StorageAccessExecutor.h:40] InsertVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 12
E20240411 22:17:22.654511 26644 StorageAccessExecutor.h:144] Storage Error: part: 12, error: E_LEADER_LEASE_FAILED(-3531).
E20240411 22:17:22.654531 26646 QueryInstance.cpp:151] Storage Error: part: 12, error: E_LEADER_LEASE_FAILED(-3531).

大概是你的语句查挂了,:thinking: 你可以贴下查询语句是啥么?以及大概的数据量,和你的机器配置。

看日志应该是语句查挂了,打印的语句很长。一条完整的SQL UNION ALL了 32个Match,总共有300行。SQL不方便发出来

E20240418 07:10:14.983500 19784 StorageAccessExecutor.h:40] Traverse failed, error E_LEADER_CHANGED, part 57
E20240418 07:10:14.984452 19782 QueryInstance.cpp:151] Storage Error: Not the leader of 18. Please retry later., query: match xxx.........

机器配置:CPU4核,内存32GB,tar安装3台物理机器
数据量:
| “Space” | “vertices” | 327189820 |
| “Space” | “edges” | 534089521 |

听描述感觉就不大妙。- -,语句能优化下么?

好,拆开查可能会好一点,每个查询获取的结果是独立的,union 在一起是为了减少请求nebula的次数

你先试试拆开之后会不会报错,o.o 然后慢慢试试 union 合理的数是多少好啦。