nebula 查询报错 Storage Error: part: xxx, error: E_RPC_FAILURE(-3).

tony.chang.new · 2022 年10 月 19 日 02:22

nebula 版本：3.1.0
部署方式：分布式
安装方式：k8s
是否为线上版本：N
硬件信息
- 磁盘（推荐使用 SSD）SSD
- CPU、内存信息
问题的具体描述：

集群在运行几分钟后，在nebula-console终端执行查询时，报如下错误：
[ERROR (-1005)]: Storage Error: part: xxx, error: E_RPC_FAILURE(-3).

此时，在graphd的日志中会有如下错误：

E20221019 01:47:55.835167 59 StorageClientBase-inl.h:206] Request to “storaged1”:9779 failed: AsyncSocketException: recv() failed (peer=10.247.185.66:9779, local=172.17.1.88:34026), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20221019 01:47:55.843434 25 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: AsyncSocketException: recv() failed (peer=10.247.185.66:9779, local=172.17.1.88:34026), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20221019 01:47:55.843541 33 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 9
E20221019 01:47:55.843565 33 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 6
E20221019 01:47:55.843573 33 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 3
E20221019 01:47:55.843583 33 StorageAccessExecutor.h:136] Storage Error: part: 9, error: E_RPC_FAILURE(-3).
E20221019 01:47:55.843672 23 QueryInstance.cpp:137] Storage Error: part: 9, error: E_RPC_FAILURE(-3).
在java应用中调用，也是偶尔报错，页面多刷几次又好了，感觉很不稳定。

报错后重启graphd服务，然后通过应用和终端查询就又正常了。麻烦看一下这是什么问题？怎么能解决？

nebula在k8s中的架构，3graphd，3metad，3storaged

show hosts graph;

±----------±-----±---------±--------±-------------±--------+
| Host | Port | Status | Role | Git Info Sha | Version |
±----------±-----±---------±--------±-------------±--------+
| “graphd” | 9669 | “ONLINE” | “GRAPH” | “33fd35e” | “3.1.0” |
| “graphd1” | 9669 | “ONLINE” | “GRAPH” | “33fd35e” | “3.1.0” |
| “graphd2” | 9669 | “ONLINE” | “GRAPH” | “33fd35e” | “3.1.0” |
±----------±-----±---------±--------±-------------±--------+
Got 3 rows (time spent 897/1686 us)

show hosts meta;

±---------±-----±---------±-------±-------------±--------+
| Host | Port | Status | Role | Git Info Sha | Version |
±---------±-----±---------±-------±-------------±--------+
| “metad0” | 9559 | “ONLINE” | “META” | “33fd35e” | “3.1.0” |
| “metad1” | 9659 | “ONLINE” | “META” | “33fd35e” | “3.1.0” |
| “metad2” | 9759 | “ONLINE” | “META” | “33fd35e” | “3.1.0” |
±---------±-----±---------±-------±-------------±--------+
Got 3 rows (time spent 710/1623 us)

show hosts storage;

±------------±-----±---------±----------±-------------±--------+
| Host | Port | Status | Role | Git Info Sha | Version |
±------------±-----±---------±----------±-------------±--------+
| “storaged0” | 9779 | “ONLINE” | “STORAGE” | “33fd35e” | “3.1.0” |
| “storaged1” | 9779 | “ONLINE” | “STORAGE” | “33fd35e” | “3.1.0” |
| “storaged2” | 9779 | “ONLINE” | “STORAGE” | “33fd35e” | “3.1.0” |
±------------±-----±---------±----------±-------------±--------+
Got 3 rows (time spent 844/1660 us)

wenhaocs · 2022 年10 月 19 日 17:13

这是常见错误，可能的原因在这：常见问题 FAQ - Nebula Graph Database 手册

能否确认一下执行了什么query，返回结果需要多少时间，storaged有无oom等

tony.chang.new · 2022 年10 月 20 日 01:35

之前按照手册：常见问题 FAQ - Nebula Graph Database 手册

已经调整过参数，默认是60000毫秒，调整为了180000。通过curl http://graphd:19669/flags查看参数配置也是生效的。

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3435 100 3435 0 0 1691k 0 --:–:-- --:–:-- --:–:-- 3354k
storage_client_timeout_ms=180000

但更改这个参数没有效果，问题是一样的。

执行的查询是：use 487335144958763008; GO 0 TO 5 STEPS FROM ‘487335617019289600’ OVER * BIDIRECT YIELD DISTINCT src(edge) as srcId, dst(edge) as dstId | LIMIT 0, 1000;

在neubla-console终端执行查询，回车后就立即会返回错误，重试几次也是一样。

在storaged节点确认了，没有OOM的情况。

graphd运行一会后就有问题，有问题后重启一下graphd就又能可以查询了。