nebula 查询报错 Storage Error: part: xxx, error: E_RPC_FAILURE(-3).

  • nebula 版本:3.1.0
  • 部署方式:分布式
  • 安装方式:k8s
  • 是否为线上版本:N
  • 硬件信息
    • 磁盘( 推荐使用 SSD)SSD
    • CPU、内存信息
  • 问题的具体描述:

集群在运行几分钟后,在nebula-console终端执行查询时,报如下错误:
[ERROR (-1005)]: Storage Error: part: xxx, error: E_RPC_FAILURE(-3).

此时,在graphd的日志中会有如下错误:

E20221019 01:47:55.835167 59 StorageClientBase-inl.h:206] Request to “storaged1”:9779 failed: AsyncSocketException: recv() failed (peer=10.247.185.66:9779, local=172.17.1.88:34026), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20221019 01:47:55.843434 25 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: AsyncSocketException: recv() failed (peer=10.247.185.66:9779, local=172.17.1.88:34026), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20221019 01:47:55.843541 33 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 9
E20221019 01:47:55.843565 33 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 6
E20221019 01:47:55.843573 33 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 3
E20221019 01:47:55.843583 33 StorageAccessExecutor.h:136] Storage Error: part: 9, error: E_RPC_FAILURE(-3).
E20221019 01:47:55.843672 23 QueryInstance.cpp:137] Storage Error: part: 9, error: E_RPC_FAILURE(-3).
在java应用中调用,也是偶尔报错,页面多刷几次又好了,感觉很不稳定。

报错后重启graphd服务,然后通过应用和终端查询就又正常了。麻烦看一下这是什么问题?怎么能解决?

nebula在k8s中的架构,3graphd,3metad,3storaged

show hosts graph;

±----------±-----±---------±--------±-------------±--------+
| Host | Port | Status | Role | Git Info Sha | Version |
±----------±-----±---------±--------±-------------±--------+
| “graphd” | 9669 | “ONLINE” | “GRAPH” | “33fd35e” | “3.1.0” |
| “graphd1” | 9669 | “ONLINE” | “GRAPH” | “33fd35e” | “3.1.0” |
| “graphd2” | 9669 | “ONLINE” | “GRAPH” | “33fd35e” | “3.1.0” |
±----------±-----±---------±--------±-------------±--------+
Got 3 rows (time spent 897/1686 us)

show hosts meta;

±---------±-----±---------±-------±-------------±--------+
| Host | Port | Status | Role | Git Info Sha | Version |
±---------±-----±---------±-------±-------------±--------+
| “metad0” | 9559 | “ONLINE” | “META” | “33fd35e” | “3.1.0” |
| “metad1” | 9659 | “ONLINE” | “META” | “33fd35e” | “3.1.0” |
| “metad2” | 9759 | “ONLINE” | “META” | “33fd35e” | “3.1.0” |
±---------±-----±---------±-------±-------------±--------+
Got 3 rows (time spent 710/1623 us)

show hosts storage;

±------------±-----±---------±----------±-------------±--------+
| Host | Port | Status | Role | Git Info Sha | Version |
±------------±-----±---------±----------±-------------±--------+
| “storaged0” | 9779 | “ONLINE” | “STORAGE” | “33fd35e” | “3.1.0” |
| “storaged1” | 9779 | “ONLINE” | “STORAGE” | “33fd35e” | “3.1.0” |
| “storaged2” | 9779 | “ONLINE” | “STORAGE” | “33fd35e” | “3.1.0” |
±------------±-----±---------±----------±-------------±--------+
Got 3 rows (time spent 844/1660 us)

这是常见错误,可能的原因在这:常见问题 FAQ - Nebula Graph Database 手册

能否确认一下执行了什么query,返回结果需要多少时间,storaged有无oom等

之前按照手册:常见问题 FAQ - Nebula Graph Database 手册

已经调整过参数,默认是60000毫秒,调整为了180000。通过curl http://graphd:19669/flags查看参数配置也是生效的。

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 3435 100 3435 0 0 1691k 0 --:–:-- --:–:-- --:–:-- 3354k
storage_client_timeout_ms=180000

但更改这个参数没有效果,问题是一样的。

执行的查询是:use 487335144958763008; GO 0 TO 5 STEPS FROM ‘487335617019289600’ OVER * BIDIRECT YIELD DISTINCT src(edge) as srcId, dst(edge) as dstId | LIMIT 0, 1000;

在neubla-console终端执行查询,回车后就立即会返回错误,重试几次也是一样。

在storaged节点确认了,没有OOM的情况。

graphd运行一会后就有问题,有问题后重启一下graphd就又能可以查询了。