报错:Storage Error: RPC failure, probably timeout,只能重启解决

  • nebula 版本:3.8.0
  • 部署方式:分布式
  • 安装方式:RPM
  • 是否上生产环境: N
  • 硬件信息
    • 磁盘机械硬盘
    • 三台服务器,每台48核,256G内存
  • 问题的具体描述
  • 相关的 meta / storage / graph info 日志信息
  • 经过复杂查询后,例如find path,出发点和终点数量都是2000个,但是返回结果是0,出现Storage Error: RPC failure, probably timeout报错,服务进程都存活,且只能重启graph或storaged恢复
  • 超时配置项–storage_client_timeout_ms=1200000,感觉还没到这个超时时间就已经报错RPCfail了
  • 有两个问题
  • 1.我清楚这样计算量大,但是有没有办法做熔断,可以让查询失败,但是不要RPC超时假死?
  • 2.出现类似的RPC超时后,有无办法自动恢复?
1 个赞

补充日志:
graph日志中报错:
E20241118 20:09:02.444499 791807 StorageClientBase-inl.h:227] Request to “10.45.151.221”:9779 failed: AsyncSocketException: recv() failed (peer=10.45.151.221:9779, local=10.45.151.222:55062), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20241118 20:09:02.454157 791402 StorageClientBase-inl.h:143] There some RPC errors: RPC failure in StorageClient: AsyncSocketException: recv() failed (peer=10.45.151.221:9779, local=10.45.151.222:55062), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20241118 20:09:02.454239 791402 StorageAccessExecutor.h:47] Traverse failed, error E_RPC_FAILURE, part 20

你配置了 memory tracker 了吗?

graph和storage都是下面的配置:

trackable memory ratio (trackable_memory / (total_memory - untracked_reserved_memory) )

–memory_tracker_limit_ratio=2

untracked reserved memory in Mib

–memory_tracker_untracked_reserved_memory_mb=14000

追加说明,在某个STORAGED RPC_FAIL后,show hosts依然显示它负责某些分区的leader

嗯,我猜是 RPC 通讯开销太大了, 导致超时,但是不影响这个 part 依然是 leader