- nebula 版本:3.8.0
- 部署方式:分布式
- 安装方式:RPM
- 是否上生产环境: N
- 硬件信息
- 磁盘机械硬盘
- 三台服务器,每台48核,256G内存
- 问题的具体描述
- 相关的 meta / storage / graph info 日志信息
- 经过复杂查询后,例如find path,出发点和终点数量都是2000个,但是返回结果是0,出现Storage Error: RPC failure, probably timeout报错,服务进程都存活,且只能重启graph或storaged恢复
- 超时配置项–storage_client_timeout_ms=1200000,感觉还没到这个超时时间就已经报错RPCfail了
- 有两个问题
- 1.我清楚这样计算量大,但是有没有办法做熔断,可以让查询失败,但是不要RPC超时假死?
- 2.出现类似的RPC超时后,有无办法自动恢复?
补充日志:
graph日志中报错:
E20241118 20:09:02.444499 791807 StorageClientBase-inl.h:227] Request to “10.45.151.221”:9779 failed: AsyncSocketException: recv() failed (peer=10.45.151.221:9779, local=10.45.151.222:55062), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20241118 20:09:02.454157 791402 StorageClientBase-inl.h:143] There some RPC errors: RPC failure in StorageClient: AsyncSocketException: recv() failed (peer=10.45.151.221:9779, local=10.45.151.222:55062), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20241118 20:09:02.454239 791402 StorageAccessExecutor.h:47] Traverse failed, error E_RPC_FAILURE, part 20
你配置了 memory tracker 了吗?
graph和storage都是下面的配置:
trackable memory ratio (trackable_memory / (total_memory - untracked_reserved_memory) )
–memory_tracker_limit_ratio=2
untracked reserved memory in Mib
–memory_tracker_untracked_reserved_memory_mb=14000
追加说明,在某个STORAGED RPC_FAIL后,show hosts依然显示它负责某些分区的leader
嗯,我猜是 RPC 通讯开销太大了, 导致超时,但是不影响这个 part 依然是 leader