报错:Storage Error: RPC failure, probably timeout,只能重启解决

  • nebula 版本:3.8.0
  • 部署方式:分布式
  • 安装方式:RPM
  • 是否上生产环境: N
  • 硬件信息
    • 磁盘机械硬盘
    • 三台服务器,每台48核,256G内存
  • 问题的具体描述
  • 相关的 meta / storage / graph info 日志信息
  • 经过复杂查询后,例如find path,出发点和终点数量都是2000个,但是返回结果是0,出现Storage Error: RPC failure, probably timeout报错,服务进程都存活,且只能重启graph或storaged恢复
  • 超时配置项–storage_client_timeout_ms=1200000,感觉还没到这个超时时间就已经报错RPCfail了
  • 有两个问题
  • 1.我清楚这样计算量大,但是有没有办法做熔断,可以让查询失败,但是不要RPC超时假死?
  • 2.出现类似的RPC超时后,有无办法自动恢复?
1 个赞

补充日志:
graph日志中报错:
E20241118 20:09:02.444499 791807 StorageClientBase-inl.h:227] Request to “10.45.151.221”:9779 failed: AsyncSocketException: recv() failed (peer=10.45.151.221:9779, local=10.45.151.222:55062), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20241118 20:09:02.454157 791402 StorageClientBase-inl.h:143] There some RPC errors: RPC failure in StorageClient: AsyncSocketException: recv() failed (peer=10.45.151.221:9779, local=10.45.151.222:55062), type = Internal error, errno = 104 (Connection reset by peer): Connection reset by peer
E20241118 20:09:02.454239 791402 StorageAccessExecutor.h:47] Traverse failed, error E_RPC_FAILURE, part 20

你配置了 memory tracker 了吗?

graph和storage都是下面的配置:

trackable memory ratio (trackable_memory / (total_memory - untracked_reserved_memory) )

–memory_tracker_limit_ratio=2

untracked reserved memory in Mib

–memory_tracker_untracked_reserved_memory_mb=14000

追加说明,在某个STORAGED RPC_FAIL后,show hosts依然显示它负责某些分区的leader

嗯,我猜是 RPC 通讯开销太大了, 导致超时,但是不影响这个 part 依然是 leader

看文档说是这个参数只需要加到graph配置中,但是查看storaged的配置curl ip:9779/flags获取到的结果中也有这个参数,并且值为60000,是否需要再storaged配置中也加入这个参数呢

不需要,这个是 stroage client 的超时时间

我使用三台完全空闲的服务器,内存是256G,CPU48核,数据库中只有几条数据,graph和storaged产生很多core文件,以下是gdb分析core文件内容:
graphd:
Using host libthread_db library “/lib64/libthread_db.so.1”.
Core was generated by `/home/nebula/nebula-graph-3.8.0.el7/bin/nebula-graphd --flagfile /home/nebula/n’.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000002282b5d in folly::EventBase::runImmediatelyOrRunInEventBaseThreadAndWait(folly::Function<void ()>) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64
(gdb) bt
#0 0x0000000002282b5d in folly::EventBase::runImmediatelyOrRunInEventBaseThreadAndWait(folly::Function<void ()>) ()
#1 0x000000000178a792 in ?? ()
#2 0x00000000018f4b4c in std::_Hashtable<std::pair<nebula::HostAddr, folly::EventBase*>, std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient >, std::allocator<std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient > >, std::__detail::_Select1st, std::equal_to<std::pair<nebula::HostAddr, folly::EventBase*> >, std::hash<std::pair<nebula::HostAddr, folly::EventBase*> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() ()
#3 0x00000000018f4bde in void folly::threadlocal_detail::ElementWrapper::set<std::unordered_map<std::pair<nebula::HostAddr, folly::EventBase*>, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient, std::hash<std::pair<nebula::HostAddr, folly::EventBase*> >, std::equal_to<std::pair<nebula::HostAddr, folly::EventBase*> >, std::allocator<std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient > > >>(std::unordered_map<std::pair<nebula::HostAddr, folly::EventBase>, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient, std::hash<std::pair<nebula::HostAddr, folly::EventBase*> >, std::equal_to<std::pair<nebula::HostAddr, folly::EventBase*> >, std::allocator<std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient > > >)::{lambda(void, folly::TLPDestructionMode)#2}::_FUN(void*, folly::TLPDestructionMode) ()
#4 0x00000000021fa082 in folly::threadlocal_detail::StaticMetaBase::onThreadExit(void*) ()
#5 0x00007f7482bf1ca2 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#6 0x00007f7482bf1eb3 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f748291a96d in clone () from /lib64/libc.so.6

storaged:
Using host libthread_db library “/lib64/libthread_db.so.1”.
Core was generated by `/home/nebula/nebula-graph-3.8.0.el7/bin/nebula-storaged --flagfile /home/nebula’.
Program terminated with signal 11, Segmentation fault.
#0 0x00000000026a6afd in folly::EventBase::runImmediatelyOrRunInEventBaseThreadAndWait(folly::Function<void ()>) ()
Missing separate debuginfos, use: debuginfo-install glibc-2.17-317.el7.x86_64
#0 0x00000000026a6afd in folly::EventBase::runImmediatelyOrRunInEventBaseThreadAndWait(folly::Function<void ()>) ()
#1 0x00000000013fcef2 in ?? ()
#2 0x000000000156b2ec in std::_Hashtable<std::pair<nebula::HostAddr, folly::EventBase*>, std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient >, std::allocator<std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient > >, std::__detail::_Select1st, std::equal_to<std::pair<nebula::HostAddr, folly::EventBase*> >, std::hash<std::pair<nebula::HostAddr, folly::EventBase*> >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::~_Hashtable() ()
#3 0x000000000156b37e in void folly::threadlocal_detail::ElementWrapper::set<std::unordered_map<std::pair<nebula::HostAddr, folly::EventBase*>, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient, std::hash<std::pair<nebula::HostAddr, folly::EventBase*> >, std::equal_to<std::pair<nebula::HostAddr, folly::EventBase*> >, std::allocator<std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient > > >>(std::unordered_map<std::pair<nebula::HostAddr, folly::EventBase>, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient, std::hash<std::pair<nebula::HostAddr, folly::EventBase*> >, std::equal_to<std::pair<nebula::HostAddr, folly::EventBase*> >, std::allocator<std::pair<std::pair<nebula::HostAddr, folly::EventBase*> const, std::shared_ptrnebula::meta::cpp2::MetaServiceAsyncClient > > >)::{lambda(void, folly::TLPDestructionMode)#2}::_FUN(void*, folly::TLPDestructionMode) ()
#4 0x0000000002614142 in folly::threadlocal_detail::StaticMetaBase::onThreadExit(void*) ()
#5 0x00007f44d5e64ca2 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#6 0x00007f44d5e64eb3 in start_thread () from /lib64/libpthread.so.0
#7 0x00007f44d5b8d96d in clone () from /lib64/libc.so.6