NebulaGraph V3.2版本使用nebula-java并发读写请求触发超时时会偶发性出现graphd节点Crash现象

提问参考模版:

  • nebula 版本:(为节省回复者核对版本信息的时间,首次发帖的版本信息记得以截图形式展示)
    v3.2.1 基线 自己编译的

  • 部署方式:分布式

  • 安装方式:源码编译

  • 是否上生产环境:Y

  • OS: uname -a
Linux ncn4a-wisemlopsdppservice-32-203-100 4.18.0-147.5.1.6.h934.eulerosv2r9.x86_64 #1 SMP Sat Feb 4 09:00:27 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Compiler: g++ --version or clang++ --version
g++ (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
  • CPU: lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   42 bits physical, 48 bits virtual
CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 6266C CPU @ 3.
                                 00GHz
Stepping:                        7
CPU MHz:                         3000.000
BogoMIPS:                        6000.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       256 KiB
L1i cache:                       256 KiB
L2 cache:                        8 MiB
L3 cache:                        30.3 MiB
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Processor vulnerable
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable: Clear CPU buffers attemp
                                 ted, no microcode; SMT Host state un
                                 known
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass
                                  disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers
                                  and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: d
                                 isabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Vulnerable: Clear CPU buffers attemp
                                 ted, no microcode; SMT Host state un
                                 known
Flags:                           fpu vme de pse tsc msr pae mce cx8 a
                                 pic sep mtrr pge mca cmov pat pse36 
                                 clflush mmx fxsr sse sse2 ss ht sysc
                                 all nx pdpe1gb rdtscp lm constant_ts
                                 c rep_good nopl xtopology nonstop_ts
                                 c cpuid tsc_known_freq pni pclmulqdq
                                  ssse3 fma cx16 pcid sse4_1 sse4_2 x
                                 2apic movbe popcnt tsc_deadline_time
                                 r aes xsave avx f16c rdrand hypervis
                                 or lahf_lm abm 3dnowprefetch invpcid
                                 _single ssbd ibrs ibpb stibp ibrs_en
                                 hanced fsgsbase tsc_adjust bmi1 hle 
                                 avx2 smep bmi2 erms invpcid rtm mpx 
                                 avx512f avx512dq rdseed adx smap clf
                                 lushopt clwb avx512cd avx512bw avx51
                                 2vl xsaveopt xsavec xgetbv1 arat avx
                                 512_vnni md_clear flush_l1d arch_cap
                                 abilities

  • 问题的具体描述

使用nebula-java并发读写压测nebula graph 3节点的集群,客户端设置超时时间10ms,会频繁触发超时(故意设置短的超时时间为了触发超时)
graph 服务端会经常出现coredump异常重启
调用栈如下

Thread 79 "graph-netio25" received signal SIGSEGV, Segmentation fault.
0x0000000005e86ba9 in apache::thrift::Cpp2Connection::stop() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
1183	/install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h: No such file or directory.
(gdb) bt
#0  0x0000000005e86ba9 in apache::thrift::Cpp2Connection::stop() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#1  0x0000000005e88889 in apache::thrift::Cpp2Connection::channelClosed(folly::exception_wrapper&&) ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#2  0x0000000005e9c567 in apache::thrift::HeaderServerChannel::messageChannelEOF() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#3  0x0000000005e401d6 in apache::thrift::Cpp2Channel::processReadEOF() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#4  0x0000000005e47132 in non-virtual thunk to wangle::ContextImpl<apache::thrift::Cpp2Channel>::readEOF() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#5  0x0000000005e4568d in non-virtual thunk to wangle::ContextImpl<apache::thrift::FramingHandler>::readEOF() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#6  0x0000000005e4550a in wangle::ContextImpl<apache::thrift::TAsyncTransportHandler>::fireReadEOF() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#7  0x0000000005e43570 in non-virtual thunk to apache::thrift::TAsyncTransportHandler::readEOF() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#8  0x0000000006013bae in folly::AsyncSocket::handleRead() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#9  0x000000000600898a in folly::AsyncSocket::ioReady(unsigned short) ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#10 0x00000000060e2315 in ?? () at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#11 0x00000000060e297f in event_base_loop ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#12 0x000000000602180e in folly::EventBase::loopBody(int, bool) ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#13 0x00000000060220fe in folly::EventBase::loop() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#14 0x0000000006024d38 in folly::EventBase::loopForever() ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#15 0x0000000005faf549 in folly::IOThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) ()
    at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#16 0x0000000005fbdab7 in void folly::detail::function::FunctionTraits<void ()>::callBig<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) () at /install_temp/gcc-10.3.0/gcc-10.3.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/shared_ptr_base.h:1183
#17 0x0000000003eb1430 in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f9f43212410)
    at ../../../third-party/third-party/include/folly/Function.h:400
#18 0x0000000003f3a70a in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (
--Type <RET> for more, q to quit, c to continue without paging--
    __closure=0x7f9f43212410) at ../../../../third-party/third-party/include/folly/executors/thread_factory/NamedThreadFactory.h:40
#19 0x0000000003f7a23b in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...)
    at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/bits/invoke.h:60
#20 0x0000000003f79d05 in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_result&&, (folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&)...) (__fn=...)
    at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/bits/invoke.h:95
#21 0x0000000003f79a00 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x7f9f43212410) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:264
#22 0x0000000003f7963c in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0x7f9f43212410) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:271
#23 0x0000000003f78f7e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0x7f9f43212400) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:215
#24 0x00000000065f1e7e in std::execute_native_thread_routine (__p=0x7f9f43212400) at ../../../.././libstdc++-v3/src/c++11/thread.cc:78
#25 0x00007f9f43884f3b in ?? () from /usr/lib64/libpthread.so.0
#26 0x00007f9f437bc840 in clone () from /usr/lib64/libc.so.6

有大佬帮忙看看吗

这么点日志我看不太出来
怀疑和你的查询有关。

建议你升到最新的3.6,版本稳定性会高很多
如果还有问题,建议你先不要并发测,先一条条跑下看有没有问题。

Thread 79 "graph-netio25" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 393143]
0x000000000673b4c8 in apache::thrift::transport::THeader::getSequenceNumber() const ()
(gdb) bt
#0  0x000000000673b4c8 in apache::thrift::transport::THeader::getSequenceNumber() const ()
#1  0x000000000673ba84 in apache::thrift::HeaderServerChannel::HeaderRequest::isOneway() const ()
#2  0x000000000673c002 in apache::thrift::Cpp2Connection::Cpp2Request::isOneway() const ()
#3  0x0000000006735785 in apache::thrift::Cpp2Connection::stop() ()
#4  0x0000000006738dfb in ?? ()
#5  0x000000000673afc5 in ?? ()
#6  0x000000000673a91e in ?? ()
#7  0x0000000006738fb9 in apache::thrift::Cpp2Connection::channelClosed(folly::exception_wrapper&&) ()
#8  0x000000000675c3a7 in apache::thrift::HeaderServerChannel::messageChannelEOF() ()
#9  0x00000000066a4f9e in apache::thrift::Cpp2Channel::processReadEOF() ()
#10 0x00000000066a4966 in apache::thrift::Cpp2Channel::readEOF(wangle::HandlerContext<int, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, apache::thrift::transport::THeader*> >*) ()
#11 0x00000000066b91b0 in wangle::ContextImpl<apache::thrift::Cpp2Channel>::readEOF() ()
#12 0x00000000066b78f6 in wangle::ContextImpl<apache::thrift::FramingHandler>::fireReadEOF() ()
#13 0x00000000066bb047 in wangle::Handler<folly::IOBufQueue&, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >, std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, apache::thrift::transport::THeader*>, std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > >::readEOF(wangle::HandlerContext<std::pair<std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> >, std::unique_ptr<apache::thrift::transport::THeader, std::default_delete<apache::thrift::transport::THeader> > >, std::unique_ptr<folly::IOBuf, std::default_delete<folly::IOBuf> > >*) ()
#14 0x00000000066b80bc in wangle::ContextImpl<apache::thrift::FramingHandler>::readEOF() ()
#15 0x00000000066b5e18 in wangle::ContextImpl<apache::thrift::TAsyncTransportHandler>::fireReadEOF() ()
#16 0x00000000066a7355 in apache::thrift::TAsyncTransportHandler::readEOF() ()
#17 0x00000000069d30a9 in folly::AsyncSocket::handleRead() ()
#18 0x00000000069c7e30 in folly::AsyncSocket::ioReady(unsigned short) ()
#19 0x0000000006a9f3e4 in ?? ()
#20 0x0000000006a9fc9f in event_base_loop ()
#21 0x00000000069e1d95 in folly::EventBase::loopBody(int, bool) ()
#22 0x00000000069e267e in folly::EventBase::loop() ()
#23 0x00000000069e4f08 in folly::EventBase::loopForever() ()
#24 0x000000000696c799 in folly::IOThreadPoolExecutor::threadRun(std::shared_ptr<folly::ThreadPoolExecutor::Thread>) ()
#25 0x000000000697b0c5 in void folly::detail::function::FunctionTraits<void ()>::callSmall<std::_Bind<void (folly::ThreadPoolExecutor::*(folly::ThreadPoolExecutor*, std::shared_ptr<folly::ThreadPoolExecutor::Thread>))(std::shared_ptr<folly::ThreadPoolExecutor::Thread>)> >(folly::detail::function::Data&) ()
#26 0x0000000004618188 in folly::detail::function::FunctionTraits<void ()>::operator()() (this=0x7f34168112c0)
    at ../../../third_party_build/install/include/folly/Function.h:400
#27 0x00000000046a0446 in folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}::operator()() (
--Type <RET> for more, q to quit, c to continue without paging--
    __closure=0x7f34168112c0) at ../../../../third_party_build/install/include/folly/executors/thread_factory/NamedThreadFactory.h:40
#28 0x00000000046dffc7 in std::__invoke_impl<void, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_other, folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&) (__f=...)
    at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/bits/invoke.h:60
#29 0x00000000046dfa91 in std::__invoke<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}>(std::__invoke_result&&, (folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}&&)...) (__fn=...)
    at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/bits/invoke.h:95
#30 0x00000000046df78c in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x7f34168112c0) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:264
#31 0x00000000046df3c8 in std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> >::operator()() (this=0x7f34168112c0) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:271
#32 0x00000000046ded0a in std::thread::_State_impl<std::thread::_Invoker<std::tuple<folly::NamedThreadFactory::newThread(folly::Function<void ()>&&)::{lambda()#1}> > >::_M_run() (this=0x7f34168112b0) at /opt/buildtools/gcc-10.3.0/include/c++/10.3.0/thread:215
#33 0x0000000006fd915e in std::execute_native_thread_routine (__p=0x7f34168112b0) at ../../../.././libstdc++-v3/src/c++11/thread.cc:78
#34 0x00007f3416fb9f3b in ?? () from /usr/lib64/libpthread.so.0
#35 0x00007f3416ef1840 in clone () from /usr/lib64/libc.so.6

这里是更详细的调用栈,目前看和语句没关系,而是和nebula-java的超时处理有关系,可能在某个超时请求中,nebula-java超时之后把链接关闭了,而服务端graph还在处理返回值,有一些踩内存的操作

这个问题其实我自己已经定位清楚,是thrift库里面的一个地方没有做空指针校验,也就是上面那个调用栈的地方

1 个赞

thrift\lib\cpp2\async\HeaderServerChannel.h

  class HeaderRequest final : public ResponseChannelRequest {
   public:
    HeaderRequest(
        HeaderServerChannel* channel,
        std::unique_ptr<folly::IOBuf>&& buf,
        std::unique_ptr<apache::thrift::transport::THeader>&& header,
        const server::TServerObserver::SamplingStatus& samplingStatus);

    bool isActive() const override {
      DCHECK(false);
      return true;
    }

    // Note: 这个函数里面应该做空指针校验
    bool isOneway() const override {
      return header_->getSequenceNumber() == ONEWAY_REQUEST_ID;
    }

    bool includeEnvelope() const override { return true; }

    void setInOrderRecvSequenceId(uint32_t seqId) { InOrderRecvSeqId_ = seqId; }
2 个赞

超级赞,:tulip: @flymysql 独立发现了一个已经在 third party 修复的问题,用新版本的 third-party 3 可解
(讨论在 https://github.com/vesoft-inc/nebula/issues/5750

另外,超级感谢 @flymysql 的一个迎合优化 PR,也快合并啦:+1: https://github.com/vesoft-inc/nebula/pull/5754

感谢认可,另外社区pr似乎没committer帮忙检视,可以帮忙@两个committer帮忙检视下吗

2 个赞

嗯嗯, @critical27 是我们的存储老大,已经给你 review/ approve 了哈,估计快 merge 了哈:tada:

sorry,我们没有良好的 review 机制,目前已经有两个存储相关的同学在 review pr 了哈,两个 approve 之后 pr 就可以合并了哈。不好意思让你等久了~