storage节点crash,重启也失败

  • nebula 版本:3.1.0.el7.x86_64
  • 部署方式:分布式
  • 安装方式:RPM
  • 是否为线上版本:Y
  • 硬件信息
    • 磁盘: 2T SSD
    • CPU、内存信息: 32核,128G内存
  • 问题的具体描述
    在3台机器上分别部署graphd/metad/storaged;后新增一台storage节点,并执行balance data,一切正常;再增加第3台storage节点时,前面两台storage节点都crash了,并生成好几个core文件。
    storage的错误日志如下:
    (safe mode, symbolizer not available)
    *** Aborted at 1665649186 (Unix time, try ‘date -d @1665649186’) ***
    *** Signal 11 (SIGSEGV) (0x20) received by PID 13428 (pthread TID 0x7f3fd7178700) (linux TID 13628) (code: address not mapped to object), stack trace: ***
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x253bde1]
    *** Aborted at 1665649341 (Unix time, try ‘date -d @1665649341’) ***
    *** Signal 11 (SIGSEGV) (0x20) received by PID 13949 (pthread TID 0x7fe4459ff700) (linux TID 14147) (code: address not mapped to object), stack trace: ***
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x253bde1]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x26)[0x25332e6]
    /usr/local/nebula/bin/nebula-storaged[0x2531277]
    /lib64/libpthread.so.0(+0xf5cf)[0x7fe4846e15cf]
    /usr/local/nebula/bin/nebula-storaged(_ZN6nebula7storage18TransactionManager18getTermFromKVStoreEii+0xd)[0x1269d7d]
    /usr/local/nebula/bin/nebula-storaged(_ZN6nebula7storage28ChainResumeAddPrimeProcessor12processLocalENS_4cpp29ErrorCodeE+0x80)[0x12961a0]
    /usr/local/nebula/bin/nebula-storaged[0x12695ba]
    /usr/local/nebula/bin/nebula-storaged[0x126b006]
    /usr/local/nebula/bin/nebula-storaged[0x249c60b]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE+0x137)[0x247b037]
    /usr/local/nebula/bin/nebula-storaged[0x246d84e]
    /usr/local/nebula/bin/nebula-storaged(ZN5folly23AtomicNotificationQueueINS_8FunctionIFvvEEEE5driveIRNS_9EventBase10FuncRunnerEEEbOT+0xd0)[0x24e6610]
    /usr/local/nebula/bin/nebula-storaged(_ZThn40_N5folly32EventBaseAtomicNotificationQueueINS_8FunctionIFvvEEENS_9EventBase10FuncRunnerEE12handlerReadyEt+0x2c)[0x24e763c]
    /usr/local/nebula/bin/nebula-storaged[0x25a1f24]
    /usr/local/nebula/bin/nebula-storaged(event_base_loop+0x36e)[0x25a258e]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase8loopBodyEib+0x46d)[0x24e0bbd]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase4loopEv+0x3d)[0x24e14ad]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase11loopForeverEv+0x17)[0x24e40e7]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly20IOThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE+0x338)[0x246e238]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly6detail8function14FunctionTraitsIFvvEE7callBigISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS7_6ThreadEEEPS7_SA_EEEEvRNS1_4DataE+0x46)[0x247c7a6]
    /usr/local/nebula/bin/nebula-storaged[0x2aa508f]
    /lib64/libpthread.so.0(+0x7dd4)[0x7fe4846d9dd4]
    /lib64/libc.so.6(clone+0x6c)[0x7fe48440302c]
    (safe mode, symbolizer not available)
    *** Aborted at 1665649341 (Unix time, try ‘date -d @1665649341’) ***
    *** Signal 11 (SIGSEGV) (0x20) received by PID 13949 (pthread TID 0x7fe43e9ff700) (linux TID 14154) (code: address not mapped to object), stack trace: ***
    *** Aborted at 1665649425 (Unix time, try ‘date -d @1665649425’) ***
    *** Signal 11 (SIGSEGV) (0x20) received by PID 14328 (pthread TID 0x7f8cc65ff700) (linux TID 14558) (code: address not mapped to object), stack trace: ***
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x253bde1]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x26)[0x25332e6]
    /usr/local/nebula/bin/nebula-storaged[0x2531277]
    /lib64/libpthread.so.0(+0xf5cf)[0x7f8cfd47b5cf]
    /usr/local/nebula/bin/nebula-storaged(_ZN6nebula7storage18TransactionManager18getTermFromKVStoreEii+0xd)[0x1269d7d]
    /usr/local/nebula/bin/nebula-storaged(_ZN6nebula7storage28ChainResumeAddPrimeProcessor12processLocalENS_4cpp29ErrorCodeE+0x80)[0x12961a0]
    /usr/local/nebula/bin/nebula-storaged[0x12695ba]
    /usr/local/nebula/bin/nebula-storaged[0x126b006]
    /usr/local/nebula/bin/nebula-storaged[0x249c60b]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE+0x137)[0x247b037]
    /usr/local/nebula/bin/nebula-storaged[0x246d84e]
    /usr/local/nebula/bin/nebula-storaged(ZN5folly23AtomicNotificationQueueINS_8FunctionIFvvEEEE5driveIRNS_9EventBase10FuncRunnerEEEbOT+0xd0)[0x24e6610]
    /usr/local/nebula/bin/nebula-storaged(_ZThn40_N5folly32EventBaseAtomicNotificationQueueINS_8FunctionIFvvEEENS_9EventBase10FuncRunnerEE12handlerReadyEt+0x2c)[0x24e763c]
    /usr/local/nebula/bin/nebula-storaged[0x25a1f24]
    /usr/local/nebula/bin/nebula-storaged(event_base_loop+0x36e)[0x25a258e]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase8loopBodyEib+0x46d)[0x24e0bbd]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase4loopEv+0x3d)[0x24e14ad]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase11loopForeverEv+0x17)[0x24e40e7]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly20IOThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE+0x338)[0x246e238]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly6detail8function14FunctionTraitsIFvvEE7callBigISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS7_6ThreadEEEPS7_SA_EEEEvRNS1_4DataE+0x46)[0x247c7a6]
    /usr/local/nebula/bin/nebula-storaged[0x2aa508f]
    /lib64/libpthread.so.0(+0x7dd4)[0x7f8cfd473dd4]
    /lib64/libc.so.6(clone+0x6c)[0x7f8cfd19d02c]
    (safe mode, symbolizer not available)
    *** Aborted at 1665649425 (Unix time, try ‘date -d @1665649425’) ***
    *** Signal 11 (SIGSEGV) (0x20) received by PID 14328 (pthread TID 0x7f8cbcdfb700) (linux TID 14562) (code: address not mapped to object), stack trace: ***
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x253bde1]
    *** Aborted at 1665649457 (Unix time, try ‘date -d @1665649457’) ***
    *** Signal 11 (SIGSEGV) (0x20) received by PID 14714 (pthread TID 0x7f1b229ff700) (linux TID 14948) (code: address not mapped to object), stack trace: ***
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x253bde1]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x26)[0x25332e6]
    /usr/local/nebula/bin/nebula-storaged[0x2531277]
    /lib64/libpthread.so.0(+0xf5cf)[0x7f1b60e015cf]
    /usr/local/nebula/bin/nebula-storaged(_ZN6nebula7storage18TransactionManager18getTermFromKVStoreEii+0xd)[0x1269d7d]
    /usr/local/nebula/bin/nebula-storaged(_ZN6nebula7storage28ChainResumeAddPrimeProcessor12processLocalENS_4cpp29ErrorCodeE+0x80)[0x12961a0]
    /usr/local/nebula/bin/nebula-storaged[0x12695ba]
    /usr/local/nebula/bin/nebula-storaged[0x126b006]
    /usr/local/nebula/bin/nebula-storaged[0x249c60b]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly18ThreadPoolExecutor7runTaskERKSt10shared_ptrINS0_6ThreadEEONS0_4TaskE+0x137)[0x247b037]
    /usr/local/nebula/bin/nebula-storaged[0x246d84e]
    /usr/local/nebula/bin/nebula-storaged(ZN5folly23AtomicNotificationQueueINS_8FunctionIFvvEEEE5driveIRNS_9EventBase10FuncRunnerEEEbOT+0xd0)[0x24e6610]
    /usr/local/nebula/bin/nebula-storaged(_ZThn40_N5folly32EventBaseAtomicNotificationQueueINS_8FunctionIFvvEEENS_9EventBase10FuncRunnerEE12handlerReadyEt+0x2c)[0x24e763c]
    /usr/local/nebula/bin/nebula-storaged[0x25a1f24]
    /usr/local/nebula/bin/nebula-storaged(event_base_loop+0x36e)[0x25a258e]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase8loopBodyEib+0x46d)[0x24e0bbd]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase4loopEv+0x3d)[0x24e14ad]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly9EventBase11loopForeverEv+0x17)[0x24e40e7]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly20IOThreadPoolExecutor9threadRunESt10shared_ptrINS_18ThreadPoolExecutor6ThreadEE+0x338)[0x246e238]
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly6detail8function14FunctionTraitsIFvvEE7callBigISt5_BindIFMNS_18ThreadPoolExecutorEFvSt10shared_ptrINS7_6ThreadEEEPS7_SA_EEEEvRNS1_4DataE+0x46)[0x247c7a6]
    /usr/local/nebula/bin/nebula-storaged[0x2aa508f]
    /lib64/libpthread.so.0(+0x7dd4)[0x7f1b60df9dd4]
    /lib64/libc.so.6(clone+0x6c)[0x7f1b60b2302c]
    (safe mode, symbolizer not available)
    *** Aborted at 1665649457 (Unix time, try ‘date -d @1665649457’) ***
    *** Signal 11 (SIGSEGV) (0x20) received by PID 14714 (pthread TID 0x7f1b1b3ff700) (linux TID 14956) (code: address not mapped to object), stack trace: ***
    /usr/local/nebula/bin/nebula-storaged(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x253bde1]

再增加第3台storage节点时,前面两台storage节点都crash了,并生成好几个core文件。

不好意思,没理解你的意思,原先 3 stor 3 meta 3 graph ,先新增了1 stor, 然后balance data, 然后呢?没太看懂。第三台是指哪个?

原先1 storage,1 graph,1 meta。 后增加一台storage,balance data, 一切正常;再增加一台storage, balance data,此时,前面两台storage都crash了,重启都起不来。

我可以开个腾讯会议,共享屏幕看一下日志

这种系统突然挂了,还起不来的问题,挺恐怖的。

稍等,我先和比较熟balance的人了解下,这个堆栈看上去可能是空指针。但我不清楚是否有既有问题哈。

应该是空指针。有core文件生成,很大,没法导给你们

可以贴一下崩溃时间的日志吗?主要是storaged的。

storage-error.log (833 字节)
storage-info.log (17.7 KB)

9777端口是用来干嘛的?storage之间内部rpc通信用的?

storaged-stderr.log (2.8 KB)

这个问题有人帮忙看吗?

我把storage的数据文件全部备份到别的文件,然后清空数据存储目录,storage就可以正常启动,看来是数据文件有问题导致启动不了。
那有办法重新加载备份的数据文件吗?

另外说一个可疑点:在增加第三台storage之后执行balance data, 失败了,然后有两台storage直接crash。

(root@nebula) [sp_pt_10219769125f453ebcb6d62a95503179]> show jobs
+--------+------------------+------------+----------------------------+----------------------------+
| Job Id | Command          | Status     | Start Time                 | Stop Time                  |
+--------+------------------+------------+----------------------------+----------------------------+
| 75     | "ZONE_BALANCE"   | "FAILED"   | 2022-10-14T10:41:11.000000 | 2022-10-14T10:41:11.000000 |
| 74     | "COMPACT"        | "FINISHED" | 2022-10-13T15:13:14.000000 | 2022-10-13T15:13:50.000000 |
| 72     | "ZONE_BALANCE"   | "FINISHED" | 2022-10-13T12:14:07.000000 | 2022-10-13T12:15:37.000000 |
| 71     | "LEADER_BALANCE" | "FINISHED" | 2022-10-13T12:09:23.000000 | 2022-10-13T12:09:23.000000 |
+--------+------------------+------------+----------------------------+----------------------------+
Got 4 rows (time spent 1601/2197 us)

@SuperYoko 看起来是数据里有 coruption ?

我在一个space里执行的“balance data”没有完成的时候,在另一个space里也执行了“balance data”,会有影响吗?

社区版的experimental特性不稳定,而且还开了toss…确实不太了解…

别呀,那扩容存储节点,不能balance data,谁还敢用你们的产品?想稳定就得买企业版?

toss是啥?开了会有什么影响?可以关吗?

大佬能否推进一下这个问题的解决?这个问题严重影响我们对nebula的信心。

balanc data是企业版的功能,很久就不维护了。