2.0版本一个奇怪操作引发coredump

sworduo · 2022 年3 月 3 日 09:44

nebula 版本：2.0.0
部署方式：分布式 / 单机都有这个问题
安装方式：源码编译
问题的具体描述：
1.一个线程不停的调用rpc接口执行（go from xxx over edge1 where xxx） union/minus/intersect (go from xxx over edge1 where xxx)，同时打开console，执行drop edge edge1。graphd是稳定coredump。
2.或者在edge1不存在时执行查询语句，然后同时create edge edge1，那么也会core。
3.必须drop go语句中的同名edge才会core，drop/create不相关的edge不会有问题。

-日志：
太长了，没有贴。简单来说，在drop语句执行完之后，可以继续正常执行命令，然后在drop之后1～3s左右才会core。

-gdb调试：
gdb调试，会发现每次段错误在不同地方出现。暂时发现，会有三个地方出错。
1.

#0  0x00007ff708d74428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ff708d7602a in __GI_abort () at abort.c:89
#2  0x0000000002e26e64 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] ()
#3  0x00000000047f5a66 in __cxxabiv1::__terminate(void (*)()) ()
#4  0x00000000047f5ab1 in std::terminate() ()
#5  0x00000000047f610f in __cxa_pure_virtual ()
#6  0x0000000003250287 in nebula::graph::Scheduler::execute (this=0x7ff6d5226500, executor=0x7ff6fe287f40)
    at /home/nebula/src/scheduler/Scheduler.cpp:171
#7  0x000000000324f1a7 in nebula::graph::Scheduler::<lambda(nebula::Status)>::operator()(nebula::Status) const (__closure=0x7ff6d3a67648, 
    stats=...) at /home/nebula/src/scheduler/Scheduler.cpp:126
#8  0x0000000003257212 in nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> >::operator()(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) (this=0x7ff6d3a67640, 
    arg=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) at /home/nebula/src/scheduler/Scheduler.h:56
#9  0x0000000003252f82 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> > >::invoke<nebula::Status>(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>) (this=0x7ff6d3a67640, 
    args#0=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:91

Thread 84 "executor-pri3-2" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f61835fe700 (LWP 17162)]
0x000000000309094e in nebula::graph::ExecutionPlan::addProfileStats(long, nebula::ProfilingStats&&) (this=0x0, planNodeId=11, 
    profilingStats=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x3bb18f0, DIE 0x3c54314>)
    at /home/nebula/src/planner/ExecutionPlan.cpp:98
98      /home/nebula/src/planner/ExecutionPlan.cpp: No such file or directory.
(gdb) where
#0  0x000000000309094e in nebula::graph::ExecutionPlan::addProfileStats(long, nebula::ProfilingStats&&) (this=0x0, planNodeId=11, 
    profilingStats=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x3bb18f0, DIE 0x3c54314>)
    at /home/nebula/src/planner/ExecutionPlan.cpp:98
#1  0x00000000030dc299 in nebula::graph::Executor::close (this=0x7f616c09d800) at /home/nebula/src/executor/Executor.cpp:549
#2  0x0000000003125c3b in nebula::graph::GetNeighborsExecutor::close (this=0x7f616c09d800)
    at /home/nebula/src/executor/query/GetNeighborsExecutor.cpp:39
#3  0x00000000032501ab in nebula::graph::Scheduler::<lambda(nebula::Status)>::operator()(nebula::Status) const (__closure=0x7f616c4a03c0, 
    s=...) at /home/nebula/src/scheduler/Scheduler.cpp:173
#4  0x0000000003254f23 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::execute(nebula::graph::Executor*)::<lambda(nebula::Status)> >::invoke<nebula::Status>(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881dbcc>) (
    this=0x7f616c4a03c0, args#0=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881dbcc>)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:91
#5  0x0000000003254f90 in folly::futures::detail::detail_msvc_15_7_workaround::invoke<false, folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::execute(nebula::graph::Executor*)::<lambda(nebula::Status)> >, nebula::Status, nebula::Status&&>(folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::execute(nebula::graph::Executor*)::<lambda(nebula::Status)> > &, folly::Try<nebula::Status> &) (state=..., t=...) at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:288

#0  0x00000000030e17ac in nebula::graph::RequestContext<nebula::ExecutionResponse>::runner (this=0x2000000000000)
    at /home/nebula/src/service/RequestContext.h:68
#1  0x00000000030dc618 in nebula::graph::Executor::runner (this=0x7fdd6068e400) at /home/nebula/src/executor/Executor.cpp:573
#2  0x00000000030dc3d9 in nebula::graph::Executor::error (this=0x7fdd6068e400, status=...) at /home/nebula/src/executor/Executor.cpp:559
#3  0x000000000324f177 in nebula::graph::Scheduler::<lambda(nebula::Status)>::operator()(nebula::Status) const (__closure=0x7fdd31e84b48, 
    stats=...) at /home/nebula/src/scheduler/Scheduler.cpp:125
#4  0x0000000003257212 in nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> >::operator()(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) (this=0x7fdd31e84b40, 
    arg=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) at /home/nebula/src/scheduler/Scheduler.h:56
#5  0x0000000003252f82 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> > >::invoke<nebula::Status>(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>) (this=0x7fdd31e84b40, 
    args#0=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:91
#6  0x0000000003252fc6 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> > >::<lambda()>::operator()(void) const (this=0x7fdd31e84b40)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:96

-猜测：
看起来像是内存问题。

-相关操作：
(lookup xxx intersect/minus lookup xxx)似乎不会出现这个问题（同样场景模拟了几十次都没core），我看了执行计划，差别在于go语句会多一个project执行节点。

-期待：
这个问题在2.0之后有修复过吗我想看看问题是在哪里…有点难找

steam · 2022 年3 月 7 日 06:49

我来捞个帖子哈，研发同学在看 ing

sworduo · 2022 年3 月 7 日 06:56

我大概找到问题了，应该是scheduler这里对并行执行计划的调度有点问题，假如有个子分支提前返回错误，那么会提前结束语句，delete这个查询语句并释放资源，然而此时另一个分支才刚刚执行完，访问context里的那些runner啥的，就出错了。
我猜是这个原因。

lookup语句不会有这个问题的原因则是因为，intersect节点collect的是两个indexscan的future，所以会等两个分支都执行完才返回。

但是go语句的执行计划则不是。

应该是这个问题，我把go语句的project节点删掉了就不会出错了。但我又觉得如果真是这个问题应该很早就发现了才对…

sworduo · 2022 年3 月 7 日 09:16

看了下，3.0可能也会有这个问题，当其中一个分支makefailure提前结束，delete自身后，另一个分支才执行完成，这时可能会用到有关这个语句context里的变量。

steam · 2022 年3 月 7 日 09:17

嗯嗯，我把你反馈的信息给研发同学了，他们在看这个问题

sworduo · 2022 年3 月 9 日 02:18

研发大佬们怎么说，我改了下2.0的scheduler，在分支内部出现错误时将错误往上传递而不是抛出异常，问题解决了。但我还是不太确定是不是因为分支提前返回触发这个core

steam · 2022 年3 月 9 日 02:28

我让他们有结论了来回复下你哈

sworduo · 2022 年3 月 9 日 02:53

好的

CPWstatic · 2022 年3 月 9 日 07:33

2.5开始就修复这个问题了，你可以看下最新的代码哈。所有的executor的异常都会提前捕获而不是直接抛出给上层。

sworduo · 2022 年3 月 9 日 07:50

好的，直接往上传递就不用检查是不是在分支里了

steam · 2022 年3 月 9 日 08:05

你说的是 master 分支吗

sworduo · 2022 年3 月 9 日 08:27

不是，我说的是执行计划里的分支

sworduo · 2022 年3 月 9 日 09:31

请问你们重构scheduler后有加测试吗，我翻这个文件的修改历史似乎没找到

CPWstatic · 2022 年3 月 9 日 09:44

2.0 ~ 2.5 是在nebula-graph那个repo。2.6 ~ 3.0是在nebula

system · 2022 年4 月 8 日 09:45

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。