2.0版本一个奇怪操作引发coredump

  • nebula 版本:2.0.0

  • 部署方式:分布式 / 单机 都有这个问题

  • 安装方式:源码编译

  • 问题的具体描述:
    1.一个线程不停的调用rpc接口执行 (go from xxx over edge1 where xxx) union/minus/intersect (go from xxx over edge1 where xxx),同时打开console,执行drop edge edge1。graphd是稳定coredump。
    2.或者在edge1不存在时执行查询语句,然后同时create edge edge1,那么也会core。
    3.必须drop go语句中的同名edge才会core,drop/create不相关的edge不会有问题。

-日志:
太长了,没有贴。简单来说,在drop语句执行完之后,可以继续正常执行命令,然后在drop之后1~3s左右才会core。

-gdb调试:
gdb调试,会发现每次段错误在不同地方出现。暂时发现,会有三个地方出错。
1.

#0  0x00007ff708d74428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ff708d7602a in __GI_abort () at abort.c:89
#2  0x0000000002e26e64 in __gnu_cxx::__verbose_terminate_handler() [clone .cold] ()
#3  0x00000000047f5a66 in __cxxabiv1::__terminate(void (*)()) ()
#4  0x00000000047f5ab1 in std::terminate() ()
#5  0x00000000047f610f in __cxa_pure_virtual ()
#6  0x0000000003250287 in nebula::graph::Scheduler::execute (this=0x7ff6d5226500, executor=0x7ff6fe287f40)
    at /home/nebula/src/scheduler/Scheduler.cpp:171
#7  0x000000000324f1a7 in nebula::graph::Scheduler::<lambda(nebula::Status)>::operator()(nebula::Status) const (__closure=0x7ff6d3a67648, 
    stats=...) at /home/nebula/src/scheduler/Scheduler.cpp:126
#8  0x0000000003257212 in nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> >::operator()(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) (this=0x7ff6d3a67640, 
    arg=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) at /home/nebula/src/scheduler/Scheduler.h:56
#9  0x0000000003252f82 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> > >::invoke<nebula::Status>(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>) (this=0x7ff6d3a67640, 
    args#0=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:91
Thread 84 "executor-pri3-2" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f61835fe700 (LWP 17162)]
0x000000000309094e in nebula::graph::ExecutionPlan::addProfileStats(long, nebula::ProfilingStats&&) (this=0x0, planNodeId=11, 
    profilingStats=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x3bb18f0, DIE 0x3c54314>)
    at /home/nebula/src/planner/ExecutionPlan.cpp:98
98      /home/nebula/src/planner/ExecutionPlan.cpp: No such file or directory.
(gdb) where
#0  0x000000000309094e in nebula::graph::ExecutionPlan::addProfileStats(long, nebula::ProfilingStats&&) (this=0x0, planNodeId=11, 
    profilingStats=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x3bb18f0, DIE 0x3c54314>)
    at /home/nebula/src/planner/ExecutionPlan.cpp:98
#1  0x00000000030dc299 in nebula::graph::Executor::close (this=0x7f616c09d800) at /home/nebula/src/executor/Executor.cpp:549
#2  0x0000000003125c3b in nebula::graph::GetNeighborsExecutor::close (this=0x7f616c09d800)
    at /home/nebula/src/executor/query/GetNeighborsExecutor.cpp:39
#3  0x00000000032501ab in nebula::graph::Scheduler::<lambda(nebula::Status)>::operator()(nebula::Status) const (__closure=0x7f616c4a03c0, 
    s=...) at /home/nebula/src/scheduler/Scheduler.cpp:173
#4  0x0000000003254f23 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::execute(nebula::graph::Executor*)::<lambda(nebula::Status)> >::invoke<nebula::Status>(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881dbcc>) (
    this=0x7f616c4a03c0, args#0=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881dbcc>)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:91
#5  0x0000000003254f90 in folly::futures::detail::detail_msvc_15_7_workaround::invoke<false, folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::execute(nebula::graph::Executor*)::<lambda(nebula::Status)> >, nebula::Status, nebula::Status&&>(folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::execute(nebula::graph::Executor*)::<lambda(nebula::Status)> > &, folly::Try<nebula::Status> &) (state=..., t=...) at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:288
#0  0x00000000030e17ac in nebula::graph::RequestContext<nebula::ExecutionResponse>::runner (this=0x2000000000000)
    at /home/nebula/src/service/RequestContext.h:68
#1  0x00000000030dc618 in nebula::graph::Executor::runner (this=0x7fdd6068e400) at /home/nebula/src/executor/Executor.cpp:573
#2  0x00000000030dc3d9 in nebula::graph::Executor::error (this=0x7fdd6068e400, status=...) at /home/nebula/src/executor/Executor.cpp:559
#3  0x000000000324f177 in nebula::graph::Scheduler::<lambda(nebula::Status)>::operator()(nebula::Status) const (__closure=0x7fdd31e84b48, 
    stats=...) at /home/nebula/src/scheduler/Scheduler.cpp:125
#4  0x0000000003257212 in nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> >::operator()(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) (this=0x7fdd31e84b40, 
    arg=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ac4c>) at /home/nebula/src/scheduler/Scheduler.h:56
#5  0x0000000003252f82 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> > >::invoke<nebula::Status>(<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>) (this=0x7fdd31e84b40, 
    args#0=<unknown type in /usr/local/nebula/bin/nebula-graphd, CU 0x874ed3a, DIE 0x881ed1a>)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:91
#6  0x0000000003252fc6 in folly::futures::detail::CoreCallbackState<nebula::Status, nebula::graph::Scheduler::ExecTask<nebula::graph::Scheduler::doSchedule(nebula::graph::Executor*)::<lambda(nebula::Status)> > >::<lambda()>::operator()(void) const (this=0x7fdd31e84b40)
    at /opt/vesoft/third-party/include/folly/futures/Future-inl.h:96

-猜测:
看起来像是内存问题。

-相关操作:
(lookup xxx intersect/minus lookup xxx)似乎不会出现这个问题(同样场景模拟了几十次都没core),我看了执行计划,差别在于go语句会多一个project执行节点。

-期待:
这个问题在2.0之后有修复过吗 :rofl: 我想看看问题是在哪里…有点难找

我来捞个帖子哈,研发同学在看 ing

我大概找到问题了,应该是scheduler这里对并行执行计划的调度有点问题,假如有个子分支提前返回错误,那么会提前结束语句,delete这个查询语句并释放资源,然而此时另一个分支才刚刚执行完,访问context里的那些runner啥的,就出错了。
我猜是这个原因。

lookup语句不会有这个问题的原因则是因为,intersect节点collect的是两个indexscan的future,所以会等两个分支都执行完才返回。

但是go语句的执行计划则不是。 :thinking:

应该是这个问题,我把go语句的project节点删掉了就不会出错了。但我又觉得如果真是这个问题应该很早就发现了才对…

1 个赞

:thinking:看了下,3.0可能也会有这个问题,当其中一个分支makefailure提前结束,delete自身后,另一个分支才执行完成,这时可能会用到有关这个语句context里的变量。

嗯嗯,我把你反馈的信息给研发同学了,:thinking: 他们在看这个问题

:thinking:研发大佬们怎么说,我改了下2.0的scheduler,在分支内部出现错误时将错误往上传递而不是抛出异常,问题解决了。但我还是不太确定是不是因为分支提前返回触发这个core :rofl:

1 个赞

我让他们有结论了来回复下你哈

:+1:好的

2.5开始就修复这个问题了,你可以看下最新的代码哈。所有的executor的异常都会提前捕获而不是直接抛出给上层。

1 个赞

:pleading_face:好的,直接往上传递就不用检查是不是在分支里了

你说的是 master 分支吗

不是,我说的是执行计划里的分支

请问你们重构scheduler后有加测试吗,我翻这个文件的修改历史似乎没找到 :rofl:

2.0 ~ 2.5 是在nebula-graph那个repo。2.6 ~ 3.0是在nebula

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。