未知语句,导致Graphd挂掉:

  • nebula 版本:3.0.1
  • 部署方式:分布式
  • 安装方式:源码编译
  • 是否为线上版本:Y
  • 硬件信息
    • 磁盘HDD
(gdb) bt
#0  0x00007ff9b1d0f410 in ?? ()
#1  0x00000000025f44da in std::char_traits<char>::compare (__n=<optimized out>, __s2=0x2659a6f <nebula::kTag> "_tag", __s1=<optimized out>)
    at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:347
#2  std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::compare (this=0x7ff9a7b59340, __s=0x2659a6f <nebula::kTag> "_tag")
    at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/basic_string.tcc:1435
#3  0x000000000125f8f3 in nebula::graph::GetNeighborsIter::getVertex(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const ()
    at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#4  0x000000000125fdbf in nebula::graph::GetNeighborsIter::getVertices() () at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#5  0x0000000001195a6a in nebula::graph::TraverseExecutor::buildInterimPath(nebula::graph::GetNeighborsIter*) ()
    at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#6  0x0000000001198147 in nebula::graph::TraverseExecutor::handleResponse(nebula::storage::StorageRpcResponse<nebula::storage::cpp2::GetNeighborsResponse>&&) ()
    at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#7  0x0000000001198fcc in nebula::graph::TraverseExecutor::getNeighbors()::{lambda(nebula::storage::StorageRpcResponse<nebula::storage::cpp2::GetNeighborsResponse>&&)#1}::operator()(nebula::storage::StorageRpcResponse<nebula::storage::cpp2::GetNeighborsResponse>&&) () at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#8  0x0000000001199447 in _ZN5folly6detail8function14FunctionTraitsIFvRNS_7futures6detail8CoreBaseEONS_8Executor9KeepAliveIS7_EEPNS_17exception_wrapperEEE9callSmallIZNS4_4CoreIN6nebula7storage18StorageRpcResponseINSI_4cpp220GetNeighborsResponseEEEE11setCallbackIZNS4_10FutureBaseISM_E18thenImplementationIZNOS_6FutureISM_E9thenValueIZNSH_5graph16TraverseExecutor12getNeighborsEvEUlOSM_E_EENSS_INS4_19valueCallableResultISM_T_E10value_typeEEEOS10_EUlSA_ONS_3TryISM_EEE_NS4_25tryExecutorCallableResultISM_S18_vEEEENSt9enable_ifIXsrNT0_13ReturnsFutureE5valueENS1C_6ReturnEE4typeES14_S1C_NS4_18InlineContinuationEEUlSA_S17_E_EEvS14_OSt10shared_ptrINS_14RequestContextEES1H_EUlS6_SA_SC_E_EEvS6_SA_SC_RNS1_4DataE () at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#9  0x0000000002004f2c in ?? () at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#10 0x0000000001daf8f7 in virtual thunk to apache::thrift::concurrency::FunctionRunner::run() () at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#11 0x0000000001ef2468 in apache::thrift::concurrency::ThreadManager::Impl::Worker::run() () at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#12 0x0000000001ef456e in apache::thrift::concurrency::PthreadThread::threadMain(void*) () at /install_temp/gcc-10.2.0/gcc-10.2.0/x86_64-pc-linux-gnu/libstdc++-v3/include/bits/char_traits.h:395
#13 0x00007ff9b1d77f4b in ?? ()
#14 0x0000000000000000 in ?? ()
(gdb)

graphd的coredump:

1 个赞

可以给出方便复现的语句吗?

从堆栈看,应该是TraverseExecutor的handleResponse里面执行时候,最终到GetNeighborsIter::getVertex 的 tagPropNameList[i] == nebula::kTag 判断出的问题

是数组下表越界了吗? 你在coredump的bt栈里调试下, 应该是比较简单的bug

主要是目前还没有确定是哪个语句导致的,确实怀疑是,但是不确定

能复现吗,当时日志是开到最低等级所以没找到命令吗 :thinking:

正在尝试复现

我们是用批跑的,一整个测试套,所以需要费点时间,然后并发跑,看到log里面有bad_alloc的异常,但是翻了代码,在最上层有exception兜底,确实感觉不应该跑挂了

MATCH (v:Tag1)-[*0…0] where (properties(v).name contains “xxx”) RETURN DISTINCT vs;

类似这种(但是应该是并发场景,单独执行nGQL无法触发崩溃)

你是不是oom了,内存没分配到,但是依旧往下执行,访问某块内存时期待有值,实际上没有 :sweat_smile:

1 个赞

oom没有在系统messages log中找到对应的kill记录,每一次bt的记录都是我沾出来的那块记录,单条语句还是不能触发这个崩溃

1 个赞

这种情况也会发生么,如果分配内存失败,然后继续往下走,出现内存访问越界(oom,系统还没有把进程给kill掉,但是每一次的backtrace都是一致的,这个感觉有一些问题,我没有特别理解,难道是因为这种场景下 tagPropNameList 的vector发生了扩容,导致内存访问失败的?)

:sweat_smile:对于这种未知问题,在你提供如此少的信息的情况下,除非这是已知bug的变种,或者官方见过类似的,我不觉得谁能帮你定位并解决问题。从你的描述来看,已经可以稳定复现bug了,那你在代码里多打点log,重新编译,花点时间不就能解决问题吗。或者你把引发问题的一系列语句和使用场景展示出来,或许官方会帮你看。问题有很多种可能,批处理语句太多了,使用分号执行多条语句时的隐藏bug,或者是并发时才会出现的bug,你这信息这么少,除非官方解决过类似的问题,不然很难帮你。
另外,你是HDD?或许换成SSD bug就没了(狗头

2 个赞

好的 我自己再尝试一下,多谢了

F20220606 14:21:02.080574 1227884 Iterator.cpp:428] Check failed: tagPropNameList.size() == propList.values.size() (2 vs. 8)

顺便请教一下:tagPropNameList和propList values里面放的具体是什么数据,多谢(我猜测这块可能是由于,并发创建了schema数据,然后查询导致数量不一致)这两个存储的信息不是特别清楚,如果您知道的话,麻烦指导一下,我再进一步定位,多谢

经过进一步Log定位,tagPropNameList里面存储的是每一个tag的schema,propList values里面存储的是对应的值

1 个赞

您好,我打出log,感觉是schema和value没有对应上导致的:
以下是将tagPropNameList和propList.values都打印出来以后的结果

Print for crash xxx tagPropNameList size: 2
Print for crash xxx tagPropNameList value: JZXPFFHYJXPXUPJZWJJC
Print for crash xxx tagPropNameList value: _tag
Print for crash xxx propList size: 8
Print for crash xxx propList value: "EoGxE3QlrScIeHSPamsFIDauw5ZBHW3s80b5REA2uWfXBJExRlSM1e2NWYFM2zMtJmWK0895F99P4bzFtlEaFLBdqODB2RDvCsRo"
Print for crash xxx propList value: "908318010425663232"
Print for crash xxx propList value: "1"
Print for crash xxx propList value: "SEYJVNWKJUTPTIEAMLMJA,SEYJVNWKJUTPTIEAMLMJB,"
Print for crash xxx propList value: "SEYJVNWKJUTPTIEAMLMJ"
Print for crash xxx propList value: "graphServiceEntity_GRAPH_875146890205947392"
Print for crash xxx propList value: NULL
Print for crash xxx propList value: 96310

Print for crash xxx tagPropNameList size: 2
Print for crash xxx tagPropNameList value: BJAGLASCDJIFBCSZFHYH
Print for crash xxx tagPropNameList value: _tag
Print for crash xxx propList size: 2
Print for crash xxx propList value: "cqwvfewxpfvkxewqdi"
Print for crash xxx propList value: 164179

Print for crash xxx tagPropNameList size: 2
Print for crash xxx tagPropNameList value: QYQCYTIOVJAIBZURBPEV
Print for crash xxx tagPropNameList value: _tag
Print for crash xxx propList size: 2
Print for crash xxx propList value: "LZDLCKKLMYFAZRRVLUIS"
Print for crash xxx propList value: 164185

Print for crash xxx tagPropNameList size: 2
Print for crash xxx tagPropNameList value: JZXPFFHYJXPXUPJZWJJC
Print for crash xxx tagPropNameList value: _tag
Print for crash xxx propList size: 8
Print for crash xxx propList value: "EoGxE3QlrScIeHSPamsFIDauw5ZBHW3s80b5REA2uWfXBJExRlSM1e2NWYFM2zMtJmWK0895F99P4bzFtlEaFLBdqODB2RDvCsRo"
Print for crash xxx propList value: "908318010425663232"
Print for crash xxx propList value: "1"
Print for crash xxx propList value: "SEYJVNWKJUTPTIEAMLMJA,SEYJVNWKJUTPTIEAMLMJB,"
Print for crash xxx propList value: "SEYJVNWKJUTPTIEAMLMJ"
Print for crash xxx propList value: "graphServiceEntity_GRAPH_875146890205947392"
Print for crash xxx propList value: NULL
Print for crash xxx propList value: 96310

Print for crash xxx tagPropNameList size: 2
Print for crash xxx tagPropNameList value: BJAGLASCDJIFBCSZFHYH
Print for crash xxx tagPropNameList value: _tag
Print for crash xxx propList size: 2
Print for crash xxx propList value: "cqwvfewxpfvkxewqdi"
Print for crash xxx propList value: 164179

Print for crash xxx tagPropNameList size: 2
Print for crash xxx tagPropNameList value: QYQCYTIOVJAIBZURBPEV
Print for crash xxx tagPropNameList value: _tag
Print for crash xxx propList size: 2
Print for crash xxx propList value: "LZDLCKKLMYFAZRRVLUIS"
Print for crash xxx propList value: 164185

vertex: ("908318010425663232") Tag: DCIJPMUWZPRLJXAUFUOW1, JZXPFFHYJXPXUPJZWJJC:"EoGxE3QlrScIeHSPamsFIDauw5ZBHW3s80b5REA2uWfXBJExRlSM1e2NWYFM2zMtJmWK0895F99P4bzFtlEaFLBdqODB2RDvCsRo"Tag: MWVJTWTLCWQDVJKELCBP1, BJAGLASCDJIFBCSZFHYH:"cqwvfewxpfvkxewqdi"Tag: IOOHWGWJWUROGATHQQGD1, QYQCYTIOVJAIBZURBPEV:"LZDLCKKLMYFAZRRVLUIS" size: 1

后面的八个值,对应到另外一个Tag上了:不知道是哪里的步骤没有处理准确导致这个问题的:如果有对应思路,烦请指导一下,多谢~

desc tag GRAPH_LABEL_DEFAULT;
+----------------------+----------+-------+---------+---------+
| Field                | Type     | Null  | Default | Comment |
+----------------------+----------+-------+---------+---------+
| "Description"        | "string" | "YES" |         |         |
| "SourceId"           | "string" | "YES" |         |         |
| "Schema"             | "string" | "YES" |         |         |
| "ParentOntologyTags" | "string" | "YES" |         |         |
| "Tag"                | "string" | "YES" |         |         |
| "GraphLabel"         | "string" | "YES" |         |         |
| "AllUpOntologyLabel" | "string" | "YES" |         |         |
+----------------------+----------+-------+---------+---------+

有可能是currentDs_的tagPropsMap里面对应PropIndex通过colIdx记录的currentRow的index不对,导致默认索引默认是0,然后因为每一个点都有上述的那个Tag,导致,索引匹配出错,不知道这块逻辑是在哪里做的?
我也在不断尝试,如果有好的思路,可以评论一下,多谢官方支持

1 个赞

好像是在rctx还是哪个ctx里面做的,你看看代码,感觉这是个不小的bug,可以提个issue给官方 :partying_face:

1 个赞