NebulaGraph V3.2版本偶发性出现graphd节点Crash现象

docker镜像:vesoft/nebula-graphd:v3.2.0

暂时没看出来,我先记个issue吧

有线索提供吗,可以协助排查 :grin:

没有,不过这个minidump翻译出来了,我贴一下

Operating system: Linux
                  0.0.0 Linux 5.4.0-117-generic #132-Ubuntu SMP Thu Jun 2 00:39:06 UTC 2022 x86_64
CPU: amd64
     family 6 model 85 stepping 7
     1 CPU

GPU: UNKNOWN

Crash reason:  SIGSEGV /SEGV_MAPERR
Crash address: 0x1c
Process uptime: not available

Thread 30 (crashed)
 0  nebula-graphd!apache::thrift::Cpp2Connection::stop() + 0x59
    rax = 0x0000000000000000   rdx = 0x0000000002e123c0
    rcx = 0x0000000002cef3f8   rbx = 0x00007f39c4f60260
    rsi = 0x0000000000000000   rdi = 0x00007f39bcbff600
    rbp = 0x00007f39bcbf4f80   rsp = 0x00007f39bcbf4f10
     r8 = 0x0000000000000000    r9 = 0x00007f39bcbf4638
    r10 = 0x00007f39bcbf4630   r11 = 0x0000000000000206
    r12 = 0x00007f39bbe073e8   r13 = 0x00007f39bcbf5001
    r14 = 0x00007f39bbe09110   r15 = 0x00007f39bcbf4f20
    rip = 0x0000000001f2f679
    Found by: given as instruction pointer in context
 1  nebula-graphd!apache::thrift::Cpp2Connection::channelClosed(folly::exception_wrapper&&) + 0x39
    rbx = 0x00007f39c4f10901   rbp = 0x00007f39bcbf5020
    rsp = 0x00007f39bcbf4f90   r12 = 0x00007f39bbe09110
    r13 = 0x00007f39bcbf5030   r14 = 0x00007f39bcbf4fa0
    r15 = 0x00007f39bcbf4fb0   rip = 0x0000000001f31359
    Found by: call frame info

集群挂掉的时候在做什么操作?有办法稳定复现问题?

正常的业务请求,所有语句都尝试过了,没有复现

正是因为没法复现,所以才难以定位

看了nebula-go打印的异常日志,只有io timeout这种,也没其他异常日志

暂时没找到稳定复现的方法,这个问题大概会几天出现一次, 当前的数据量很小,访问服务频率很低。

如有必要,我们会配合你做一切有必要的尝试。

刚刚又挂了,graphd的关键日志部分如下:

E20220916 09:18:42.187779 27 Serializer.h:43] Thrift serialization is only defined for structs and unions, not containers thereof. Attemping to serialize a value of type `nebula::Value`.
I20220916 09:18:42.188601 41 ThriftClientManager-inl.h:67] resolve "nebula-aio-0":9779 as "10.42.1.103":9779
I20220916 09:18:42.194344 40 ThriftClientManager-inl.h:67] resolve "nebula-aio-0":9779 as "10.42.1.103":9779
I20220916 09:18:42.695487 28 GraphService.cpp:76] Authenticating user root from 10.42.0.172:42418
I20220916 09:18:43.207806 31 GraphService.cpp:76] Authenticating user root from 10.42.0.172:42424
E20220916 09:19:30.228691 31 IndexScanRule.cpp:440] No valid index found
E20220916 09:19:39.081180 25 IndexScanRule.cpp:440] No valid index found
E20220916 09:19:51.894157 30 IndexScanRule.cpp:440] No valid index found

hello 能否提供一些详细信息,比如:

  1. 集群的配置,多少 storage、graph、meta 节点
  2. schema && 数据规模
  3. 集群都做了什么操纵,查询 or 写入,qps 多少?

1、单机部署,storage、graph、meta都为1节点

2、测试schema && 数据规模采用该文件复现:https://docs.nebula-graph.io/2.0/basketballplayer-2.X.ngql

3、并发执行增删改查,出现graphd崩溃(必现)
打印错误日志如下:

2022-09-20 15:41:19  file=pool/pool.go:175 level=error session.Execute by sid[7] with NGQL:[UPDATE VERTEX ON player 'player105' SET age = age + 2], err: -1005:Storage Error: More than one request trying to add/update/delete one edge/vertex at the same time.
2022-09-20 15:41:19  file=rbac-crash/main.go:48 level=error -1005:Storage Error: More than one request trying to add/update/delete one edge/vertex at the same time.
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[4] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: read tcp 10.55.23.52:63978->10.55.16.144:9669: i/o timeout
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[10] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: read tcp 10.55.23.52:63984->10.55.16.144:9669: i/o timeout
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[0] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: read tcp 10.55.23.52:63974->10.55.16.144:9669: i/o timeout
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[2] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: read tcp 10.55.23.52:63976->10.55.16.144:9669: i/o timeout
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[15] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: read tcp 10.55.23.52:63989->10.55.16.144:9669: i/o timeout
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[11] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: read tcp 10.55.23.52:63985->10.55.16.144:9669: i/o timeout
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[12] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: EOF
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[9] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: EOF
2022-09-20 15:41:19  file=pool/pool.go:175 level=error session.Execute by sid[3] with NGQL:[UPDATE VERTEX ON player 'player105' SET age = age + 2], err: EOF
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[6] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: EOF
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[13] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: EOF
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[5] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: EOF
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[1] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: EOF
2022-09-20 15:41:19  file=rbac-crash/main.go:48 level=error EOF
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[14] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: write tcp 10.55.23.52:63988->10.55.16.144:9669: wsasend: An existing connection was forcibly closed by the remote host.
2022-09-20 15:41:19  file=pool/pool.go:195 level=error session.ExecuteJson by sid[8] with NGQL:[LOOKUP ON player YIELD id(vertex) AS VertexID;], err: read tcp 10.55.23.52:63982->10.55.16.144:9669: wsarecv: An existing connection was forcibly closed by the remote host.
2022-09-20 15:41:20  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:21  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:21  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:22  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:22  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:23  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:23  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:24  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:24  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:25  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:25  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:26  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:26  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:27  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:27  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:28  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:28  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:29  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:30  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:30  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:31  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:31  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:32  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:32  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:33  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:33  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:34  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:34  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:35  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:35  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:36  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:36  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:37  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:37  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:38  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:38  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:39  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:39  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:40  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:40  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:41  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:41  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:42  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:42  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:43  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:43  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:44  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:44  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:45  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:45  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:46  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:46  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:47  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:47  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:48  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout
2022-09-20 15:41:48  file=pool/pool.go:54 level=warn open graph conn failed with host: 10.55.16.144, port: 9669, dial tcp 10.55.16.144:9669: i/o timeout

问题已解决
是我们go driver选择的问题,当我使用了nebula-go后问题消失。

thanks

1 个赞