Match报错 Storage Error: part: 20, error: E_RPC_FAILURE(-3).

这是storaged.INFO的部分日志,重启之后先balance leader,查询差不多14:25分左右执行的

I0702 14:23:45.162096 69540 Host.cpp:149] [Port: 9780, Space: 60, Part: 10] [Host: 10.142.158.75:9780] This is the first time to send the logs to this host, lastLogIdSent = 15907614, lastLogTermSent = 31
I0702 14:23:45.162003 69587 RaftPart.cpp:1248] [Port: 9780, Space: 60, Part: 6] The partition is elected as the leader
I0702 14:23:45.162073 69581 RaftPart.cpp:1152] [Port: 9780, Space: 60, Part: 17] Partition is elected as the new leader for term 49
I0702 14:23:45.162195 69541 Host.cpp:149] [Port: 9780, Space: 60, Part: 9] [Host: 10.142.158.76:9780] This is the first time to send the logs to this host, lastLogIdSent = 15838164, lastLogTermSent = 41
I0702 14:23:45.162263 69581 RaftPart.cpp:1248] [Port: 9780, Space: 60, Part: 17] The partition is elected as the leader
I0702 14:23:45.162279 69542 Host.cpp:149] [Port: 9780, Space: 60, Part: 6] [Host: 10.142.158.77:9780] This is the first time to send the logs to this host, lastLogIdSent = 15776524, lastLogTermSent = 44
I0702 14:23:45.162279 69587 Part.cpp:191] [Port: 9780, Space: 60, Part: 1] Find the new leader "10.142.158.76":9780
I0702 14:23:45.162333 69561 Host.cpp:149] [Port: 9780, Space: 60, Part: 17] [Host: 10.142.158.76:9780] This is the first time to send the logs to this host, lastLogIdSent = 15707326, lastLogTermSent = 48
I0702 14:23:45.162320 69541 Host.cpp:149] [Port: 9780, Space: 60, Part: 9] [Host: 10.142.158.77:9780] This is the first time to send the logs to this host, lastLogIdSent = 15838164, lastLogTermSent = 41
I0702 14:23:45.162345 69482 RaftPart.cpp:422] [Port: 9780, Space: 32, Part: 2] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162400 69482 RaftPart.cpp:436] [Port: 9780, Space: 32, Part: 2] I am already the leader!
I0702 14:23:45.162351 69542 Host.cpp:149] [Port: 9780, Space: 60, Part: 6] [Host: 10.142.158.75:9780] This is the first time to send the logs to this host, lastLogIdSent = 15776524, lastLogTermSent = 44
I0702 14:23:45.162423 69561 Host.cpp:149] [Port: 9780, Space: 60, Part: 17] [Host: 10.142.158.77:9780] This is the first time to send the logs to this host, lastLogIdSent = 15707326, lastLogTermSent = 48
I0702 14:23:45.162468 69482 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 10] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162483 69482 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 10] I am already the leader!
I0702 14:23:45.162629 69482 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 17] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162639 69482 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 17] I am already the leader!
I0702 14:23:45.162648 69483 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 9] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162667 69483 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 9] I am already the leader!
I0702 14:23:45.162659 69481 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 6] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162694 69481 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 6] I am already the leader!
I0702 14:23:58.640892 69481 RaftPart.cpp:422] [Port: 9780, Space: 32, Part: 1] Commit transfer leader to "10.142.158.76":9780
I0702 14:23:58.640935 69481 RaftPart.cpp:442] [Port: 9780, Space: 32, Part: 1] I am Follower, just wait for the new leader!
I0702 14:23:58.955364 69481 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 1] Commit transfer leader to "10.142.158.76":9780
I0702 14:23:58.955391 69481 RaftPart.cpp:442] [Port: 9780, Space: 60, Part: 1] I am Follower, just wait for the new leader!
I0702 14:23:59.421893 69481 RaftPart.cpp:422] [Port: 9780, Space: 32, Part: 3] Commit transfer leader to "10.142.158.76":9780
I0702 14:23:59.421919 69481 RaftPart.cpp:442] [Port: 9780, Space: 32, Part: 3] I am Follower, just wait for the new leader!
I0702 14:33:06.375084 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 15] Clean wals number 1
I0702 14:33:06.381273 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 7] Clean wals number 1
I0702 14:33:06.386870 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 2] Clean wals number 1
I0702 14:33:06.389310 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 3] Clean wals number 1
I0702 14:33:06.395366 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 10] Clean wals number 1
I0702 14:33:06.399804 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 17] Clean wals number 1

出现network error之后还有个现象,执行停止服务命令显示成功,但再启动服务的时候,会发现storage已经在运行了

[root@A5-306-HW-2488HV5-2019-011 ~]# /usr/local/nebula/scripts/nebula.service stop all
[INFO] Stopping nebula-metad...
[INFO] Done
[INFO] Stopping nebula-graphd...
[INFO] Done
[INFO] Stopping nebula-storaged...
[INFO] Done
[root@A5-306-HW-2488HV5-2019-011 ~]# /usr/local/nebula/scripts/nebula.service start all
[INFO] Starting nebula-metad...
[INFO] Done
[ERROR] nebula-graphd already running: 69400
[ERROR] nebula-storaged already running: 69436

network error 指的是 closed network connection 这个报错么?

不是的,就直接是这样
image

您这个是 studio ?我说怎么在服务端代码里找不到这个报错 :stuck_out_tongue_closed_eyes:

console 能重现错误么?

console里怎么复现啊, 有没有手册我去试试 :joy:

找到手册了,我先研究一下…

可能是数据量大了,studio 数据传输有点问题了,console的话可以参考手册,只要是能连到 graphd 的地方就可以运行。

@jerry.liang 这个network error 是不是数据量大的时候会出的问题,有什么能绕过的窍门么?

console的参考手册 连接Nebula Graph - Nebula Graph Database 手册

出现了新的错误
用console明显跑的时间比studio长了很多,没有studio里面network error的错误
这次是[ERROR (-8)]: Storage Error: The leader has changed. Try again later
storage没有crash

graphd.ERROR的日志

I0702 19:57:00.853134 66602 SwitchSpaceExecutor.cpp:43] Graph switched to `mobile_imei_all', space id: 60
I0702 20:05:06.939954 66446 StorageClientBase.inl:147] Update the leader for [60, 6] to "10.142.158.75":9779
I0702 20:08:23.539999 66447 StorageClientBase.inl:147] Update the leader for [60, 17] to "10.142.158.76":9779
E0702 20:08:23.540454 66603 StorageAccessExecutor.h:35] IndexScanExecutor failed, error E_LEADER_CHANGED, part 17
E0702 20:08:23.540549 66603 StorageAccessExecutor.h:35] IndexScanExecutor failed, error E_LEADER_CHANGED, part 6
E0702 20:08:23.553347 66602 QueryInstance.cpp:103] Storage Error: The leader has changed. Try again later
I0702 20:08:25.569489 66464 StorageClientBase.inl:147] Update the leader for [60, 6] to "10.142.158.75":9779
I0702 20:16:20.541244 66463 StorageClientBase.inl:147] Update the leader for [60, 14] to "10.142.158.77":9779
I0702 20:16:20.541317 66463 StorageClientBase.inl:147] Update the leader for [60, 17] to "10.142.158.76":9779
E0702 20:16:43.967430 66603 StorageAccessExecutor.h:35] GetVerticesExecutor failed, error E_LEADER_CHANGED, part 17
E0702 20:16:43.967499 66603 StorageAccessExecutor.h:35] GetVerticesExecutor failed, error E_LEADER_CHANGED, part 6
E0702 20:16:43.967511 66603 StorageAccessExecutor.h:35] GetVerticesExecutor failed, error E_LEADER_CHANGED, part 14
E0702 20:16:43.967749 66602 QueryInstance.cpp:103] Storage Error: The leader has changed. Try again later

host的选择会有很大影响吗,尝试了一下78作为host,重新执行一次,跑了50多分钟才报错

这个是我的部署情况

这个是日志


I0702 21:01:01.425348  3747 MetaClient.cpp:3053] Load leader of "10.142.158.75":9779 in 5 space
I0702 21:01:01.425400  3747 MetaClient.cpp:3053] Load leader of "10.142.158.76":9779 in 5 space
I0702 21:01:01.425431  3747 MetaClient.cpp:3053] Load leader of "10.142.158.77":9779 in 5 space
I0702 21:01:01.425451  3747 MetaClient.cpp:3053] Load leader of "10.142.158.78":9779 in 5 space
I0702 21:01:01.425457  3747 MetaClient.cpp:3056] Load leader ok
I0702 21:53:53.966553  3592 StorageClientBase.inl:147] Update the leader for [60, 20] to "10.142.158.76":9779
E0702 21:57:50.173939  3746 StorageAccessExecutor.h:35] GetVerticesExecutor failed, error E_LEADER_CHANGED, part 20
E0702 21:57:50.174208  3745 QueryInstance.cpp:103] Storage Error: The leader has changed. Try again later

发现问题了,这次是storage crash了

[root@A5-306-HW-2488HV5-2019-011 logs]# dmesg |grep nebula
[1922276.128930] nebula-storaged invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[1922276.128936] nebula-storaged cpuset=/ mems_allowed=0-3
[1922276.128940] CPU: 28 PID: 89277 Comm: nebula-storaged Kdump: loaded Tainted: P           OE  ------------   3.10.0-1127.el7.x86_64 #1
[1922276.129758] [101619]     0 101619   205172     2678      25        8             0 nebula-importer
[1922276.129777] [88273]     0 88273   161795    59560     229     6174             0 nebula-metad
[1922276.129779] [88283]     0 88283 153616529 115912318  258347    43013             0 nebula-graphd
[1922276.129781] [88293]     0 88293 29178434 10089095   48529    16525             0 nebula-storaged
[1922276.129789] Out of memory: Kill process 88283 (nebula-graphd) score 852 or sacrifice child
[1922276.129836] Killed process 88283 (nebula-graphd), UID 0, total-vm:614466116kB, anon-rss:463649272kB, file-rss:0kB, shmem-rss:0kB

full-scan 的 OLAP query 情况下,可能 graphD 和 storageD 跑在不同的 host 的时候相当于两个的可用内存就会多一些。

针对内存的优化我们还在做, 包括 limit 的 storage 下推,还在抓紧做,那样的话,这样 query 里加上 limit 可以缓解。
另外,您的场景能用where限制一下query条件么?

问下这边也是导入数据后,查询出现Storage Error: The leader has changed. Try again later
这个问题怎么解决是手动banlance leader ,还是nebual 回自动平衡, Try again later 是要等多久

请勿一个问题多处分发浪费他人的回复精力,问题以 用spark 导入数据后,查询数据出现 Storage Error: The leader has changed. Try again later 的回复为主

可以用where限制,我限制一下数据量看看。 full-scan的时候,内存会占用的很厉害吗?有没有哪个参数是限制内存量的。 感觉执行的时候,机器的内存还没吃满就oom了

手动执行了balance leader。 是查询立刻就出现the leader has changed了吗? 这种情况差不多最多1,2个心跳周期。 我的情况是,query执行了可能1个小时,然后报这个错

fullscan 的话,目前在没有优化完的情况下是拿到所有的点数据返回给 query engine 的。
相比起来 lookup 没有把属性全都返回,会省一些内存,不过您的这个query似乎用 lookup 没法表达。

storage 读的内存可以调的主要是 block cache和bloom filter,rocksdb参考文档https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
参考帖子
nebula-storage 内存与性能的问题 - #2,来自 valseek

把–rocksdb_block_cache调成了all memory,出现了这个错误。 看3个日志都没记录。内存也的确几乎吃满了,这个也和内存是有关系的吗

2021/07/09 15:36:22 [ERROR] Failed to reconnect, Failed to open transport, error: dial tcp 10.142.158.78:9669: con
nect: connection refused
2021/07/09 15:36:22 Loop error, Failed to open transport, error: dial tcp 10.142.158.78:9669: connect: connection
refused

Bye root!
Fri, 09 Jul 2021 15:36:22 CST

panic: Loop error, Failed to open transport, error: dial tcp 10.142.158.78:9669: connect: connection refused

goroutine 1 [running]:
log.Panicf(0x646161, 0xe, 0xc00014de78, 0x1, 0x1)
        /opt/hostedtoolcache/go/1.16.4/x64/src/log/log.go:361 +0xc5
main.main()
        /home/runner/work/nebula-console/nebula-console/main.go:419 +0x4f3

该话题在最后一个回复创建后30天后自动关闭。不再允许新的回复。