Match查询中,显示fail to open transport

nebula 版本:2.0.1
部署方式(分布式 ),4个节点
是否为线上版本:
硬件信息
磁盘( HHD)
CPU、内存 512G

10亿+节点

通过console做match查询,运行约60分钟左右,出现如下错误,storage没有崩,3个日志里也没有相关的信息

2021/07/12 10:29:06 [ERROR] Failed to reconnect, Failed to open transport, error: dial tcp 10.142.158.78:9669: connect: connection refused
2021/07/12 10:29:06 Loop error, Failed to open transport, error: dial tcp 10.142.158.78:9669: connect: connection refused

Bye root!
Mon, 12 Jul 2021 10:29:06 CST

panic: Loop error, Failed to open transport, error: dial tcp 10.142.158.78:9669: connect: connection refused

goroutine 1 [running]:
log.Panicf(0x646161, 0xe, 0xc00018de78, 0x1, 0x1)
        /opt/hostedtoolcache/go/1.16.4/x64/src/log/log.go:361 +0xc5
main.main()
        /home/runner/work/nebula-console/nebula-console/main.go:419 +0x4f3

执行后 graphd 也还在吗? 看错误信息像是挂了

这个应该怎么查看请问

我们前端的堡垒机时间久了会断,所以是用crontab跑的任务

在 Nebula 安装目录执行./scripts/nebula.service status all

是的,的确不在了

[root@A5-306-HW-2488HV5-2019-011 nebula]# ./scripts/nebula.service status all
[INFO] nebula-metad: Running as 2069, Listening on 9559
[INFO] nebula-graphd: Exited
[INFO] nebula-storaged: Running as 2207, Listening on 9779

日志里 graph meta storage 都没有留日志信息吗? 看看 logs/nebula-graphd.ERROR

1 个赞

graphd.ERROR里没有,还是上次的(0709)

[root@A5-306-HW-2488HV5-2019-011 logs]# more nebula-graphd.ERROR
Log file created at: 2021/07/09 17:34:22
Running on machine: A5-306-HW-2488HV5-2019-011
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0709 17:34:22.828738 56852 StorageAccessExecutor.h:35] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 14
E0709 17:34:22.844535 56851 QueryInstance.cpp:103] Storage Error: The leader has changed. Try again later

storage里有:

[root@A5-306-HW-2488HV5-2019-011 logs]# tail -5  nebula-storaged.ERROR
E0712 08:56:49.182061  2539 RaftPart.cpp:1143] [Port: 9780, Space: 60, Part: 7] Receive response about askForVote from "10.142.158.76":9780, error code is -6
E0712 08:56:50.597052  2540 RaftPart.cpp:1143] [Port: 9780, Space: 60, Part: 7] Receive response about askForVote from "10.142.158.75":9780, error code is -6
E0712 08:56:50.597088  2540 RaftPart.cpp:1143] [Port: 9780, Space: 60, Part: 7] Receive response about askForVote from "10.142.158.76":9780, error code is -6
E0712 08:56:51.272409  2539 RaftPart.cpp:1143] [Port: 9780, Space: 60, Part: 7] Receive response about askForVote from "10.142.158.76":9780, error code is -6
E0712 08:56:51.272452  2539 RaftPart.cpp:1143] [Port: 9780, Space: 60, Part: 7] Receive response about askForVote from "10.142.158.75":9780, error code is -6

meta

[root@A5-306-HW-2488HV5-2019-011 logs]# tail -5 nebula-metad.ERROR
E0712 17:27:53.744719  2123 RaftPart.cpp:1143] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "10.142.158.75":9560, error code is -6
E0712 17:27:53.744740  2123 RaftPart.cpp:1143] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "10.142.158.76":9560, error code is -6
E0712 17:27:55.284801  2124 RaftPart.cpp:1143] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "10.142.158.76":9560, error code is -6
E0712 17:27:55.284832  2124 RaftPart.cpp:1143] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "10.142.158.77":9560, error code is -6
E0712 17:27:55.284852  2124 RaftPart.cpp:1143] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "10.142.158.75":9560, error code is -6
1 个赞

graph 挂了之后有生成 core 文件吗? 有的话请贴一下 backtrace, 没的话可能就是数据大引起 oom 了

没有core文件,还是oom了…

dmesg | grep nebula 确认下是否oom吧
另外可以把完整日志文件形式上传

的确是oom了,日志是core的吗

[2933480.112862] [90061]     0 90061   205172        0      26     2688             0 nebula-importer
[2933480.112865] [90062]     0 90062   986430    12593     156     1715             0 nebula-http-gat
[2933480.112879] [70047]     0 70047   161795    64628     227        0             0 nebula-metad
[2933480.112881] [70137]     0 70137 139689871 123725387  241964      181             0 nebula-graphd
[2933480.112884] [70158]     0 70158  2904793  2209397    4625        0             0 nebula-storaged
[2933480.112896] [83148]     0 83148   176710     2145      14        0             0 nebula-console
[2933480.112899] Out of memory: Kill process 70137 (nebula-graphd) score 909 or sacrifice child
[2933480.113239] Killed process 70137 (nebula-graphd), UID 0, total-vm:558759484kB, anon-rss:494901808kB, file-rss:0kB, shmem-rss:0kB

日志显示的是 raft 的错误信息不是core, 具体原因需要看完整日志

一般 graph core 的话会在 build 目录下生成 core 文件的, core 信息也会写进 .ERROR

浙ICP备20010487号