storaged 节点频繁宕机

  • nebula 版本:2.0.0
  • 部署方式:分布式
  • 是否为线上版本:Y / N
  • 硬件信息
    • 磁盘( 推荐使用 SSD)
    • CPU、内存信息
  • 问题的具体描述
    最近nebula的storaged节点频繁宕机,目前只是启动了集群,也没有导入数据。
    一般在启动后过一段时间就会挂掉,最长一次坚持了2天.
    查看了一下dmesg日志,也没有oom

nebula-storaged.ERROR报错日志:

E0802 11:38:29.493543  8862 ThriftClientManager.inl:39] Transport is closed by peers 0x7fed55f27a90 for host: "172.19.143.226":9780
E0802 11:38:29.493543  8849 ThriftClientManager.inl:39] Transport is closed by peers 0x7fed57b27510 for host: "172.19.143.226":9780
E0802 11:38:29.493569  8892 ThriftClientManager.inl:39] Transport is closed by peers 0x7fed53c29710 for host: "172.19.143.226":9780
E0802 11:38:29.493593  8835 ThriftClientManager.inl:39] Transport is closed by peers 0x7fed58a27e10 for host: "172.19.143.226":9780
E0802 11:38:29.493367  8870 ThriftClientManager.inl:39] Transport is closed by peers 0x7fed54727710 for host: "172.19.143.226":9780
E0802 11:38:39.502755  8835 ThriftClientManager.inl:33] Invalid Channel: 0x7fed58a0b000 for host: "172.19.143.226":9780
E0802 11:38:39.502797  8839 ThriftClientManager.inl:33] Invalid Channel: 0x7fed5880c500 for host: "172.19.143.226":9780
E0802 11:38:39.503342  8842 ThriftClientManager.inl:39] Transport is closed by peers 0x7fed58228190 for host: "172.19.143.226":9780
E0802 11:38:39.503338  8849 ThriftClientManager.inl:33] Invalid Channel: 0x7fed57b0b200 for host: "172.19.143.226":9780
E0802 11:38:39.503346  8858 ThriftClientManager.inl:33] Invalid Channel: 0x7fed5690af00 for host: "172.19.143.226":9780
E0802 11:38:39.503351  8855 ThriftClientManager.inl:33] Invalid Channel: 0x7fed56b0b200 for host: "172.19.143.226":9780
E0802 11:38:39.503338  8852 ThriftClientManager.inl:33] Invalid Channel: 0x7fed5730b200 for host: "172.19.143.226":9780
E0802 11:38:39.503407  8845 ThriftClientManager.inl:33] Invalid Channel: 0x7fed57e0c200 for host: "172.19.143.226":9780

能帮忙看一下么?
core文件太大如法上传

系统中是不是有IP冲突了?看日志是RPC连接错误。

截了一段core文件的日志,发不出来。貌似把core文件日志识别成url了。
提示我最多2个链接。

刚发现一条其他的错误日志
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0802 18:45:26.468462 20323 Configuration.cpp:75] json parse error on line 0 near `true"","max_byte’: expected ‘}’
初步怀疑跟

是同一个问题

是不是配置文件格式哪里写的有问题,URL解析出错了。

1 个赞

日志可以用markdown 代码块语法,首尾 ``` 包裹的,不会被识别为 URL,效果如我帮你编辑的1楼

应该是json的错误,studio里面查到的配置文件是带引号的,update的时候需要把引号去掉

可以请教一下error code对应的错误信息去哪查么?
早上有一台storaged服务又挂了,看了下error日志

E0803 10:23:41.481458 22047 RaftPart.cpp:1143] [Port: 9780, Space: 61, Part: 10] Receive response about askForVote from “172.19.143.224”:9780, error code is -11
E0803 10:23:41.483978 22047 RaftPart.cpp:1143] [Port: 9780, Space: 61, Part: 10] Receive response about askForVote from “172.19.143.227”:9780, error code is -5
E0803 10:23:41.673928 22046 RaftPart.cpp:1143] [Port: 9780, Space: 61, Part: 14] Receive response about askForVote from “172.19.143.224”:9780, error code is -11
E0803 10:23:41.673982 22046 RaftPart.cpp:1143] [Port: 9780, Space: 61, Part: 14] Receive response about askForVote from “172.19.143.227”:9780, error code is -5

error code 5和11 不知道什么意思?

抱歉哈,我们正在做一些让 log 里 error 更 human readable 的工作,您现在可以参考 common 里的这个文件的定义哈

另外,我的storaged宕机后,我把整个图空间删掉了,我查了一下磁盘,900G的数据还剩500G,
这500G是什么数据呢?

好的,我参考一下,谢谢

真正的 disk 删除操作是在 compact 之后才会落盘的哈
update: 见楼下 pandasheeps 回复

我删除了图空间后,
又执行过SUBMIT JOB COMPACT;命令

但是依然占用500G。

不用执行compact。
auto_remove_invalid_space 这个参数设置为true,然后重启storage。

1 个赞

谢谢

你好,目前貌似还是有问题。
我这边有4台16核32G的机器,
在nebula集群中导入了2T的数据,这时还没问题。
然后我执行了compaction操作,跑了一上午没跑完,我看了一下,所有的storaged机器全部挂了。
查看了下storaged的error日志,只有下面这两种日志。我就不太清楚挂掉的原因了
Transport is closed by peers 0x7fed54727710 for host: “172.19.143.226”:9780
Invalid Channel: 0x7fed58a0b000 for host: “172.19.143.226”:9780

dmsg看下,估计oom了

谢谢,看了一下,确实是oom了,这种只能机器扩容了吧?
另外我重启了一下服务,立即去查了下日志,发现了大量下面这个错误。
貌似是space not found,这个是什么原因呢?

E0809 11:22:20.834218 8187 RaftPart.cpp:1143] [Port: 9780, Space: 90, Part: 6] Receive response about askForVote from “172.19.143.227”:9780, error code is -5
E0809 11:22:20.883253 8190 RaftPart.cpp:1143] [Port: 9780, Space: 90, Part: 18] Receive response about askForVote from “172.19.143.226”:9780, error code is -5
E0809 11:22:20.883293 8190 RaftPart.cpp:1143] [Port: 9780, Space: 90, Part: 18] Receive response about askForVote from “172.19.143.227”:9780, error code is -5
E0809 11:22:21.589761 8190 RaftPart.cpp:1143] [Port: 9780, Space: 90, Part: 14] Receive response about askForVote from “172.19.143.226”:9780, error code is -5
E0809 11:22:21.589812 8190 RaftPart.cpp:1143] [Port: 9780, Space: 90, Part: 14] Receive response about askForVote from “172.19.143.227”:9780, error code is -5

https://docs.nebula-graph.com.cn/site/pdf/NebulaGraph-book.pdf 看下storaged配置文件和资源估算的方式,可以性能换内存。

1 个赞

好的,谢谢

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。