2.5.1升级到2.6.1后集群启动失败

问题:从2.5.1升级到2.6.1后,集群启动失败,请问这是什么原因呢?

异常日志:
第一个节点
[root@node4 logs]# cat graphd-stderr.log
E1118 11:40:39.266746 3695104 MetaClient.cpp:636] Send request to “10.210.39.139”:9559, exceed retry limit
E1118 11:40:39.267132 3695095 MetaClient.cpp:94] RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1118 11:40:39.267174 3695095 GraphService.cpp:43] Failed to wait for meta service ready synchronously.
E1118 11:40:39.267195 3695095 GraphDaemon.cpp:158] Failed to wait for meta service ready synchronously.

[root@node4 logs]# cat metad-stderr.log
*** Aborted at 1637206866 (Unix time, try ‘date -d @1637206866’) ***
*** Signal 15 (SIGTERM) (0x386418) received by PID 3694912 (pthread TID 0x7fea01a61980) (linux TID 3694912) (maybe from PID 3695640, UID 0) (code: 0), stack trace: ***
/opt/nebula/graph/bin/nebula-metad(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x20ed031]
/opt/nebula/graph/bin/nebula-metad(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x1b)[0x20e4e9b]
/opt/nebula/graph/bin/nebula-metad[0x20e31e7]
/lib64/libpthread.so.0(+0xf62f)[0x7fea00f2462f]
/lib64/libc.so.6(nanosleep+0x2d)[0x7fea00c0c85d]
/lib64/libc.so.6(sleep+0xd3)[0x7fea00c0c6f3]
/opt/nebula/graph/bin/nebula-metad(Z6initKVSt6vectorIN6nebula8HostAddrESaIS1_EES1+0xa04)[0xf4ac64]
/opt/nebula/graph/bin/nebula-metad(main+0x79c)[0xf0e6ec]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x7fea00b69554]
/opt/nebula/graph/bin/nebula-metad[0xf4998d]
(safe mode, symbolizer not available)

[root@node4 logs]# cat nebula-storaged.node4.root.log.ERROR.20211118-114038.3695026
Log file created at: 2021/11/18 11:40:38
Running on machine: node4
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1118 11:40:38.997750 3695069 MetaClient.cpp:636] Send request to “10.210.38.69”:9559, exceed retry limit
E1118 11:40:38.998262 3695026 MetaClient.cpp:94] RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E1118 11:40:38.998296 3695026 StorageServer.cpp:163] waitForMetadReady error!
E1118 11:40:38.998307 3695026 StorageDaemon.cpp:161] Storage server start failed

其他节点:
[root@node2 logs]# cat graphd-stderr.log
E1118 11:40:46.985862 4089498 GraphDaemon.cpp:111] 10.210.38.69 is not a valid ip in current host, candidates: 172.17.0.1,10.254.131.20,127.0.0.1,10.210.38.70

[root@node2 logs]# cat nebula-metad.node2.root.log.ERROR.20211118-114045.4089444
Log file created at: 2021/11/18 11:40:45
Running on machine: node2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1118 11:40:45.642781 4089444 MetaDaemon.cpp:256] 10.210.38.69 is not a valid ip in current host, candidates: 172.17.0.1,10.254.131.20,127.0.0.1,10.210.38.70

[root@node2 logs]# cat nebula-storaged.node2.root.log.ERROR.20211118-114046.4089471
Log file created at: 2021/11/18 11:40:46
Running on machine: node2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1118 11:40:46.737658 4089471 StorageDaemon.cpp:119] 10.210.38.69 is not a valid ip in current host, candidates: 172.17.0.1,10.254.131.20,127.0.0.1,10.210.38.70

看 meta log 只是被 kill 了,是不是您升级前关闭 meta,之后就再没起来 meta 呢?

$ date -d @1637206866
Thu Nov 18 03:41:06 UTC 2021

是的,升级前关闭了所有服务,升级后所有服务都无法启动。

能再单独启动一下 meta 么?看一下它的 log

第一个节点 启动 meta后:
image
一开始并无日志,后来打印日志如下:
[root@node4 logs]# cat metad-stderr.log
*** Aborted at 1637215722 (Unix time, try ‘date -d @1637215722’) ***
*** Signal 15 (SIGTERM) (0x6442) received by PID 24947 (pthread TID 0x7f92c2906980) (linux TID 24947) (maybe from PID 25666, UID 0) (code: 0), stack trace: ***
/opt/nebula/graph/bin/nebula-metad(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x20ed031]
/opt/nebula/graph/bin/nebula-metad(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x1b)[0x20e4e9b]
/opt/nebula/graph/bin/nebula-metad[0x20e31e7]
/lib64/libpthread.so.0(+0xf62f)[0x7f92c1dc962f]
/lib64/libc.so.6(nanosleep+0x2d)[0x7f92c1ab185d]
/lib64/libc.so.6(sleep+0xd3)[0x7f92c1ab16f3]
/opt/nebula/graph/bin/nebula-metad(Z6initKVSt6vectorIN6nebula8HostAddrESaIS1_EES1+0xa04)[0xf4ac64]
/opt/nebula/graph/bin/nebula-metad(main+0x79c)[0xf0e6ec]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x7f92c1a0e554]
/opt/nebula/graph/bin/nebula-metad[0xf4998d]
(safe mode, symbolizer not available)

其他节点meta启动后直接挂掉,异常日志如下:
[root@node2 logs]# cat nebula-metad.node2.root.log.ERROR.20211118-165644.35357
Log file created at: 2021/11/18 16:56:44
Running on machine: node2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1118 16:56:44.078560 35357 MetaDaemon.cpp:256] 10.210.38.69 is not a valid ip in current host, candidates: 10.210.38.70,127.0.0.1

各个机器ip变了么?配置里本地的ip和实际ip能匹配么?

感觉是其他节点的meta服务无法启动,导致第一个节点的meta一直不能正常工作,其他节点都有错误日志: 10.210.38.69 is not a valid ip 不是有效的ip?

没有改变的,就升级后直接启动

is not a valid ip in current host, candidates: 172.27.0.1,127.0.0.1,172.17.0.1,172.23.52.151
我部署的时候把配置文件的IP都换成了公网IP,也报了不是有效IP这个错误,导致服务都没起来,但是配127.0.0.1是可以的。不过我需要将他配成真实的公网IP

这是它推荐的候选 ip,你可以试试这个 172.23.52.151,这个IP应该也是指向你操作的那台服务器的,但通过 ifconfig 命令并没有显示

我直接部署 2.6.1的,把IP换成 ifconfig显示的IP,启动后日志显示无效IP,换成它的推荐IP就可以了;但有的服务器就没有这个异常

经过测试,2.5.1升级到2.6.1后集群无法启动,日志显示:

10.210.38.69 可通过ifconfig 命令查看,后面的候选者IP 10.210.38.70 通过ifconfig无法查看,但10.210.38.70和10.210.38.69指向的是同一台服务器,将配置中IP换成 10.210.38.70 后,升级后服务可以正常启动
这应该是2.6.1版本的一个bug @wey

感谢, .70 不是这个 host 上新分配的 ip ,在2.5 的时候两个 IP 就都在这个 host 上对么?

我提了 issue,欢迎来补充信息。

是的,IP是没有重新分配过的,10.210.38.70 这个IP我之前也是不知道有指向这台服务器的,通过ifconfig无法查看到 10.210.38.70,我用其他机器向 10.210.38.70 发送文件,发送到了 10.210.38.69上,因此确认两个IP指向同一台服务器

ok,您这个情况还不是我想象的多 IP(多 interface,或者是同一个 interface 多地址),您是有多个 network namespace 么?

ip netns list

或者可以咨询一下负责 host 网络的同事?

感谢回复!不过我想配的是真实的公网ip,这个IP应该还是本地的IP,其他机子会不好访问 :joy:。不过我换了2.5.0版本就好了,应该是新版本有问题。

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。