机器挂掉后，重启失败（报错显示无效ip，但没有动过配置）

shixingr · 2022 年9 月 30 日 01:39

提问参考模版：

nebula 版本：3.1.0
部署方式：分布式
安装方：RPM
是否为线上版本：Y
硬件信息
- 磁盘 SATA
- CPU、内存信息：三台实体机一样：10C * 2 内存：8 * 16 G
问题的具体描述
在1、2两台机器挂掉后， Nebula 数据库依然可查，上1、2机器重启失败，无法重启，三个服务都为Exited，然后手动把3的正常服务 stop 掉，机器3也无法重启了，整个集群不可查。
相关的 meta / storage / graph info 日志信息（尽量使用文本形式方便检索）
机器1、的三个服务错误日志，每次启动命令【sudo /home/q/module/nebula/scripts/nebula.service start all】

image809×283 101 KB

都会出现这些日志

[shixingr@l-nebula1.tj.cn6 /home/q/module/nebula/logs]$ cat nebula-metad.ERROR 
Log file created at: 2022/09/29 21:22:59
Running on machine: l-nebula1.tj.cn6
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20220929 21:22:59.933522 43178 MetaDaemon.cpp:129] 10.66.140.34 is not a valid ip in current host, candidates: 10.66.140.41,127.0.0.1

[shixingr@l-nebula1.tj.cn6 /home/q/module/nebula/logs]$ cat nebula-storaged.ERROR 
Log file created at: 2022/09/29 21:22:59
Running on machine: l-nebula1.tj.cn6
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20220929 21:22:59.994889 43198 StorageDaemon.cpp:123] 10.66.140.34 is not a valid ip in current host, candidates: 10.66.140.41,127.0.0.1

[shixingr@l-nebula1.tj.cn6 /home/q/module/nebula/logs]$ cat nebula-graphd.ERROR 
Log file created at: 2022/09/29 21:22:59
Running on machine: l-nebula1.tj.cn6
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20220929 21:22:59.963265 43188 GraphDaemon.cpp:110] 10.66.140.34 is not a valid ip in current host, candidates: 10.66.140.41,127.0.0.1

kyle · 2022 年9 月 30 日 02:10

检查一下配置文件里边的 local_ip 和防火墙状态。

shixingr · 2022 年9 月 30 日 02:20

local_ip 都是填的本机IP，三台机器都是各自的IP。
这是其中一台机器的 metad 配置

########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-metad.pid

########## logging ##########
# The directory to host logging files
--log_dir=logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=metad-stdout.log
--stderr_log_file=metad-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2
# wether logging files' name contain time stamp, If Using logrotate to rotate logging files, than should set it to true.
--timestamp_in_logfile_name=true

########## networking ##########
# Comma separated Meta Server addresses
--meta_server_addrs=10.66.140.34:9559,10.66.140.39:9559
# Local IP used to identify the nebula-metad process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=10.66.140.34
# Meta daemon listening port
--port=9559
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19559
# Port to listen on Storage with HTTP protocol, it corresponds to ws_http_port in storage's configuration file
--ws_storage_http_port=19779

########## storage ##########
# Root data path, here should be only single path for metad
--data_path=data/meta

########## Misc #########
# The default number of parts when a space is created
--default_parts_num=100
# The default replica factor when a space is created
--default_replica_factor=1

--heartbeat_interval_secs=10
--agent_heartbeat_interval_secs=60

防火墙是关闭的

kyle · 2022 年9 月 30 日 02:44

都能 ping 通吗

kyle · 2022 年9 月 30 日 02:46

也有可能是端口冲突，看下这几个配置的端口有没有被占用

shixingr · 2022 年9 月 30 日 02:57

如果有其中一台机器是挂掉的，是不是就启动不了了，昨天挂了一台。
然后因为有其他的影响，现在两台都挂了，只剩一台没有metad ip 配置的机器。【metad ip 填的两台机器】
或者换句话说，分布式需要机器都存活，才能用吗

shixingr · 2022 年9 月 30 日 06:49

还在忙嘛，我现在机器 1、2 重启了。然后三个服务在两台机器都启动成功了，对外查询不影响了，但是机器3、处于offline，无法启动，报错还是is not a valid ip in current host，IP肯定是没问题的，一直在用。

kyle · 2022 年9 月 30 日 10:37

启动的时候会检查配置中所有节点的状态。

shixingr · 2022 年9 月 30 日 11:05

我现在3台，只能起来第二台，然后对外的读写服务都挂掉了。
这是为什么呀，第1、3台都是同样的日志，不是有效ip，候选IP还会有两个我没填过的IP
是这样的，这俩IP【192.160.0.0,127.0.0.1】都是我在配置文件里没有写的ip

[shixingr@l-nebula1.tj.cn6 /home/q/module/nebula]$ cat logs/nebula-metad.ERROR 
Log file created at: 2022/09/30 18:59:58
Running on machine: l-nebula1.tj.cn6
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20220930 18:59:58.606060 115728 MetaDaemon.cpp:129] 10.66.140.34 is not a valid ip in current host, candidates: 192.163.16.64,192.160.0.0,127.0.0.1,10.66.140.41

kyle · 2022 年9 月 30 日 11:52

看上去报错很明确… log 也是在起 daemon 的时候报出来的

wey · 2022 年9 月 30 日 23:28

这个 ip 不是本机网卡的 ip 吧？

shixingr · 2022 年10 月 8 日 01:59

是因为ip的问题吗，这个ip我用了把半年，没问题，宕机后起来就有问题了额

wey · 2022 年10 月 8 日 02:18

估计这个 bond 在启动的时候状态不对


cat /proc/net/bonding/bond0

shixingr · 2022 年10 月 8 日 02:33

上传图片一直出错，这样看吧，我粘出来了

[shixingr@l-nebula1.tj.cn6 ~]$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: em1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 5000
Down Delay (ms): 0

Slave Interface: em1
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 78:ac:44:27:81:1c
Slave queue ID: 0

Slave Interface: em2
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 78:ac:44:27:81:1d
Slave queue ID: 0

wey · 2022 年10 月 8 日 04:36

看起来是 bond mode 1 active-backup 模式，是不是哪里配置不对（比如交换机上和这个mode需要的配置是匹配的么？mode 1 按理说是不需要 switch 适配的，但是如果错误配置成了别的会不会出问题？）
如果原来同配置是好的，是不是原来的 active 是 em2 刚好可以现在切到 em1 就有问题了？

如果都排除了，就要看看是不是服务检测 valid candidate 的方式对 bond 有不兼容的地方了（bug）

shixingr · 2022 年10 月 8 日 06:08

都没动过，是可能宕机后自动切换了吗，不行的话我就只能重新部署了

wey · 2022 年10 月 8 日 08:03

https://github.com/vesoft-inc/nebula/blob/2f3259de4673ff3d5c6f2281a6c606375b0afebe/src/common/network/NetworkUtils.cpp#L62

看应该起来这里 getifaddrs() 可以获得 bond0 才对，你现在重启服务还是不行么？

shixingr · 2022 年10 月 8 日 08:17

对不行，我抽空重新部署一下吧，卡着动不了了

wey · 2022 年10 月 8 日 08:20

不知道你方便不方便在有 bond nic 的其他机器上重新部署看看能不能复现问题呢？感觉像是一个问题

shixingr · 2022 年10 月 9 日 06:59

额，没有空闲机器了，我重新部署了一下这三台。还是同样的问题，只有一台能起来，然后console进不去。