systemd托管nebula服务，无法正常启动，手动拉起来却是正常的

Berserker · 2024 年12 月 27 日 07:12

提问参考模版：

nebula 版本：3.8.0
部署方式：分布式
安装方式：RPM
是否上生产环境：N
硬件信息
- 磁盘：HDD
- CPU、内存信息：8C16G

问题的具体描述：
由于目前将nebula的部署方式从K8s切换成了宿主机二进制部署，所以需要进行服务托管，保证进程退出后能够正常拉起来。于是想到了配置systemd服务，配置如下：

[Unit]
Description=Nebula Metad
After=network.target
AssertPathExists=/usr/local/nebula/

[Service]
Type=simple
WorkingDirectory=/usr/local/nebula/
ExecStart=/usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target

配置完成后，执行systemctl start nebula-metad.service后，服务无法正常拉起来，使用systemctl status nebula-metad.service查看结果如下：

● nebula-metad.service - Nebula Metad
   Loaded: loaded (/etc/systemd/system/nebula-metad.service; enabled; vendor preset: disabled)
   Active: activating (auto-restart) since 五 2024-12-27 14:29:38 CST; 1s ago
  Process: 24121 ExecStart=/usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf (code=exited, status=0/SUCCESS)
 Main PID: 24121 (code=exited, status=0/SUCCESS)
[root@nebula-0 nebula]# ps -ef | grep nebula
root     24129  1231  0 14:29 pts/0    00:00:00 grep --color=auto nebula

使用journalctl -u nebula-metad.service --since "5 minutes ago" 查看系统日志，如下：

12月 27 14:30:08 nebula-0 systemd[1]: Started Nebula Metad.
12月 27 14:30:08 nebula-0 systemd[1]: Starting Nebula Metad...
12月 27 14:30:18 nebula-0 systemd[1]: nebula-metad.service holdoff time over, scheduling restart.
12月 27 14:30:18 nebula-0 systemd[1]: Started Nebula Metad.
12月 27 14:30:18 nebula-0 systemd[1]: Starting Nebula Metad...
12月 27 14:30:29 nebula-0 systemd[1]: nebula-metad.service holdoff time over, scheduling restart.
12月 27 14:30:29 nebula-0 systemd[1]: Started Nebula Metad.
12月 27 14:30:29 nebula-0 systemd[1]: Starting Nebula Metad...
12月 27 14:30:39 nebula-0 systemd[1]: nebula-metad.service holdoff time over, scheduling restart.
12月 27 14:30:39 nebula-0 systemd[1]: Started Nebula Metad.
12月 27 14:30:39 nebula-0 systemd[1]: Starting Nebula Metad...
12月 27 14:30:49 nebula-0 systemd[1]: nebula-metad.service holdoff time over, scheduling restart.
12月 27 14:30:49 nebula-0 systemd[1]: Started Nebula Metad.
12月 27 14:30:49 nebula-0 systemd[1]: Starting Nebula Metad...
12月 27 14:30:59 nebula-0 systemd[1]: nebula-metad.service holdoff time over, scheduling restart.
12月 27 14:30:59 nebula-0 systemd[1]: Started Nebula Metad.
12月 27 14:30:59 nebula-0 systemd[1]: Starting Nebula Metad...

然后手动执行/usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf 可以正常启动：

[root@nebula-0 nebula]# /usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf
[root@nebula-0 nebula]# ps -ef | grep nebula
root     24198     1  1 14:33 ?        00:00:00 /usr/local/nebula/bin/nebula-metad --flagfile /usr/local/nebula/etc/nebula-metad.conf
root     24324  1231  0 14:33 pts/0    00:00:00 grep --color=auto nebula

看metad服务的日志，手动启动，确实也成功了，没有报错

Log file created at: 2024/12/27 14:33:11
Running on machine: nebula-0
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20241227 14:33:11.424259 24198 MetaDaemon.cpp:137] localhost = "nebula-0":9559
I20241227 14:33:11.428797 24198 NebulaStore.cpp:48] Start the raft service...
I20241227 14:33:11.429234 24198 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20241227 14:33:11.433812 24198 RaftexService.cpp:46] Start raft service on 9560
I20241227 14:33:11.433888 24198 NebulaStore.cpp:82] Scan the local path, and init the spaces_
I20241227 14:33:11.433924 24198 NebulaStore.cpp:90] Scan path "/usr/local/nebula/data/meta/nebula/0"
I20241227 14:33:11.433954 24198 NebulaStore.cpp:292] Init data from partManager for "nebula-0":9559
I20241227 14:33:11.433976 24198 NebulaStore.cpp:417] Create data space 0
I20241227 14:33:11.470616 24267 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20241227 14:33:11.470781 24266 RocksEngine.cpp:107] open rocksdb on /usr/local/nebula/data/meta/nebula/0/data
I20241227 14:33:11.478994 24267 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20241227 14:33:11.492691 24198 NebulaStore.cpp:480] Space 0, part 0 has been added, asLearner 0
I20241227 14:33:11.492724 24198 NebulaStore.cpp:75] Register handler...
I20241227 14:33:11.492733 24198 MetaDaemonInit.cpp:106] Waiting for the leader elected...
I20241227 14:33:11.492738 24198 MetaDaemonInit.cpp:118] Leader has not been elected, sleep 1s
I20241227 14:33:12.492858 24198 MetaDaemonInit.cpp:153] Get meta version is 4
I20241227 14:33:12.492911 24198 MetaDaemonInit.cpp:169] Nebula store init succeeded, clusterId 3549281905671881576
I20241227 14:33:12.492919 24198 MetaDaemon.cpp:150] Start http service
I20241227 14:33:12.493121 24198 MetaDaemonInit.cpp:226] Starting Meta HTTP Service
I20241227 14:33:12.494405 24289 WebService.cpp:124] Web service started on HTTP[19559]
I20241227 14:33:12.494446 24198 MetaDaemonInit.cpp:192] Check root user
I20241227 14:33:12.494510 24198 RootUserMan.h:35] God user exists
I20241227 14:33:12.498164 24198 MetaDaemon.cpp:193] The meta daemon start on "nebula-0":9559
I20241227 14:33:12.498199 24198 JobManager.cpp:88] Not leader, skip reading remaining jobs
I20241227 14:33:12.498262 24198 JobManager.cpp:64] JobManager initialized
I20241227 14:33:12.498277 24295 JobManager.cpp:150] JobManager::scheduleThread enter

各位大佬们，可以帮忙看下是什么问题导致的吗？社区这么多人用二进制进行部署，难道都没遇到过这种问题吗？

MuYi-方扬 · 2024 年12 月 28 日 06:34

你这个配置感觉不太对，至少少了 stop/restart 等配置
建议参考官网的配置

MuYi-方扬 · 2024 年12 月 28 日 06:36

你可以执行systemctl status nebula-metad.service 看下你哪里配错了

MuYi-方扬 · 2024 年12 月 28 日 06:42

我这边按照配置是 OK 的

system · 2025 年1 月 27 日 06:42

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。