nebula-graphd: Exited

胸口碎大石 · 2023 年5 月 18 日 09:33

nebula 版本：3.4.1
部署方式：单机
安装方式：RPM
是否上生产环境：N
硬件信息
- 高性能云硬盘 50G
- 2核 4GB
编辑配置文件nebula-graphd.conf，修改–enable_authorize参数为true，nebula.service stop all ， nebula.service start all 之后，nebula-graphd一直是Exited状态：
[INFO] nebula-metad(a6dc673): Running as 32513, Listening on 9559
[INFO] nebula-graphd(a6dc673): Exited
[INFO] nebula-storaged(a6dc673): Running as 32636, Listening on 9779

查询发现9669，19669已占用：
nebula-gr 32587 root 67u IPv6 235797 0t0 TCP *:9669 (LISTEN)

再次使用nebula.service stop all ， nebula.service start all 情况依旧，重启电脑之后执行nebula.service start all 情况依旧， kill 32587 之后nebula.service start graphd 情况依旧

E20230518 14:43:42.762362  5710 WebService.cpp:152] Failed to start web service: failed to bind to async server socket: 0.0.0.0:19669: Address already in use

E20230518 16:52:18.535353  3385 MetaClient.cpp:772] Send request to "127.0.0.1":9559, exceed retry limit
E20230518 16:52:18.535396  3385 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E20230518 16:52:18.535437  3459 MetaClient.cpp:192] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connect

wey · 2023 年5 月 19 日 05:28

这是连不上 metad，metad 配置的是本地9559 ？看看 metad log正常不，看看防火墙允许 9559？

胸口碎大石 · 2023 年5 月 19 日 05:47

是9559，metad的ERRORlog正常：
“”"
Log file created at: 2023/05/18 10:17:32
Running on machine: VM-0-14-centos
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20230518 10:17:32.890949 4035 AddHostsProcessor.cpp:72] The host “127.0.0.1”:9779 have existed!
“”"
查看metad的INFO日志，已经收到9669graph的心跳包了：
“”"
I20230519 10:29:23.545069 21645 MetaDaemon.cpp:137] localhost = “127.0.0.1”:9559
I20230519 10:29:25.385288 21762 HBProcessor.cpp:33] Receive heartbeat from “127.0.0.1”:9669, role = GRAPH
I20230519 10:29:35.388867 21762 HBProcessor.cpp:33] Receive heartbeat from “127.0.0.1”:9669, role = GRAPH
I20230519 10:29:45.398845 21762 HBProcessor.cpp:33] Receive heartbeat from “127.0.0.1”:9669, role = GRAPH
“”"
但是graph还是exited

胸口碎大石 · 2023 年5 月 19 日 05:48

防火墙是关闭的

min.wu · 2023 年5 月 19 日 08:44

端口被占用了？

wey · 2023 年5 月 19 日 11:38

啊，这是原因（我在手机上看的漏掉了）

胸口碎大石 · 2023 年5 月 20 日 01:30

端口就是被nebula-gr服务占用了，但是graphd在查询状态时还是exited

胸口碎大石 · 2023 年5 月 20 日 01:31

我kill了占用9669端口的服务，然后重启所有服务，graphd还是exited

wey · 2023 年5 月 20 日 03:25

被占用的是 19669，graphd 不只需要 listen 在 9669

胸口碎大石 · 2023 年5 月 20 日 04:43

那请问应该怎么解决这个问题？

wey · 2023 年5 月 20 日 04:45

看看谁占用，可以的话，杀掉进程

胸口碎大石 · 2023 年5 月 20 日 04:47

我之前kill过进程，重启之后还是不行

wey · 2023 年5 月 20 日 04:50

可能是非 NebulaGraph 的进程占用的

胸口碎大石 · 2023 年5 月 20 日 04:52

[root@VM-0-14-centos ~]# kill 32017
[root@VM-0-14-centos ~]# /usr/local/nebula/scripts/nebula.service stop all
[INFO] Stopping nebula-metad…
[INFO] Done
[INFO] Stopping nebula-graphd…
[INFO] Done
[INFO] Stopping nebula-storaged…
[INFO] Done
[root@VM-0-14-centos ~]# /usr/local/nebula/scripts/nebula.service start all
[INFO] Starting nebula-metad…
[INFO] Done
[INFO] Starting nebula-graphd…
[INFO] Done
[root@VM-0-14-centos ~]# /usr/local/nebula/scripts/nebula.service status all
[INFO] nebula-metad(a6dc673): Running as 14347, Listening on 9559
[INFO] nebula-graphd(a6dc673): Exited
[INFO] nebula-storaged(a6dc673): Exited
[root@VM-0-14-centos ~]# lsof -i :19669
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nebula-gr 11083 root 29u IPv4 795253 0t0 TCP *:19669 (LISTEN)
[root@VM-0-14-centos ~]# lsof -i :9669
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nebula-gr 11083 root 67u IPv4 795323 0t0 TCP *:9669 (LISTEN)
[root@VM-0-14-centos ~]# kill 11083
[root@VM-0-14-centos ~]# lsof -i :9669
[root@VM-0-14-centos ~]# lsof -i :19669
[root@VM-0-14-centos ~]# /usr/local/nebula/scripts/nebula.service stop all
[INFO] Stopping nebula-metad…
[INFO] Done
[INFO] Stopping nebula-graphd…
[INFO] Done
[INFO] Stopping nebula-storaged…
[INFO] Done
[root@VM-0-14-centos ~]# /usr/local/nebula/scripts/nebula.service start all
[INFO] Starting nebula-metad…
[INFO] Done
[INFO] Starting nebula-graphd…
[INFO] Done
[INFO] Starting nebula-storaged…
[INFO] Done
[root@VM-0-14-centos ~]# /usr/local/nebula/scripts/nebula.service status all
[INFO] nebula-metad(a6dc673): Running as 14867, Listening on 9559
[INFO] nebula-graphd(a6dc673): Exited
[WARN] nebula-storaged after v3.0.0 will not start service until it is added to cluster.
[WARN] See Manage Storage hosts:ADD HOSTS in https://docs.nebula-graph.io/
[INFO] nebula-storaged(a6dc673): Running as 14997, Listening on 9779
[root@VM-0-14-centos ~]# /usr/local/nebula/scripts/nebula.service status all
[INFO] nebula-metad(a6dc673): Running as 14867, Listening on 9559
[INFO] nebula-graphd(a6dc673): Exited
[INFO] nebula-storaged(a6dc673): Running as 14997, Listening on 9779
[root@VM-0-14-centos ~]#

我把9669，19669占用的进程 32017和11083都kill了，然后重启之后还是Exited

min.wu · 2023 年5 月 22 日 05:11

先看下日志，是不是端口占用。你是不是有多个账号在起服务？nebula-gr是谁的账号？
你也可以自己搞个简单的脚本试试能不能绑到19669, 9669

胸口碎大石 · 2023 年5 月 22 日 06:17

端口只有nebula-gr进程在LISTEN，只有一个账号root在操作，nebula-gr就是root账户

wey · 2023 年5 月 22 日 06:26

要么关了 graph-gr 要么修改 graphd config 绕过这个端口，换成别的。

如果启动能过了 bind port 的阶段还是 exited 再看日志判断是什么其他原因

胸口碎大石 · 2023 年5 月 22 日 08:00

我把占用9669的server、graphd和java进程全部kill，之后把graphd的端口修改为 9679，19679，重启之后依然Exited

graphd日志如下：

metad日志如下：

storaged日志如下：

还请大佬帮忙分析下，多谢

min.wu · 2023 年5 月 22 日 10:00

日志不全

胸口碎大石 · 2023 年5 月 22 日 10:52

nebula-graphd.VM-0-14-centos.root.log.INFO.20230522-154445.16940 (1.4 KB)
nebula-metad.VM-0-14-centos.root.log.INFO.20230522-154445.16881 (5.0 KB)
nebula-storaged.VM-0-14-centos.root.log.INFO.20230522-154445.17004 (11.2 KB)

重启之后的详细日志请见附件