nebula集群崩溃，graph启动不起来，一个节点的meta服务一直在刷日志，其他节点的meta都在等待

曲先生 · 2024 年8 月 27 日 00:35

nebula 版本：3.6.0

部署方式：华为云ECS
安装方式：RPM
是否上生产环境： N
硬件信息 3节点，每个节点配置如下：
- 磁盘机械盘2T
- CPU 8
- 内存信息 32G

nebula集群崩溃不好用奔溃了，停了
1、在每个节点上执行: /usr/local/nebula/scripts/nebula.service start all
2、过一段时间执行: /usr/local/nebula/scripts/nebula.service status all
3、结果如下：

**日志情况如下**

-meta

> I20240826 17:53:13.732348 19551 NebulaSnapshotManager.cpp:67] Space 0 Part 0 start send snapshot of commitLogId 220604815 commitLogTerm 7, rate limited to 10485760, batch size is 524288
> I20240826 17:53:14.121416 19552 NebulaSnapshotManager.cpp:67] Space 0 Part 0 start send snapshot of commitLogId 220604815 commitLogTerm 7, rate limited to 10485760, batch size is 524288
> I20240826 17:53:14.506019 19553 NebulaSnapshotManager.cpp:67] Space 0 Part 0 start send snapshot of commitLogId 220604815 commitLogTerm 7, rate limited to 10485760, batch size is 524288

-storage

E20240826 17:46:42.199613 18130 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: TTransportException: Timed out
I20240826 17:46:42.199652 18130 MetaClient.cpp:137] Waiting for the metad to be ready!
E20240826 17:49:55.407919 18266 MetaClient.cpp:772] Send request to “10.56.11.242”:9559, exceed retry limit
E20240826 17:49:55.407992 18266 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: TTransportException: Timed out
E20240826 17:49:55.408043 18130 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: TTransportException: Timed out
I20240826 17:49:55.408082 18130 MetaClient.cpp:137] Waiting for the metad to be ready!
E20240826 17:52:08.427695 18130 MetaClient.cpp:112] Heartbeat failed, status:Machine not existed!
I20240826 17:52:08.427763 18130 MetaClient.cpp:137] Waiting for the metad to be ready!

graph

0240826 17:32:23.058907 20294 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: TTransportException: Timed out
E20240826 17:32:23.058969 20114 MetaClient.cpp:157] RPC failure in MetaClient: apache::thrift::transport::TTransportException: TTransportException: Timed out
E20240826 17:32:23.059001 20114 GraphService.cpp:49] Failed to wait for meta service ready synchronously.
E20240826 17:32:23.059024 20114 GraphServer.cpp:39] Failed to wait for meta service ready synchronously.
E20240826 17:32:23.060482 20114 GraphDaemon.cpp:156] The graph server start failed

看起来像所有的节点都在等待其中一个节点的meta 服务启动，是不是不太正常啊。

曲先生 · 2024 年8 月 27 日 02:06

每次都提交了相同的 commitLogId，我是不是应该重启这个meta

曲先生 · 2024 年8 月 27 日 02:34

重启之后还是不好用，还是在传输，其他节点都正常，后来发现其中一个节点的日志盘满了。

########## logging ##########

The directory to host logging files

–log_dir=/data/logs

Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively

–minloglevel=0

Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging

–v=0

Maximum seconds to buffer the log messages

–logbufsecs=0

Whether to redirect stdout and stderr to separate output files

–redirect_stdout=true

Destination filename of stdout and stderr, which will also reside in log_dir.

–stdout_log_file=metad-stdout.log
–stderr_log_file=metad-stderr.log

Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.

–stderrthreshold=3

wether logging files’ name contain time stamp, If Using logrotate to rotate logging files, than should set it to true.

–timestamp_in_logfile_name=true
清理日志之后重启恢复了。
这里的配置好像没有配置日志轮转的，保留几个文件或者多长时间的，这个有么？

system · 2024 年9 月 3 日 02:35

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。