nebula add host 增加storage节点失败

nebula版本3.8.0 nebula是2台服务器做的集群,因为意外断电导致集群启动不了,一台的meta一直启动失败,一直报错

,然后删除了两个nebula机器data中的meta, nebula就启动成功了 但是add host storage 失败,两个storage节点一直offline,数据也没有了,但是data中的storage依然还在,以下是storage日志

麻烦各位大佬帮我看看,我这nebula的数据还能恢复吗? 磁盘中storage的数据还在

2 台机器做集群,很容易故障。因为超过半数挂了。
能否恢复我就不太清楚了。。mark 下

嗯嗯 现在服务器重装nebula重新跑数据了 后面看增加服务器会不会好一点

同样是 nebula3.8.0

Windows 11 (WSL 2, Ubuntu 22 和 24)

和题主不同,不是因为意外断电,我是在开发环境下 使用 docker-compose 来起的 nebula 服务。平时经常没有 down container 就直接点关机了。

大部分时候不 down container 直接关机都没有问题,但是昨天遇到试图 up 服务的时候

  • 3 个 storaged 服务 全部 unhealthy
  • meta 服务显示无法和 Storage 通信

报错 1

meta 报错:

E20250105 07:35:19.738706   146 SaveGraphVersionProcessor.cpp:25] Failed to save graph version, errorCode: E_LEADER_CHANGED
E20250105 07:35:20.744900   146 SaveGraphVersionProcessor.cpp:25] Failed to save graph version, errorCode: E_LEADER_CHANGED
E20250105 07:35:21.746572   146 SaveGraphVersionProcessor.cpp:25] Failed to save graph version, errorCode: E_LEADER_CHANGED
E20250105 07:35:28.478281   146 SaveGraphVersionProcessor.cpp:25] Failed to save graph version, errorCode: E_LEADER_CHANGED
E20250105 07:38:08.194293   146 SaveGraphVersionProcessor.cpp:25] Failed to save graph version, errorCode: E_LEADER_CHANGED
E20250105 07:38:38.511175   146 SaveGraphVersionProcessor.cpp:25] Failed to save graph version, errorCode: E_LEADER_CHANGED

Storage 报错

E20250105 07:34:55.404174    93 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E20250105 07:34:55.404657     1 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connect
E20250105 07:35:08.476162    94 MetaClient.cpp:772] Send request to "metad2":9559, exceed retry limit
E20250105 07:35:08.476723    94 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E20250105 07:35:08.477655     1 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connect
E20250105 07:35:21.496620    95 MetaClient.cpp:772] Send request to "metad1":9559, exceed retry limit
E20250105 07:35:21.497597    95 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E20250105 07:35:21.499795     1 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connect
E20250105 07:35:34.517278    96 MetaClient.cpp:772] Send request to "metad1":9559, exceed retry limit
E20250105 07:35:34.517863    96 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E20250105 07:35:34.518396     1 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connect
E20250105 07:35:47.599869    48 MetaClient.cpp:772] Send request to "metad1":9559, exceed retry limit
E20250105 07:35:47.601718    48 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E20250105 07:35:47.604538     1 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Dropping unsent request. Connection closed after: apache::thrift::transport::TTransportException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connect
E20250105 07:38:06.199206    54 MetaClient.cpp:772] Send request to "metad2":9559, exceed retry limit
E20250105 07:38:06.199898    54 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: TTransportException: Timed out
E20250105 07:38:06.200489     1 MetaClient.cpp:112] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: TTransportException: Timed out

报错 2:

I20250106 03:19:23.378823     1 StorageDaemon.cpp:147] data path= /data/storage
I20250106 03:19:23.396061     1 MetaClient.cpp:80] Create meta client to "metad0":9559
I20250106 03:19:23.396488     1 MetaClient.cpp:81] root path: /usr/local/nebula, data path size: 1
W20250106 03:19:23.396960     1 FileBasedClusterIdMan.cpp:43] Open file failed, error No such file or directory
I20250106 03:19:23.413105    49 ThriftClientManager-inl.h:67] resolve "metad1":9559 as "172.18.0.2":9559
I20250106 03:19:24.415612    49 ThriftClientManager-inl.h:67] resolve "metad2":9559 as "172.18.0.4":9559
I20250106 03:19:25.418324    49 ThriftClientManager-inl.h:67] resolve "metad1":9559 as "172.18.0.2":9559
I20250106 03:19:26.436275    49 ThriftClientManager-inl.h:67] resolve "metad0":9559 as "172.18.0.3":9559
E20250106 03:19:26.437835     1 MetaClient.cpp:112] Heartbeat failed, status:Machine not existed!
I20250106 03:19:26.442168     1 MetaClient.cpp:137] Waiting for the metad to be ready!

目前的解决方式

  1. 方式 1:docker 尝试挂载到全新的数据文件夹(空)
  2. 方式 2:docker 尝试挂载到以前的备份数据文件夹

两种方式都可以正常启动服务。

想请问,这种情况是因为某些关键数据损坏导致 storaged 服务无法启动吗?虽然一般服务器除了题主的断电情况,也不会强制关机等。

生产环境目前还没遇到类似的问题,但是开发环境遇到了就有些担心

Ubuntu 24.04 LTS

docker-compose 不建议做为生产用途

理解