恢复过程中，storaged或metad服务启动失败。错误日志: [db/db_impl/db_impl_open.cc:2112] DB::Open() failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: /xxx/000024.ldb: No such file or directory in file /xxx/MANIFEST-000020

cccxgit · 2024 年11 月 18 日 02:14

nebula 版本：3.6
部署方式：k8s分布式
安装方式：源码编译
是否上生产环境：Y
硬件信息
- 机械磁盘
- 6U15G
问题的具体描述
我这边生产环境搭建了一个基于nebula的k8s分布式集群，已创建15个图空间，导入5亿点边数据。在服务正常情况下，执行create snapshot进行数据的备份。基于备份的数据，为metad和storaged服务进行恢复时，存在偶先storaged或metad服务启动失败，报错信息为

2024/11/11-21:38:38.656812 140578633020992 [WARN] [db/db_impl/db_impl_open.cc:2112] DB::Open() failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: /xxx/000024.ldb: No such file or directory in file /xxx/MANIFEST-000020
2024/11/11-21:38:38.656835 140578633020992 [db/db_impl/db_impl.cc:477] Shutdown: canceling all background work
2024/11/11-21:38:38.656893 140578633020992 [db/db_impl/db_impl.cc:677] Shutdown complete

其他信息：
（1）通过多次恢复验证，storaged启动失败的概率大于metad
（2）本集群未使用bragent进行备份恢复，而是自研一套方案。本集群的恢复方案为：1、从远端存储机器中下载snapshot备份文件压缩包到storaged和metad容器中；2、通过nebula.service stop关闭storaged和metad服务，并解压snapshot文件到storaged或metad指定data目录下（storaged存在多个图空间，对应多个snapshot压缩文件，启动多线程并行解压）；3、解压完成的服务，执行nebula.service start启动（采用节点粒度启动服务。当节点中的storaged和metad都解压完，一起启动服务）。
（3）不同节点机器性能存在差异，因此服务启动时间不同，存在时间差（可能10mins）
（4）当前已验证远端存储机器下载的snapshot文件无破损（md5值验证）；所有解压均无失败
问题检索：
（1）rocksdb github社区，有几个相似问题的issue，均处于open

https://github.com/facebook/rocksdb/issues/10258
https://github.com/facebook/rocksdb/issues/10357

（2）其中https://github.com/facebook/rocksdb/issues/10357贴子最下面，似乎有解决方案

。请帮助分析感谢

system · 2024 年12 月 18 日 02:15

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。