k8s部署 节点伸缩重启导致storaged无法恢复

  • nebula 版本:vesoft/nebula-storaged:v2-nightly
  • 部署方式(分布式 / 单机 / Docker / DBaaS):k8s

操作过pod伸缩,重启 最终show hosts没有一个恢复的
show hosts;
[ERROR (-8)]: No hosts!

I0413 13:43:33.326176    68 RaftPart.cpp:1043] [Port: 9780, Space: 179, Part: 22] Start leader election, reason: lastMsgDur 30930, term 40
I0413 13:43:33.326208    68 RaftPart.cpp:1164] [Port: 9780, Space: 179, Part: 22] Start leader election...
I0413 13:43:33.326227    68 RaftPart.cpp:1206] [Port: 9780, Space: 179, Part: 22] No peer found, I will be the leader
I0413 13:43:33.326234    68 RaftPart.cpp:1152] [Port: 9780, Space: 179, Part: 22] Partition is elected as the new leader for term 41
I0413 13:43:33.326241    68 RaftPart.cpp:1247] [Port: 9780, Space: 179, Part: 22] The partition is elected as the leader
I0413 13:43:33.326261    65 Part.cpp:187] Being elected as the leader for the term 41
I0413 13:43:33.326450    68 RaftPart.cpp:1043] [Port: 9780, Space: 158, Part: 22] Start leader election, reason: lastMsgDur 33239, term 40
I0413 13:43:33.326488    68 RaftPart.cpp:1164] [Port: 9780, Space: 158, Part: 22] Start leader election...
I0413 13:43:33.326517    68 RaftPart.cpp:1206] [Port: 9780, Space: 158, Part: 22] No peer found, I will be the leader
I0413 13:43:33.326548    68 RaftPart.cpp:1152] [Port: 9780, Space: 158, Part: 22] Partition is elected as the new leader for term 41
I0413 13:43:33.326584    68 RaftPart.cpp:1247] [Port: 9780, Space: 158, Part: 22] The partition is elected as the leader
I0413 13:43:33.326609    67 Part.cpp:187] Being elected as the leader for the term 41
I0413 13:43:33.326877    68 RaftPart.cpp:1043] [Port: 9780, Space: 158, Part: 46] Start leader election, reason: lastMsgDur 32493, term 40
I0413 13:43:33.326885    68 RaftPart.cpp:1164] [Port: 9780, Space: 158, Part: 46] Start leader election...
I0413 13:43:33.326900    68 RaftPart.cpp:1206] [Port: 9780, Space: 158, Part: 46] No peer found, I will be the leader
I0413 13:43:33.326906    68 RaftPart.cpp:1152] [Port: 9780, Space: 158, Part: 46] Partition is elected as the new leader for term 41
I0413 13:43:33.326913    68 RaftPart.cpp:1247] [Port: 9780, Space: 158, Part: 46] The partition is elected as the leader
I0413 13:43:33.326930    65 Part.cpp:187] Being elected as the leader for the term 41
I0413 13:43:33.327782    68 RaftPart.cpp:1043] [Port: 9780, Space: 158, Part: 52] Start leader election, reason: lastMsgDur 33492, term 40
I0413 13:43:33.327790    68 RaftPart.cpp:1164] [Port: 9780, Space: 158, Part: 52] Start leader election...
I0413 13:43:33.327805    68 RaftPart.cpp:1206] [Port: 9780, Space: 158, Part: 52] No peer found, I will be the leader
I0413 13:43:33.327811    68 RaftPart.cpp:1152] [Port: 9780, Space: 158, Part: 52] Partition is elected as the new leader for term 41
I0413 13:43:33.327818    68 RaftPart.cpp:1247] [Port: 9780, Space: 158, Part: 52] The partition is elected as the leader
I0413 13:43:33.327834    67 Part.cpp:187] Being elected as the leader for the term 41
I0413 13:43:33.328028    68 RaftPart.cpp:1043] [Port: 9780, Space: 158, Part: 58] Start leader election, reason: lastMsgDur 32135, term 40
I0413 13:43:33.328064    68 RaftPart.cpp:1164] [Port: 9780, Space: 158, Part: 58] Start leader election...
I0413 13:43:33.328132    68 RaftPart.cpp:1206] [Port: 9780, Space: 158, Part: 58] No peer found, I will be the leader
I0413 13:43:33.328140    68 RaftPart.cpp:1152] [Port: 9780, Space: 158, Part: 58] Partition is elected as the new leader for term 41
I0413 13:43:33.328147    68 RaftPart.cpp:1247] [Port: 9780, Space: 158, Part: 58] The partition is elected as the leader
I0413 13:43:33.328168    65 Part.cpp:187] Being elected as the leader for the term 41
I0413 13:43:33.328435    68 RaftPart.cpp:1043] [Port: 9780, Space: 179, Part: 76] Start leader election, reason: lastMsgDur 31153, term 40
I0413 13:43:33.328464    68 RaftPart.cpp:1164] [Port: 9780, Space: 179, Part: 76] Start leader election...
I0413 13:43:33.328480    68 RaftPart.cpp:1206] [Port: 9780, Space: 179, Part: 76] No peer found, I will be the leader
I0413 13:43:33.328487    68 RaftPart.cpp:1152] [Port: 9780, Space: 179, Part: 76] Partition is elected as the new leader for term 41
I0413 13:43:33.328495    68 RaftPart.cpp:1247] [Port: 9780, Space: 179, Part: 76] The partition is elected as the leader
I0413 13:43:33.328516    67 Part.cpp:187] Being elected as the leader for the term 41
I0413 13:43:33.329351    68 Part.cpp:187] Being elected as the leader for the term 41
I0413 13:43:33.959661    65 RaftPart.cpp:1043] [Port: 9780, Space: 179, Part: 46] Start leader election, reason: lastMsgDur 30837, term 40
I0413 13:43:33.959689    65 RaftPart.cpp:1164] [Port: 9780, Space: 179, Part: 46] Start leader election...
I0413 13:43:33.959707    65 RaftPart.cpp:1206] [Port: 9780, Space: 179, Part: 46] No peer found, I will be the leader
I0413 13:43:33.959715    65 RaftPart.cpp:1152] [Port: 9780, Space: 179, Part: 46] Partition is elected as the new leader for term 41
I0413 13:43:33.959728    65 RaftPart.cpp:1247] [Port: 9780, Space: 179, Part: 46] The partition is elected as the leader

@nextflow 您好, 可以说下怎么操作的吗?
安装的时候是几副本?后来是怎么操作的?伸缩分别成多少副本?

是3副本,因为日志太大了导致磁盘爆了,操作了重新部署,一直没有恢复
后来改成单副本 也是不行

现在pod节点会不断重启

@nextflow metad 服务正常不? show hosts meta 显示正常吗?

show hosts meta;
[ERROR (-8)]: No hosts!

错误日志有条日志,不知道有没有关系
W0413 13:55:29.247143 1 FileBasedClusterIdMan.cpp:46] Open file failed, error No such file or directory

您这个是那个服务的日志?

是storaged服务的

不知道有办法快速迁移数据吗 把数据复制到新的集群上

您重新部署的时候有没有动数据?

能不能贴下 storage1 从启动开始的 log?

total 50G
lrwxrwxrwx 1 root root   65 Apr 13 14:39 nebula-storaged.INFO -> nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-143934.1
lrwxrwxrwx 1 root root   68 Apr 13 13:55 nebula-storaged.WARNING -> nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-135529.1
-rw-r--r-- 1 root root 745M Apr 13 13:18 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-131659.1
-rw-r--r-- 1 root root 731M Apr 13 13:19 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-131820.1
-rw-r--r-- 1 root root 599M Apr 13 13:21 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-131940.1
-rw-r--r-- 1 root root 553M Apr 13 13:22 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-132100.1
-rw-r--r-- 1 root root 441M Apr 13 13:23 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-132220.1
-rw-r--r-- 1 root root 424M Apr 13 13:25 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-132340.1
-rw-r--r-- 1 root root 434M Apr 13 13:26 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-132500.1
-rw-r--r-- 1 root root 1.8G Apr 13 13:41 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-132915.1
-rw-r--r-- 1 root root 1.8G Apr 13 13:44 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-134128.1
-rw-r--r-- 1 root root 1.8G Apr 13 13:51 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-134421.1
-rw-r--r-- 1 root root 1.5G Apr 13 13:53 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-135125.1
-rw-r--r-- 1 root root 717M Apr 13 13:55 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-135407.1
-rw-r--r-- 1 root root 1.8G Apr 13 13:58 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-135529.1
-rw-r--r-- 1 root root 1.8G Apr 13 14:07 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-135836.1
-rw-r--r-- 1 root root 1.8G Apr 13 14:10 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-140706.1
-rw-r--r-- 1 root root 1.8G Apr 13 14:16 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-141051.1
-rw-r--r-- 1 root root 1.8G Apr 13 14:20 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-141655.1
-rw-r--r-- 1 root root 1.8G Apr 13 14:26 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-142034.1
-rw-r--r-- 1 root root 1.8G Apr 13 14:30 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-142653.1
-rw-r--r-- 1 root root 1.8G Apr 13 14:39 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-143033.1
-rw-r--r-- 1 root root 1.7G Apr 13 14:43 nebula-storaged.nebula-storaged-0.root.log.INFO.20210413-143934.1
-rw-r--r-- 1 root root  255 Apr 13 13:16 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-131659.1
-rw-r--r-- 1 root root  255 Apr 13 13:18 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-131820.1
-rw-r--r-- 1 root root  255 Apr 13 13:19 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-131940.1
-rw-r--r-- 1 root root  255 Apr 13 13:21 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-132100.1
-rw-r--r-- 1 root root  255 Apr 13 13:22 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-132220.1
-rw-r--r-- 1 root root  255 Apr 13 13:23 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-132340.1
-rw-r--r-- 1 root root  255 Apr 13 13:25 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-132500.1
-rw-r--r-- 1 root root  255 Apr 13 13:29 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-132915.1
-rw-r--r-- 1 root root  255 Apr 13 13:54 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-135407.1
-rw-r--r-- 1 root root  255 Apr 13 13:55 nebula-storaged.nebula-storaged-0.root.log.WARNING.20210413-135529.1
-rw-r--r-- 1 root root  23G Apr 13 14:43 stderr.log
-rw-r--r-- 1 root root    0 Apr 13 13:16 stdout.log

请问需要哪个日志 太大了

如果我从3个副本改成一个副本 有办法恢复吗

  1. 您这个 log 等级是不是开的太高了…
  2. 要重启之后的 INFO(手动重启一下)
  3. stderr.log, 重启之后的部分就可以

Q: 如果我从3个副本改成一个副本 有办法恢复吗

没有这功能吧…

首先你这里应该是将三个节点变成一个节点后启动storage的日志吧?
从三个storage变成一个storage,meta的是否有修改成一个meta的地址,有的话,可能clusterid都变了,那么其他storage已经就连不上了,需要删除下 cluster.id 文件,这样storage才能接入。
其次你的meta记录的storage是三个,但是你现在的storage只剩一个,所以storage的日志一直报peer找不到。
目前不支持直接从多个storage伸缩成一个,假如你已经直接这样做了,数据应该是读不了了。
假如你只是恢复服务,你可以把meta的data目录删除,重新启动所有服务,这样应该就可以正常使用。

storage和meta都操作过3个变成1个 应该是clusterid都变了
我现在主要是想恢复数据 哪怕是单个节点的数据也可以 部分数据丢失可以接受
请问这种情况最快速的方案是什么呢