balance data数据未发生迁移

  • nebula 版本:2.5.1

  • 部署方式:分布式

  • 安装方式:源码编译

  • 是否为线上版本:Y

  • 硬件信息

    • SSD
    • 20 * CPU 16c、内存信息 64G
  • 问题描述:
    新增7个storage节点,执行balance data。1分钟之内所有子任务显示success,然后所有数据都变为无法查询。登陆添加的storage节点机器查看data&wal目录,发现没有任何分片的数据文件。

  • 补充信息:
    · 所有数据都是采用sst方式导入
    · 单备份,三备份均会出现该问题
    · 由于单磁盘容量的限制,每个storage挂了多个路径–data_path=/data/graphdb/storage/disk1,/data/graphdb/storage/disk2,/data/graphdb/storage/disk3

  • meta 日志

I0115 01:06:52.248580  1567 Balancer.cpp:43] Start to invoke balance plan 1642179863
I0115 01:06:52.259410 854836 BalanceTask.cpp:216] [1642179863, 1:98, 11.145.4.123:22509->30.42.50.151:22509] Part has been moved successfully!
I0115 01:06:52.261832  1567 HBProcessor.cpp:35] Receive heartbeat from "11.145.4.199":22509, role = STORAGE
I0115 01:06:52.265578 854836 BalanceTask.cpp:46] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Start to move part, check the peers firstly!
I0115 01:06:52.279281 854836 BalanceTask.cpp:216] [1642179863, 1:93, 11.145.5.171:22509->30.42.50.211:22509] Part has been moved successfully!
I0115 01:06:52.281384  7197 BalanceTask.cpp:63] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Ask the src to give up the leadership.
I0115 01:06:52.285619 854836 BalanceTask.cpp:216] [1642179863, 1:93, 30.42.41.101:22509->30.42.74.36:22509] Part has been moved successfully!
I0115 01:06:52.293191  7197 BalanceTask.cpp:90] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Open the part as learner on dst.
I0115 01:06:52.298393 854836 BalanceTask.cpp:46] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Start to move part, check the peers firstly!
I0115 01:06:52.307433  7197 BalanceTask.cpp:104] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Add learner dst.
I0115 01:06:52.311659 854836 BalanceTask.cpp:216] [1642179863, 1:184, 11.145.4.123:22509->30.42.50.151:22509] Part has been moved successfully!
I0115 01:06:52.313736  7200 BalanceTask.cpp:63] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Ask the src to give up the leadership.
I0115 01:06:52.318099  7197 AdminClient.cpp:452] Return leader change from "11.145.5.72":22508, new leader is "11.145.5.96":22508, retry 0, limit 30
I0115 01:06:52.320493  7197 BalanceTask.cpp:118] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Waiting for the data catch up.
I0115 01:06:52.326460 854836 BalanceTask.cpp:216] [1642179863, 1:184, 11.145.4.151:22509->30.42.50.170:22509] Part has been moved successfully!
I0115 01:06:52.327986  7200 BalanceTask.cpp:90] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Open the part as learner on dst.
I0115 01:06:52.333217  7197 AdminClient.cpp:452] Return leader change from "11.145.5.72":22508, new leader is "11.145.5.96":22508, retry 0, limit 3
I0115 01:06:52.334256  7197 BalanceTask.cpp:132] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Send member change request to the leader, it will add the new member on dst host
I0115 01:06:52.338730 854836 BalanceTask.cpp:216] [1642179863, 1:184, 11.145.4.199:22509->30.42.74.36:22509] Part has been moved successfully!
I0115 01:06:52.340914  7200 BalanceTask.cpp:104] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Add learner dst.
I0115 01:06:52.345783  7197 AdminClient.cpp:452] Return leader change from "11.145.5.72":22508, new leader is "11.145.5.96":22508, retry 0, limit 30
I0115 01:06:52.350668  7197 BalanceTask.cpp:147] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Send member change request to the leader, it will remove the old member on src host
I0115 01:06:52.351603 854836 BalanceTask.cpp:216] [1642179863, 1:175, 11.145.4.200:22509->30.42.50.85:22509] Part has been moved successfully!
I0115 01:06:52.352754  7200 AdminClient.cpp:452] Return leader change from "30.42.54.173":22508, new leader is "30.42.74.36":22508, retry 0, limit 30
I0115 01:06:52.355341  7200 BalanceTask.cpp:118] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Waiting for the data catch up.
I0115 01:06:52.358749  7197 AdminClient.cpp:452] Return leader change from "11.145.5.72":22508, new leader is "11.145.5.96":22508, retry 0, limit 30
I0115 01:06:52.362947  7197 BalanceTask.cpp:163] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Update meta for part.
I0115 01:06:52.363605 854836 BalanceTask.cpp:216] [1642179863, 1:175, 11.145.5.121:22509->30.42.50.84:22509] Part has been moved successfully!
I0115 01:06:52.364440  7200 AdminClient.cpp:452] Return leader change from "30.42.54.173":22508, new leader is "30.42.74.36":22508, retry 0, limit 3
I0115 01:06:52.366465  7200 BalanceTask.cpp:132] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Send member change request to the leader, it will add the new member on dst host
I0115 01:06:52.370198  7197 AdminClient.cpp:197] [space:1, part:98] Update original peers "11.145.5.72":22509,"11.145.5.96":22509,"30.42.50.151":22509, remove "11.145.5.72":22509, add "30.42.50.211":22509
I0115 01:06:52.370225  7197 AdminClient.cpp:203] remove [1, 98] from "11.145.5.72":22509
I0115 01:06:52.370234  7197 AdminClient.cpp:210] add [1, 98] to "30.42.50.211":22509
I0115 01:06:52.376374 854836 BalanceTask.cpp:46] [1642179863, 1:175, 11.145.5.171:22509->30.42.50.211:22509] Start to move part, check the peers firstly!
I0115 01:06:52.377463  7200 AdminClient.cpp:452] Return leader change from "30.42.54.173":22508, new leader is "30.42.74.36":22508, retry 0, limit 30
I0115 01:06:52.379523  1567 HBProcessor.cpp:35] Receive heartbeat from "11.145.4.151":22509, role = STORAGE
I0115 01:06:52.381835  7200 BalanceTask.cpp:147] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Send member change request to the leader, it will remove the old member on src host
I0115 01:06:52.382894  7201 BalanceTask.cpp:173] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Update meta succeeded!
I0115 01:06:52.383102  7201 BalanceTask.cpp:182] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Close part on src host, srcLived.
I0115 01:06:52.389657 854836 BalanceTask.cpp:216] [1642179863, 1:181, 11.145.5.33:22509->30.42.50.170:22509] Part has been moved successfully!
I0115 01:06:52.390666  7200 AdminClient.cpp:452] Return leader change from "30.42.54.173":22508, new leader is "30.42.74.36":22508, retry 0, limit 30
I0115 01:06:52.391752  7204 BalanceTask.cpp:63] [1642179863, 1:175, 11.145.5.171:22509->30.42.50.211:22509] Ask the src to give up the leadership.
I0115 01:06:52.394785  7200 BalanceTask.cpp:163] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Update meta for part.
I0115 01:06:52.401362  7201 BalanceTask.cpp:202] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Check the peers...
I0115 01:06:52.401628  7200 AdminClient.cpp:197] [space:1, part:93] Update original peers "30.42.54.173":22509,"30.42.50.211":22509,"30.42.74.36":22509, remove "30.42.54.173":22509, add "30.42.74.26":22509
I0115 01:06:52.401660  7200 AdminClient.cpp:203] remove [1, 93] from "30.42.54.173":22509
I0115 01:06:52.401672  7200 AdminClient.cpp:210] add [1, 93] to "30.42.74.26":22509
I0115 01:06:52.401980 854836 BalanceTask.cpp:216] [1642179863, 1:181, 11.145.5.72:22509->30.42.50.151:22509] Part has been moved successfully!
I0115 01:06:52.402979  7204 BalanceTask.cpp:90] [1642179863, 1:175, 11.145.5.171:22509->30.42.50.211:22509] Open the part as learner on dst.
I0115 01:06:52.411836  7201 BalanceTask.cpp:216] [1642179863, 1:98, 11.145.5.72:22509->30.42.50.211:22509] Part has been moved successfully!
I0115 01:06:52.415293  7205 BalanceTask.cpp:173] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Update meta succeeded!
I0115 01:06:52.415303 854836 BalanceTask.cpp:216] [1642179863, 1:181, 11.145.5.96:22509->30.42.74.26:22509] Part has been moved successfully!
I0115 01:06:52.415431  7205 BalanceTask.cpp:182] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Close part on src host, srcLived.
I0115 01:06:52.417254  7204 BalanceTask.cpp:104] [1642179863, 1:175, 11.145.5.171:22509->30.42.50.211:22509] Add learner dst.
I0115 01:06:52.421459  7201 BalanceTask.cpp:46] [1642179863, 1:98, 11.145.5.96:22509->30.42.74.26:22509] Start to move part, check the peers firstly!
I0115 01:06:52.427121 854836 BalanceTask.cpp:216] [1642179863, 1:164, 11.145.5.121:22509->30.42.50.84:22509] Part has been moved successfully!
I0115 01:06:52.428076  1567 HBProcessor.cpp:35] Receive heartbeat from "11.134.210.102":13708, role = GRAPH
I0115 01:06:52.428192  7204 AdminClient.cpp:452] Return leader change from "11.145.5.171":22508, new leader is "30.42.50.84":22508, retry 0, limit 30
I0115 01:06:52.430778  7204 BalanceTask.cpp:118] [1642179863, 1:175, 11.145.5.171:22509->30.42.50.211:22509] Waiting for the data catch up.
I0115 01:06:52.433650  7205 BalanceTask.cpp:202] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Check the peers...
I0115 01:06:52.435528  7201 BalanceTask.cpp:63] [1642179863, 1:98, 11.145.5.96:22509->30.42.74.26:22509] Ask the src to give up the leadership.
I0115 01:06:52.439270 854836 BalanceTask.cpp:216] [1642179863, 1:164, 11.145.5.171:22509->30.42.50.211:22509] Part has been moved successfully!
I0115 01:06:52.440263  7204 AdminClient.cpp:452] Return leader change from "11.145.5.171":22508, new leader is "30.42.50.84":22508, retry 0, limit 3
I0115 01:06:52.442173  7204 BalanceTask.cpp:132] [1642179863, 1:175, 11.145.5.171:22509->30.42.50.211:22509] Send member change request to the leader, it will add the new member on dst host
I0115 01:06:52.448174  7205 BalanceTask.cpp:216] [1642179863, 1:93, 30.42.54.173:22509->30.42.74.26:22509] Part has been moved successfully!
  • storage 日志(新增节点)
I0115 01:07:01.143492  1218 RaftPart.cpp:494] [Port: 22510, Space: 1, Part: 141] The host "30.42.50.170":22510 has been existed as learner, promote it!
I0115 01:07:01.155961  1218 Part.cpp:426] [Port: 22510, Space: 1, Part: 141] preprocess remove peer "30.42.54.173":22510
I0115 01:07:01.155979  1218 Host.h:32] [Port: 22510, Space: 1, Part: 141] [Host: 30.42.54.173:22510]  The host has been destroyed!
I0115 01:07:01.155987  1218 RaftPart.cpp:524] [Port: 22510, Space: 1, Part: 141] Remove peer "30.42.54.173":22510
I0115 01:07:01.196606  1218 AdminProcessor.h:329] Check peers for space 1, part 141
I0115 01:07:01.196630  1218 RaftPart.cpp:2064] [Port: 22510, Space: 1, Part: 141] Check host "30.42.50.211":22510
I0115 01:07:01.196637  1218 RaftPart.cpp:2064] [Port: 22510, Space: 1, Part: 141] Check host "30.42.50.170":22510
I0115 01:07:01.196645  1218 RaftPart.cpp:2072] [Port: 22510, Space: 1, Part: 141] Add peer "30.42.50.211":22510 if not exist!
I0115 01:07:01.196651  1218 RaftPart.cpp:499] [Port: 22510, Space: 1, Part: 141] The host "30.42.50.211":22510 has been existed as follower!
I0115 01:07:01.196658  1218 RaftPart.cpp:2072] [Port: 22510, Space: 1, Part: 141] Add peer "30.42.50.151":22510 if not exist!
I0115 01:07:01.196666  1218 RaftPart.cpp:481] [Port: 22510, Space: 1, Part: 141] I am already in the raft group!
I0115 01:07:01.196671  1218 RaftPart.cpp:2072] [Port: 22510, Space: 1, Part: 141] Add peer "30.42.50.170":22510 if not exist!
I0115 01:07:01.196677  1218 RaftPart.cpp:499] [Port: 22510, Space: 1, Part: 141] The host "30.42.50.170":22510 has been existed as follower!
I0115 01:07:01.596065  1214 SlowOpTracker.h:33] [Port: 22510, Space: 1, Part: 14] total time:59ms, Total send logs: 2
I0115 01:07:02.235462  1214 Part.cpp:390] [Port: 22510, Space: 1, Part: 144] Skip stale add learner "11.145.5.96":22510, the part is opened at 1642180020696, but the log timestamp is 1642179856700
I0115 01:07:02.235505  1214 Part.cpp:416] [Port: 22510, Space: 1, Part: 144] Skip stale add peer "11.145.5.96":22510, the part is opened at 1642180020696, but the log timestamp is 1642179856726
I0115 01:07:02.235517  1214 Part.cpp:429] [Port: 22510, Space: 1, Part: 144] Skip stale remove peer "11.145.5.14":22510, the part is opened at 1642180020696, but the log timestamp is 1642179856844
I0115 01:07:02.235524  1214 Part.cpp:403] [Port: 22510, Space: 1, Part: 144] Skip stale transfer leader "11.145.5.72":22510, the part is opened at 1642180020696, but the log timestamp is 1642179857048
I0115 01:07:02.235534  1214 Part.cpp:307] [Port: 22510, Space: 1, Part: 144] Skip commit stale remove peer "11.145.5.14":22510, the part is opened at 1642180020696, but the log timestamp is 1642179856844
I0115 01:07:02.235541  1214 Part.cpp:295] [Port: 22510, Space: 1, Part: 144] Skip commit stale transfer leader "11.145.5.72":22510, the part is opened at 1642180020696, but the log timestamp is 1642179857048
I0115 01:07:02.237572  1214 Part.cpp:387] [Port: 22510, Space: 1, Part: 144] preprocess add learner "30.42.50.170":22510
I0115 01:07:02.237583  1214 RaftPart.cpp:394] [Port: 22510, Space: 1, Part: 144] The host "30.42.50.170":22510 has been existed as  group member
I0115 01:07:02.237592  1214 Part.cpp:413] [Port: 22510, Space: 1, Part: 144] preprocess add peer "30.42.50.170":22510
I0115 01:07:02.237601  1214 RaftPart.cpp:499] [Port: 22510, Space: 1, Part: 144] The host "30.42.50.170":22510 has been existed as follower!
I0115 01:07:02.237614  1214 Part.cpp:426] [Port: 22510, Space: 1, Part: 144] preprocess remove peer "11.145.5.33":22510
I0115 01:07:02.237622  1214 RaftPart.cpp:515] [Port: 22510, Space: 1, Part: 144] The peer "11.145.5.33":22510 not exist!
I0115 01:07:02.237632  1214 Part.cpp:400] [Port: 22510, Space: 1, Part: 144] preprocess trans leader "11.145.5.96":22510
I0115 01:07:02.237639  1214 RaftPart.cpp:401] [Port: 22510, Space: 1, Part: 144] Pre process transfer leader to "11.145.5.96":22510
I0115 01:07:02.237645  1214 RaftPart.cpp:405] [Port: 22510, Space: 1, Part: 144] I am follower, just wait for the new leader.
I0115 01:07:02.237654  1214 RaftPart.cpp:595] [Port: 22510, Space: 1, Part: 144] I am Follower, skip remove peer in commit
I0115 01:07:02.237660  1214 RaftPart.cpp:429] [Port: 22510, Space: 1, Part: 144] Commit transfer leader to "11.145.5.96":22510
I0115 01:07:02.237668  1214 RaftPart.cpp:449] [Port: 22510, Space: 1, Part: 144] I am Follower, just wait for the new leader!
I0115 01:07:02.239614  1214 Part.cpp:387] [Port: 22510, Space: 1, Part: 144] preprocess add learner "30.42.50.151":22510
I0115 01:07:02.239627  1214 RaftPart.cpp:384] [Port: 22510, Space: 1, Part: 144] I am learner!
I0115 01:07:02.241616  1214 Part.cpp:413] [Port: 22510, Space: 1, Part: 144] preprocess add peer "30.42.50.151":22510
I0115 01:07:02.241628  1214 RaftPart.cpp:481] [Port: 22510, Space: 1, Part: 144] I am already in the raft group!
I0115 01:07:02.241636  1214 Part.cpp:426] [Port: 22510, Space: 1, Part: 144] preprocess remove peer "11.145.5.72":22510
I0115 01:07:02.241642  1214 RaftPart.cpp:515] [Port: 22510, Space: 1, Part: 144] The peer "11.145.5.72":22510 not exist!
I0115 01:07:04.225697  1230 Host.h:32] [Port: 22510, Space: 1, Part: 50] [Host: 11.145.5.96:22510]  The host has been destroyed!
I0115 01:07:05.423733  1214 RaftPart.cpp:595] [Port: 22510, Space: 1, Part: 141] I am Follower, skip remove peer in commit
I0115 01:07:05.576086  1214 RaftPart.cpp:595] [Port: 22510, Space: 1, Part: 144] I am Follower, skip remove peer in commit
I0115 01:07:07.365478  1218 SlowOpTracker.h:33] [Port: 22510, Space: 1, Part: 40] total time:61ms, Total send logs: 2

你好,请问balance期间有无其他操作,这个balance任务的具体状态是什么。所有task都成功的话,可以挑选任意一个task,将 storage上相关part的日志发出来,以便分析。

balance期间没有做其他操作,所有task都显示成功,storage上的日志指的是服务日志,wal,还是rocksdb日志呢?

就像你上面发的那些 [Port: 22510, Space: 1, Part: 144]类似这种,最好是一个task的 srcHost和dstHost以及该part的leader上的日志都发一下

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。