数据均衡期间，停掉meta的leader，均衡任务状态一直是RUNNING状态，即使小数据量(100万)也是这种情况

DROP · 2024 年8 月 16 日 07:12

提问参考模版：

nebula 版本：3.6.0（已确认）
部署方式：分布式
安装方式：源码编译 / Docker
是否上生产环境：N
硬件信息
- HDD
- CPU、内存信息：3U3.5G
问题的具体描述
在metad和storaged都为3实例的情况下，在执行balance data之后，meta leader掉线，新的meta成为leader，导致执行show均衡job的时候，job的status一直是running状态，无论是百万还是千万的数据量，依然是以上情况。

一下是meta leader掉线之前的日志信息：
I20240815 21:05:15.835616   558 MetaDaemon.cpp:241] Signal 15(Terminated) received, stopping this server
I20240815 21:05:15.912626   558 JobManager.cpp:138] JobManager::shutDown() begin
I20240815 21:05:15.916101   685 JobManager.cpp:155] Detect shutdown called, exit
I20240815 21:05:15.916178   685 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20240815 21:05:15.916237   685 JobManager.cpp:175] Load an invalid job from space 0 jodId 0
I20240815 21:05:15.916390   558 JobManager.cpp:146] JobManager::shutDown() end
I20240815 21:05:15.916424   558 NebulaStore.cpp:343] Stop the raft service...
I20240815 21:05:15.916437   558 RaftexService.cpp:69] Stopping the raftex service on port 26743
I20240815 21:05:15.916457   558 RaftexService.cpp:79] All partitions have stopped
I20240815 21:05:15.916472   558 NebulaStore.cpp:346] Stop kv engine...
I20240815 21:05:15.916718   558 NebulaStore.cpp:343] Stop the raft service...
I20240815 21:05:15.916741   558 RaftexService.cpp:69] Stopping the raftex service on port 26743
I20240815 21:05:15.916749   558 RaftexService.cpp:79] All partitions have stopped
I20240815 21:05:15.916754   558 NebulaStore.cpp:346] Stop kv engine...
I20240815 21:05:15.916769   558 NebulaStore.cpp:36] Cut off the relationship with meta client
I20240815 21:05:15.917281   558 Part.h:59] [Port: 26743, Space: 0, Part: 0] ~Part()
I20240815 21:05:15.919994   558 RocksEngine.h:247] Release rocksdb on xxx/data/meta/nebula/0
I20240815 21:05:15.920468   558 NebulaStore.cpp:44] ~NebulaStore()
I20240815 21:05:15.929644   558 MetaDaemon.cpp:226] The meta Daemon stopped
I20240815 21:05:44.487430  7108 BalanceTask.cpp:43] 13, 12:2,storaged-0:26746->storaged-2:26746 still in processing
I20240815 21:05:44.487555  7108 BalanceTask.cpp:137] 13, 12:2,storaged-0:26746->storaged-2:26746 Send member change request to the leader, it will add the new member on dst host
I20240815 21:05:44.487592  7108 BalanceTask.cpp:251] 13, 12:2 Can't persist task!
I20240815 21:05:44.487603  7108 BalancePlan.cpp:98] Balance 13 has completed 1 task
I20240815 21:05:44.492920  7109 BalanceTask.cpp:43] 13, 12:1,storaged-1->storaged-2 still in processing
I20240815 21:05:44.492985  7109 BalanceTask.cpp:137] 13, 12:1,storaged-1:26746->storaged-2:26746 Send member change request to the leader, it will add the new member on dst host
I20240815 21:05:44.493006  7109 BalanceTask.cpp:251] 13, 12:1 Can't persist task!
I20240815 21:05:44.493041  7109 BalancePlan.cpp:98] Balance 13 has completed 2 task
I20240815 21:05:44.493054  7109 BalancePlan.cpp:102] Balance 13 failed!
I20240815 21:05:44.493068  7109 ZoneBalanceJobExecutor.cpp:67] Balance plan 13 update meta failed
I20240815 21:05:44.493094  7109 JobManager.cpp:311] Trying to end job, spaceId=12, jobId=13, target phase status=FAILED
I20240815 21:05:44.493144  7109 JobDescription.cpp:113] Loading job description failed, error: E_SPACE_NOT_FOUND
I20240815 21:05:44.493156  7109 JobManager.cpp:327] Load job failed, spaceId=12 jobId=13
I20240815 21:05:44.497642   558 JobManager.cpp:138] JobManager::shutDown() begin
I20240815 21:05:44.497684   558 JobManager.cpp:141] JobManager not running, exit

MuYi-方扬 · 2024 年8 月 17 日 08:54

balance data 在社区版是不建议使用的功能

DROP · 2024 年8 月 19 日 02:45

但是出现了这种情况，可不可以从源码层面直接将这个job的状态改为FAILED

system · 2024 年9 月 18 日 02:45

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。