数据均衡期间,停掉meta的leader,均衡任务状态一直是RUNNING状态,即使小数据量(100万)也是这种情况

提问参考模版:

  • nebula 版本:3.6.0(已确认)
  • 部署方式:分布式
  • 安装方式:源码编译 / Docker
  • 是否上生产环境:N
  • 硬件信息
    • HDD
    • CPU、内存信息:3U3.5G
  • 问题的具体描述
    在metad和storaged都为3实例的情况下,在执行balance data之后,meta leader掉线,新的meta成为leader,导致执行show均衡job的时候,job的status一直是running状态,无论是百万还是千万的数据量,依然是以上情况。
一下是meta leader掉线之前的日志信息:
I20240815 21:05:15.835616   558 MetaDaemon.cpp:241] Signal 15(Terminated) received, stopping this server
I20240815 21:05:15.912626   558 JobManager.cpp:138] JobManager::shutDown() begin
I20240815 21:05:15.916101   685 JobManager.cpp:155] Detect shutdown called, exit
I20240815 21:05:15.916178   685 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20240815 21:05:15.916237   685 JobManager.cpp:175] Load an invalid job from space 0 jodId 0
I20240815 21:05:15.916390   558 JobManager.cpp:146] JobManager::shutDown() end
I20240815 21:05:15.916424   558 NebulaStore.cpp:343] Stop the raft service...
I20240815 21:05:15.916437   558 RaftexService.cpp:69] Stopping the raftex service on port 26743
I20240815 21:05:15.916457   558 RaftexService.cpp:79] All partitions have stopped
I20240815 21:05:15.916472   558 NebulaStore.cpp:346] Stop kv engine...
I20240815 21:05:15.916718   558 NebulaStore.cpp:343] Stop the raft service...
I20240815 21:05:15.916741   558 RaftexService.cpp:69] Stopping the raftex service on port 26743
I20240815 21:05:15.916749   558 RaftexService.cpp:79] All partitions have stopped
I20240815 21:05:15.916754   558 NebulaStore.cpp:346] Stop kv engine...
I20240815 21:05:15.916769   558 NebulaStore.cpp:36] Cut off the relationship with meta client
I20240815 21:05:15.917281   558 Part.h:59] [Port: 26743, Space: 0, Part: 0] ~Part()
I20240815 21:05:15.919994   558 RocksEngine.h:247] Release rocksdb on xxx/data/meta/nebula/0
I20240815 21:05:15.920468   558 NebulaStore.cpp:44] ~NebulaStore()
I20240815 21:05:15.929644   558 MetaDaemon.cpp:226] The meta Daemon stopped
I20240815 21:05:44.487430  7108 BalanceTask.cpp:43] 13, 12:2,storaged-0:26746->storaged-2:26746 still in processing
I20240815 21:05:44.487555  7108 BalanceTask.cpp:137] 13, 12:2,storaged-0:26746->storaged-2:26746 Send member change request to the leader, it will add the new member on dst host
I20240815 21:05:44.487592  7108 BalanceTask.cpp:251] 13, 12:2 Can't persist task!
I20240815 21:05:44.487603  7108 BalancePlan.cpp:98] Balance 13 has completed 1 task
I20240815 21:05:44.492920  7109 BalanceTask.cpp:43] 13, 12:1,storaged-1->storaged-2 still in processing
I20240815 21:05:44.492985  7109 BalanceTask.cpp:137] 13, 12:1,storaged-1:26746->storaged-2:26746 Send member change request to the leader, it will add the new member on dst host
I20240815 21:05:44.493006  7109 BalanceTask.cpp:251] 13, 12:1 Can't persist task!
I20240815 21:05:44.493041  7109 BalancePlan.cpp:98] Balance 13 has completed 2 task
I20240815 21:05:44.493054  7109 BalancePlan.cpp:102] Balance 13 failed!
I20240815 21:05:44.493068  7109 ZoneBalanceJobExecutor.cpp:67] Balance plan 13 update meta failed
I20240815 21:05:44.493094  7109 JobManager.cpp:311] Trying to end job, spaceId=12, jobId=13, target phase status=FAILED
I20240815 21:05:44.493144  7109 JobDescription.cpp:113] Loading job description failed, error: E_SPACE_NOT_FOUND
I20240815 21:05:44.493156  7109 JobManager.cpp:327] Load job failed, spaceId=12 jobId=13
I20240815 21:05:44.497642   558 JobManager.cpp:138] JobManager::shutDown() begin
I20240815 21:05:44.497684   558 JobManager.cpp:141] JobManager not running, exit

balance data 在社区版是不建议使用的功能

但是出现了这种情况,可不可以从源码层面直接将这个job的状态改为FAILED

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。