均衡数据方案

没有其他 job,balance期间读写数据也都关了

1 个赞

@spw @SuperYoko hi, 请问还需要补充信息吗

看起来是 -3035 @SuperYoko @spw 重试超过30次了

X(E_RETRY_EXHAUSTED, -3035)

是这里?

I20230313 20:23:19.163095 22758 AdminClient.cpp:592] Unknown code -3035 from "10.152.2.95":9778, retry 27, limit 30
I20230313 20:23:19.163563 22758 AdminClient.cpp:574] Return leader change from "10.151.1.223":9778, new leader is "10.152.2.95":9778, retry 28, limit 30
I20230313 20:23:19.165951 22751 AdminClient.cpp:592] Unknown code -3035 from "10.152.2.95":9778, retry 27, limit 30
I20230313 20:23:19.166437 22751 AdminClient.cpp:574] Return leader change from "10.151.1.223":9778, new leader is "10.152.2.95":9778, retry 28, limit 30
I20230313 20:37:23.668624 16473 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230313 20:37:23.693392 16473 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230313 20:38:14.175863 22760 AdminClient.cpp:592] Unknown code -3035 from "10.151.1.223":9778, retry 29, limit 30
I20230313 20:38:14.176486 22760 BalanceTask.cpp:127] 96, 47:76,10.152.2.95:9779->10.152.10.78:9779 Catchup data failed, status Leader changed!
I20230313 20:38:14.177793 22760 BalanceTask.cpp:38] 96, 47:76,10.152.2.95:9779->10.152.10.78:9779 Task failed, status 5
I20230313 20:38:14.177805 22760 BalancePlan.cpp:98] Balance 96 has completed 29 task
I20230313 20:38:14.177812 22760 BalancePlan.cpp:113] Skip the task for the same partId 76
I20230313 20:38:14.178561 22760 BalanceTask.cpp:38] 96, 47:76,10.151.1.223:9779->10.152.3.161:9779 Task failed, status 1
I20230313 20:38:14.178570 22760 BalancePlan.cpp:98] Balance 96 has completed 30 task
I20230313 20:38:14.178572 22760 BalancePlan.cpp:113] Skip the task for the same partId 76
I20230313 20:38:14.179280 22760 BalanceTask.cpp:38] 96, 47:76,10.128.5.244:9779->10.140.50.10:9779 Task failed, status 1
I20230313 20:38:14.179288 22760 BalancePlan.cpp:98] Balance 96 has completed 31 task
I20230313 20:38:19.167425 22758 AdminClient.cpp:592] Unknown code -3035 from "10.152.2.95":9778, retry 29, limit 30
I20230313 20:38:19.167869 22758 BalanceTask.cpp:127] 96, 47:86,10.128.5.244:9779->10.152.10.78:9779 Catchup data failed, status Leader changed!
I20230313 20:38:19.168772 22758 BalanceTask.cpp:38] 96, 47:86,10.128.5.244:9779->10.152.10.78:9779 Task failed, status 5
I20230313 20:38:19.168792 22758 BalancePlan.cpp:98] Balance 96 has completed 32 task
I20230313 20:38:19.168797 22758 BalancePlan.cpp:113] Skip the task for the same partId 86
I20230313 20:38:19.169384 22751 AdminClient.cpp:592] Unknown code -3035 from "10.152.2.95":9778, retry 29, limit 30
I20230313 20:38:19.169580 22758 BalanceTask.cpp:38] 96, 47:86,10.151.1.223:9779->10.140.50.10:9779 Task failed, status 1
I20230313 20:38:19.169589 22758 BalancePlan.cpp:98] Balance 96 has completed 33 task
I20230313 20:38:19.169591 22758 BalancePlan.cpp:113] Skip the task for the same partId 86
I20230313 20:38:19.169827 22751 BalanceTask.cpp:127] 96, 47:73,10.128.5.244:9779->10.152.3.161:9779 Catchup data failed, status Leader changed!
I20230313 20:38:19.170343 22758 BalanceTask.cpp:38] 96, 47:86,10.152.2.95:9779->10.152.3.161:9779 Task failed, status 1
I20230313 20:38:19.170351 22758 BalancePlan.cpp:98] Balance 96 has completed 34 task
I20230313 20:38:19.171085 22751 BalanceTask.cpp:38] 96, 47:73,10.128.5.244:9779->10.152.3.161:9779 Task failed, status 5
I20230313 20:38:19.171095 22751 BalancePlan.cpp:98] Balance 96 has completed 35 task
I20230313 20:38:19.171100 22751 BalancePlan.cpp:113] Skip the task for the same partId 73
I20230313 20:38:19.171757 22751 BalanceTask.cpp:38] 96, 47:73,10.151.1.223:9779->10.152.10.78:9779 Task failed, status 1
I20230313 20:38:19.171766 22751 BalancePlan.cpp:98] Balance 96 has completed 36 task
I20230313 20:38:19.171768 22751 BalancePlan.cpp:113] Skip the task for the same partId 73
I20230313 20:38:19.172451 22751 BalanceTask.cpp:38] 96, 47:73,10.152.2.95:9779->10.140.50.10:9779 Task failed, status 1
I20230313 20:38:19.172457 22751 BalancePlan.cpp:98] Balance 96 has completed 37 task
I20230313 20:38:19.172461 22751 BalancePlan.cpp:102] Balance 96 failed!

另外,似乎 balancing 过程中穿插了 compaction,有关系么?

@wey 没错。根据log,E_RETRY_EXHAUSTED 是由多次retry但raft一直在send snapshot导致的。这在WaitingForCatchUpDataProcessor::process里遇到的error是E_RAFT_SENDING_SNAPSHOT。从storaged.INFO中也可以看出来:

I20230310 15:33:14.584909 19844 AdminProcessor.h:355] Waiting for catching up data, peer "10.152.10.78":9780, space 47, part 21, remaining 21 retry times, result -3512
I20230310 15:33:14.584933 19844 AdminProcessor.h:372] Space 47, partId 21 is still sending snapshot, please wait...
I20230310 15:33:14.661196 19859 AdminProcessor.h:355] Waiting for catching up data, peer "10.152.3.161":9780, space 47, part 33, remaining 1 retry times, result -3512
I20230310 15:33:14.661223 19859 AdminProcessor.h:372] Space 47, partId 33 is still sending snapshot, please wait...
I20230310 15:33:39.683490 19846 AdminProcessor.h:355] Waiting for catching up data, peer "10.140.50.10":9780, space 47, part 68, remaining 24 retry times, result -3512
I20230310 15:33:39.683521 19846 AdminProcessor.h:372] Space 47, partId 68 is still sending snapshot, please wait...
I20230310 15:33:44.563724 19858 AdminProcessor.h:355] Waiting for catching up data, peer "10.152.10.78":9780, space 47, part 35, remaining 9 retry times, result -3512
1 个赞

是因为 snapshot 太大?要调高 retry 次数才行?还是之前的 partial rebalance 有进度了,再执行一次 balance data 就会更进一步?

调高这个?

这个是storaged的参数?需要我这边在配置里调整吗?

是的,如果 balancing 在现在的带宽下没法追上,这个是存储的配置

waiting_catch_up_interval_in_secs,我觉得可以调大重试

在大家给出其他建议之前

先按照@wey说的试试吧。现在看有可能是网络或者磁盘速度太慢了。

1 个赞

直接在 nebula-storaged.conf 中 增加配置 --waiting_catch_up_interval_in_secs=xx是吗,我看默认值是30, 建议修改到多少比较合适?

我也没有经验值,还是第一次遇到这个追不上的,为了保证成功,这次可以给大点,300?回头可以再调小?

1 个赞