compaction 过了两天了还没有结束

Reid00 · 2023 年1 月 17 日 03:11

提问参考模版：

nebula 版本：3.3.0
部署方式：分布式
安装方式：RPM
是否为线上版本： N
硬件信息
- 磁盘（推荐使用 SSD）HDD
问题的具体描述
相关的 meta / storage / graph info 日志信息（尽量使用文本形式方便检索）


+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| Job Id(TaskId) | Command(Dest)    | Status     | Start Time                 | Stop Time                  | Error Code        |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| 6              | "COMPACT"        | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | "E_JOB_SUBMITTED" |
| 0              | "172.18.103.120" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T06:22:42.000000 | ""                |
| 1              | "172.18.103.125" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T17:22:08.000000 | ""                |
| 2              | "172.18.103.113" | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | ""                |
| "Total:3"      | "Succeeded:0"    | "Failed:2" | "In Progress:1"            | ""                         | ""                |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+

compaction 已经跑了好几天了，另外两台失败，但是meta leader 还在跑，一直也不结束，有什么办法可以解决这个问题吗？
ps: 这个集群是HDD，io 性能不太好。

SuperYoko · 2023 年1 月 17 日 03:21

可以停止任务如果停止不了，可以尝试直接重启storage
看看这两台失败的storage日志看看为什么失败呢？空间不够？

Reid00 · 2023 年1 月 17 日 06:10

重启过storage，再次尝试重启吗？
是空间不够，开始磁盘使用率 90%，我开始了full compact，中间看过一度磁盘100%，我没办法用console 连接，一天之后磁盘使用率降下来了，但是还有一台一直没结束。

Reid00 · 2023 年1 月 17 日 09:55

这个节点已经offline 很久了，但是这个Compact 还是没有结束

(root@nebula) [(none)]> show hosts
+------------------+------+-----------+-----------+--------------+----------------------+------------------------+---------+
| Host             | Port | HTTP port | Status    | Leader count | Leader distribution  | Partition distribution | Version |
+------------------+------+-----------+-----------+--------------+----------------------+------------------------+---------+
| "172.18.103.113" | 9779 | 19779     | "OFFLINE" | 0            | "No valid partition" | "Unified:30"           | "3.3.0" |
| "172.18.103.120" | 9779 | 19779     | "ONLINE"  | 13           | "Unified:13"         | "Unified:30"           | "3.3.0" |
| "172.18.103.125" | 9779 | 19779     | "ONLINE"  | 17           | "Unified:17"         | "Unified:30"           | "3.3.0" |
+------------------+------+-----------+-----------+--------------+----------------------+------------------------+---------+
Got 3 rows (time spent 5.88ms/8.067512ms)

Tue, 17 Jan 2023 17:53:29 CST

(root@nebula) [(none)]> use Unified
Execution succeeded (time spent 2.488ms/3.123339ms)

Tue, 17 Jan 2023 17:53:35 CST

(root@nebula) [Unified]> show jobs
+--------+------------------+------------+----------------------------+----------------------------+
| Job Id | Command          | Status     | Start Time                 | Stop Time                  |
+--------+------------------+------------+----------------------------+----------------------------+
| 7      | "LEADER_BALANCE" | "QUEUE"    |                            |                            |
| 6      | "COMPACT"        | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            |
| 5      | "COMPACT"        | "FINISHED" | 2023-01-13T03:10:26.000000 | 2023-01-13T07:11:16.000000 |
| 4      | "COMPACT"        | "FINISHED" | 2023-01-11T11:51:45.000000 | 2023-01-11T16:37:11.000000 |
| 3      | "LEADER_BALANCE" | "FINISHED" | 2023-01-11T01:50:48.000000 | 2023-01-11T01:50:54.000000 |
+--------+------------------+------------+----------------------------+----------------------------+
Got 5 rows (time spent 23.884ms/24.97749ms)

Tue, 17 Jan 2023 17:53:36 CST

SuperYoko · 2023 年1 月 17 日 13:10

离线以后确实不能完成，是storage上报完成后才能完成的。

Reid00 · 2023 年1 月 28 日 03:36

后面离线的storaged 恢复了，能够慢慢完成吗？我现在看的这个节点online，但是还在RUNNING

spw · 2023 年1 月 28 日 04:03

重启之后基本就只能等待失败了。现在还是 running 状态吗？可以 show job 6 看一下

Reid00 · 2023 年1 月 28 日 06:19

信息如下，113 节点，我看目前硬盘空间也是够的。
我早晨重启了113 的storaged 服务:
日志

I20230128 09:50:07.366894  5015 NebulaStore.cpp:92] Scan path "/data/nebula330storage/nebula/1"
I20230128 09:50:07.367002  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_background_jobs=8
I20230128 09:50:07.367030  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_subcompactions=8
I20230128 09:50:07.367213  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_base=268435456
I20230128 09:50:07.367237  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_write_buffer_number=4
I20230128 09:50:07.367271  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option write_buffer_size=67108864
I20230128 09:50:07.367287  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option disable_auto_compactions=true
I20230128 09:50:07.376287  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option block_size=8192
I20230128 09:50:07.487393  5015 RocksEngine.cpp:97] open rocksdb on /data/nebula330storage/nebula/1/data
I20230128 09:50:07.487450  5015 RocksEngine.h:196] Release rocksdb on /data/nebula330storage/nebula/1
I20230128 09:50:07.489181  5015 NebulaStore.cpp:271] Init data from partManager for "172.18.103.113":9779
I20230128 09:50:07.489264  5015 NebulaStore.cpp:369] Data space 1 has existed!
I20230128 09:50:07.489316  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_background_jobs=8
I20230128 09:50:07.489328  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_subcompactions=8
I20230128 09:50:07.489435  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_base=268435456
I20230128 09:50:07.489446  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_write_buffer_number=4
I20230128 09:50:07.489456  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option write_buffer_size=67108864
I20230128 09:50:07.489466  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option disable_auto_compactions=true
I20230128 09:50:07.489630  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option block_size=8192
I20230128 09:50:07.553908  5015 RocksEngine.cpp:97] open rocksdb on /data/nebula330storage/nebula/1/data
I20230128 09:50:07.554031  5015 NebulaStore.cpp:430] [Space: 1, Part: 1] has existed!
I20230128 09:50:07.554078  5015 NebulaStore.cpp:430] [Space: 1, Part: 2] has existed!
I20230128 09:50:07.554093  5015 NebulaStore.cpp:430] [Space: 1, Part: 3] has existed!
I20230128 09:50:07.554107  5015 NebulaStore.cpp:430] [Space: 1, Part: 4] has existed!
I20230128 09:50:07.554121  5015 NebulaStore.cpp:430] [Space: 1, Part: 5] has existed!
I20230128 09:50:07.554133  5015 NebulaStore.cpp:430] [Space: 1, Part: 6] has existed!
I20230128 09:50:07.554147  5015 NebulaStore.cpp:430] [Space: 1, Part: 7] has existed!
I20230128 09:50:07.554160  5015 NebulaStore.cpp:430] [Space: 1, Part: 8] has existed!
I20230128 09:50:07.554183  5015 NebulaStore.cpp:430] [Space: 1, Part: 9] has existed!
I20230128 09:50:07.554195  5015 NebulaStore.cpp:430] [Space: 1, Part: 10] has existed!
I20230128 09:50:07.554208  5015 NebulaStore.cpp:430] [Space: 1, Part: 11] has existed!
I20230128 09:50:07.554222  5015 NebulaStore.cpp:430] [Space: 1, Part: 12] has existed!
I20230128 09:50:07.554235  5015 NebulaStore.cpp:430] [Space: 1, Part: 13] has existed!
I20230128 09:50:07.554250  5015 NebulaStore.cpp:430] [Space: 1, Part: 14] has existed!
I20230128 09:50:07.554262  5015 NebulaStore.cpp:430] [Space: 1, Part: 15] has existed!
I20230128 09:50:07.554275  5015 NebulaStore.cpp:430] [Space: 1, Part: 16] has existed!
I20230128 09:50:07.554289  5015 NebulaStore.cpp:430] [Space: 1, Part: 17] has existed!
I20230128 09:50:07.554302  5015 NebulaStore.cpp:430] [Space: 1, Part: 18] has existed!
I20230128 09:50:07.554315  5015 NebulaStore.cpp:430] [Space: 1, Part: 19] has existed!
I20230128 09:50:07.554328  5015 NebulaStore.cpp:430] [Space: 1, Part: 20] has existed!
I20230128 09:50:07.554347  5015 NebulaStore.cpp:430] [Space: 1, Part: 21] has existed!
I20230128 09:50:07.554360  5015 NebulaStore.cpp:430] [Space: 1, Part: 22] has existed!
I20230128 09:50:07.554378  5015 NebulaStore.cpp:430] [Space: 1, Part: 23] has existed!
I20230128 09:50:07.554397  5015 NebulaStore.cpp:430] [Space: 1, Part: 24] has existed!
I20230128 09:50:07.554414  5015 NebulaStore.cpp:430] [Space: 1, Part: 25] has existed!
I20230128 09:50:07.554430  5015 NebulaStore.cpp:430] [Space: 1, Part: 26] has existed!
I20230128 09:50:07.554442  5015 NebulaStore.cpp:430] [Space: 1, Part: 27] has existed!
I20230128 09:50:07.554455  5015 NebulaStore.cpp:430] [Space: 1, Part: 28] has existed!
I20230128 09:50:07.554468  5015 NebulaStore.cpp:430] [Space: 1, Part: 29] has existed!
I20230128 09:50:07.554481  5015 NebulaStore.cpp:430] [Space: 1, Part: 30] has existed!
I20230128 09:50:07.554563  5015 NebulaStore.cpp:78] Register handler...
I20230128 09:50:07.554576  5015 StorageServer.cpp:228] Init LogMonitor
I20230128 09:50:07.554723  5015 StorageServer.cpp:96] Starting Storage HTTP Service
I20230128 09:50:07.555509  5015 StorageServer.cpp:100] Http Thread Pool started
I20230128 09:50:07.560770  5179 WebService.cpp:124] Web service started on HTTP[19779]
I20230128 09:50:07.560891  5015 TransactionManager.cpp:24] TransactionManager ctor()
I20230128 09:50:07.561961  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_background_jobs=8
I20230128 09:50:07.561991  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_subcompactions=8
I20230128 09:50:07.562110  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_base=268435456
I20230128 09:50:07.562124  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_write_buffer_number=4
I20230128 09:50:07.562136  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option write_buffer_size=67108864
I20230128 09:50:07.562150  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option disable_auto_compactions=true
I20230128 09:50:07.562343  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option block_size=8192
I20230128 09:50:07.641964  5015 RocksEngine.cpp:97] open rocksdb on /mnt/nebula330/data/storage/nebula/0/data
I20230128 09:50:07.642197  5015 AdminTaskManager.cpp:22] max concurrent subtasks: 10
I20230128 09:50:07.642505  5015 AdminTaskManager.cpp:40] exit AdminTaskManager::init()
I20230128 09:50:07.642700  5200 AdminTaskManager.cpp:227] waiting for incoming task
I20230128 09:50:28.812856  5081 MetaClient.cpp:3108] Load leader of "172.18.103.113":9779 in 0 space
I20230128 09:50:28.812965  5081 MetaClient.cpp:3108] Load leader of "172.18.103.120":9779 in 1 space
I20230128 09:50:28.812988  5081 MetaClient.cpp:3108] Load leader of "172.18.103.125":9779 in 1 space
I20230128 09:50:28.812997  5081 MetaClient.cpp:3114] Load leader ok

(root@nebula) [Unified]> show job 6
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| Job Id(TaskId) | Command(Dest)    | Status     | Start Time                 | Stop Time                  | Error Code        |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| 6              | "COMPACT"        | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | "E_JOB_SUBMITTED" |
| 0              | "172.18.103.120" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T06:22:42.000000 | ""                |
| 1              | "172.18.103.125" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T17:22:08.000000 | ""                |
| 2              | "172.18.103.113" | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | ""                |
| "Total:3"      | "Succeeded:0"    | "Failed:2" | "In Progress:1"            | ""                         | ""                |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+

(root@nebula) [(none)]> show hosts
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| Host             | Port | HTTP port | Status   | Leader count | Leader distribution  | Partition distribution | Version |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| "172.18.103.113" | 9779 | 19779     | "ONLINE" | 0            | "No valid partition" | "Unified:30"           | "3.3.0" |
| "172.18.103.120" | 9779 | 19779     | "ONLINE" | 13           | "Unified:13"         | "Unified:30"           | "3.3.0" |
| "172.18.103.125" | 9779 | 19779     | "ONLINE" | 17           | "Unified:17"         | "Unified:30"           | "3.3.0" |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+

spw · 2023 年1 月 28 日 09:47

172.18.103.113 这台机器重启后，有 reportTaskFinish() 字样的日志吗，可以贴一下

Reid00 · 2023 年1 月 28 日 10:13

今天的日志都试了，没有相关的信息

spw · 2023 年1 月 29 日 02:50

这就有点奇怪了，请确认下：
1.13 机器的所有 storaged 日志可以都看下。
2. 这台机器上的 data 文件夹动过没？
3. 可以 stop job ，从代码上来说，无论如何这个 compact job 已经失败了。

Reid00 · 2023 年1 月 29 日 03:07

应该都看了一遍了，没有
data 文件夹没有动过。不过放假前我改过storage.conf 给data_path 增加了要给另外盘的路径
就是想 stop job，但是目前不知道怎么操作，看文档，貌似说的不能stop
Compaction - NebulaGraph Database 手册

spw · 2023 年1 月 29 日 03:59

我跟同事确认了下，这确实是一个 bug，在这个 pr 修了：https://github.com/vesoft-inc/nebula/pull/5195

这个 job 的尸体可以留在那不管了。如果对你后续提交 job 有影响的话，可以重启下 metad。

Reid00 · 2023 年1 月 29 日 08:00

重启meta 后，已经过去1h 多了，后续的balace leader 确实可以running，但是这个一直不finishe

看了meta 相关的日志，正常的hearbeat 没其他信息

(root@nebula) [(none)]> show hosts
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| Host             | Port | HTTP port | Status   | Leader count | Leader distribution  | Partition distribution | Version |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| "172.18.103.113" | 9779 | 19779     | "ONLINE" | 0            | "No valid partition" | "Unified:30"           | "3.3.0" |
| "172.18.103.120" | 9779 | 19779     | "ONLINE" | 13           | "Unified:13"         | "Unified:30"           | "3.3.0" |
| "172.18.103.125" | 9779 | 19779     | "ONLINE" | 17           | "Unified:17"         | "Unified:30"           | "3.3.0" |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
Got 3 rows (time spent 4.422ms/6.104329ms)

Sun, 29 Jan 2023 15:57:48 CST

(root@nebula) [(none)]> use Unified
Execution succeeded (time spent 2.106ms/3.05725ms)

Sun, 29 Jan 2023 15:57:52 CST

(root@nebula) [Unified]> show jobs
+--------+------------------+-----------+----------------------------+-----------+
| Job Id | Command          | Status    | Start Time                 | Stop Time |
+--------+------------------+-----------+----------------------------+-----------+
| 7      | "LEADER_BALANCE" | "RUNNING" | 2023-01-29T05:47:43.000000 |           |
| 6      | "COMPACT"        | "RUNNING" | 2023-01-14T00:18:32.000000 |           |
+--------+------------------+-----------+----------------------------+-----------+
Got 2 rows (time spent 23.079ms/24.300296ms)

Sun, 29 Jan 2023 15:57:53 CST

spw · 2023 年1 月 29 日 08:11

嗯，由于上面那个 bug，这个状态已经改不了了，可以忽略他，不影响其他 Job 运行。

Reid00 · 2023 年1 月 29 日 11:23

还有要给问题:
因为我在storage中添加了新的data_path, 我想通过balance data 把数据均衡到两个目录中。
现在发现balance data 是disable了
看到了这个pr，想要实现这个数据均衡到两个目录中，该怎么做呢？

github.com/vesoft-inc/nebula

split experimental_feature flag

vesoft-inc:master ← SuperYoko:add_toss_flag

opened 07:52AM - 14 Oct 22 UTC

SuperYoko

+31 -7

## What type of PR is this? - [ ] bug - [ ] feature - [x] enhancement ## What problem(s) does this PR solve? #### Issue(s) number: #4729 #### Description: experimental flags control toss and data balance at the same time. ## How do you solve it? add 2 sub flag, current config only allow data balance. ``` # if use experimental features --enable_experimental_feature=false # if use toss feature, only work if enable_experimental_feature is true --enable_toss=false # if use balance data feature, only work if enable_experimental_feature is true --enable_data_balance=true ``` ## Special notes for your reviewer, ex. impact of this fix, design document, etc: ## Checklist: Tests: - [ ] Unit test(positive and negative cases) - [ ] Function test - [ ] Performance test - [ ] N/A Affects: - [x] Documentation affected (Please add the label if documentation needs to be modified.) - [ ] Incompatibility (If it breaks the compatibility, please describe it and add the label.） - [ ] If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).) - [ ] Performance impacted: Consumes more CPU/Memory ## Release notes: Please confirm whether to be reflected in release notes and how to describe: > add new flag to support only open toss（not suggest）/ data balance feature

SuperYoko · 2023 年1 月 30 日 03:19

这个pr仅仅是将data balance开关放到了实验特性下，目的是与toss特性分开，你只需要把
–enable_experimental_feature=true
–enable_data_balance=true
同时打开就可以使用了。

Reid00 · 2023 年1 月 30 日 03:52

storage 中的 --data_path 开始时默认值 --data_path=data/storage
后来磁盘空间不够，我新加了一个硬盘，改为 --data_path=data/storage,/data/nebula330storage

enable 之后，发现balance data 并不能达到把已有的数据，平衡到新的目录下，怎么实现这个目的呢？

SuperYoko · 2023 年1 月 30 日 04:52

data balance功能是在hosts之间平衡分片，你这样是没法达成目的的。你要么改动data path和数据把某个storage的数据放到新增的磁盘上，要么基于新磁盘增加新的host，然后data balance

Reid00 · 2023 年1 月 30 日 06:24

新加host 不太方便，目前三节点测试用的
这个怎么操作呢？

你要么改动data path和数据把某个storage的数据放到新增的磁盘上

这个情况下，如果导入数据，是不是优先数据到原先的目录(旧的空间不多的磁盘中)？