compaction 过了两天了还没有结束

提问参考模版:

  • nebula 版本:3.3.0

  • 部署方式:分布式

  • 安装方式:RPM

  • 是否为线上版本: N

  • 硬件信息

    • 磁盘( 推荐使用 SSD)HDD
  • 问题的具体描述

  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)


+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| Job Id(TaskId) | Command(Dest)    | Status     | Start Time                 | Stop Time                  | Error Code        |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| 6              | "COMPACT"        | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | "E_JOB_SUBMITTED" |
| 0              | "172.18.103.120" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T06:22:42.000000 | ""                |
| 1              | "172.18.103.125" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T17:22:08.000000 | ""                |
| 2              | "172.18.103.113" | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | ""                |
| "Total:3"      | "Succeeded:0"    | "Failed:2" | "In Progress:1"            | ""                         | ""                |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+

compaction 已经跑了好几天了,另外两台失败,但是meta leader 还在跑,一直也不结束,有什么办法可以解决这个问题吗?
ps: 这个集群是HDD,io 性能不太好。

  1. 可以停止任务 如果停止不了,可以尝试直接重启storage
  2. 看看这两台失败的storage日志看看为什么失败呢?空间不够?
2 个赞
  1. 重启过storage,再次尝试重启吗?
  2. 是空间不够,开始磁盘使用率 90%, 我开始了full compact, 中间看过一度磁盘100%, 我没办法用console 连接,一天之后磁盘使用率降下来了,但是还有一台一直没结束。

这个节点已经offline 很久了,但是这个Compact 还是没有结束 :open_mouth:

(root@nebula) [(none)]> show hosts
+------------------+------+-----------+-----------+--------------+----------------------+------------------------+---------+
| Host             | Port | HTTP port | Status    | Leader count | Leader distribution  | Partition distribution | Version |
+------------------+------+-----------+-----------+--------------+----------------------+------------------------+---------+
| "172.18.103.113" | 9779 | 19779     | "OFFLINE" | 0            | "No valid partition" | "Unified:30"           | "3.3.0" |
| "172.18.103.120" | 9779 | 19779     | "ONLINE"  | 13           | "Unified:13"         | "Unified:30"           | "3.3.0" |
| "172.18.103.125" | 9779 | 19779     | "ONLINE"  | 17           | "Unified:17"         | "Unified:30"           | "3.3.0" |
+------------------+------+-----------+-----------+--------------+----------------------+------------------------+---------+
Got 3 rows (time spent 5.88ms/8.067512ms)

Tue, 17 Jan 2023 17:53:29 CST

(root@nebula) [(none)]> use Unified
Execution succeeded (time spent 2.488ms/3.123339ms)

Tue, 17 Jan 2023 17:53:35 CST

(root@nebula) [Unified]> show jobs
+--------+------------------+------------+----------------------------+----------------------------+
| Job Id | Command          | Status     | Start Time                 | Stop Time                  |
+--------+------------------+------------+----------------------------+----------------------------+
| 7      | "LEADER_BALANCE" | "QUEUE"    |                            |                            |
| 6      | "COMPACT"        | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            |
| 5      | "COMPACT"        | "FINISHED" | 2023-01-13T03:10:26.000000 | 2023-01-13T07:11:16.000000 |
| 4      | "COMPACT"        | "FINISHED" | 2023-01-11T11:51:45.000000 | 2023-01-11T16:37:11.000000 |
| 3      | "LEADER_BALANCE" | "FINISHED" | 2023-01-11T01:50:48.000000 | 2023-01-11T01:50:54.000000 |
+--------+------------------+------------+----------------------------+----------------------------+
Got 5 rows (time spent 23.884ms/24.97749ms)

Tue, 17 Jan 2023 17:53:36 CST

离线以后确实不能完成,是storage上报完成后才能完成的。

1 个赞

后面离线的storaged 恢复了, 能够慢慢完成吗?我现在看的这个节点online,但是还在RUNNING

重启之后基本就只能等待失败了。现在还是 running 状态吗?可以 show job 6 看一下

信息如下,113 节点,我看目前硬盘空间也是够的。
我早晨重启了113 的storaged 服务:
日志

I20230128 09:50:07.366894  5015 NebulaStore.cpp:92] Scan path "/data/nebula330storage/nebula/1"
I20230128 09:50:07.367002  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_background_jobs=8
I20230128 09:50:07.367030  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_subcompactions=8
I20230128 09:50:07.367213  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_base=268435456
I20230128 09:50:07.367237  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_write_buffer_number=4
I20230128 09:50:07.367271  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option write_buffer_size=67108864
I20230128 09:50:07.367287  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option disable_auto_compactions=true
I20230128 09:50:07.376287  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option block_size=8192
I20230128 09:50:07.487393  5015 RocksEngine.cpp:97] open rocksdb on /data/nebula330storage/nebula/1/data
I20230128 09:50:07.487450  5015 RocksEngine.h:196] Release rocksdb on /data/nebula330storage/nebula/1
I20230128 09:50:07.489181  5015 NebulaStore.cpp:271] Init data from partManager for "172.18.103.113":9779
I20230128 09:50:07.489264  5015 NebulaStore.cpp:369] Data space 1 has existed!
I20230128 09:50:07.489316  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_background_jobs=8
I20230128 09:50:07.489328  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_subcompactions=8
I20230128 09:50:07.489435  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_base=268435456
I20230128 09:50:07.489446  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_write_buffer_number=4
I20230128 09:50:07.489456  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option write_buffer_size=67108864
I20230128 09:50:07.489466  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option disable_auto_compactions=true
I20230128 09:50:07.489630  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option block_size=8192
I20230128 09:50:07.553908  5015 RocksEngine.cpp:97] open rocksdb on /data/nebula330storage/nebula/1/data
I20230128 09:50:07.554031  5015 NebulaStore.cpp:430] [Space: 1, Part: 1] has existed!
I20230128 09:50:07.554078  5015 NebulaStore.cpp:430] [Space: 1, Part: 2] has existed!
I20230128 09:50:07.554093  5015 NebulaStore.cpp:430] [Space: 1, Part: 3] has existed!
I20230128 09:50:07.554107  5015 NebulaStore.cpp:430] [Space: 1, Part: 4] has existed!
I20230128 09:50:07.554121  5015 NebulaStore.cpp:430] [Space: 1, Part: 5] has existed!
I20230128 09:50:07.554133  5015 NebulaStore.cpp:430] [Space: 1, Part: 6] has existed!
I20230128 09:50:07.554147  5015 NebulaStore.cpp:430] [Space: 1, Part: 7] has existed!
I20230128 09:50:07.554160  5015 NebulaStore.cpp:430] [Space: 1, Part: 8] has existed!
I20230128 09:50:07.554183  5015 NebulaStore.cpp:430] [Space: 1, Part: 9] has existed!
I20230128 09:50:07.554195  5015 NebulaStore.cpp:430] [Space: 1, Part: 10] has existed!
I20230128 09:50:07.554208  5015 NebulaStore.cpp:430] [Space: 1, Part: 11] has existed!
I20230128 09:50:07.554222  5015 NebulaStore.cpp:430] [Space: 1, Part: 12] has existed!
I20230128 09:50:07.554235  5015 NebulaStore.cpp:430] [Space: 1, Part: 13] has existed!
I20230128 09:50:07.554250  5015 NebulaStore.cpp:430] [Space: 1, Part: 14] has existed!
I20230128 09:50:07.554262  5015 NebulaStore.cpp:430] [Space: 1, Part: 15] has existed!
I20230128 09:50:07.554275  5015 NebulaStore.cpp:430] [Space: 1, Part: 16] has existed!
I20230128 09:50:07.554289  5015 NebulaStore.cpp:430] [Space: 1, Part: 17] has existed!
I20230128 09:50:07.554302  5015 NebulaStore.cpp:430] [Space: 1, Part: 18] has existed!
I20230128 09:50:07.554315  5015 NebulaStore.cpp:430] [Space: 1, Part: 19] has existed!
I20230128 09:50:07.554328  5015 NebulaStore.cpp:430] [Space: 1, Part: 20] has existed!
I20230128 09:50:07.554347  5015 NebulaStore.cpp:430] [Space: 1, Part: 21] has existed!
I20230128 09:50:07.554360  5015 NebulaStore.cpp:430] [Space: 1, Part: 22] has existed!
I20230128 09:50:07.554378  5015 NebulaStore.cpp:430] [Space: 1, Part: 23] has existed!
I20230128 09:50:07.554397  5015 NebulaStore.cpp:430] [Space: 1, Part: 24] has existed!
I20230128 09:50:07.554414  5015 NebulaStore.cpp:430] [Space: 1, Part: 25] has existed!
I20230128 09:50:07.554430  5015 NebulaStore.cpp:430] [Space: 1, Part: 26] has existed!
I20230128 09:50:07.554442  5015 NebulaStore.cpp:430] [Space: 1, Part: 27] has existed!
I20230128 09:50:07.554455  5015 NebulaStore.cpp:430] [Space: 1, Part: 28] has existed!
I20230128 09:50:07.554468  5015 NebulaStore.cpp:430] [Space: 1, Part: 29] has existed!
I20230128 09:50:07.554481  5015 NebulaStore.cpp:430] [Space: 1, Part: 30] has existed!
I20230128 09:50:07.554563  5015 NebulaStore.cpp:78] Register handler...
I20230128 09:50:07.554576  5015 StorageServer.cpp:228] Init LogMonitor
I20230128 09:50:07.554723  5015 StorageServer.cpp:96] Starting Storage HTTP Service
I20230128 09:50:07.555509  5015 StorageServer.cpp:100] Http Thread Pool started
I20230128 09:50:07.560770  5179 WebService.cpp:124] Web service started on HTTP[19779]
I20230128 09:50:07.560891  5015 TransactionManager.cpp:24] TransactionManager ctor()
I20230128 09:50:07.561961  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_background_jobs=8
I20230128 09:50:07.561991  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_subcompactions=8
I20230128 09:50:07.562110  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_base=268435456
I20230128 09:50:07.562124  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option max_write_buffer_number=4
I20230128 09:50:07.562136  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option write_buffer_size=67108864
I20230128 09:50:07.562150  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option disable_auto_compactions=true
I20230128 09:50:07.562343  5015 RocksEngineConfig.cpp:366] Emplace rocksdb option block_size=8192
I20230128 09:50:07.641964  5015 RocksEngine.cpp:97] open rocksdb on /mnt/nebula330/data/storage/nebula/0/data
I20230128 09:50:07.642197  5015 AdminTaskManager.cpp:22] max concurrent subtasks: 10
I20230128 09:50:07.642505  5015 AdminTaskManager.cpp:40] exit AdminTaskManager::init()
I20230128 09:50:07.642700  5200 AdminTaskManager.cpp:227] waiting for incoming task
I20230128 09:50:28.812856  5081 MetaClient.cpp:3108] Load leader of "172.18.103.113":9779 in 0 space
I20230128 09:50:28.812965  5081 MetaClient.cpp:3108] Load leader of "172.18.103.120":9779 in 1 space
I20230128 09:50:28.812988  5081 MetaClient.cpp:3108] Load leader of "172.18.103.125":9779 in 1 space
I20230128 09:50:28.812997  5081 MetaClient.cpp:3114] Load leader ok
(root@nebula) [Unified]> show job 6
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| Job Id(TaskId) | Command(Dest)    | Status     | Start Time                 | Stop Time                  | Error Code        |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+
| 6              | "COMPACT"        | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | "E_JOB_SUBMITTED" |
| 0              | "172.18.103.120" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T06:22:42.000000 | ""                |
| 1              | "172.18.103.125" | "FAILED"   | 2023-01-14T00:18:32.000000 | 2023-01-14T17:22:08.000000 | ""                |
| 2              | "172.18.103.113" | "RUNNING"  | 2023-01-14T00:18:32.000000 |                            | ""                |
| "Total:3"      | "Succeeded:0"    | "Failed:2" | "In Progress:1"            | ""                         | ""                |
+----------------+------------------+------------+----------------------------+----------------------------+-------------------+

(root@nebula) [(none)]> show hosts
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| Host             | Port | HTTP port | Status   | Leader count | Leader distribution  | Partition distribution | Version |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| "172.18.103.113" | 9779 | 19779     | "ONLINE" | 0            | "No valid partition" | "Unified:30"           | "3.3.0" |
| "172.18.103.120" | 9779 | 19779     | "ONLINE" | 13           | "Unified:13"         | "Unified:30"           | "3.3.0" |
| "172.18.103.125" | 9779 | 19779     | "ONLINE" | 17           | "Unified:17"         | "Unified:30"           | "3.3.0" |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+

172.18.103.113 这台机器重启后,有 reportTaskFinish() 字样的日志吗,可以贴一下


今天的日志都试了,没有相关的信息

这就有点奇怪了,请确认下:
1.13 机器的所有 storaged 日志可以都看下。
2. 这台机器上的 data 文件夹动过没?
3. 可以 stop job ,从代码上来说,无论如何这个 compact job 已经失败了。

  1. 应该都看了一遍了,没有
  2. data 文件夹没有动过。不过放假前我改过storage.conf 给data_path 增加了要给另外盘的路径
  3. 就是想 stop job,但是目前不知道怎么操作,看文档,貌似说的不能stop
    Compaction - NebulaGraph Database 手册

我跟同事确认了下,这确实是一个 bug,在这个 pr 修了:https://github.com/vesoft-inc/nebula/pull/5195

这个 job 的尸体可以留在那不管了。如果对你后续提交 job 有影响的话,可以重启下 metad。

1 个赞

重启meta 后,已经过去1h 多了,后续的balace leader 确实可以running, 但是这个一直不finishe

看了meta 相关的日志,正常的hearbeat 没其他信息

(root@nebula) [(none)]> show hosts
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| Host             | Port | HTTP port | Status   | Leader count | Leader distribution  | Partition distribution | Version |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
| "172.18.103.113" | 9779 | 19779     | "ONLINE" | 0            | "No valid partition" | "Unified:30"           | "3.3.0" |
| "172.18.103.120" | 9779 | 19779     | "ONLINE" | 13           | "Unified:13"         | "Unified:30"           | "3.3.0" |
| "172.18.103.125" | 9779 | 19779     | "ONLINE" | 17           | "Unified:17"         | "Unified:30"           | "3.3.0" |
+------------------+------+-----------+----------+--------------+----------------------+------------------------+---------+
Got 3 rows (time spent 4.422ms/6.104329ms)

Sun, 29 Jan 2023 15:57:48 CST

(root@nebula) [(none)]> use Unified
Execution succeeded (time spent 2.106ms/3.05725ms)

Sun, 29 Jan 2023 15:57:52 CST

(root@nebula) [Unified]> show jobs
+--------+------------------+-----------+----------------------------+-----------+
| Job Id | Command          | Status    | Start Time                 | Stop Time |
+--------+------------------+-----------+----------------------------+-----------+
| 7      | "LEADER_BALANCE" | "RUNNING" | 2023-01-29T05:47:43.000000 |           |
| 6      | "COMPACT"        | "RUNNING" | 2023-01-14T00:18:32.000000 |           |
+--------+------------------+-----------+----------------------------+-----------+
Got 2 rows (time spent 23.079ms/24.300296ms)

Sun, 29 Jan 2023 15:57:53 CST

嗯,由于上面那个 bug,这个状态已经改不了了,可以忽略他,不影响其他 Job 运行。

还有要给问题:
因为我在storage中添加了新的data_path, 我想通过balance data 把数据均衡到两个目录中。
现在发现balance data 是disable了
看到了这个pr, 想要实现这个数据均衡到两个目录中,该怎么做呢?

这个pr仅仅是将data balance开关放到了实验特性下,目的是与toss特性分开,你只需要把
–enable_experimental_feature=true
–enable_data_balance=true
同时打开就可以使用了。

storage 中的 --data_path 开始时默认值 --data_path=data/storage
后来磁盘空间不够,我新加了一个硬盘,改为 --data_path=data/storage,/data/nebula330storage

enable 之后,发现balance data 并不能达到把已有的数据,平衡到新的目录下,怎么实现这个目的呢?

data balance功能是在hosts之间平衡分片,你这样是没法达成目的的。你要么改动data path和数据把某个storage的数据放到新增的磁盘上,要么基于新磁盘增加新的host,然后data balance

  • 新加host 不太方便,目前三节点测试用的
  • 这个怎么操作呢?

你要么改动data path和数据把某个storage的数据放到新增的磁盘上

  • 这个情况下,如果导入数据,是不是优先数据到原先的目录(旧的空间不多的磁盘中)?