disable_auto_compactions 开启关闭问题,storage直接OFFLINE

  • nebula 版本:v2-nightly 20210319当天的版本
  • 部署方式(分布式 / 单机 / Docker / DBaaS):docker swarm X5
  • 硬件信息
    • 磁盘( 推荐使用 SSD)ssd
    • CPU、内存信息 96c 256g

我先disable_auto_compactions=true,导入数据
然后开启disable_auto_compactions=false,compact文件大概5000+,所有的storage直接OFFLINE
下面是部分日志,这个需要怎么处理

(u@nebula) [merchant_graph]> show hosts storage
+-----------------+------+-----------+-----------+--------------+
| Host            | Port | Status    | Role      | Git Info Sha |
+-----------------+------+-----------+-----------+--------------+
| "n1"   | 9779 | "OFFLINE" | "STORAGE" | "269f606"    |
+-----------------+------+-----------+-----------+--------------+
| "n2"  | 9779 | "OFFLINE" | "STORAGE" | "269f606"    |
+-----------------+------+-----------+-----------+--------------+
| "n3"  | 9779 | "OFFLINE" | "STORAGE" | "269f606"    |
+-----------------+------+-----------+-----------+--------------+
| "n4"  | 9779 | "OFFLINE" | "STORAGE" | "269f606"    |
+-----------------+------+-----------+-----------+--------------+
| "n5" | 9779 | "OFFLINE" | "STORAGE" | "269f606"    |
+-----------------+------+-----------+-----------+--------------+

storage1

I0319 16:54:32.744014   171 EventListner.h:18] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5268 files into 0, base level is 0, output level is 1
I0319 16:54:32.745740   171 CompactionFilter.h:66] Do default minor compaction!
I0319 16:55:02.913374    40 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 65] , total time:30016ms, Total send logs: 2
W0319 16:55:02.913594    40 RaftPart.cpp:1015] [Port: 9780, Space: 1, Part: 65] Only 0 hosts succeeded, Need to try again
I0319 16:55:02.920315    40 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 55] , total time:30023ms, Total send logs: 2

storage2

I0319 17:06:36.928706    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 100] , total time:30016ms, Total send logs: 2
I0319 17:06:38.435526    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 52] , total time:30027ms, Total send logs: 2
I0319 17:06:38.502779    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 50] , total time:30024ms, Total send logs: 2
I0319 17:06:51.297026    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 5] , total time:30022ms, Total send logs: 2
I0319 17:07:04.609537    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 70] , total time:30019ms, Total send logs: 2
I0319 17:07:04.615329    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 92] , total time:30027ms, Total send logs: 2
I0319 17:07:04.642742    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 2] , total time:30011ms, Total send logs: 2
I0319 17:07:04.656733    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 27] , total time:30022ms, Total send logs: 2
I0319 17:07:06.937398    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 100] , total time:30021ms, Total send logs: 2
I0319 17:07:08.443768    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 52] , total time:30020ms, Total send logs: 2
I0319 17:07:08.511430    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 50] , total time:30021ms, Total send logs: 2
I0319 17:07:21.305799    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 5] , total time:30021ms, Total send logs: 2
I0319 17:07:34.611845    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 92] , total time:30008ms, Total send logs: 2
W0319 17:07:34.612061    12 RaftPart.cpp:1015] [Port: 9780, Space: 1, Part: 92] Only 0 hosts succeeded, Need to try again
I0319 17:07:34.620101    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 70] , total time:30023ms, Total send logs: 2
I0319 17:07:34.650038    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 2] , total time:30020ms, Total send logs: 2
I0319 17:07:34.671274    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 27] , total time:30027ms, Total send logs: 2
I0319 17:07:36.939249    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 100] , total time:30014ms, Total send logs: 2
I0319 17:07:38.448431    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 52] , total time:30017ms, Total send logs: 2
I0319 17:07:38.515970    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 50] , total time:30016ms, Total send logs: 2
I0319 17:07:51.316084    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 5] , total time:30022ms, Total send logs: 2
I0319 17:08:04.622735    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 92] , total time:30022ms, Total send logs: 2
I0319 17:08:04.630560    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 70] , total time:30022ms, Total send logs: 2
I0319 17:08:04.656745    12 SlowOpTracker.h:33] [Port: 9780, Space: 1, Part: 2] , total time:30018ms, Total send logs: 2
1 个赞

应该是心跳都断了,IO被阻塞了。老问题了。

  1. heartbeat_interval_secs 全部都改大,比如100
  2. 减少rocksdb并发的 compact job。max_subcompactions = 3, max_background_jobs = 3
  3. –rate_limit=20做限制。

好的,我试下

–rate_limit设置后会导致正常写入变慢,这个要如何平衡?

这个,要不换大一点试试? rocksdb 调参本来就是玄学。

当然,还是加钱换硬盘简单点。

emm,那玄学调下把

小学奥数题:
一个水龙头进水,一个水龙头放水
进水太快怎么办,水池满了怎么办

大概情况我知道,我想了解,你们是怎么调节写入和compact的平衡的

rocksdb 有个参数是 stall write,L0太多了就会stall write一下。 进水的龙头 和水池
rate limit 控制了L0向上Level compact的速度。 出水的龙头。

1 个赞

老司机,这例子无敌了

我这边测试还是不行,rate limit=20加上后还是会出现offline,这个速率,磁盘的io我看是没有到瓶颈的,并且rate limit同时会限制落盘的速度,那这样的话就不能关闭自动compact了。
盘的话我估计升级的空间不大了,nvme ssd的盘,写入平均300M/s+。
可以做到单个节点,单个节点开启compact吗,这样compact的时候还能提供服务,不至于所有的offline。

heartbeat_interval_secs 这个改大了吗?

你说的3项都改了,这docker swarm的部分配置,我去console看的话,心跳时间是我设置的值。
这是meta的


这是storage的

storaged 和 graphd 的这个项,也都改了?
不过这几天代码变化也挺多的,等ga了再试试吧。

好的

请问有通过调参解决这个问题吗?可以等待他自动恢复吗?

disable_auto_compactions 开启关闭问题,storage直接OFFLINE 上面有回复哦,这是个问题,不会自动恢复