storaged节点CPU居高不下,且leader数量为0

  • nebula 版本:3.8.0
  • 部署方式:分布式,共5台机器:5个graphd服务和5个storaged服务,3个metad服务
  • 安装方式: RPM
  • 是否上生产环境:Y
  • 硬件信息
    • 磁盘 :ssd,每台机器配置了12个ssd硬盘,每块硬盘800G容量
    • CPU、内存信息:每台机器48核,每台机器内存250G
  • 问题的具体描述:
  • 有一台机器storaged占用的CPU长期远高于其他机器,如图

    这台机器的load也是长期繁忙状态

    并且ldeader数量也长期是0.执行leader balance后,会获得一些leader,但是过一会就又会变成0.

    网络占用也是长期高于其他机器:

这台机器曾经硬盘满了,排查时候按照群里的指点,手动删除了ndata/wal下面的大量日志.结果后面这台机器就变成这样了.重启也不能解决.
full compact进行的时候,他的load也和其他机器不一样,如下图:


目前full compact要进行15小时以上,特别慢.不知道为什么

storaged.INFO日志贴一下吧

全都是重复的日志,所以我就每种类型都放一段
3.4日 未执行full compact时,截取的日志

I20260304 11:00:08.656034 28027 CompactionFilter.h:85] Do automatic or periodic compaction!
I20260304 11:00:08.656054   786 CompactionFilter.h:85] Do automatic or periodic compaction!
I20260304 11:00:09.365978 28035 EventListener.h:158] Stall conditions changed column family: default, current condition: Delayed, previous condition: Normal
I20260304 11:00:11.395750 28021 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelMaxLevelSize, status: OK, compacted 4 files into 3, base le
vel is 3, output level is 4
I20260304 11:00:11.397199 28021 EventListener.h:158] Stall conditions changed column family: default, current condition: Normal, previous condition: Delayed
I20260304 11:00:11.398168 28021 EventListener.h:21] Rocksdb start compaction column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 0, base level 
is 1, output level is 2
I20260304 11:00:11.420118 28021 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 1, base le
vel is 1, output level is 2
I20260304 11:00:11.421952 28021 EventListener.h:21] Rocksdb start compaction column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 0, base level 
is 3, output level is 4
I20260304 11:00:11.449342 28021 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 1, base le
vel is 3, output level is 4
I20260304 11:00:11.451213 28021 EventListener.h:21] Rocksdb start compaction column family: default because of LevelMaxLevelSize, status: OK, compacted 4 files into 0, base level 
is 3, output level is 4
I20260304 11:00:11.451274 28021 CompactionFilter.h:85] Do automatic or periodic compaction!
I20260304 11:00:11.694576 28022 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelMaxLevelSize, status: OK, compacted 9 files into 4, base le
vel is 1, output level is 2
I20260304 11:00:11.696604 28022 EventListener.h:21] Rocksdb start compaction column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 0, base level 
is 2, output level is 3
I20260304 11:00:11.711776 28022 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 1, base le
vel is 2, output level is 3
I20260304 11:00:11.713809 28022 EventListener.h:21] Rocksdb start compaction column family: default because of Ttl, status: OK, compacted 1 files into 0, base level is 4, output l
evel is 5
I20260304 11:00:11.731420 28022 EventListener.h:35] Rocksdb compaction completed column family: default because of Ttl, status: OK, compacted 1 files into 1, base level is 4, outp
ut level is 5
I20260304 11:00:11.733223 28022 EventListener.h:21] Rocksdb start compaction column family: default because of Ttl, status: OK, compacted 1 files into 0, base level is 4, output l
evel is 5
I20260304 11:00:11.750456 28022 EventListener.h:35] Rocksdb compaction completed column family: default because of Ttl, status: OK, compacted 1 files into 1, base level is 4, outp
ut level is 5
I20260304 11:00:11.752414 28022 EventListener.h:21] Rocksdb start compaction column family: default because of Ttl, status: OK, compacted 1 files into 0, base level is 4, output l
evel is 5
I20260304 11:00:11.773178 28022 EventListener.h:35] Rocksdb compaction completed column family: default because of Ttl, status: OK, compacted 1 files into 1, base level is 4, outp
ut level is 5
I20260304 11:00:11.775159 28022 EventListener.h:21] Rocksdb start compaction column family: default because of Ttl, status: OK, compacted 1 files into 0, base level is 4, output l
evel is 5
I20260304 11:00:11.794433 28022 EventListener.h:35] Rocksdb compaction completed column family: default because of Ttl, status: OK, compacted 1 files into 1, base level is 4, outp
ut level is 5
I20260304 11:00:11.796516 28022 EventListener.h:21] Rocksdb start compaction column family: default because of LevelMaxLevelSize, status: OK, compacted 7 files into 0, base level 
is 3, output level is 4
I20260304 11:00:11.796592 28022 CompactionFilter.h:85] Do automatic or periodic compaction!

3月5日 早上 正在执行full compact时,截取的日志

I20260305 08:58:58.705614 28031 EventListener.h:35] Rocksdb compaction completed column family: default because of ManualCompaction, status: OK, compacted 14 files into 14, base l
evel is 5, output level is 5
I20260305 08:58:58.893121 28022 EventListener.h:21] Rocksdb start compaction column family: default because of ManualCompaction, status: OK, compacted 13 files into 0, base level 
is 5, output level is 5
I20260305 08:58:58.893332  8580 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.893357  8581 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.893378  8582 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.893409  8583 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.893586  8584 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.893702  8585 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.893827  8586 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.893934  8587 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.894059  8588 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.894207  8589 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:58:58.894203 28022 CompactionFilter.h:81] Do full/manual compaction!
I20260305 08:59:01.481825 28025 EventListener.h:35] Rocksdb compaction completed column family: default because of ManualCompaction, status: OK, compacted 13 files into 13, base l
evel is 5, output level is 5
I20260305 08:59:01.714690 28019 EventListener.h:21] Rocksdb start compaction column family: default because of ManualCompaction, status: OK, compacted 13 files into 0, base level 
is 5, output level is 5

先看156的日志,搜下snapshot相关的。
没恢复前先别manual compact了

好的,停掉了定时的 手动 full compact

这是包含snapshot关键字相关的日志: 在storage.INFO中只找到这一个snapshot

Log file created at: 2026/01/22 14:19:23
Running on machine: dn26156
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
I20260122 14:19:23.640971 27916 StorageDaemon.cpp:132] localhost = "10.177.26.156":9779
I20260122 14:19:23.641609 27916 StorageDaemon.cpp:147] data path= /data/disk01/nebula/ndata
I20260122 14:19:23.641636 27916 StorageDaemon.cpp:147] data path= /data/disk02/nebula/ndata
I20260122 14:19:23.641659 27916 StorageDaemon.cpp:147] data path= /data/disk03/nebula/ndata
I20260122 14:19:23.641680 27916 StorageDaemon.cpp:147] data path= /data/disk04/nebula/ndata
I20260122 14:19:23.641702 27916 StorageDaemon.cpp:147] data path= /data/disk05/nebula/ndata
I20260122 14:19:23.641724 27916 StorageDaemon.cpp:147] data path= /data/disk06/nebula/ndata
I20260122 14:19:23.641746 27916 StorageDaemon.cpp:147] data path= /data/disk07/nebula/ndata
I20260122 14:19:23.641768 27916 StorageDaemon.cpp:147] data path= /data/disk08/nebula/ndata
I20260122 14:19:23.641790 27916 StorageDaemon.cpp:147] data path= /data/disk09/nebula/ndata
I20260122 14:19:23.641812 27916 StorageDaemon.cpp:147] data path= /data/disk10/nebula/ndata
I20260122 14:19:23.641834 27916 StorageDaemon.cpp:147] data path= /data/disk11/nebula/ndata
I20260122 14:19:23.641855 27916 StorageDaemon.cpp:147] data path= /data/disk12/nebula/ndata
I20260122 14:19:23.710889 27916 MetaClient.cpp:80] Create meta client to "10.177.26.157":9559
I20260122 14:19:23.710969 27916 MetaClient.cpp:81] root path: /data/disk12/nebula/installation, data path size: 12
I20260122 14:19:23.711225 27916 FileBasedClusterIdMan.cpp:53] Get clusterId: 8726630418431019117
I20260122 14:19:24.356597 27916 MetaClient.cpp:3263] Load leader of "10.177.26.155":9779 in 3 space
I20260122 14:19:24.356662 27916 MetaClient.cpp:3263] Load leader of "10.177.26.156":9779 in 0 space
I20260122 14:19:24.356690 27916 MetaClient.cpp:3263] Load leader of "10.177.26.157":9779 in 1 space
I20260122 14:19:24.359793 27916 MetaClient.cpp:3263] Load leader of "10.177.26.158":9779 in 3 space
I20260122 14:19:24.360019 27916 MetaClient.cpp:3263] Load leader of "10.177.26.159":9779 in 2 space
I20260122 14:19:24.360044 27916 MetaClient.cpp:3269] Load leader ok
I20260122 14:19:24.371644 27916 MetaClient.cpp:162] Register time task for heartbeat!
I20260122 14:19:24.371765 27916 StorageServer.cpp:250] Init schema manager
I20260122 14:19:24.371796 27916 StorageServer.cpp:253] Init index manager
I20260122 14:19:24.371815 27916 StorageServer.cpp:256] Init kvstore
I20260122 14:19:24.371924 27916 NebulaStore.cpp:48] Start the raft service...
I20260122 14:19:24.379045 27916 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 10485760 for each part by default
I20260122 14:19:24.391628 27916 RaftexService.cpp:46] Start raft service on 9780
I20260122 14:19:24.392472 27916 NebulaStore.cpp:82] Scan the local path, and init the spaces_
I20260122 14:19:24.392805 27916 NebulaStore.cpp:90] Scan path "/data/disk01/nebula/ndata/nebula/0"
I20260122 14:19:24.392848 27916 NebulaStore.cpp:90] Scan path "/data/disk01/nebula/ndata/nebula/305"
I20260122 14:19:24.395208 27916 NebulaStore.cpp:90] Scan path "/data/disk01/nebula/ndata/nebula/20"
I20260122 14:19:24.395697 28014 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.395748 28014 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.396466 28014 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.396533 28014 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.396554 28014 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.396574 28014 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.397262 28014 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.397557 27916 NebulaStore.cpp:90] Scan path "/data/disk01/nebula/ndata/nebula/37"
I20260122 14:19:24.397616 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.397647 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.397785 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.397796 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.397804 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.397814 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.398042 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.402616 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.402670 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.403610 27916 NebulaStore.cpp:90] Scan path "/data/disk02/nebula/ndata/nebula/37"
I20260122 14:19:24.403757 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.403806 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.403827 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.403846 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.405481 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.408442 27916 NebulaStore.cpp:90] Scan path "/data/disk02/nebula/ndata/nebula/305"
I20260122 14:19:24.409063 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.409121 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.409271 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.409282 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.409291 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.409300 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.409899 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.415474 27916 NebulaStore.cpp:90] Scan path "/data/disk02/nebula/ndata/nebula/20"
I20260122 14:19:24.416445 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.416574 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.417732 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.417760 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.417780 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.417800 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.418622 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.421654 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.421715 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.421840 27916 NebulaStore.cpp:90] Scan path "/data/disk03/nebula/ndata/nebula/37"
I20260122 14:19:24.422013 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.422040 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.422060 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.422096 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.422557 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.423208 27916 NebulaStore.cpp:90] Scan path "/data/disk03/nebula/ndata/nebula/20"
I20260122 14:19:24.423380 28040 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20

后面的日志,就开始 auto compact了,一直到今天,都是auto compact 或full compact相关的日志

I20260122 14:19:24.397557 27916 NebulaStore.cpp:90] Scan path "/data/disk01/nebula/ndata/nebula/37"
I20260122 14:19:24.397616 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.397647 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.397785 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.397796 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.397804 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.397814 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.398042 28015 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.402616 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.402670 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.403610 27916 NebulaStore.cpp:90] Scan path "/data/disk02/nebula/ndata/nebula/37"
I20260122 14:19:24.403757 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.403806 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.403827 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.403846 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.405481 28016 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.408442 27916 NebulaStore.cpp:90] Scan path "/data/disk02/nebula/ndata/nebula/305"
I20260122 14:19:24.409063 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.409121 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.409271 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.409282 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.409291 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.409300 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.409899 28023 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.415474 27916 NebulaStore.cpp:90] Scan path "/data/disk02/nebula/ndata/nebula/20"
I20260122 14:19:24.416445 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.416574 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.417732 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.417760 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.417780 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.417800 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.418622 28029 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.421654 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20
I20260122 14:19:24.421715 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=20
I20260122 14:19:24.421840 27916 NebulaStore.cpp:90] Scan path "/data/disk03/nebula/ndata/nebula/37"
I20260122 14:19:24.422013 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=0
I20260122 14:19:24.422040 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20260122 14:19:24.422060 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=12
I20260122 14:19:24.422096 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20260122 14:19:24.422557 28037 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20260122 14:19:24.423208 27916 NebulaStore.cpp:90] Scan path "/data/disk03/nebula/ndata/nebula/20"
I20260122 14:19:24.423380 28040 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=20

要么其他节点的日志里都搜下snapshot?因为删了wal,network-in流量高,应该是在收发snapshot,但是出于某种原因,一直没收完。数据量大吗?