v1.0版本 storage 节点宕机

  • nebula 版本:v1.0
  • 部署方式 :分布式
  • 硬件信息
    • 磁盘:SSD
    • CPU、内存信息:mem 64g
  • 问题的具体描述, 昨晚十一点50多的时候,集群中一台 storage 不可用。这个时间点没有写入操作。storage 日志如下
Log file created at: 2021/01/26 23:53:51
Running on machine: nebula-server-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0126 23:53:51.229916 14114 RocksEngine.cpp:103] Check failed: status.ok() IO error: While open directory: /mnt/data/data/storage/nebula/415/data: Too many open files

看了日志,发现好像是自动compact 的时候报的错,但是不知道为什么compact 时会报这个错?

compact的日志在哪里?
从上边的日志来看,是操作系统的句柄数不足了,ulimit -n 是多少?

一直是 65535。

日志

I0126 23:53:39.672562 14183 RaftexService.cpp:111] Start the Raftex Service successfully
I0126 23:53:39.672652 14114 NebulaStore.cpp:54] Scan the local path, and init the spaces_
I0126 23:53:39.674708 14114 NebulaStore.cpp:60] Scan path "/mnt/data/data/storage/283"
I0126 23:53:39.674811 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option max_subcompactions=4
I0126 23:53:39.674818 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option max_background_jobs=10
I0126 23:53:39.674844 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option disable_auto_compactions=false
I0126 23:53:39.674886 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option block_size=8192
I0126 23:53:39.720609 14114 RocksEngine.cpp:106] open rocksdb on /mnt/data/data/storage/nebula/283/data
I0126 23:53:39.720628 14114 NebulaStore.cpp:88] Load space 283 from disk
I0126 23:53:39.720649 14114 NebulaStore.cpp:122] Need to open 4 parts of space 283
I0126 23:53:40.402359 14181 FileBasedWal.cpp:58] [Port: 44501, Space: 283, Part: 7] lastLogId in wal is 932265, lastLogTerm is 24, path is /mnt/data/data/storage/nebula/283/wal/7/0000000000000570689.wal
I0126 23:53:40.402458 14181 RaftPart.cpp:293] [Port: 44501, Space: 283, Part: 7] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 932265, lastLogTerm 24, committedLogId 932261, term 24
I0126 23:53:40.402578 14181 NebulaStore.cpp:152] Load part 283, 7 from disk
I0126 23:53:40.406361 14180 FileBasedWal.cpp:58] [Port: 44501, Space: 283, Part: 4] lastLogId in wal is 933195, lastLogTerm is 24, path is /mnt/data/data/storage/nebula/283/wal/4/0000000000000568419.wal
I0126 23:53:40.406432 14180 RaftPart.cpp:293] [Port: 44501, Space: 283, Part: 4] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 933195, lastLogTerm 24, committedLogId 933191, term 24
I0126 23:53:40.406531 14180 NebulaStore.cpp:152] Load part 283, 4 from disk
I0126 23:53:40.407023 14182 FileBasedWal.cpp:58] [Port: 44501, Space: 283, Part: 10] lastLogId in wal is 931923, lastLogTerm is 24, path is /mnt/data/data/storage/nebula/283/wal/10/0000000000000570979.wal
I0126 23:53:40.407070 14182 RaftPart.cpp:293] [Port: 44501, Space: 283, Part: 10] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 931923, lastLogTerm 24, committedLogId 931919, term 24
I0126 23:53:40.407166 14182 NebulaStore.cpp:152] Load part 283, 10 from disk
I0126 23:53:40.408329 14179 FileBasedWal.cpp:58] [Port: 44501, Space: 283, Part: 1] lastLogId in wal is 931823, lastLogTerm is 24, path is /mnt/data/data/storage/nebula/283/wal/1/0000000000000568523.wal
I0126 23:53:40.408396 14179 RaftPart.cpp:293] [Port: 44501, Space: 283, Part: 1] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 931823, lastLogTerm 24, committedLogId 931819, term 24
I0126 23:53:40.408497 14179 NebulaStore.cpp:152] Load part 283, 1 from disk
I0126 23:53:40.408514 14114 NebulaStore.cpp:167] Load space 283 complete
I0126 23:53:40.408525 14114 NebulaStore.cpp:60] Scan path "/mnt/data/data/storage/424"
I0126 23:53:40.408555 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option max_subcompactions=4
I0126 23:53:40.408561 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option max_background_jobs=10
I0126 23:53:40.408572 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option disable_auto_compactions=false
I0126 23:53:40.408592 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option block_size=8192
I0126 23:53:40.567490 14114 RocksEngine.cpp:106] open rocksdb on /mnt/data/data/storage/nebula/424/data
I0126 23:53:40.567546 14114 NebulaStore.cpp:88] Load space 424 from disk
I0126 23:53:40.567581 14114 NebulaStore.cpp:122] Need to open 4 parts of space 424
I0126 23:53:40.692576 14180 FileBasedWal.cpp:58] [Port: 44501, Space: 424, Part: 7] lastLogId in wal is 1398885, lastLogTerm is 6, path is /mnt/data/data/storage/nebula/424/wal/7/0000000000001334637.wal
I0126 23:53:40.693524 14180 RaftPart.cpp:293] [Port: 44501, Space: 424, Part: 7] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 1398885, lastLogTerm 6, committedLogId 1398881, term 6
I0126 23:53:40.697499 14180 NebulaStore.cpp:152] Load part 424, 7 from disk
I0126 23:53:40.853278 14181 FileBasedWal.cpp:58] [Port: 44501, Space: 424, Part: 10] lastLogId in wal is 1398886, lastLogTerm is 6, path is /mnt/data/data/storage/nebula/424/wal/10/0000000000001250328.wal
I0126 23:53:40.853521 14181 RaftPart.cpp:293] [Port: 44501, Space: 424, Part: 10] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 1398886, lastLogTerm 6, committedLogId 1398881, term 6
I0126 23:53:40.859114 14179 FileBasedWal.cpp:58] [Port: 44501, Space: 424, Part: 4] lastLogId in wal is 1398944, lastLogTerm is 6, path is /mnt/data/data/storage/nebula/424/wal/4/0000000000001247239.wal
I0126 23:53:40.859491 14181 NebulaStore.cpp:152] Load part 424, 10 from disk
I0126 23:53:40.865520 14179 RaftPart.cpp:293] [Port: 44501, Space: 424, Part: 4] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 1398944, lastLogTerm 6, committedLogId 1398939, term 6
I0126 23:53:40.875488 14179 NebulaStore.cpp:152] Load part 424, 4 from disk
I0126 23:53:41.111096 14182 FileBasedWal.cpp:58] [Port: 44501, Space: 424, Part: 1] lastLogId in wal is 1398653, lastLogTerm is 6, path is /mnt/data/data/storage/nebula/424/wal/1/0000000000001110899.wal
I0126 23:53:41.111519 14182 RaftPart.cpp:293] [Port: 44501, Space: 424, Part: 1] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 1398653, lastLogTerm 6, committedLogId 1398649, term 6
I0126 23:53:41.114487 14182 NebulaStore.cpp:152] Load part 424, 1 from disk
I0126 23:53:41.114513 14114 NebulaStore.cpp:167] Load space 424 complete
I0126 23:53:41.119488 14114 NebulaStore.cpp:60] Scan path "/mnt/data/data/storage/415"
I0126 23:53:41.122515 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option max_subcompactions=4
I0126 23:53:41.123481 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option max_background_jobs=10
I0126 23:53:41.126490 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option disable_auto_compactions=false
I0126 23:53:41.129500 14114 RocksEngineConfig.cpp:119] Emplace rocksdb option block_size=8192
F0126 23:53:51.229916 14114 RocksEngine.cpp:103] Check failed: status.ok() IO error: While open directory: /mnt/data/data/storage/nebula/415/data: Too many open files
                                                                                                                                                                                            

当出现系统问题时,目前nebula会停止服务,也是为了保证数据安全。

好的,明白了

为什么系统会把句柄数占满,需要分析一下。
尝试通过lsof命令分析一下系统中进程打开的文件数?

这块我们这边先对文件句柄数加个监控,然后排查一下。

看了系统的文件句柄数监控,那个时间点只有15k的数量,没有达到65536这个值。

nebula-storage使用的最大文件描述符如下:

问下,对于系统级别的句柄数和进程级别的句柄数是怎么监控的?有分析过程吗?

系统句柄数:prometheus 中的 node_filefd_allocated 这条命令,然后找对应的时间点
进程:查看/proc/pid/limits

刚才又出现了一次这种情况,一台storaged 宕机,报错信息和之前一样。

附上日志

Log file created at: 2021/01/28 17:22:57
Running on machine: nebula-server-1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F0128 17:22:57.146863 11034 RocksEngine.cpp:103] Check failed: status.ok() IO error: While open directory: /mnt/data/data/storage/nebula/357/data: Too many open files

使用lsof 命令查看

[root@nebula-server-1 logs]# lsof -p 13075  | wc -l
4307
[root@nebula-server-1 logs]# ps aux | grep storaged
root     13075  186 57.7 59934344 37428728 ?   Ssl  17:29  55:00 /usr/local/nebula/bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged.conf
root     15073  0.0  0.0 112712   964 pts/2    S+   17:59   0:00 grep --color=auto storaged
[root@nebula-server-1 logs]# lsof -n |awk '{print $2}'|sort|uniq -c |sort -nr | head
 555474 13075
  47840 25718
  23256 12992
   4280 13551
   4108 7859
   1652 1160
   1495 6760
    630 1157
    624 1570
    455 1118


这是系统今天下午的文件描述符数统计

先把ulimit 改了吧,挂进程还是小心些。

这个问题的根本原因还是因为 rocksdb::DB::Open 时系统fd不足导致 (RocksEngine.cpp:103).

有什么解决办法吗?

请问一下ulimit 是改大还是该小?现在ulimit -n 是65535这个值

改大一点

行,先把storage进程的最大打开文件描述符调大监测看看。