Storage服务无法启动 偶尔启动成功会崩溃 求帮助!

nebula 版本:v3.2.0
部署方式:分布式
安装方式:RPM
是否为线上版本:Y
硬件信息
磁盘( 推荐使用 SSD)SSD


CPU、内存信息
32c128g 一共五台服务器
202 metad:9559 graphd:9669 storaged:9779
203 metad:9559 graphd:9669 storaged:9779
204 metad:9559 graphd:9669 storaged:9779
205 graphd:9669 storaged:9779
206 graphd:9669 storaged:9779
问题的具体描述 Storage服务无法启动 偶尔启动成功会崩溃!
storaged-stderr.log

*** Signal 6 (SIGABRT) (0xa2e) received by PID 2606 (pthread TID 0x7feece3ff700) (linux TID 2636) (maybe from PID 2606, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
ExceptionHandler::GenerateDump sys_pipe failed:Too many open files
ExceptionHandler::WaitForContinueSignal sys_read failed:ExceptionHandler::SendContinueSignalToChild sys_write failed:Bad file descriptor
Bad file descriptor
*** Aborted at 1666921888 (Unix time, try 'date -d @1666921888') ***
*** Signal 6 (SIGABRT) (0xa2e) received by PID 2606 (pthread TID 0x7feecd7ff700) (linux TID 2638) (maybe from PID 2606, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
*** Aborted at 1666921888 (Unix time, try 'date -d @1666921888') ***
*** Signal 6 (SIGABRT) (0xa2e) received by PID 2606 (pthread TID 0x7feed17ff700) (linux TID 2628) (maybe from PID 2606, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221028-102847.23727!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221028-102847.23727!F20221028 10:28:47.280041 23744 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 32] Failed to commit logs
*** Check failure stack trace: ***
*** Aborted at 1666924127 (Unix time, try 'date -d @1666924127') ***
*** Signal 6 (SIGABRT) (0x5caf) received by PID 23727 (pthread TID 0x7fb910bff700) (linux TID 23744) (maybe from PID 23727, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
ExceptionHandler::GenerateDump sys_pipe failed:Too many open files
ExceptionHandler::WaitForContinueSignal sys_read failed:ExceptionHandler::SendContinueSignalToChild sys_write failed:Bad file descriptor
Bad file descriptor
Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221101-033954.16674!F20221101 03:39:54.263427 16687 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 4] Failed to commit logs
*** Check failure stack trace: ***
F20221101 03:39:54.265478 16710 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 2] Failed to commit logsF20221101 03:39:54.265703 16704 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 17] Failed to commit logs
*** Check failure stack trace: ***
*** Aborted at 1667245194 (Unix time, try 'date -d @1667245194') ***
F20221101 03:39:54.265478 16710 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 2] Failed to commit logsF20221101 03:39:54.265703 16704 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 17] Failed to commit logs
*** Check failure stack trace: ***
*** Signal 6 (SIGABRT) (0x4122) received by PID 16674 (pthread TID 0x7f952acfe700) (linux TID 16687) (maybe from PID 16674, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
ExceptionHandler::GenerateDump sys_pipe failed:Too many open files
ExceptionHandler::WaitForContinueSignal sys_read failed:ExceptionHandler::SendContinueSignalToChild sys_write failed:Bad file descriptorBad file descriptor

*** Aborted at 1667245194 (Unix time, try 'date -d @1667245194') ***
*** Signal 6 (SIGABRT) (0x4122) received by PID 16674 (pthread TID 0x7f9521dff700) (linux TID 16710) (maybe from PID 16674, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
*** Aborted at 1667245194 (Unix time, try 'date -d @1667245194') ***
*** Signal 6 (SIGABRT) (0x4122) received by PID 16674 (pthread TID 0x7f95241ff700) (linux TID 16704) (maybe from PID 16674, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221101-103402.9782!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221101-103402.9782!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221101-103402.9782!F20221101 10:34:02.414801  9797 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 4] Failed to commit logs
*** Check failure stack trace: ***
*** Aborted at 1667270042 (Unix time, try 'date -d @1667270042') ***
*** Signal 6 (SIGABRT) (0x2636) received by PID 9782 (pthread TID 0x7f7c135ff700) (linux TID 9797) (maybe from PID 9782, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
ExceptionHandler::GenerateDump sys_pipe failed:Too many open files
ExceptionHandler::WaitForContinueSignal sys_read failed:ExceptionHandler::SendContinueSignalToChild sys_write failed:Bad file descriptorBad file descriptor

F20221101 10:34:02.691797  9818 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 2] Failed to commit logs
*** Check failure stack trace: ***
Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221101-105248.15484!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221101-105248.15484!Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20221101-105249.15484!F20221101 10:52:49.521068 15507 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 2] Failed to commit logs
*** Check failure stack trace: ***
F20221101 10:52:49.521129 15517 RaftPart.cpp:1073] [Port: 9780, Space: 120, Part: 4] Failed to commit logs
*** Check failure stack trace: ***
*** Aborted at 1667271169 (Unix time, try 'date -d @1667271169') ***
*** Signal 6 (SIGABRT) (0x3c7c) received by PID 15484 (pthread TID 0x7f03c4dff700) (linux TID 15507) (maybe from PID 15484, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
ExceptionHandler::GenerateDump sys_pipe failed:Too many open files
ExceptionHandler::WaitForContinueSignal sys_read failed:ExceptionHandler::SendContinueSignalToChild sys_write failed:Bad file descriptorBad file descriptor

*** Aborted at 1667271169 (Unix time, try 'date -d @1667271169') ***
*** Signal 6 (SIGABRT) (0x3c7c) received by PID 15484 (pthread TID 0x7f03c12ff700) (linux TID 15517) (maybe from PID 15484, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)

nebula-storaged.INFO

I20221104 11:51:12.150789 18255 NebulaStore.cpp:430] [Space: 120, Part: 34] has existed!
I20221104 11:51:12.150794 18255 NebulaStore.cpp:430] [Space: 120, Part: 37] has existed!
I20221104 11:51:12.150804 18255 NebulaStore.cpp:430] [Space: 120, Part: 39] has existed!
I20221104 11:51:12.150810 18255 NebulaStore.cpp:430] [Space: 120, Part: 42] has existed!
I20221104 11:51:12.150815 18255 NebulaStore.cpp:430] [Space: 120, Part: 44] has existed!
I20221104 11:51:12.150818 18255 NebulaStore.cpp:430] [Space: 120, Part: 47] has existed!
I20221104 11:51:12.150823 18255 NebulaStore.cpp:430] [Space: 120, Part: 49] has existed!
I20221104 11:51:12.150827 18255 NebulaStore.cpp:430] [Space: 120, Part: 52] has existed!
I20221104 11:51:12.150832 18255 NebulaStore.cpp:430] [Space: 120, Part: 54] has existed!
I20221104 11:51:12.150836 18255 NebulaStore.cpp:430] [Space: 120, Part: 57] has existed!
I20221104 11:51:12.150840 18255 NebulaStore.cpp:430] [Space: 120, Part: 59] has existed!
I20221104 11:51:12.150859 18255 NebulaStore.cpp:78] Register handler...
I20221104 11:51:12.150868 18255 StorageServer.cpp:228] Init LogMonitor
I20221104 11:51:12.172487 18255 StorageServer.cpp:96] Starting Storage HTTP Service
I20221104 11:51:12.174451 18255 StorageServer.cpp:100] Http Thread Pool started
I20221104 11:51:12.194942 22044 WebService.cpp:124] Web service started on HTTP[19779]
I20221104 11:51:12.194989 18255 TransactionManager.cpp:24] TransactionManager ctor()
I20221104 11:51:17.803148 18322 MetaClient.cpp:3094] Load leader of "172.17.126.202":9779 in 1 space
I20221104 11:51:17.803186 18322 MetaClient.cpp:3094] Load leader of "172.17.126.203":9779 in 1 space
I20221104 11:51:17.803193 18322 MetaClient.cpp:3094] Load leader of "172.17.126.204":9779 in 1 space
I20221104 11:51:17.803201 18322 MetaClient.cpp:3094] Load leader of "172.17.126.205":9779 in 1 space
I20221104 11:51:17.803210 18322 MetaClient.cpp:3094] Load leader of "172.17.126.206":9779 in 1 space
I20221104 11:51:17.803215 18322 MetaClient.cpp:3100] Load leader ok
I20221104 11:51:27.813603 18322 MetaClient.cpp:3094] Load leader of "172.17.126.202":9779 in 1 space
I20221104 11:51:27.813668 18322 MetaClient.cpp:3094] Load leader of "172.17.126.203":9779 in 1 space
I20221104 11:51:27.813681 18322 MetaClient.cpp:3094] Load leader of "172.17.126.204":9779 in 1 space
I20221104 11:51:27.813694 18322 MetaClient.cpp:3094] Load leader of "172.17.126.205":9779 in 1 space
I20221104 11:51:27.813707 18322 MetaClient.cpp:3094] Load leader of "172.17.126.206":9779 in 1 space
I20221104 11:51:27.813715 18322 MetaClient.cpp:3100] Load leader ok
I20221104 11:51:48.197736 18322 MetaClient.cpp:3094] Load leader of "172.17.126.202":9779 in 1 space
I20221104 11:51:48.198194 18322 MetaClient.cpp:3094] Load leader of "172.17.126.203":9779 in 1 space
I20221104 11:51:48.198215 18322 MetaClient.cpp:3094] Load leader of "172.17.126.204":9779 in 1 space
I20221104 11:51:48.198225 18322 MetaClient.cpp:3094] Load leader of "172.17.126.205":9779 in 1 space
I20221104 11:51:48.198231 18322 MetaClient.cpp:3094] Load leader of "172.17.126.206":9779 in 1 space
I20221104 11:51:48.198236 18322 MetaClient.cpp:3100] Load leader ok

尝试调整了ulimit设置为500000也不可以正常启动
image

打开的文件太多了,改下系统的内核配置吧,比如限制打开文件的数量为100000
ulimit -n 1000000

好的 我试试 现在rockdbs的L0要有10w 我storaged起不来没法合并

那就再调大点。比如100w


我发现会这样 storage启动的过程中 被系统kill了 因为占用内存到了110g 我机器是128g的

已经调整为100w 然后因为系统kill 还是起不来

########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-storaged.pid
# Whether to use the configuration obtained from the configuration file
--local_config=true

########## logging ##########
# The directory to host logging files
--log_dir=logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=3
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=storaged-stdout.log
--stderr_log_file=storaged-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2

########## networking ##########
# Comma separated Meta server addresses
--meta_server_addrs=172.17.126.202:9559,172.17.126.203:9559,172.17.126.204:9559
# Local IP used to identify the nebula-storaged process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=172.17.126.202
# Storage daemon listening port
--port=9779
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19779
# HTTP2 service port
--ws_h2_port=19780
# heartbeat with meta service
--heartbeat_interval_secs=10

######### Raft #########
# Raft election timeout
--raft_heartbeat_interval_secs=30
# RPC timeout for raft client (ms)
--raft_rpc_timeout_ms=500
## recycle Raft WAL
--wal_ttl=14400

--storage_client_timeout_ms=6000000

########## Disk ##########
# Root data path. Split by comma. e.g. --data_path=/disk1/path1/,/disk2/path2/
# One path per Rocksdb instance.
--data_path=/data/nebula/data/storage/,/datanebula02/nebula/data/storage/,/datanebula03/nebula/data/storage/

# Minimum reserved bytes of each data path
--minimum_reserved_bytes=268435456

# The default reserved bytes for one batch operation
--rocksdb_batch_size=4096
--auto_remove_invalid_space=true
# The default block cache size used in BlockBasedTable.
# The unit is MB.
--rocksdb_block_cache=4
# The type of storage engine, `rocksdb', `memory', etc.
--engine_type=rocksdb
--rocksdb_compression=lz4

# Set different compressions for different levels
# For example, if --rocksdb_compression is snappy,
# "no:no:lz4:lz4::zstd" is identical to "no:no:lz4:lz4:snappy:zstd:snappy"
# In order to disable compression for level 0/1, set it to "no:no"
--rocksdb_compression_per_level=

# Whether or not to enable rocksdb's statistics, disabled by default
--enable_rocksdb_statistics=false

# Statslevel used by rocksdb to collection statistics, optional values are
#   * kExceptHistogramOrTimers, disable timer stats, and skip histogram stats
#   * kExceptTimers, Skip timer stats
#   * kExceptDetailedTimers, Collect all stats except time inside mutex lock AND time spent on compression.
#   * kExceptTimeForMutex, Collect all stats except the counters requiring to get time inside the mutex lock.
#   * kAll, Collect all stats
--rocksdb_stats_level=kExceptHistogramOrTimers

# Whether or not to enable rocksdb's prefix bloom filter, enabled by default.
--enable_rocksdb_prefix_filtering=true
# Whether or not to enable rocksdb's whole key bloom filter, disabled by default.
--enable_rocksdb_whole_key_filtering=false

############## rocksdb Options ##############
# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
--rocksdb_db_options={"max_subcompactions":"16","max_background_jobs":"16"}
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_column_family_options={"write_buffer_size":"134217728","max_write_buffer_number":"10","target_file_size_base":"67108864","level0_file_num_compaction_trigger":"16","disable_auto_compactions":"true","level0_slowdown_writes_trigger":"40","level0_stop_writes_trigger":"50","max_bytes_for_level_base":"1073741824","max_bytes_for_level_multiplier":"16"}
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_block_based_table_options={"block_size":"32768","cache_index_and_filter_blocks":"false"}
#--rocksdb_column_family_options={"write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456"}
#--rocksdb_block_based_table_options={"block_size":"8192"}

--timezone_name=UTC+08:00
#--num_io_threads=32
#--num_worker_threads=32
#--max_concurrent_subtasks=32
#--snapshot_part_rate_limit=52428800
#--snapshot_batch_size=10485760
#--rebuild_index_part_rate_limit=20971520
#--rebuild_index_batch_size=5242880
#--max_edge_returned_per_vertex=100000000
#--rocksdb_rate_limit=50
--enable_partitioned_index_filter=true

这是我的storage配置

应该是启动的时候发送了oom。
你conf 里面那些参数。都改小点。或者直接设置下disable_auto_compactions 为true

disable_auto_compactions 这个参数已经是true了
另外其他关于内存的参数我设置小些 然后您感觉256g够用吗

应该不需要这么大的内存吧

能不能麻烦你帮我看看 我现在256g内存 在大合并的时候还是oom了 而且storage服务器在启动中会挂掉
麻烦看看是不是我的配置哪里有问题


你把storage conf发一下

I20221106 23:34:10.244091 32146 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_multiplier=16
I20221106 23:34:10.244108 32146 RocksEngineConfig.cpp:366] Emplace rocksdb option max_bytes_for_level_base=1073741824

这些变量改的太大了。你知道是啥意思吗?就改这么大?

把这两个变量改成默认值吧

max_bytes_for_level_multiplier=8
max_bytes_for_level_base=536870912
请问这样可以吗
第二个参数貌似是L1的总大小

我建议你用默认的,除非你知道他的准确含义。我估计你这个问题就是出在他们身上

好的 谢谢 还有其他指导吗

改完上述参数,结果咋样呀?

唉 还是不行 每个rocksdb的L0文件得有10w左右 然后每个storaged有三个rocksdb实例 无奈只能重做space 这次开着自动小合并写