full compact参数讨论:full compact导致数据导入失败

白玉汤儿 · 2025 年6 月 9 日 09:43

nebula 版本：3.8.0
部署方式：分布式,共6台机器,6个graphd服务和6个storaged服务,3台部署了metad服务
安装方式： RPM
是否上生产环境：Y
硬件信息
- 磁盘 :hdd,每台机器配置了12个hdd硬盘,每块硬盘8T容量
- CPU、内存信息:每台机器48核,每台机器内存376G

问题的具体描述:
full compact导致数据导入失败:
full compact如果和nebula exchange同时进行,就有极大概率导致read timeout,导数任务失败.

目前我有两个思路:
1.调整参数加快full compact,尽量在半夜就完成
2.调整参数降低full compact,防止full compact影响到数据加工,例如参数 :–rocksdb_rate_limit=10等
**我尝试第1个思路,通过增加compact线程数,来加快compact的速度:max_subcompactions":“3->64”,“max_background_jobs”:“5->64”,“max_background_flushes”:“2”
但是我发现:调大了compact线程数,full compact的速度也没有明显加快,反而加大了导数失败的概率.
所以想请教一下各位,交流一下full compact的各种心得.优化rocksdb参数的心得

现状:
每天使用nebula-exchange进行数据导入的时间达到了10~17小时(数据量庞大),如果我加快导入速度(目前spark导数的partition即并发度是32,假如我提高到200),那么storaged的raft就会来不及同步,raft buffer就会满导致导入报错.
在full compact期间,我使用iostat -x 1查看所有机器,发现每台机器都有一个hdd硬盘的IO是100%.
每天早上5点会开始定时导数生产,但是此时如果正在进行full compact,就有极大概率导致read timeout,导数任务失败.
每天都进行full compact的话,预计full compacrt需要10小时左右才能完成.
而如果每4天进行一次full compact,那预计需要37小时才能完成.

报错时的graph info 日志信息:

 E20250609 15:44:57.242136 45492 StorageClientBase-inl.h:224] Request to "ip已隐藏":9779 time out: TTransportException: Timed out
E20250609 15:44:57.242307 45425 StorageClientBase-inl.h:143] There some RPC errors: RPC failure in StorageClient with timeout: TTransportException: Timed out
E20250609 15:44:57.242854 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 111
E20250609 15:44:57.242880 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 77
E20250609 15:44:57.242892 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 65
E20250609 15:44:57.242902 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 31
E20250609 15:44:57.242913 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 123
E20250609 15:44:57.242923 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 39
E20250609 15:44:57.242933 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 29
E20250609 15:44:57.242944 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 59
E20250609 15:44:57.242954 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 85
E20250609 15:44:57.242964 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 119
E20250609 15:44:57.242972 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 95
E20250609 15:44:57.242982 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 27
E20250609 15:44:57.242992 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 15
E20250609 15:44:57.243010 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 71
E20250609 15:44:57.243021 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 91
E20250609 15:44:57.243031 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 1
E20250609 15:44:57.243039 45425 StorageAccessExecutor.h:47] InsertVerticesExecutor failed, error E_RPC_FAILURE, part 35
E20250609 15:44:57.243108 45427 QueryInstance.cpp:151] Storage Error: RPC failure, probably timeout., query: INSERT VERTEX xxxx...

报错时的spark 日志信息

25/06/09 15:42:25 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (ip已隐藏:38820) with ID 31
25/06/09 15:42:25 INFO TaskSetManager: Starting task 22.3 in stage 1.0 (TID 152, dn65142.hadoop65.unicom, executor 31, partition 22, PROCESS_LOCAL, 7780 bytes)
25/06/09 15:42:25 INFO BlockManagerMasterEndpoint: Registering block manager dn65142.hadoop65.unicom:43889 with 10.5 GB RAM, BlockManagerId(31, dn65142.hadoop65.unicom, 43889, None)
25/06/09 15:42:26 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on dn65142.hadoop65.unicom:43889 (size: 17.5 KB, free: 10.5 GB)
25/06/09 15:42:27 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to ip已隐藏:38820
25/06/09 15:42:29 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (ip已隐藏:33408) with ID 30
25/06/09 15:42:29 INFO TaskSetManager: Starting task 30.0 in stage 1.0 (TID 153, dn6614.hadoop66.unicom, executor 30, partition 30, PROCESS_LOCAL, 7780 bytes)
25/06/09 15:42:30 INFO BlockManagerMasterEndpoint: Registering block manager dn6614.hadoop66.unicom:37738 with 10.5 GB RAM, BlockManagerId(30, dn6614.hadoop66.unicom, 37738, None)
25/06/09 15:42:31 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on dn6614.hadoop66.unicom:37738 (size: 17.5 KB, free: 10.5 GB)
25/06/09 15:42:32 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to IP已隐藏.66.14:33408
25/06/09 15:42:37 INFO YarnSchedulerBackend$YarnDriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (IP已隐藏.66.8:52956) with ID 32
25/06/09 15:42:37 INFO TaskSetManager: Starting task 31.0 in stage 1.0 (TID 154, dn6608.hadoop66.unicom, executor 32, partition 31, PROCESS_LOCAL, 7780 bytes)
25/06/09 15:42:37 INFO BlockManagerMasterEndpoint: Registering block manager dn6608.hadoop66.unicom:36078 with 10.5 GB RAM, BlockManagerId(32, dn6608.hadoop66.unicom, 36078, None)
25/06/09 15:42:38 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on dn6608.hadoop66.unicom:36078 (size: 17.5 KB, free: 10.5 GB)
25/06/09 15:42:39 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to IP已隐藏.66.8:52956
25/06/09 15:42:48 WARN TaskSetManager: Lost task 21.3 in stage 1.0 (TID 128, dn6617.hadoop66.unicom, executor 6): com.vesoft.nebula.client.graph.exception.IOErrorException: java.net.SocketTimeoutException: Read timed out
        at com.vesoft.nebula.client.graph.net.SyncConnection.executeWithParameter(SyncConnection.java:189)
        at com.vesoft.nebula.client.graph.net.Session.executeWithParameter(Session.java:113)
        at com.vesoft.nebula.client.graph.net.Session.execute(Session.java:78)
        at com.vesoft.exchange.common.GraphProvider.submit(GraphProvider.scala:78)
        at com.vesoft.exchange.common.writer.NebulaGraphClientWriter.writeVertices(ServerBaseWriter.scala:138)
        at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$com$vesoft$nebula$exchange$processor$VerticesProcessor$$processEachPartition$1.apply(VerticesProcessor.scala:79)
        at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$com$vesoft$nebula$exchange$processor$VerticesProcessor$$processEachPartition$1.apply(VerticesProcessor.scala:77)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at com.vesoft.nebula.exchange.processor.VerticesProcessor.com$vesoft$nebula$exchange$processor$VerticesProcessor$$processEachPartition(VerticesProcessor.scala:77)
        at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$3.apply(VerticesProcessor.scala:180)
        at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$3.apply(VerticesProcessor.scala:180)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

25/06/09 15:42:48 ERROR TaskSetManager: Task 21 in stage 1.0 failed 4 times; aborting job
25/06/09 15:42:48 INFO YarnScheduler: Cancelling stage 1
25/06/09 15:42:48 INFO YarnScheduler: Killing all running tasks in stage 1: Stage cancelled
25/06/09 15:42:48 INFO YarnScheduler: Stage 1 was cancelled
25/06/09 15:42:48 INFO DAGScheduler: ResultStage 1 (foreachPartition at VerticesProcessor.scala:180) failed in 1435.179 s due to Job aborted due to stage failure: Task 21 in stage 1.0 failed 4 times, most recent failure: Los
t task 21.3 in stage 1.0 (TID 128, dn6617.hadoop66.unicom, executor 6): com.vesoft.nebula.client.graph.exception.IOErrorException: java.net.SocketTimeoutException: Read timed out

storaged.conf配置文件 :欢迎大家提出配置文件的修改意见!

--timezone_name=UTC+08:00
########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-storaged-listener.pid
# Whether to use the configuration obtained from the configuration file
--local_config=true

########## logging ##########
# The directory to host logging files
--log_dir=logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=storaged-listener-stdout.log
--stderr_log_file=storaged-listener-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=3
# Wether logging files' name contain timestamp.
--timestamp_in_logfile_name=true

########## networking ##########
# Comma separated Meta server addresses
--meta_server_addrs=10.177.67.39:9559,10.177.67.40:9559,10.177.67.41:9559
# Local IP used to identify the nebula-storaged process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=10.177.57.45
# Storage daemon listening port
--port=9779
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19779
# heartbeat with meta service
--heartbeat_interval_secs=10

######### Raft #########
# Raft election timeout
--raft_heartbeat_interval_secs=30
# RPC timeout for raft client (ms)
--raft_rpc_timeout_ms=500
## recycle Raft WAL
--wal_ttl=14400

########## Disk ##########
# Root data path. split by comma. e.g. --data_path=/disk1/path1/,/disk2/path2/
# One path per Rocksdb instance.
#--data_path=/data/disk01/nebula/ndata_v300/,/data/disk02/nebula/ndata_v300/,/data/disk03/nebula/ndata_v300/,/data/disk04/nebula/ndata_v300/,/data/disk05/nebula/ndata_v300/,/data/disk06/nebula/ndata_v300/,/data/disk07/nebula/ndata_v300/,/data/disk08/nebula/ndata_v300/,/data/disk09/nebula/ndata_v300/,/data/disk10/nebula/ndata_v300/,/data/disk11/nebula/ndata_v300/,/data/disk12/nebula/ndata_v300/
--data_path=/data/disk01/nebula/ndata,/data/disk02/nebula/ndata,/data/disk03/nebula/ndata,/data/disk04/nebula/ndata,/data/disk05/nebula/ndata,/data/disk06/nebula/ndata,/data/disk07/nebula/ndata,/data/disk08/nebula/ndata,/data/disk09/nebula/ndata,/data/disk10/nebula/ndata,/data/disk11/nebula/ndata,/data/disk12/nebula/ndata
#--data_path=/data/disk01/nebula/ndata,/data/disk02/nebula/ndata,/data/disk03/nebula/ndata,/data/disk04/nebula/ndata,/data/disk05/nebula/ndata,/data/disk06/nebula/ndata,/data/disk07/nebula/ndata,/data/disk08/nebula/ndata,/data/disk09/nebula/ndata,/data/disk10/nebula/ndata,/data/disk11/nebula/ndata,/data/disk12/nebula/ndata


# Minimum reserved bytes of each data path
--minimum_reserved_bytes=268435456

# The default reserved bytes for one batch operation

--rocksdb_batch_size=4096
# The default block cache size used in BlockBasedTable. (MB)
# recommend: 1/3 of all memory
--rocksdb_block_cache=128341
# Disable page cache to better control memory used by rocksdb.
# Caution: Make sure to allocate enough block cache if disabling page cache!
--disable_page_cache=false

# Compression algorithm, options: no,snappy,lz4,lz4hc,zlib,bzip2,zstd
# For the sake of binary compatibility, the default value is snappy.
# Recommend to use:
#   * lz4 to gain more CPU performance, with the same compression ratio with snappy
#   * zstd to occupy less disk space
#   * lz4hc for the read-heavy write-light scenario
--rocksdb_compression=lz4

# Set different compressions for different levels
# For example, if --rocksdb_compression is snappy,
# "no:no:lz4:lz4::zstd" is identical to "no:no:lz4:lz4:snappy:zstd:snappy"
# In order to disable compression for level 0/1, set it to "no:no"
--rocksdb_compression_per_level=

############## rocksdb Options ##############
# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
--rocksdb_db_options={"max_subcompactions":"64","max_background_jobs":"64","max_background_flushes":"2","skip_checking_sst_file_sizes_on_db_open":"true"}
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_column_family_options={"disable_auto_compactions":"false","write_buffer_size":"67108864","max_write_buffer_number":"5","max_bytes_for_level_base":"268435456"}
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_block_based_table_options={"block_size":"8192"}


# Whether or not to enable rocksdb's statistics, disabled by default
--enable_rocksdb_statistics=false

# Statslevel used by rocksdb to collection statistics, optional values are
#   * kExceptHistogramOrTimers, disable timer stats, and skip histogram stats
#   * kExceptTimers, Skip timer stats
#   * kExceptDetailedTimers, Collect all stats except time inside mutex lock AND time spent on compression.
#   * kExceptTimeForMutex, Collect all stats except the counters requiring to get time inside the mutex lock.
#   * kAll, Collect all stats
--rocksdb_stats_level=kExceptHistogramOrTimers

# Whether or not to enable rocksdb's prefix bloom filter, enabled by default.
--enable_rocksdb_prefix_filtering=true
# Whether or not to enable rocksdb's whole key bloom filter, disabled by default.
--enable_rocksdb_whole_key_filtering=false

############### misc ####################
# Whether turn on query in multiple thread
--query_concurrently=true
# Whether remove outdated space data
--auto_remove_invalid_space=true
# Network IO threads number
--num_io_threads=16
# Max active connections for all networking threads. 0 means no limit.
# Max connections for each networking thread = num_max_connections / num_netio_threads
--num_max_connections=0
# Worker threads number to handle request
--num_worker_threads=64
# Maximum subtasks to run admin jobs concurrently
--max_concurrent_subtasks=10
# The rate limit in bytes when leader synchronizes snapshot data
--snapshot_part_rate_limit=10485760
# The amount of data sent in each batch when leader synchronizes snapshot data
--snapshot_batch_size=1048576
# The rate limit in bytes when leader synchronizes rebuilding index
--rebuild_index_part_rate_limit=4194304
# The amount of data sent in each batch when leader synchronizes rebuilding index
--rebuild_index_batch_size=1048576

########## memory tracker ##########
# trackable memory ratio (trackable_memory / (total_memory - untracked_reserved_memory) )
--memory_tracker_limit_ratio=0.8
# untracked reserved memory in Mib
--memory_tracker_untracked_reserved_memory_mb=50

# enable log memory tracker stats periodically
--memory_tracker_detail_log=false
# log memory tacker stats interval in milliseconds
--memory_tracker_detail_log_interval_ms=60000

# enable memory background purge (if jemalloc is used)
--memory_purge_enabled=true
# memory background purge interval in seconds
--memory_purge_interval_seconds=10

MuYi-方扬 · 2025 年6 月 12 日 13:06

你数据量那么大啊？配那么大的 HDD，可以的话换点 SSD 呗

白玉汤儿 · 2025 年6 月 15 日 08:30

是的…目前就是先尝试通过调参找到上限,实在调优解决不了了,才能考虑升级硬件的事