Star

leader分布不均匀,且balance无效

集群(30节点)最开始搭建好后创建了第一个space,第一个space最开始正常,不论是leader的分布还是partition的分布,都是均匀且正常的。
最近突然发现第一个创建的space leader分布不均匀,而且如果多次show hosts的话,每次看到的结果都是不同的。
后来又创建了第二个space,发现leader的分布还是不均匀,且还是不停的变化,但是有一个规律,比如说我设置了数量1200,他就会从0慢慢增长比如200 400 600啥的,涨到1000多的时候,又退回来又从200开始涨,感觉就像在不停的重试似的

目前观测到storage的日志
error

E0817 14:49:34.054718 32564 Host.cpp:389] [Port: 44501, Space: 18, Part: 619] [Host: x.x.x.x:44501] Failed to append logs to the host (Err: -5): Resource temporarily unavailable [11]

warning

W0817 14:49:25.228298 32532 RaftexService.cpp:180] Cannot find the part 651 in the graph space 18
W0817 14:49:25.262503 32532 RaftexService.cpp:180] Cannot find the part 951 in the graph space 18
W0817 14:49:25.299392 32532 RaftexService.cpp:180] Cannot find the part 351 in the graph space 18
W0817 14:49:25.299605 32532 RaftexService.cpp:180] Cannot find the part 261 in the graph space 18

麻烦帮忙看下原因,谢谢~

可以贴一下 meta 和 storage 的 conf 吗?

您好 配置信息如下(集群是万兆网、ssd盘)

meta

########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-metad.pid

########## logging ##########
# The directory to host logging files, which must already exists
--log_dir=logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=1
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=1
# Maximum seconds to buffer the log messages
--logbufsecs=0

########## networking ##########
# Meta Server Address
--meta_server_addrs=x.x.x.1:45500,x.x.x.2:45500,x.x.x.3:45500
# Local ip
--local_ip=x.x.x.1
# Meta daemon listening port
--port=45500
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=11000
# HTTP2 service port
--ws_h2_port=11002

--heartbeat_interval_secs=10

########## storage ##########
# Root data path, here should be only single path for metad
--data_path=/data1/data/meta

storage

########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-storaged.pid

########## logging ##########
# The directory to host logging files, which must already exists
--log_dir=logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=1
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=1
# Maximum seconds to buffer the log messages
--logbufsecs=0

########## networking ##########
# Meta server address
--meta_server_addrs=x.x.x.1:45500,x.x.x.2:45500,x.x.x.3:45500
# Local ip
--local_ip=x.x.x.1
# Storage daemon listening port
--port=44500
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=12000
# HTTP2 service port
--ws_h2_port=12002
# heartbeat with meta service
--heartbeat_interval_secs=10

######### Raft #########
# Raft election timeout
--raft_heartbeat_interval_secs=30
# RPC timeout for raft client (ms)
--raft_rpc_timeout_ms=500
## recycle Raft WAL
--wal_ttl=3600

########## Disk ##########
# Root data path. Split by comma. e.g. --data_path=/disk1/path1/,/disk2/path2/
# One path per Rocksdb instance.
--data_path=/data1/data/storage,/data2/data/storage,/data3/data/storage,/data4/data/storage,/data5/data/storage,/data6/data/storage,/data7/data/storage,/data8/data/storage
############## Rocksdb Options ##############
# The default reserved bytes for one batch operation
--rocksdb_batch_size=4096

# The default block cache size used in BlockBasedTable. (MB)
# recommend: 1/3 of all memory

# Compression algorithm, options: no,snappy,lz4,lz4hc,zlib,bzip2,zstd
# For the sake of binary compatibility, the default value is snappy.
# Recommend to use:
#   * lz4 to gain more CPU performance, with the same compression ratio with snappy
#   * zstd to occupy less disk space
#   * lz4hc for the read-heavy write-light scenario
--rocksdb_compression=snappy

# Set different compressions for different levels
# For example, if --rocksdb_compression is snappy,
# "no:no:lz4:lz4::zstd" is identical to "no:no:lz4:lz4:snappy:zstd:snappy"
# In order to disable compression for level 0/1, set it to "no:no"
--rocksdb_compression_per_level=


############## rocksdb Options ##############
--rocksdb_disable_wal=true


--rocksdb_block_cache=61440
--num_io_threads=24
--num_worker_threads=18
--max_handlers_per_req=256
--min_vertices_per_bucket=100
--reader_handlers=28
--vertex_cache_bucket_exp=8



# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
--rocksdb_db_options={"max_subcompactions":"4","max_background_jobs":"4","stats_dump_period_sec":"200", "write_thread_max_yield_usec":"600"}
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_column_family_options={"disable_auto_compactions":"false","write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456","min_write_buffer_number_to_merge":"2", "max_write_buffer_number_to_maintain":"1"}
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_block_based_table_options={"block_size":"8192","block_restart_interval":"2"}

############# edge samplings ##############
# --enable_reservoir_sampling=false
# --max_edge_returned_per_vertex=2147483647
#--local_config=false

“Cannot find the part” 这个日志是一直在刷 还是建完space之后过一会就没了? 另外可以看眼meta的日志里面是不是收到30个机器的心跳了

看了一下
是建完space之后一会就没了

meta日志级别提高的warning了,不过warning、error都没有错误日志出现。我看看去调一下meta日志级别

到目前为止还在乱蹦…特别明显…每次show hosts Leader count都会变

打开了info日志,应该如何查看是否收到30个机器心跳?有啥关键词嘛?~

另外… 按理来讲应该是都连上了,show hosts的时候都是online,而且leader distribution那列会不停的变化,一会A机器上有space的leader,一会B机器上有space的leader,反正会三十台机器随机蹦出来leader,没有哪台机器是永远不出现leader的

绝大多数leader数量变是因为storage和meta的heartbeat_interval_secs配的不一致, 不过你的配置里时一致的, 有没有可能多个机器的配置不一样. meta的日志里会刷类似"Receive heartbeat from …"这样的日志, 看看同一个机器的间隔, 以及是不是收到所有机器的日志.

您好~
这边看了一下
大致情况如下(篇幅有限 所以裁剪了一下 但是规律一致)

I0817 16:22:41.102599 10886 HBProcessor.cpp:31] Receive heartbeat from [x.x.x.1:44500]
I0817 16:22:41.574378 10886 HBProcessor.cpp:31] Receive heartbeat from [x.x.x.2:44500]
I0817 16:22:44.522013 10876 HBProcessor.cpp:31] Receive heartbeat from [x.x.x.3:44500]
...
I0817 16:22:51.116571 10877 HBProcessor.cpp:31] Receive heartbeat from [x.x.x.1:44500]
I0817 16:22:51.588479 10877 HBProcessor.cpp:31] Receive heartbeat from [x.x.x.2:44500]
I0817 16:22:54.536151 10886 HBProcessor.cpp:31] Receive heartbeat from [x.x.x.3:44500]

1.确认收到了30台机器的日志
2.同一台机器的间隔稳定10秒

现在就是这30台机器 感觉分成了几批似的 没批之间有一定的时间间隔

嗯 这样的话心跳应该是没问题的. storage那边没有刷任何异常日志么?

。。。。十分尴尬。。。 现在自己好了。。。
各个space都均匀了。。。
我还有几个疑问。。辛苦您给指点一下

1.leader distribution 会由于莫名其妙的原因而导致不均匀吗?(意思是说,我没有人为的去增加或减少节点的话他也会因为某些别的原因导致不均匀了)
2.如果1成立的话,那是不是得意味着需要定期手动检查一下平衡,然后手动balance一下?(balance leader很耗费资源吗?将来可以自动化吗?比如可以配置一下,每天or每周凌晨去自检自恢复一下啥的…)
3.balance leader是异步的吗?我今天遇到的问题会不会是正处在balance中,所以分布会不停的跳来跳去~

通常来说leader不会随意切换(机器都活着的情况下), balance leader算是异步的, meta发出transfer leader之后就返回了, 至于成不成功是看raft的各个状态.

好嘞 感谢解答~!

回头可以试试把storage都重启 再balance leader看看多久才平衡 按理1分钟左右就应该好了的 这个时间太长了

好~

浙ICP备20010487号