2.0 master版本RaftPart的leader会自动变化

  • nebula 版本:GitHub 上最新版本
  • 部署方式(分布式 ):集群部署,3 台机器
  • 硬件信息
    • 磁盘:NVME SSD 3.2TB
    • 内存:256GB
(root@nebula) [(none)]> describe space push_new
+----+------------+------------------+----------------+---------+------------+--------------------+-----------+
| ID | Name       | Partition Number | Replica Factor | Charset | Collate    | Vid Type           | Group     |
+----+------------+------------------+----------------+---------+------------+--------------------+-----------+
| 2  | "push_new" | 30               | 3              | "utf8"  | "utf8_bin" | "FIXED_STRING(24)" | "default" |
+----+------------+------------------+----------------+---------+------------+--------------------+-----------+
Got 1 rows (time spent 1050/1298 us)
  • 问题的具体描述:
    在压测过程中发现 RaftPart 的 Leader 自动变化了(所有的进程都没有 core 过),有没有什么参数可以控制这个变化。昨天看每台机器上的 leader count 都是 20,今天看发现有变化。
(root@nebula) [(none)]> show hosts
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| Host             | Port  | Status   | Leader count | Leader distribution          | Partition distribution       |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "xxx.xxx.xxx.xxx"  | 44500 | "ONLINE" | 16           | "push_new:4, push_space:12"  | "push_new:30, push_space:30" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "xxx.xxx.xxx.xxx" | 44500 | "ONLINE" | 15           | "push_new:10, push_space:5"  | "push_new:30, push_space:30" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "xxx.xxx.xxx.xxx" | 44500 | "ONLINE" | 29           | "push_new:16, push_space:13" | "push_new:30, push_space:30" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "Total"          |       |          | 60           | "push_new:30, push_space:30" | "push_new:90, push_space:90" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+

并且storaged上出现了错误日志,应该是Leader变化导致的。

E0127 22:02:59.256729 31565 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 20] The partition is not a leader
E0127 22:02:59.256877 31565 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 20] Cannot append logs, clean the buffer
E0127 22:02:59.679719 31568 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 19] The partition is not a leader
E0127 22:02:59.679809 31568 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680357 31581 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680359 31552 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.680780 31552 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 6] Cannot append logs, clean the buffer
E0127 22:02:59.680786 31561 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 1] The partition is not a leader
E0127 22:02:59.681671 31545 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.681681 31570 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.681684 31575 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 6] Cannot append logs, clean the buffer

leader变化是正常的, 为什么想保持leader不变呢?storaged出现错误日志后, 你实际使用上遇到问题了吗?
https://docs.nebula-graph.com.cn/manual-CN/1.overview/3.design-and-architecture/2.storage-design/#transfer_leadership

没遇到问题,就是说raft client在leader变化之后会 变更leader重试吗?

是的, 这个日志是正常符合预期的, 也许这个日志级别定义为warning更合适点。

是大量写入数据吗?目前在大量并发写入的时候,leader可能会切换,导致打印日志。

是的,在大量导入数据。

我们碰到过这样的情况,再balance leader一下好了。

@steam