2.0 master版本RaftPart的leader会自动变化

bupt_guojun · 2021 年1 月 28 日 03:13

nebula 版本：GitHub 上最新版本
部署方式（分布式）：集群部署，3 台机器
硬件信息
- 磁盘：NVME SSD 3.2TB
- 内存：256GB

(root@nebula) [(none)]> describe space push_new
+----+------------+------------------+----------------+---------+------------+--------------------+-----------+
| ID | Name       | Partition Number | Replica Factor | Charset | Collate    | Vid Type           | Group     |
+----+------------+------------------+----------------+---------+------------+--------------------+-----------+
| 2  | "push_new" | 30               | 3              | "utf8"  | "utf8_bin" | "FIXED_STRING(24)" | "default" |
+----+------------+------------------+----------------+---------+------------+--------------------+-----------+
Got 1 rows (time spent 1050/1298 us)

问题的具体描述：
在压测过程中发现 RaftPart 的 Leader 自动变化了(所有的进程都没有 core 过)，有没有什么参数可以控制这个变化。昨天看每台机器上的 leader count 都是 20，今天看发现有变化。

(root@nebula) [(none)]> show hosts
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| Host             | Port  | Status   | Leader count | Leader distribution          | Partition distribution       |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "xxx.xxx.xxx.xxx"  | 44500 | "ONLINE" | 16           | "push_new:4, push_space:12"  | "push_new:30, push_space:30" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "xxx.xxx.xxx.xxx" | 44500 | "ONLINE" | 15           | "push_new:10, push_space:5"  | "push_new:30, push_space:30" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "xxx.xxx.xxx.xxx" | 44500 | "ONLINE" | 29           | "push_new:16, push_space:13" | "push_new:30, push_space:30" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+
| "Total"          |       |          | 60           | "push_new:30, push_space:30" | "push_new:90, push_space:90" |
+------------------+-------+----------+--------------+------------------------------+------------------------------+

并且storaged上出现了错误日志，应该是Leader变化导致的。

E0127 22:02:59.256729 31565 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 20] The partition is not a leader
E0127 22:02:59.256877 31565 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 20] Cannot append logs, clean the buffer
E0127 22:02:59.679719 31568 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 19] The partition is not a leader
E0127 22:02:59.679809 31568 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680357 31581 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680359 31552 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.680780 31552 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 6] Cannot append logs, clean the buffer
E0127 22:02:59.680786 31561 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 1] The partition is not a leader
E0127 22:02:59.681671 31545 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.681681 31570 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.681684 31575 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 6] Cannot append logs, clean the buffer

jievince · 2021 年1 月 28 日 07:51

leader变化是正常的，为什么想保持leader不变呢？storaged出现错误日志后，你实际使用上遇到问题了吗？
https://docs.nebula-graph.com.cn/manual-CN/1.overview/3.design-and-architecture/2.storage-design/#transfer_leadership

bupt_guojun · 2021 年1 月 28 日 07:53

bupt_guojun:

E0127 22:02:59.256729 31565 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 20] The partition is not a leader
E0127 22:02:59.256877 31565 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 20] Cannot append logs, clean the buffer
E0127 22:02:59.679719 31568 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 19] The partition is not a leader
E0127 22:02:59.679809 31568 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680357 31581 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680359 31552 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.680780 31552 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 6] Cannot append logs, clean the buffer
E0127 22:02:59.680786 31561 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 1] The partition is not a leader
E0127 22:02:59.681671 31545 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.681681 31570 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer

没遇到问题，就是说raft client在leader变化之后会变更leader重试吗？

jievince · 2021 年1 月 28 日 08:01

bupt_guojun:

E0127 22:02:59.256729 31565 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 20] The partition is not a leader
E0127 22:02:59.256877 31565 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 20] Cannot append logs, clean the buffer
E0127 22:02:59.679719 31568 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 19] The partition is not a leader
E0127 22:02:59.679809 31568 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680357 31581 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.680359 31552 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.680780 31552 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 6] Cannot append logs, clean the buffer
E0127 22:02:59.680786 31561 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 1] The partition is not a leader
E0127 22:02:59.681671 31545 RaftPart.cpp:367] [Port: 44501, Space: 2, Part: 18] The partition is not a leader
E0127 22:02:59.681681 31570 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 19] Cannot append logs, clean the buffer
E0127 22:02:59.681684 31575 RaftPart.cpp:687] [Port: 44501, Space: 2, Part: 6] Cannot append logs, clean the buffer

是的，这个日志是正常符合预期的，也许这个日志级别定义为warning更合适点。

critical27 · 2021 年1 月 28 日 08:06

是大量写入数据吗？目前在大量并发写入的时候，leader可能会切换，导致打印日志。

bupt_guojun · 2021 年1 月 28 日 08:38

是的，在大量导入数据。

min.wu · 2021 年1 月 29 日 05:32

我们碰到过这样的情况，再balance leader一下好了。

jamieliu1023 · 2021 年2 月 1 日 08:07

@steam