从刚才贴的 log 看, Part 52 那个partition的 raft group, 应该是有至少一台的持久化 data 丢失了.
当前机器的grep 结果,基本上都是这种,貌似有点多
I0301 19:18:35.383262 15 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261001, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.383766 24 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261002, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.398375 24 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261006, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.431094 20 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261007, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.442268 40 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261009, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.448040 38 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261012, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.461788 39 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261013, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.465445 20 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261015, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.467043 19 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261017, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.469553 41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261020, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.470352 41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261024, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.471130 41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261025, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.471912 41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261026, term 1 i had committed yet. My committedLogId is 393043, term is 1
可以 grep Part: 52 之后 再 grep “There are”
我估计是所有的,基本上大部分插入都失败了
加 “There are”后就没数据了
那是删过 log? 这句话一定会打的. 在另外两台机器上试试呢?
当前机器,和另外两台机器的日志
I0301 17:32:32.087620 1 RaftPart.cpp:295] [Port: 9780, Space: 15, Part: 52] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 393044, lastLogTerm 1, committedLogId 393043, term 1
I0301 17:32:25.236040 1 RaftPart.cpp:295] [Port: 9780, Space: 15, Part: 52] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 0, lastLogTerm 0, committedLogId 0, term 0
I0301 17:32:25.377434 1 RaftPart.cpp:295] [Port: 9780, Space: 15, Part: 52] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 0, lastLogTerm 0, committedLogId 0, term 0
嗯, 那就符合逻辑了. 我们说你的当前机器是 A, 另外两台是 B, C.
从 log 看是 B 跟 C 互相选举 (B 为 leader),
B 发 request 给 A, 但是 A 不接受, 因为有本地已提交的 log . 就报错.
这里的问题是,
既然 A 有本地已提交的 log, 那么 B 跟 C 其中之一就一定也有,
但是用 “There are” 去搜的时候 ,显示 id 为 0 , 那么意味着,
B 跟 C 中的一台或者 2 台的本地 log 丢失了. (误操作?).
日志没有动过,单独目录存放的
docker 目录相关的都是这么配置的
- --data_path=/data1/storaged,/data2/storaged,/data3/storaged,/data4/storaged,/data5/storaged
volumes:
- /data1/nebula/data/storaged:/data1/storaged
- /data2/nebula/data/storaged:/data2/storaged
- /data3/nebula/data/storaged:/data3/storaged
- /data4/nebula/data/storaged:/data4/storaged
- /data5/nebula/data/storaged:/data5/storaged
- /data/nebula/nebula/logs/storaged:/logs
那我是不是应该每个storaged创建一个logs目录?
哦, 我说的 log 是 raft 的 wal(保存在 data 里), 你的某台机器的 data 应该是被动过.
有没有可能是权限的问题,我不是使用root用户启动的docker,而是使用nebula用户,然后我看了下wal的权限,都是root用户下的,上层目录权限还是nebula的,但是底层到storaged的权限就变成root的了
drwxrwxr-x 3 nebula nebula 4096 Feb 24 14:34 storaged
[nebula@node1 /data1/nebula/data]$ pwd
/data1/nebula/data
drwxr-xr-x 5 root root 4096 Feb 26 16:45 nebula
[nebula@node1 /data1/nebula/data/storaged]$ pwd
/data1/nebula/data/storaged
drwxr-xr-x 2 root root 4096 Mar 1 16:49 10
drwxr-xr-x 2 root root 4096 Mar 2 11:32 12
drwxr-xr-x 2 root root 4096 Mar 1 16:47 18
drwxr-xr-x 2 root root 4096 Mar 2 11:32 23
drwxr-xr-x 2 root root 4096 Mar 2 11:32 34
drwxr-xr-x 2 root root 4096 Mar 2 11:32 4
drwxr-xr-x 2 root root 4096 Mar 2 11:32 42
drwxr-xr-x 2 root root 4096 Mar 1 16:49 47
drwxr-xr-x 2 root root 4096 Mar 1 16:47 5
drwxr-xr-x 2 root root 4096 Mar 2 11:32 53
drwxr-xr-x 2 root root 4096 Mar 1 16:47 6
drwxr-xr-x 2 root root 4096 Mar 2 11:32 64
drwxr-xr-x 2 root root 4096 Mar 1 16:47 70
drwxr-xr-x 2 root root 4096 Mar 1 16:48 71
对于刚才说的 B和 C, 倒是有可能. 不过你有台还是可以读取 WAL 的 A 机器, 这个是有啥不一样的吗?
本机可以读?
有没有办法可以让nebulagraph自己创建的目录,在nebula用户权限下面或者说在自己指定的用户下面
物理机都是直接自己创建, docker可能得做下路径映射? 或者直接把 dir 的权限设成 777呢?
貌似不是目录权限问题,我改成root用户部署还是有问题,目录权限都改成root了
I0302 13:34:38.738606 61 Part.cpp:191] [Port: 9780, Space: 1, Part: 41] Find the new leader [9.198.129.139:9780]
I0302 13:34:38.738612 27 RaftPart.cpp:422] [Port: 9780, Space: 1, Part: 41] Commit transfer leader to [9.198.129.139:9780]
I0302 13:34:38.738948 27 RaftPart.cpp:442] [Port: 9780, Space: 1, Part: 41] I am Follower, just wait for the new leader!
I0302 13:34:40.078891 27 RaftPart.cpp:1747] [Port: 9780, Space: 1, Part: 78] The current role is Follower. Will follow the new leader 9.186.21.144:9780
[Term: 2]
I0302 13:34:40.079088 62 Part.cpp:191] [Port: 9780, Space: 1, Part: 78] Find the new leader [9.186.21.144:9780]
I0302 13:34:40.079095 27 RaftPart.cpp:422] [Port: 9780, Space: 1, Part: 78] Commit transfer leader to [9.186.21.144:9780]
I0302 13:34:40.079403 27 RaftPart.cpp:442] [Port: 9780, Space: 1, Part: 78] I am Follower, just wait for the new leader!
E0302 13:36:17.146997 27 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 17] The partition is not a leader
E0302 13:36:17.147003 25 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 90] Cannot append logs, clean the buffer
E0302 13:36:20.185653 33 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 41] The partition is not a leader
E0302 13:36:20.186385 33 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 41] Cannot append logs, clean the buffer
E0302 13:36:21.258088 21 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 90] The partition is not a leader
E0302 13:36:21.258584 21 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 90] Cannot append logs, clean the buffer
E0302 13:36:22.202852 43 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 5] The partition is not a leader
E0302 13:36:22.203528 43 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 5] Cannot append logs, clean the buffer
E0302 13:36:22.267526 24 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 30] The partition is not a leader
E0302 13:36:22.268087 24 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 30] Cannot append logs, clean the buffer
这个新的 space 也会打 Stale log 那句吗?
暂时还没打,我先把任务停掉了,leader还是不停的变
可以重启一下, 然后什么都不干, 然后瞅瞅 storage 的 log
重启后会打印Stale log
I0302 13:47:14.495133 43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 59] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170198, term is 3
I0302 13:47:14.574668 43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 17] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 171397, term is 3
I0302 13:47:14.613726 43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 95] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 169519, term is 3
I0302 13:47:14.639617 43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 88] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170475, term is 1
I0302 13:47:15.251533 42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 52] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 171330, term is 4
I0302 13:47:15.378458 42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 46] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 171035, term is 4
I0302 13:47:17.097889 42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 24] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170503, term is 5
I0302 13:47:17.158674 42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 30] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170979, term is 4
I0302 13:47:17.524771 43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 66] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 169454, term is 1
I0302 13:47:17.571445 43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 60] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 169503, term is 5