nebula 2.0.0-rc1数据批量导入leader报错

从刚才贴的 log 看, Part 52 那个partition的 raft group, 应该是有至少一台的持久化 data 丢失了.

当前机器的grep 结果,基本上都是这种,貌似有点多

I0301 19:18:35.383262    15 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261001, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.383766    24 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261002, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.398375    24 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261006, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.431094    20 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261007, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.442268    40 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261009, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.448040    38 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261012, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.461788    39 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261013, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.465445    20 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261015, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.467043    19 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261017, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.469553    41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261020, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.470352    41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261024, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.471130    41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261025, term 1 i had committed yet. My committedLogId is 393043, term is 1
I0301 19:18:35.471912    41 RaftPart.cpp:1583] [Port: 9780, Space: 15, Part: 52] Stale log! The log 261026, term 1 i had committed yet. My committedLogId is 393043, term is 1

可以 grep Part: 52 之后 再 grep “There are”

我估计是所有的,基本上大部分插入都失败了

加 “There are”后就没数据了

那是删过 log? 这句话一定会打的. 在另外两台机器上试试呢?

当前机器,和另外两台机器的日志

I0301 17:32:32.087620     1 RaftPart.cpp:295] [Port: 9780, Space: 15, Part: 52] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 393044, lastLogTerm 1, committedLogId 393043, term 1
I0301 17:32:25.236040     1 RaftPart.cpp:295] [Port: 9780, Space: 15, Part: 52] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 0, lastLogTerm 0, committedLogId 0, term 0
I0301 17:32:25.377434     1 RaftPart.cpp:295] [Port: 9780, Space: 15, Part: 52] There are 2 peer hosts, and total 3 copies. The quorum is 2, as learner 0, lastLogId 0, lastLogTerm 0, committedLogId 0, term 0

嗯, 那就符合逻辑了. 我们说你的当前机器是 A, 另外两台是 B, C.

从 log 看是 B 跟 C 互相选举 (B 为 leader),

B 发 request 给 A, 但是 A 不接受, 因为有本地已提交的 log . 就报错.

这里的问题是,

既然 A 有本地已提交的 log, 那么 B 跟 C 其中之一就一定也有,

但是用 “There are” 去搜的时候 ,显示 id 为 0 , 那么意味着,

B 跟 C 中的一台或者 2 台的本地 log 丢失了. (误操作?).

日志没有动过,单独目录存放的
docker 目录相关的都是这么配置的

- --data_path=/data1/storaged,/data2/storaged,/data3/storaged,/data4/storaged,/data5/storaged
 volumes:
      - /data1/nebula/data/storaged:/data1/storaged
      - /data2/nebula/data/storaged:/data2/storaged
      - /data3/nebula/data/storaged:/data3/storaged
      - /data4/nebula/data/storaged:/data4/storaged
      - /data5/nebula/data/storaged:/data5/storaged
      - /data/nebula/nebula/logs/storaged:/logs

那我是不是应该每个storaged创建一个logs目录?

哦, 我说的 log 是 raft 的 wal(保存在 data 里), 你的某台机器的 data 应该是被动过.

有没有可能是权限的问题,我不是使用root用户启动的docker,而是使用nebula用户,然后我看了下wal的权限,都是root用户下的,上层目录权限还是nebula的,但是底层到storaged的权限就变成root的了

drwxrwxr-x 3 nebula nebula 4096 Feb 24 14:34 storaged
[nebula@node1 /data1/nebula/data]$ pwd
/data1/nebula/data

drwxr-xr-x 5 root root 4096 Feb 26 16:45 nebula
[nebula@node1 /data1/nebula/data/storaged]$ pwd
/data1/nebula/data/storaged
drwxr-xr-x 2 root root 4096 Mar  1 16:49 10
drwxr-xr-x 2 root root 4096 Mar  2 11:32 12
drwxr-xr-x 2 root root 4096 Mar  1 16:47 18
drwxr-xr-x 2 root root 4096 Mar  2 11:32 23
drwxr-xr-x 2 root root 4096 Mar  2 11:32 34
drwxr-xr-x 2 root root 4096 Mar  2 11:32 4
drwxr-xr-x 2 root root 4096 Mar  2 11:32 42
drwxr-xr-x 2 root root 4096 Mar  1 16:49 47
drwxr-xr-x 2 root root 4096 Mar  1 16:47 5
drwxr-xr-x 2 root root 4096 Mar  2 11:32 53
drwxr-xr-x 2 root root 4096 Mar  1 16:47 6
drwxr-xr-x 2 root root 4096 Mar  2 11:32 64
drwxr-xr-x 2 root root 4096 Mar  1 16:47 70
drwxr-xr-x 2 root root 4096 Mar  1 16:48 71

对于刚才说的 B和 C, 倒是有可能. 不过你有台还是可以读取 WAL 的 A 机器, 这个是有啥不一样的吗?

本机可以读?
有没有办法可以让nebulagraph自己创建的目录,在nebula用户权限下面或者说在自己指定的用户下面

物理机都是直接自己创建, docker可能得做下路径映射? 或者直接把 dir 的权限设成 777呢?

貌似不是目录权限问题,我改成root用户部署还是有问题,目录权限都改成root了

I0302 13:34:38.738606    61 Part.cpp:191] [Port: 9780, Space: 1, Part: 41] Find the new leader [9.198.129.139:9780]
I0302 13:34:38.738612    27 RaftPart.cpp:422] [Port: 9780, Space: 1, Part: 41] Commit transfer leader to [9.198.129.139:9780]
I0302 13:34:38.738948    27 RaftPart.cpp:442] [Port: 9780, Space: 1, Part: 41] I am Follower, just wait for the new leader!
I0302 13:34:40.078891    27 RaftPart.cpp:1747] [Port: 9780, Space: 1, Part: 78] The current role is Follower. Will follow the new leader 9.186.21.144:9780 
[Term: 2]
I0302 13:34:40.079088    62 Part.cpp:191] [Port: 9780, Space: 1, Part: 78] Find the new leader [9.186.21.144:9780]
I0302 13:34:40.079095    27 RaftPart.cpp:422] [Port: 9780, Space: 1, Part: 78] Commit transfer leader to [9.186.21.144:9780]
I0302 13:34:40.079403    27 RaftPart.cpp:442] [Port: 9780, Space: 1, Part: 78] I am Follower, just wait for the new leader!
E0302 13:36:17.146997    27 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 17] The partition is not a leader
E0302 13:36:17.147003    25 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 90] Cannot append logs, clean the buffer
E0302 13:36:20.185653    33 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 41] The partition is not a leader
E0302 13:36:20.186385    33 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 41] Cannot append logs, clean the buffer
E0302 13:36:21.258088    21 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 90] The partition is not a leader
E0302 13:36:21.258584    21 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 90] Cannot append logs, clean the buffer
E0302 13:36:22.202852    43 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 5] The partition is not a leader
E0302 13:36:22.203528    43 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 5] Cannot append logs, clean the buffer
E0302 13:36:22.267526    24 RaftPart.cpp:367] [Port: 9780, Space: 1, Part: 30] The partition is not a leader
E0302 13:36:22.268087    24 RaftPart.cpp:687] [Port: 9780, Space: 1, Part: 30] Cannot append logs, clean the buffer

这个新的 space 也会打 Stale log 那句吗?

暂时还没打,我先把任务停掉了,leader还是不停的变

可以重启一下, 然后什么都不干, 然后瞅瞅 storage 的 log

重启后会打印Stale log

I0302 13:47:14.495133    43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 59] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170198, term is 3
I0302 13:47:14.574668    43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 17] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 171397, term is 3
I0302 13:47:14.613726    43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 95] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 169519, term is 3
I0302 13:47:14.639617    43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 88] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170475, term is 1
I0302 13:47:15.251533    42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 52] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 171330, term is 4
I0302 13:47:15.378458    42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 46] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 171035, term is 4
I0302 13:47:17.097889    42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 24] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170503, term is 5
I0302 13:47:17.158674    42 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 30] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 170979, term is 4
I0302 13:47:17.524771    43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 66] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 169454, term is 1
I0302 13:47:17.571445    43 RaftPart.cpp:1583] [Port: 9780, Space: 1, Part: 60] Stale log! The log 0, term 0 i had committed yet. My committedLogId is 169503, term is 5