nebula 2.0.1 metad and storaged unhealthy

  • nebula 版本:2.0.1
  • 部署方式:Docker
  • 是否为线上版本:Y
  • 硬件信息
    mac m1芯片 16g内存
  • 问题的具体描述

日志截取:

      metas-stderr.log

         I0610 19:38:43.396545    55 ThriftClientManager.inl:62] resolve "metad2":9560 as "172.27.0.2":9560
I0610 19:38:43.860219     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:44.865262     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:45.873646     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:46.879828     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:47.886328     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:48.894210     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:49.898396     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:50.673851    49 RaftPart.cpp:1275] [Port: 9560, Space: 0, Part: 0] No one is elected, continue the election
I0610 19:38:50.903517     1 MetaDaemon.cpp:112] Leader has not been elected, sleep 1s
I0610 19:38:51.207816    50 RaftPart.cpp:1193] [Port: 9560, Space: 0, Part: 0] Sending out an election request (space = 0, part = 0, term = 84, lastLogId = 0, lastLogTerm = 0, 
  storages-stderr.log 

E0610 18:44:04.943058    49 MetaClient.cpp:597] Send request to "metad1":9559, exceed retry limit
*** Aborted at 1623324101 (unix time) try "date -d @1623324101" if you are using GNU date ***
qemu: uncaught target signal 11 (Segmentation fault) - core dumped
I0610 19:25:06.851692     1 StorageDaemon.cpp:91] host = "storaged1":9779
I0610 19:25:06.907989     1 MetaClient.cpp:50] Create meta client to "metad1":9559
E0610 19:25:06.914433     1 GflagsManager.cpp:70] Load gflags json failed
I0610 19:25:06.941390     1 GflagsManager.cpp:140] Prepare to register 12 gflags to meta
W0610 19:25:06.943078     1 FileBasedClusterIdMan.cpp:46] Open file failed, error No such file or directory
I0610 19:25:12.065560    49 ThriftClientManager.inl:62] resolve "metad2":9559 as "172.27.0.2":9559
I0610 19:25:19.356051    49 ThriftClientManager.inl:62] resolve "metad1":9559 as "172.27.0.4":9559
I0610 19:25:26.638190    49 ThriftClientManager.inl:62] resolve "metad0":9559 as "172.27.0.3":9559
I0610 19:25:34.454142    49 ThriftClientManager.inl:62] resolve "metad2":9559 as "172.27.0.2":9559
E0610 19:25:34.463639    49 MetaClient.cpp:597] Send request to "metad2":9559, exceed retry limit

三个 storage/meta/graph 之间可以通过 network 互相ping通

network 手工创建 bridge

1赞

你贴info日志把,把整个info日志通过附件上传

“你贴info日志把,把整个info日志通过附件上传”

nebula-metad.INFO (114.5 KB) nebula-storaged.INFO (1.0 KB)

下边是类似 ss/netstat 的命令,单行的,您能在 meta 的 container 里执行么?(类似于 ss -plunt | grep 9560

awk 'function hextodec(str,ret,n,i,k,c){
    ret = 0
    n = length(str)
    for (i = 1; i <= n; i++) {
        c = tolower(substr(str, i, 1))
        k = index("123456789abcdef", c)
        ret = ret * 16 + k
    }
    return ret
}
function getIP(str,ret){
    ret=hextodec(substr(str,index(str,":")-2,2)); 
    for (i=5; i>0; i-=2) {
        ret = ret"."hextodec(substr(str,i,2))
    }
    ret = ret":"hextodec(substr(str,index(str,":")+1,4))
    return ret
} 
NR > 1 {{if(NR==2)print "Local - Remote";local=getIP($2);remote=getIP($3)}{print local" - "remote}}' /proc/net/tcp | grep "9559\|9560"

我在工作的 compose 拉起的 meta里会有类似这样的结果

172.18.0.4:19559 - 0.0.0.0:0
172.18.0.4:19560 - 0.0.0.0:0
0.0.0.0:9559 - 0.0.0.0:0
0.0.0.0:9560 - 0.0.0.0:0
172.18.0.4:9560 - 172.18.0.3:38204
172.18.0.4:9560 - 172.18.0.3:38208
172.18.0.4:9560 - 172.18.0.3:38272
172.18.0.4:9560 - 172.18.0.3:38302
172.18.0.4:9560 - 172.18.0.3:38342
172.18.0.4:57594 - 172.18.0.4:19559
172.18.0.4:9560 - 172.18.0.3:38256
172.18.0.4:9560 - 172.18.0.3:38234
172.18.0.4:58340 - 172.18.0.4:19559
172.18.0.4:9560 - 172.18.0.3:38294
172.18.0.4:59086 - 172.18.0.4:19559
172.18.0.4:9560 - 172.18.0.3:38260
172.18.0.4:9560 - 172.18.0.3:38226
172.18.0.4:9560 - 172.18.0.3:38230
172.18.0.4:9560 - 172.18.0.3:38242
172.18.0.4:9560 - 172.18.0.3:38290
172.18.0.4:9560 - 172.18.0.3:38264
172.18.0.4:9560 - 172.18.0.3:38286
172.18.0.4:9560 - 172.18.0.3:38322

上边是 grep 9560 端口的结果,下边是没有grep的
我在我的 M1 和 Linux docker 下结果的对比

awk 'function hextodec(str,ret,n,i,k,c){
    ret = 0
    n = length(str)
    for (i = 1; i <= n; i++) {
        c = tolower(substr(str, i, 1))
        k = index("123456789abcdef", c)
        ret = ret * 16 + k
    }
    return ret
}
function getIP(str,ret){
    ret=hextodec(substr(str,index(str,":")-2,2)); 
    for (i=5; i>0; i-=2) {
        ret = ret"."hextodec(substr(str,i,2))
    }
    ret = ret":"hextodec(substr(str,index(str,":")+1,4))
    return ret
} 
NR > 1 {{if(NR==2)print "Local - Remote";local=getIP($2);remote=getIP($3)}{print local" - "remote}}' /proc/net/tcp 

M1 (我的M1上也是 meta unhealthy)

Local - Remote
127.0.0.11:45463 - 0.0.0.0:0
0.0.0.0:9560 - 0.0.0.0:0
172.19.0.4:35274 - 172.19.0.3:9560
172.19.0.4:9560 - 172.19.0.2:60320
172.19.0.4:9560 - 172.19.0.2:60386
172.19.0.4:9560 - 172.19.0.3:60064
172.19.0.4:53230 - 172.19.0.2:9560
172.19.0.4:35234 - 172.19.0.3:9560
172.19.0.4:9560 - 172.19.0.3:60028
172.19.0.4:9560 - 172.19.0.3:60118
172.19.0.4:9560 - 172.19.0.2:60400
172.19.0.4:9560 - 172.19.0.3:60098
172.19.0.4:35204 - 172.19.0.3:9560
172.19.0.4:9560 - 172.19.0.3:60016
172.19.0.4:35320 - 172.19.0.3:9560
172.19.0.4:9560 - 172.19.0.2:60376
172.19.0.4:9560 - 172.19.0.2:60294
172.19.0.4:35286 - 172.19.0.3:9560
172.19.0.4:9560 - 172.19.0.2:60360
172.19.0.4:9560 - 172.19.0.3:60046
172.19.0.4:53330 - 172.19.0.2:9560
172.19.0.4:35302 - 172.19.0.3:9560
172.19.0.4:53296 - 172.19.0.2:9560
172.19.0.4:9560 - 172.19.0.2:60306
172.19.0.4:35220 - 172.19.0.3:9560
172.19.0.4:53214 - 172.19.0.2:9560
172.19.0.4:53284 - 172.19.0.2:9560
172.19.0.4:53264 - 172.19.0.2:9560
172.19.0.4:9560 - 172.19.0.3:60082
172.19.0.4:9560 - 172.19.0.2:60418
172.19.0.4:9560 - 172.19.0.3:60002
172.19.0.4:9560 - 172.19.0.2:60340
172.19.0.4:53244 - 172.19.0.2:9560
172.19.0.4:53312 - 172.19.0.2:9560
172.19.0.4:35254 - 172.19.0.3:9560

Linux

Local - Remote
172.18.0.4:19559 - 0.0.0.0:0
172.18.0.4:19560 - 0.0.0.0:0
0.0.0.0:9559 - 0.0.0.0:0
0.0.0.0:9560 - 0.0.0.0:0
127.0.0.11:38074 - 0.0.0.0:0
172.18.0.4:9560 - 172.18.0.3:38204
172.18.0.4:9560 - 172.18.0.3:38208
172.18.0.4:9560 - 172.18.0.3:38272
172.18.0.4:9560 - 172.18.0.3:38302
172.18.0.4:9560 - 172.18.0.3:38342
172.18.0.4:9560 - 172.18.0.3:38256
172.18.0.4:9560 - 172.18.0.3:38234
172.18.0.4:9560 - 172.18.0.3:38294
172.18.0.4:59086 - 172.18.0.4:19559
172.18.0.4:60568 - 172.18.0.4:19559
172.18.0.4:9560 - 172.18.0.3:38260
172.18.0.4:9560 - 172.18.0.3:38226
172.18.0.4:9560 - 172.18.0.3:38230
172.18.0.4:59830 - 172.18.0.4:19559
172.18.0.4:33086 - 172.18.0.4:19559
172.18.0.4:9560 - 172.18.0.3:38242
172.18.0.4:9560 - 172.18.0.3:38290
172.18.0.4:9560 - 172.18.0.3:38264
172.18.0.4:9560 - 172.18.0.3:38286
172.18.0.4:9560 - 172.18.0.3:38322

docker在m1下面支持的还不够完善,虽然是ping通的,但是响应包的收不到的。所以容器之间通信是不正常的。所以服务不能正常启动。m1下面的问题我们现在还没去看有没有其他方式去解决这个问题,所以建议你先换个环境做测试,或者弄个虚拟机。

浙ICP备20010487号