metad 容器报错

提问参考模版:

  • nebula 版本:2.0.0 rc
  • 部署方式: 分布式
  • 是否为线上版本:Y
  • longhorn 存储
  • 问题的具体描述
    我线上分布式部署了 3 个 metad、3 个 storaged 在不同的节点上。但是从今天早上开始 metad1 容器一直在重启,显示问题如下:

请问有可能是什么原因呀~

您好, 可以运行如下命令吗?

kubectl describe pod nebula-metad-0
Name:         nebula-metad-1
Namespace:    nebula-database
Priority:     0
Node:         k3s-beijing-agent-03/10.1.125.190
Start Time:   Mon, 28 Jun 2021 02:08:44 +0000
Labels:       app.kubernetes.io/component=nebula-metad
              controller-revision-hash=nebula-metad-5b87c5c6df
              io.cattle.field/appId=nebula-database
              statefulset.kubernetes.io/pod-name=nebula-metad-1
Annotations:  <none>
Status:       Running
IP:           10.42.5.221
IPs:
  IP:           10.42.5.221
Controlled By:  StatefulSet/nebula-metad
Containers:
  nebula-metad:
    Container ID:  docker://f90e69c65ca769ca7514a59dca0322a8bd0cec07a24681394ec77beef50d6c92
    Image:         vesoft/nebula-metad:v2-nightly
    Image ID:      docker-pullable://vesoft/nebula-metad@sha256:cf383d3e59f925977f50c2d1cae5635722f0546c38cace8ec5cabb036851017a
    Ports:         9559/TCP, 19559/TCP, 19560/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -ecx
    Args:
      exec /usr/local/nebula/bin/nebula-metad --flagfile=/usr/local/nebula/etc/nebula-metad.conf --v=3 --minloglevel=4 --daemonize=false --local_ip=$(hostname).nebula-metad.nebula-database.svc.cluster.local
    State:          Running
      Started:      Mon, 28 Jun 2021 02:10:58 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    143
      Started:      Mon, 28 Jun 2021 02:09:58 +0000
      Finished:     Mon, 28 Jun 2021 02:10:57 +0000
    Ready:          True
    Restart Count:  2
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     1
      memory:  1Gi
    Liveness:  http-get http://:19559/status delay=30s timeout=10s period=10s #success=1 #failure=3
    Environment:
      USER:  root
    Mounts:
      /etc/localtime from timezone (rw)
      /usr/local/nebula/data from metad (rw,path="data")
      /usr/local/nebula/etc/ from config (rw)
      /usr/local/nebula/logs from metad (rw,path="logs")
      /var/run/secrets/kubernetes.io/serviceaccount from nebula-token-k6q55 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  metad:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  metad-nebula-metad-1
    ReadOnly:   false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nebula-metad
    Optional:  false
  timezone:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/localtime
    HostPathType:  
  nebula-token-k6q55:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nebula-token-k6q55
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  nebula=yes
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

嗯嗯,这个 Pod 看起来是在 Running 状态。 现在是哪个 pod 报错?
可以看看对应的报错日志吗?是上面那个图中的吗?对于哪个 Pod ?

您这个不是 Nebula Operator 部署的吧?

现在是 nebula-metad-1 报错,日志就是上面贴的那些,nebula-metad-0 和 2 都是正常的。我是 rancher 应用商店部署的。

有更详细的日志吗? 这里看不出来为什么挂了
或者 core dump 文件?

没有别的了。。不过现在看没有 events 了,好像恢复正常了。

嗯嗯, 谢谢!
这个日志还不够分析问题,后续有问题,麻烦您再反馈反馈 :grinning:

该话题在最后一个回复创建后30天后自动关闭。不再允许新的回复。

浙ICP备20010487号