使用Nebula Operator部署Nebula Cluster时graphd一直就绪探针没通过

dylanyht · 2021 年7 月 15 日 07:37

graphd一直提示如下

Readiness probe failed: Get http://172.20.2.92:19669/status: dial tcp 172.20.2.92:19669: connect: connection refused

我降初始探测时间设置成200秒也是如此
随后将就绪探针去掉，pod显示正常了但是进如pod后相关端口都没有占用

steam · 2021 年7 月 15 日 07:37

可以吧你的部署过程贴下吗？参考哪个教程来着

dylanyht · 2021 年7 月 15 日 07:47

参考此文档：nebula-operator/nebula_cluster_helm_guide.md at master · vesoft-inc/nebula-operator · GitHub

没有做其他变动

kqzh · 2021 年7 月 15 日 08:11

你好，可以贴一下你的 nebula-cluster yaml配置吗，一般来说，只要graphd进程正常启动，探针都是会过的

dylanyht · 2021 年7 月 15 日 08:15

helm install nebula nebula-operator/nebula-cluster
–namespace nebula --create-namespace --version 0.1.0
–set nameOverride=nebula
–set nebula.storageClassName=managed-nfs-storage \

直接这种部署的

dylanyht · 2021 年7 月 15 日 08:19

nebula-graphd

apiVersion: apps/v1
kind: StatefulSet
metadata:
  annotations:
    nebula-graph.io/last-applied-configuration: '{"podManagementPolicy":"Parallel","replicas":2,"selector":{"matchLabels":{"app.kubernetes.io/cluster":"nebula","app.kubernetes.io/component":"graphd","app.kubernetes.io/managed-by":"nebula-operator","app.kubernetes.io/name":"nebula-graph"}},"serviceName":"nebula-graphd-svc","template":{"metadata":{"annotations":{"nebula-graph.io/cm-hash":"94ea457be88fae25"},"creationTimestamp":null,"labels":{"app.kubernetes.io/cluster":"nebula","app.kubernetes.io/component":"graphd","app.kubernetes.io/managed-by":"nebula-operator","app.kubernetes.io/name":"nebula-graph"}},"spec":{"containers":[{"command":["/bin/bash","-ecx","exec
      /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf
      --meta_server_addrs=nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local:9559
      --local_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local --ws_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local
      --minloglevel=1 --v=0 --daemonize=false"],"image":"vesoft/nebula-graphd:v2.0.0","imagePullPolicy":"IfNotPresent","name":"graphd","ports":[{"containerPort":9669,"name":"thrift"},{"containerPort":19669,"name":"http"},{"containerPort":19670,"name":"http2"}],"readinessProbe":{"httpGet":{"path":"/status","port":19669,"scheme":"HTTP"},"initialDelaySeconds":20,"periodSeconds":10,"timeoutSeconds":5},"resources":{"limits":{"cpu":"1","memory":"1Gi"},"requests":{"cpu":"500m","memory":"500Mi"}},"volumeMounts":[{"mountPath":"/usr/local/nebula/logs","name":"graphd","subPath":"logs"},{"mountPath":"/usr/local/nebula/etc","name":"nebula-graphd"}]}],"schedulerName":"default-scheduler","topologySpreadConstraints":[{"labelSelector":{"matchLabels":{"app.kubernetes.io/cluster":"nebula","app.kubernetes.io/component":"graphd","app.kubernetes.io/managed-by":"nebula-operator","app.kubernetes.io/name":"nebula-graph"}},"maxSkew":1,"topologyKey":"kubernetes.io/hostname","whenUnsatisfiable":"ScheduleAnyway"}],"volumes":[{"name":"graphd","persistentVolumeClaim":{"claimName":"graphd"}},{"configMap":{"items":[{"key":"nebula-graphd.conf","path":"nebula-graphd.conf"}],"name":"nebula-graphd"},"name":"nebula-graphd"}]}},"updateStrategy":{"rollingUpdate":{"partition":2},"type":"RollingUpdate"},"volumeClaimTemplates":[{"metadata":{"creationTimestamp":null,"name":"graphd"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1Gi"}},"storageClassName":"managed-nfs-storage"},"status":{}}]}'
  labels:
    app.kubernetes.io/cluster: nebula
    app.kubernetes.io/component: graphd
    app.kubernetes.io/managed-by: nebula-operator
    app.kubernetes.io/name: nebula-graph
  name: nebula-graphd
  namespace: nebula
spec:
  podManagementPolicy: Parallel
  replicas: 2
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/cluster: nebula
      app.kubernetes.io/component: graphd
      app.kubernetes.io/managed-by: nebula-operator
      app.kubernetes.io/name: nebula-graph
  serviceName: nebula-graphd-svc
  template:
    metadata:
      annotations:
        nebula-graph.io/cm-hash: 94ea457be88fae25
      creationTimestamp: null
      labels:
        app.kubernetes.io/cluster: nebula
        app.kubernetes.io/component: graphd
        app.kubernetes.io/managed-by: nebula-operator
        app.kubernetes.io/name: nebula-graph
    spec:
      containers:
      - command:
        - /bin/bash
        - -ecx
        - exec /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf
          --meta_server_addrs=nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local:9559
          --local_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local --ws_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local
          --minloglevel=1 --v=0 --daemonize=false
        image: vesoft/nebula-graphd:v2.0.0
        imagePullPolicy: IfNotPresent
        name: graphd
        ports:
        - containerPort: 9669
          name: thrift
          protocol: TCP
        - containerPort: 19669
          name: http
          protocol: TCP
        - containerPort: 19670
          name: http2
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /status
            port: 19669
            scheme: HTTP
          initialDelaySeconds: 20
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 500m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/local/nebula/logs
          name: graphd
          subPath: logs
        - mountPath: /usr/local/nebula/etc
          name: nebula-graphd
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      terminationGracePeriodSeconds: 30
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/cluster: nebula
            app.kubernetes.io/component: graphd
            app.kubernetes.io/managed-by: nebula-operator
            app.kubernetes.io/name: nebula-graph
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
      volumes:
      - name: graphd
        persistentVolumeClaim:
          claimName: graphd
      - configMap:
          defaultMode: 420
          items:
          - key: nebula-graphd.conf
            path: nebula-graphd.conf
          name: nebula-graphd
        name: nebula-graphd
  updateStrategy:
    rollingUpdate:
      partition: 2
    type: RollingUpdate
  volumeClaimTemplates:
  - apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      creationTimestamp: null
      name: graphd
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: managed-nfs-storage
      volumeMode: Filesystem
    status:
      phase: Pending

这是-o yaml出来的其他两个组件都起来了就这个一直就绪探针报错

kqzh · 2021 年7 月 15 日 08:30

可以执行kubectl describe出问题的pod吗，应该不是yaml配置的问题

dylanyht · 2021 年7 月 15 日 08:33

[root@k8s-77-64 etc]# kubectl describe pods -n nebula nebula-graphd-0
Name:         nebula-graphd-0
Namespace:    nebula
Priority:     0
Node:         172.16.77.189/172.16.77.189
Start Time:   Thu, 15 Jul 2021 15:50:01 +0800
Labels:       app.kubernetes.io/cluster=nebula
              app.kubernetes.io/component=graphd
              app.kubernetes.io/managed-by=nebula-operator
              app.kubernetes.io/name=nebula-graph
              controller-revision-hash=nebula-graphd-ffddc8f75
              statefulset.kubernetes.io/pod-name=nebula-graphd-0
Annotations:  nebula-graph.io/cm-hash: 94ea457be88fae25
Status:       Running
IP:           172.20.4.100
IPs:
  IP:           172.20.4.100
Controlled By:  StatefulSet/nebula-graphd
Containers:
  graphd:
    Container ID:  docker://53a5b9d1474c1790f2b85557d28efd40223ea58a15f07be4f8d9ea536b7cf849
    Image:         vesoft/nebula-graphd:v2.0.0
    Image ID:      docker-pullable://vesoft/nebula-graphd@sha256:9033aa72f0ec1d8c0a7aaf3dc3db6b9089dcfde7257487f2ba9f4dacda135f52
    Ports:         9669/TCP, 19669/TCP, 19670/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      /bin/bash
      -ecx
      exec /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf --meta_server_addrs=nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local:9559 --local_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local --ws_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local --minloglevel=1 --v=0 --daemonize=false
    State:          Running
      Started:      Thu, 15 Jul 2021 15:50:02 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:        500m
      memory:     500Mi
    Readiness:    http-get http://:19669/status delay=20s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /usr/local/nebula/etc from nebula-graphd (rw)
      /usr/local/nebula/logs from graphd (rw,path="logs")
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-r6pqt (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  graphd:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  graphd-nebula-graphd-0
    ReadOnly:   false
  nebula-graphd:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      nebula-graphd
    Optional:  false
  default-token-r6pqt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-r6pqt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  42m                   default-scheduler  running "VolumeBinding" filter plugin for pod "nebula-graphd-0": pod has unbound immediate PersistentVolumeClaims
  Warning  FailedScheduling  42m                   default-scheduler  running "VolumeBinding" filter plugin for pod "nebula-graphd-0": pod has unbound immediate PersistentVolumeClaims
  Normal   Scheduled         42m                   default-scheduler  Successfully assigned nebula/nebula-graphd-0 to 172.16.77.189
  Normal   Pulled            42m                   kubelet            Container image "vesoft/nebula-graphd:v2.0.0" already present on machine
  Normal   Created           42m                   kubelet            Created container graphd
  Normal   Started           42m                   kubelet            Started container graphd
  Warning  Unhealthy         2m6s (x239 over 41m)  kubelet            Readiness probe failed: Get http://172.20.4.100:19669/status: dial tcp 172.20.4.100:19669: connect: connection refused

kqzh · 2021 年7 月 15 日 08:58

现在比较确定是graphd进程启动的问题，你可以用kubectl exec进入容器，查看logs文件夹，graphd启动时的日志保存在那里

dylanyht · 2021 年7 月 15 日 09:03

terminate called after throwing an instance of 'std::system_error'
  what():  Failed to resolve address for 'nebula-graphd-0.nebula-graphd-svc.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
*** Aborted at 1626335403 (unix time) try "date -d @1626335403" if you are using GNU date ***
PC: @     0x7f5c679e5387 __GI_raise
*** SIGABRT (@0x1) received by PID 1 (TID 0x7f5c688d28c0) from PID 1; stack trace: ***
    @          0x1e5f9c1 (unknown)
    @     0x7f5c67d8c62f (unknown)
    @     0x7f5c679e5387 __GI_raise
    @     0x7f5c679e6a77 __GI_abort
    @          0x107f647 _ZN9__gnu_cxx27__verbose_terminate_handlerEv.cold
    @          0x2219b85 __cxxabiv1::__terminate()
    @          0x2219bd0 std::terminate()
    @          0x2219d03 __cxa_throw
    @          0x1063e8b (unknown)
    @          0x1d12292 folly::SocketAddress::getAddrInfo()
    @          0x1d122b3 folly::SocketAddress::setFromHostPort()
    @          0x19fe77e nebula::WebService::start()
    @          0x1080872 main
    @     0x7f5c679d1554 __libc_start_main
    @          0x1096b4d (unknown)

这些是日志，无法解析内部域名

dylanyht · 2021 年7 月 15 日 09:19

我大概知道是哪里的原因了，我这个集群的dns 设置的是search nebula.svc.cluster.local. svc.cluster.local. cluster.local. 后面多个. 但nebula-graph默认解析的的是不带.的。应该是这种情况导致。这种情况有什么办法提前设置吗

kqzh · 2021 年7 月 15 日 09:37

你好，pod域名这块operator目前是拼接的，只能对外部dns修改。我认为有可能是其他网络原因，因为storaged和metad内部也通过域名通信，如果graphd被dns限制的话，另外两个组件应该也会被限制。你可以使用 kubectl get ep nebula-graphd-svc -n nebula 看一看端点数量是否一致

dylanyht · 2021 年7 月 15 日 10:01

    spec:
      containers:
      - command:
        - /bin/bash
        - -ecx
        - exec /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf
          --meta_server_addrs=nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local.:9559,nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local.:9559,nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local.:9559
          --local_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local. --ws_ip=$(hostname).nebula-graphd-svc.nebula.svc.cluster.local.
          --minloglevel=1 --v=0 --daemonize=false
        image: vesoft/nebula-graphd:v2.0.0

我将dns改成上面的带.的形式，然后就可以启动了我打算把这几个组件都改一下。有空用个默认dns的集群搭建再试试

system · 2021 年8 月 14 日 10:02

该话题在最后一个回复创建后30天后自动关闭。不再允许新的回复。