k8s集群下,yaml文件部署storaged,出现健康检查失败

  • nebula 版本:3.6.0

  • 部署方式:k8s yaml手动部署

  • 问题描述及相关信息

因为特殊需要,我使用了yaml文件的形式部署storaged

其中storaged的yaml文件如下:

apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/cluster: nebula
    app.kubernetes.io/component: storaged
    app.kubernetes.io/managed-by: nebula-operator
    app.kubernetes.io/name: nebula-graph
  name: nebula-storaged-headless
  namespace: nebula-latest
spec:
  clusterIP: None
  ports:
  - name: thrift
    port: 9779
    protocol: TCP
    targetPort: 9779
  - name: http
    port: 19779
    protocol: TCP
    targetPort: 19779
  - name: admin
    port: 9778
    protocol: TCP
    targetPort: 9778
  publishNotReadyAddresses: true
  selector:
    app.kubernetes.io/cluster: nebula
    app.kubernetes.io/component: storaged
    app.kubernetes.io/managed-by: nebula-operator
    app.kubernetes.io/name: nebula-graph
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  generation: 1
  labels:
    app.kubernetes.io/cluster: nebula
    app.kubernetes.io/component: storaged
    app.kubernetes.io/managed-by: nebula-operator
    app.kubernetes.io/name: nebula-graph
  name: nebula-storaged
  namespace: nebula-latest
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app.kubernetes.io/cluster: nebula
      app.kubernetes.io/component: storaged
      app.kubernetes.io/managed-by: nebula-operator
      app.kubernetes.io/name: nebula-graph
  serviceName: nebula-storaged-headless
  template:
    metadata:
      creationTimestamp: null
      labels:
        app.kubernetes.io/cluster: nebula
        app.kubernetes.io/component: storaged
        app.kubernetes.io/managed-by: nebula-operator
        app.kubernetes.io/name: nebula-graph
    spec:
      containers:
      - command:
        - /bin/sh
        - -ecx
        - exec /usr/local/nebula/bin/nebula-storaged --flagfile=/usr/local/nebula/etc/nebula-storaged.conf
          --meta_server_addrs=nebula-metad-0.nebula-metad-headless:9559
          --local_ip=$(hostname).nebula-storaged-headless
          --ws_ip=$(hostname).nebula-storaged-headless --daemonize=false
        image: vesoft/nebula-storaged:v3.6.0
        imagePullPolicy: Always
        name: storaged
        ports:
        - containerPort: 9779
          name: thrift
          protocol: TCP
        - containerPort: 19779
          name: http
          protocol: TCP
        - containerPort: 9778
          name: admin
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /status
            port: 19779
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/local/nebula/data
          name: storaged-claim0
          subPath: data
        - mountPath: /usr/local/nebula/logs
          name: storaged-claim1
          subPath: logs
      - command:
        - /bin/sh
        - -ecx
        - sh /logrotate.sh; exec cron -f
        env:
        - name: LOGROTATE_ROTATE
          value: "5"
        - name: LOGROTATE_SIZE
          value: 100M
        image: vesoft/nebula-agent:latest
        imagePullPolicy: Always
        name: ng-agent
        resources:
          limits:
            cpu: 200m
            memory: 256Mi
          requests:
            cpu: 100m
            memory: 128Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/local/nebula/logs
          name: storaged-claim1 
          subPath: logs
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
      topologySpreadConstraints:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/cluster: nebula
            app.kubernetes.io/component: storaged
            app.kubernetes.io/managed-by: nebula-operator
            app.kubernetes.io/name: nebula-graph
        maxSkew: 1
        topologyKey: kubernetes.io/hostname
        whenUnsatisfiable: ScheduleAnyway
  volumeClaimTemplates:
  - metadata:
      name: storaged-claim0
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard-nfs-storage
      resources:
        requests:
          storage: 100Mi
  - metadata:
      name: storaged-claim1
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard-nfs-storage
      resources:
        requests:
          storage: 100Mi

kubectl apply的结果是pod的健康检查失败:

手动去连接也是失败:
image

相关的日志:
nebula-storaged.ERROR
image

nebula-storaged.WARNING

nebula-storaged.INFO

您好,可以试试operator的yaml + nebulacluster的yaml,operator内部在做了很多工作,比如这个报错就是说服务启动没有执行add hosts

请问是否可以进一步指点下呢,是需要在哪个服务执行add hosts呢
应该不是通过nebula-console来add hosts吧

我是从部署成功的operator,拿出了service和statefulset对应的yaml内容,经过稍微修改后执行的

metad的yaml如下:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nebula-metad-headless
  name: nebula-metad-headless
  namespace: nebula-latest
spec:
  clusterIP: None
  ports:
  - name: thrift
    port: 9559
    protocol: TCP
    targetPort: 9559
  - name: http
    port: 19559
    protocol: TCP
    targetPort: 19559
  publishNotReadyAddresses: true
  selector:
    app: nebula-metad-headless
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nebula-metad-headless
  name: nebula-metad
  namespace: nebula-latest
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nebula-metad-headless
  serviceName: nebula-metad-headless
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nebula-metad-headless
    spec:
      containers:
      - command:
        - /bin/sh
        - -ecx
        - exec /usr/local/nebula/bin/nebula-metad --flagfile=/usr/local/nebula/etc/nebula-metad.conf
          --meta_server_addrs=nebula-metad-0.nebula-metad-headless:9559
          --local_ip=$(hostname).nebula-metad-headless --ws_ip=$(hostname).nebula-metad-headless
          --daemonize=false
        image: vesoft/nebula-metad:v3.6.0
        imagePullPolicy: IfNotPresent
        name: metad
        ports:
        - containerPort: 9559
          name: thrift
          protocol: TCP
        - containerPort: 19559
          name: http
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /status
            port: 19559
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "1"
            memory: 1Gi
          requests:
            cpu: 500m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/local/nebula/data
          name: metad-claim0
          subPath: data
        - mountPath: /usr/local/nebula/logs
          name: metad-claim1
          subPath: logs
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
  volumeClaimTemplates:
  - metadata:
      name: metad-claim0
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard-nfs-storage
      resources:
        requests:
          storage: 100Mi
  - metadata:
      name: metad-claim1
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard-nfs-storage
      resources:
        requests:
          storage: 100Mi

storaged的yaml

 apiVersion: v1
kind: Service
metadata:
  labels:
    app: nebula-storaged-headless
  name: nebula-storaged-headless
  namespace: nebula-latest
spec:
  clusterIP: None
  ports:
  - name: storaged-thrift
    port: 9779
    protocol: TCP
    targetPort: 9779
  - name: storaged-http
    port: 19779
    protocol: TCP
    targetPort: 19779
  publishNotReadyAddresses: true
  selector:
    app: nebula-metad-headless
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nebula-storaged-headless
  name: nebula-storaged
  namespace: nebula-latest
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nebula-storaged-headless
  serviceName: nebula-storaged-headless
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nebula-storaged-headless
    spec:
      containers:
      - command:
        - /bin/sh
        - -ecx
        - exec /usr/local/nebula/bin/nebula-storaged --flagfile=/usr/local/nebula/etc/nebula-storaged.conf
          --meta_server_addrs=nebula-metad-0.nebula-metad-headless:9559
          --local_ip=$(hostname).nebula-storaged-headless
          --ws_ip=$(hostname).nebula-storaged-headless --daemonize=false
        image: vesoft/nebula-storaged:v3.6.0
        imagePullPolicy: IfNotPresent
        name: storaged
        ports:
        - containerPort: 9779
          name: storaged-thrift
          protocol: TCP
        - containerPort: 19779
          name: storaged-http
          protocol: TCP
        - containerPort: 9778
          name: storaged-admin
          protocol: TCP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /status
            port: 19779
            scheme: HTTP
          initialDelaySeconds: 10
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 5
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 500Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /usr/local/nebula/data
          name: storaged-claim0
          subPath: data
        - mountPath: /usr/local/nebula/logs
          name: storaged-claim1
          subPath: logs
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
  volumeClaimTemplates:
  - metadata:
      name: storaged-claim0
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard-nfs-storage
      resources:
        requests:
          storage: 100Mi
  - metadata:
      name: storaged-claim1
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: standard-nfs-storage
      resources:
        requests:
          storage: 100Mi  

您好,可以使用nebula-console,连接grpah,然后执行add hosts语句,添加storaged节点至meta,可以参考文档 管理 Storage 主机 - NebulaGraph Database 手册

这个方式我试过了,不行
因为storage的pod都没有拉起来,使用了add hosts,show hosts的结果一直是OFFLINE
pod也一直拉不起来

add hosts 之后,storaged 自己的 log 怎么说?

add hosts 之后,storaged 的log

nebula-storaged.ERROR

nebula-storaged.WARNING

nebula-storaged.INFO

metad的nebula-metad.INFO

后来我通过nebula-metad.INFO中日志,观察出了大概可能的问题。于是我采用了:
ADD HOSTS “nebula-storaged-0.nebula-storaged-headless”:9779
hosts的状态居然是ONLINE了
不过令我好奇的是:为什么,ADD HOSTS “nebula-storaged-headless”:9779 就只会处于OFFLINE

我换了一种方式,奇怪的是,storaged仍然是拉不起来
部署的yaml内容如下:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: nebula-metad-headless
  name: nebula-metad-headless
  namespace: nebula-latest
spec:
  clusterIP: None
  ports:
    - name: thrift
      port: 9559
      protocol: TCP
      targetPort: 9559
    - name: http
      port: 19559
      protocol: TCP
      targetPort: 19559

    - name: graphd-thrift
      port: 9669
      protocol: TCP
      targetPort: 9669
    - name: graphd-http
      port: 19669
      protocol: TCP
      targetPort: 19669

    - name: storaged-thrift
      port: 9779
      protocol: TCP
      targetPort: 9779
    - name: storaged-http
      port: 19779
      protocol: TCP
      targetPort: 19779
    - name: storaged-admin
      port: 9778
      protocol: TCP
      targetPort: 9778

  publishNotReadyAddresses: true
  selector:
    app: nebula-metad-headless
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: nebula-metad-headless
  name: nebula-metad
  namespace: nebula-latest
spec:
  podManagementPolicy: Parallel
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: nebula-metad-headless
  serviceName: nebula-metad-headless
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nebula-metad-headless
    spec:
      containers:
        - command:
            - /bin/sh
            - -ecx
            - exec /usr/local/nebula/bin/nebula-metad --flagfile=/usr/local/nebula/etc/nebula-metad.conf
              --meta_server_addrs=nebula-metad-0.nebula-metad-headless:9559
              --local_ip=$(hostname).nebula-metad-headless --ws_ip=$(hostname).nebula-metad-headless
              --daemonize=false
          image: vesoft/nebula-metad:v3.6.0
          imagePullPolicy: IfNotPresent
          name: metad
          ports:
            - containerPort: 9559
              name: thrift
              protocol: TCP
            - containerPort: 19559
              name: http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /status
              port: 19559
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              cpu: "1"
              memory: 1Gi
            requests:
              cpu: 500m
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /usr/local/nebula/data
              name: metad-claim0
              subPath: data
            - mountPath: /usr/local/nebula/logs
              name: metad-claim1
              subPath: logs
        - command:
          - /bin/sh
          - -ecx
          - exec /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf
            --meta_server_addrs=nebula-metad-0.nebula-metad-headless:9559
            --local_ip=nebula-metad-0.nebula-metad-headless.nebula-latest.svc.cluster.local --ws_ip=nebula-metad-0.nebula-metad-headless.nebula-latest.svc.cluster.local
            --daemonize=false
          image: vesoft/nebula-graphd:v3.6.0
          imagePullPolicy: IfNotPresent
          name: graphd
          ports:
            - containerPort: 9669
              name: graphd-thrift
              protocol: TCP
            - containerPort: 19669
              name: graphd-http
              protocol: TCP
          readinessProbe:
            failureThreshold: 3
            httpGet:
              path: /status
              port: 19669
              scheme: HTTP
            initialDelaySeconds: 10
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 5
          resources:
            limits:
              cpu: "2"
              memory: 2Gi
            requests:
              cpu: 500m
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /usr/local/nebula/logs
              name: graphd-claim0
              subPath: logs
        - command:
            - /bin/sh
            - -ecx
            - exec /usr/local/nebula/bin/nebula-storaged --flagfile=/usr/local/nebula/etc/nebula-storaged.conf
              --meta_server_addrs=nebula-metad-0.nebula-metad-headless:9559
              --local_ip=nebula-metad-0.nebula-metad-headless
              --ws_ip=nebula-metad-0.nebula-metad-headless
          image: vesoft/nebula-storaged:v3.6.0
          imagePullPolicy: IfNotPresent
          name: storaged
          ports:
            - containerPort: 9779
              name: storaged-thrift
              protocol: TCP
            - containerPort: 19779
              name: storaged-http
              protocol: TCP
            - containerPort: 9778
              name: storaged-admin
              protocol: TCP
          resources:
            limits:
              cpu: "2"
              memory: 2Gi
            requests:
              cpu: 500m
              memory: 500Mi
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
            - mountPath: /usr/local/nebula/data
              name: storaged-claim0
              subPath: data
            - mountPath: /usr/local/nebula/logs
              name: storaged-claim1
              subPath: logs

      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
  volumeClaimTemplates:
    - metadata:
        name: metad-claim0
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: standard-nfs-storage
        resources:
          requests:
            storage: 100Mi
    - metadata:
        name: metad-claim1
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: standard-nfs-storage
        resources:
          requests:
            storage: 100Mi
    - metadata:
        name: graphd-claim0
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: standard-nfs-storage
        resources:
          requests:
            storage: 100Mi

    - metadata:
        name: storaged-claim0
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: standard-nfs-storage
        resources:
          requests:
            storage: 100Mi
    - metadata:
        name: storaged-claim1
      spec:
        accessModes: [ "ReadWriteOnce" ]
        storageClassName: standard-nfs-storage
        resources:
          requests:
            storage: 100Mi

下面的是问题的一些现象

而且storaged的日志都是空的
image

metad也没有storaged相关的日志

添加的hosts,一直处于离线状态

您好,看起来是add hosts 语句有些问题,storaged的域名应该是nebual-storaged-xxx,看图里用了metad的域名

因为storaged上报用的域名是nebula-storaged-0.nebula-storaged-headless,metad里记录的是nebula-storaged-headless,就不匹配了

那看起来是我弄错了
请问是需要调整command中的命令吗

现在卡在这里了,求帮助
我发现一个问题,就是storaged不能跟graphd或者metad放在一个pod中同时启动,否则storaged就启动不了

嗯嗯,可以先drop host,再重新add正确的地址

1 个赞

还是建议使用nebula-operator来部署nebulacluster,应该会少踩很多坑

因为这里服务发现机制是:

  • 每一个服务自己的服务 id 是自己配置里自己的地址
  • storaged 的配置里自己写的是什么,那面能代表这个服务的 id 就是精确匹配的那个配置,它在和 meta 通信的时候只会说我是 <配置的地址>
  • ADD HOSTS 时候 meta 会和主动上报上来的服务的这些地址比对

所以看起来都能解析、访问的多个 IP/不同 domain name 用来 ADD HOSTS 的时候,只能用那个配置的地址。

有一个我不太理解的是,为什么storaged 不能跟graphd或者metad放在同一个容器呢,否则就会导致pod启动不起来,而graphd和metad却可以

技术上说,不考虑其他因素,放在同一个容器是可以的,启动不起来的原因应该是哪里有问题,可以配置里把 error forward 到 stderror,看看报错是啥。

可能是一些东西冲突(port、file),理论上肯定可以奇技淫巧绕开。

我不知道这种改造的原因是啥,有状态的 work load 在 k8s 里本来就很难,不是 day1 拉起来就好了,后边的可维护性很重要,有 operator 把这些麻烦事儿都做了,还有社区在维护,好处多多,它只是引入了 control plane 的一些 overhead,但是完全是可控的。

如果是因为不方便用 storage provider,自己也可以 hack 本地盘,我记得 operator 也快支持 local disk 了,非常不推荐这么折腾。

实在不想用 operator 还不如用 docker compose/swarm。

1 个赞

试试 single pod multi container?