nebula-metad 无法启动

  • nebula 版本:v3.2.0
  • 部署方式:分布式
  • 安装方式:k8s
  • 是否为线上版本:N
  • 硬件信息
    • 磁盘: 挂载的PVC,SAS
    • CPU、内存信息: 4C16G
  • 问题的具体描述
    我将官方 docker-compose 文件用 kompose 转换成 k8s yaml 后手动部署在 k8s 集群。从 metad 开始启动,在启动时直接报错,但进程依然存在不会退出(有时进程会直接退出),在启动第二个 metad 节点时,进程直接退出,无法启动。
  • k8s yaml

    kind: Service
    apiVersion: v1
    metadata:
      name: nebula-metad
      namespace: scas
      labels:
        app: nebula-metad
    spec:
      ports:
        - name: '9559'
          protocol: TCP
          port: 9559
          targetPort: 9559
        - name: '19559'
          protocol: TCP
          port: 19559
          targetPort: 19559
        - name: '19560'
          protocol: TCP
          port: 19560
          targetPort: 19560
      selector:
        app: nebula-metad
      clusterIP: None
      type: ClusterIP
      sessionAffinity: None
    ---
    kind: StatefulSet
    apiVersion: apps/v1
    metadata:
      name: nebula-metad
      namespace: scas
      labels:
        app: nebula-metad
    spec:
      replicas: 2
      selector:
        matchLabels:
          app: nebula-metad
      template:
        metadata:
          creationTimestamp: null
          labels:
            app: nebula-metad
        spec:
          volumes:
            - name: log-dir
              flexVolume:
                driver: sgt.shareit.com/hostpathperpod
                options:
                  hostPath: /data/logs
          containers:
            - name: nebula-metad
              image: 'vesoft/nebula-metad:v3.2.0'
              command:
                - /bin/sh
                - '-c'
                - 'exec /usr/local/nebula/bin/nebula-metad --flagfile=/usr/local/nebula/etc/nebula-metad.conf --meta_server_addrs=nebula-metad-0.nebula-metad:9559,nebula-metad-1.nebula-metad:9559,nebula-metad-2.nebula-metad:9559 --local_ip=$HOSTNAME.nebula-metad --ws_ip=$HOSTNAME.nebula-metad --data_path=/data/meta --log_dir=/data/logs --v=4 --minloglevel=0 --port=9559 --ws_http_port=19559 --daemonize=false'
              ports:
                - name: thrift
                  containerPort: 9559
                  protocol: TCP
                - name: http
                  containerPort: 19559
                  protocol: TCP
                - name: http2
                  containerPort: 19560
                  protocol: TCP
              env:
                - name: TZ
                  value: UTC
                - name: USER
                  value: root
              resources:
                limits:
                  cpu: '4'
                  memory: 16Gi
                requests:
                  cpu: 100m
                  memory: 300Mi
              volumeMounts:
                - name: nebula-metad
                  mountPath: /data/meta
                - name: log-dir
                  mountPath: /data/logs
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
              imagePullPolicy: IfNotPresent
              securityContext:
                capabilities:
                  add:
                    - SYS_PTRACE
          restartPolicy: Always
          terminationGracePeriodSeconds: 30
          dnsPolicy: ClusterFirst
          securityContext: {}
          affinity:
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: app
                        operator: In
                        values:
                          - nebula-metad
                  topologyKey: kubernetes.io/hostname
          schedulerName: default-scheduler
      volumeClaimTemplates:
        - kind: PersistentVolumeClaim
          apiVersion: v1
          metadata:
            name: nebula-metad
            namespace: scas
            creationTimestamp: null
            annotations:
              everest.io/disk-volume-type: SAS
              volume.beta.kubernetes.io/cce-storage-additional-resource-tags: 'env=test,group=SCAS,project=nebula-metad'
          spec:
            accessModes:
              - ReadWriteOnce
            resources:
              requests:
                storage: 1Gi
            storageClassName: csi-disk-topology
            volumeMode: Filesystem
          status:
            phase: Pending
      serviceName: nebula-metad
      podManagementPolicy: OrderedReady
      updateStrategy:
        type: RollingUpdate
        rollingUpdate:
          partition: 0
      revisionHistoryLimit: 10
    
  • metad0 stderr 日志:
    metad0.log (116.8 KB)

  • metad1 stderr 日志:
    metad1.log (9.1 KB)

E20220908 16:50:32.298689    62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad': Name or service not known (error=-2): Unknown error -2

Docker Compose 彼此的域名访问和 k8s/kompose 转换之后的域名有差别,你根据 k8s 里 pod 的域名改一下 meta_server_addrs 为实际上 meta pods 的地址,再看看其他这里的配置,要自洽

另外,nebula graph 有 k8s operator,不一定要绕一下 kompose 哈,不过你绕通了,欢迎来分享总结。

我创建了 headless service, meta_server_addrs 使用了 service 地址, nebula-metad-2.nebula-metad Name or service not known 是因启动前两个 metad POD 就有问题,所以第三个 metad POD 一直没有启动

集群内部通信需要彼此的个体地址,你可以用 nebula-operator 拉一个nebulagraph 实例之后,看 pod 的 yaml 参考一下

问题的根本原因是 metad 注册时其本身处于未就绪状态,因此会 host not found, 在 metad 的 service 中添加 publishNotReadyAddresses: true 即可

1 个赞

浙ICP备20010487号