nebula-metad 无法启动

yzhengwei · 2022 年9 月 9 日 02:16

nebula 版本：v3.2.0
部署方式：分布式
安装方式：k8s
是否为线上版本：N
硬件信息
- 磁盘: 挂载的PVC，SAS
- CPU、内存信息： 4C16G
问题的具体描述
我将官方 docker-compose 文件用 kompose 转换成 k8s yaml 后手动部署在 k8s 集群。从 metad 开始启动，在启动时直接报错，但进程依然存在不会退出(有时进程会直接退出)，在启动第二个 metad 节点时，进程直接退出，无法启动。

k8s yaml

kind: Service
apiVersion: v1
metadata:
  name: nebula-metad
  namespace: scas
  labels:
    app: nebula-metad
spec:
  ports:
    - name: '9559'
      protocol: TCP
      port: 9559
      targetPort: 9559
    - name: '19559'
      protocol: TCP
      port: 19559
      targetPort: 19559
    - name: '19560'
      protocol: TCP
      port: 19560
      targetPort: 19560
  selector:
    app: nebula-metad
  clusterIP: None
  type: ClusterIP
  sessionAffinity: None
---
kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: nebula-metad
  namespace: scas
  labels:
    app: nebula-metad
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nebula-metad
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nebula-metad
    spec:
      volumes:
        - name: log-dir
          flexVolume:
            driver: sgt.shareit.com/hostpathperpod
            options:
              hostPath: /data/logs
      containers:
        - name: nebula-metad
          image: 'vesoft/nebula-metad:v3.2.0'
          command:
            - /bin/sh
            - '-c'
            - 'exec /usr/local/nebula/bin/nebula-metad --flagfile=/usr/local/nebula/etc/nebula-metad.conf --meta_server_addrs=nebula-metad-0.nebula-metad:9559,nebula-metad-1.nebula-metad:9559,nebula-metad-2.nebula-metad:9559 --local_ip=$HOSTNAME.nebula-metad --ws_ip=$HOSTNAME.nebula-metad --data_path=/data/meta --log_dir=/data/logs --v=4 --minloglevel=0 --port=9559 --ws_http_port=19559 --daemonize=false'
          ports:
            - name: thrift
              containerPort: 9559
              protocol: TCP
            - name: http
              containerPort: 19559
              protocol: TCP
            - name: http2
              containerPort: 19560
              protocol: TCP
          env:
            - name: TZ
              value: UTC
            - name: USER
              value: root
          resources:
            limits:
              cpu: '4'
              memory: 16Gi
            requests:
              cpu: 100m
              memory: 300Mi
          volumeMounts:
            - name: nebula-metad
              mountPath: /data/meta
            - name: log-dir
              mountPath: /data/logs
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          imagePullPolicy: IfNotPresent
          securityContext:
            capabilities:
              add:
                - SYS_PTRACE
      restartPolicy: Always
      terminationGracePeriodSeconds: 30
      dnsPolicy: ClusterFirst
      securityContext: {}
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values:
                      - nebula-metad
              topologyKey: kubernetes.io/hostname
      schedulerName: default-scheduler
  volumeClaimTemplates:
    - kind: PersistentVolumeClaim
      apiVersion: v1
      metadata:
        name: nebula-metad
        namespace: scas
        creationTimestamp: null
        annotations:
          everest.io/disk-volume-type: SAS
          volume.beta.kubernetes.io/cce-storage-additional-resource-tags: 'env=test,group=SCAS,project=nebula-metad'
      spec:
        accessModes:
          - ReadWriteOnce
        resources:
          requests:
            storage: 1Gi
        storageClassName: csi-disk-topology
        volumeMode: Filesystem
      status:
        phase: Pending
  serviceName: nebula-metad
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  revisionHistoryLimit: 10

metad0 stderr 日志：
metad0.log (116.8 KB)
metad1 stderr 日志：
metad1.log (9.1 KB)

wey · 2022 年9 月 9 日 02:19

E20220908 16:50:32.298689    62 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-2.nebula-metad': Name or service not known (error=-2): Unknown error -2

Docker Compose 彼此的域名访问和 k8s/kompose 转换之后的域名有差别，你根据 k8s 里 pod 的域名改一下 meta_server_addrs 为实际上 meta pods 的地址，再看看其他这里的配置，要自洽

另外，nebula graph 有 k8s operator，不一定要绕一下 kompose 哈，不过你绕通了，欢迎来分享总结。

yzhengwei · 2022 年9 月 9 日 09:25

我创建了 headless service, meta_server_addrs 使用了 service 地址， nebula-metad-2.nebula-metad Name or service not known 是因启动前两个 metad POD 就有问题，所以第三个 metad POD 一直没有启动

wey · 2022 年9 月 9 日 09:31

集群内部通信需要彼此的个体地址，你可以用 nebula-operator 拉一个nebulagraph 实例之后，看 pod 的 yaml 参考一下

yzhengwei · 2022 年9 月 21 日 07:51

问题的根本原因是 metad 注册时其本身处于未就绪状态，因此会 host not found, 在 metad 的 service 中添加 publishNotReadyAddresses: true 即可

system · 2022 年10 月 21 日 07:52

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。