使用helm部署nebulagraph失败

刘杰1 · 2023 年4 月 12 日 03:14

nebula 版本：没有指定应该是最新版本吧。nebula-operator是1.4.0
部署方式：k8s
安装方式：k8s
是否上生产环境：Y
硬件信息
- 磁盘（推荐使用 SSD）
- CPU、内存信息
问题的具体描述
使用如下命令部署，部署不成功
helm install ${NEBULA_CLUSTER_NAME} nebula-operator/nebula-cluster
–namespace ${NEBULA_CLUSTER_NAMESPACE}
–set nameOverride=${NEBULA_CLUSTER_NAME}
–set nebula.storageClassName=“${STORAGE_CLASS_NAME}”

Error: failed to download “nebula-operator/nebula-cluster” (hint: running helm repo update may help)
执行 helm repo update成功。再次执行helm部署命令返回如下报错信息：
Error: unable to build kubernetes objects from release manifest: error validating “”: error validating data: ValidationError(NebulaCluster.spec): unknown field “exporter” in io.nebula-graph.apps.v1alpha1.NebulaCluster.spec

kqzh · 2023 年4 月 12 日 03:33

你好，看起来是1.4.0的helm包有些问题，可以用最新的1.4.2的operator再试试，helm install nebula-operator nebula-operator/nebula-operator --namespace=nebula-operator-system --version=1.4.2

在install之前需要先把集群里旧的nebula-cluster crd删除

刘杰1 · 2023 年4 月 12 日 03:41

您好，非常感谢回复。重新更换版本以后，执行不成功。
Error: failed to download “nebula-operator/nebula-operator” at version “1.4.2” (hint: running helm repo update may help)
执行helm repo update成功，继续helm还是报如上错误

刘杰1 · 2023 年4 月 12 日 03:42

请问集群中的就得nebula-cluster crd要如何确认，并删除呢？
如果是uninstall之前的nebula-operator 的话，在执行新的命令之前已经uninstall了。

kubectl get crds|grep nebula
nebulaclusters.apps.nebula-graph.io 2023-04-12T01:42:29Z
nebularestores.apps.nebula-graph.io 2023-04-12T01:42:29Z

kqzh · 2023 年4 月 12 日 03:47

可以执行kubectl delete crd nebulaclusters.apps.nebula-graph.io这行命令，执行完之后重新install operator，会创建最新的crd

刘杰1 · 2023 年4 月 12 日 03:51

Failed to pull image “gcr.io/kubebuilder/kube-rbac-proxy:v0.8.0”: rpc error: code = Unknown desc = Error response from daemon: Get https://gcr.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
这个镜像拉取不下来。我之前使用如下命令创建的，您看可以吗？
helm install nebula-operator nebula-operator/nebula-operator --namespace=nebula-operator-system --version=1.4.2
–set image.kubeRBACProxy.image=kubesphere/kube-rbac-proxy:v0.8.0
–set image.kubeScheduler.image=kubesphere/kube-scheduler:v1.18.8

kqzh · 2023 年4 月 12 日 03:54

可以的，operator的helm支持替换这些参数

刘杰1 · 2023 年4 月 12 日 04:03

您好再次叨扰，helm执行成功了，但是pod全部处于pending，并且
nebula-graphd-0 没有创建成功。
kubectl -n “${NEBULA_CLUSTER_NAMESPACE}” get pod -l “app.kubernetes.io/cluster=${NEBULA_CLUSTER_NAME}”
NAME READY STATUS RESTARTS AGE
nebula-metad-0 0/1 Pending 0 2m20s
nebula-metad-1 0/1 Pending 0 2m20s
nebula-metad-2 0/1 Pending 0 2m20s
nebula-storaged-0 0/1 Pending 0 2m20s
nebula-storaged-1 0/1 Pending 0 2m20s
nebula-storaged-2 0/1 Pending 0 2m19s

以下是pod的describe信息
Warning FailedScheduling 65s default-scheduler 0/2 nodes are available: 2 pod has unbound immediate PersistentVolumeClaims.
Warning FailedScheduling 63s default-scheduler 0/2 nodes are available: 2 Insufficient cpu.

kqzh · 2023 年4 月 12 日 05:38

这两个报错要检查下你的系统环境， pvc 的 access mode 和 storage class 是否设置正确，节点cpu资源是否足够，可以调整resources.requests 和 limit

刘杰1 · 2023 年4 月 12 日 05:43

pvc的access mode是否在helm中已经指定？storage class使用的是阿里云的nas。看控制台pvc是可以创建成功的，讲集群副本数手动指定为1，cpu的limiti改成500m.request改为300m.还是pending状态。

刘杰1 · 2023 年4 月 12 日 05:46

pod里面的日志
{“code”:“SERVER_ERROR_CODE”,“message”:“Cannot invoke method getContent() on null object”,“requestId”:“c7f5c5d0-d459-4641-af4f-03629ecbe4ad”,“successResponse”:false}

刘杰1 · 2023 年4 月 12 日 05:52

我的集群worker节点只有两个，和这个有关系吗？

kqzh · 2023 年4 月 12 日 05:54

可以kubectl get sc确认下storageClass的Name，install nebulacluster的时候，有 --set nebula.storageClassName=xx 吗

刘杰1 · 2023 年4 月 12 日 06:03

您好，pvc这一块是没有问题了，现在就是提示cpu不足，用edit编辑statefulSet修改以后不生效，还是默认配置好的cpu.limit.1 资源请求这一块要如何修改？

kqzh · 2023 年4 月 12 日 06:08

可以 kubectl edit nc ${NEBULA_CLUSTER_NAMESPACE}

刘杰1 · 2023 年4 月 12 日 06:14

Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 1
memory: 500Mi
这样还是提示0/2 nodes are available: 2 Insufficient cpu.

kqzh · 2023 年4 月 12 日 06:23

请问你的woker节点都是多少核的，如果资源不够的话，可能需要加节点或者升级下节点配置

刘杰1 · 2023 年4 月 12 日 06:27

您好，我的两个节点都是4C16G的。目前没有部署其他的应用。
另外请问一下，还有两个graphdpod我这个怎么没有创建出来呢？

kqzh · 2023 年4 月 12 日 07:32

你好，应该就是资源不够导致的，如果是想测试用的话，可以把nebulacluster每个服务的replica改成1 1 1再试试，Requests也可以改成500m

刘杰1 · 2023 年4 月 13 日 02:10

您好。采纳您的建议，将所有副本调整为1.目前已经能分配节点了，
[wrs-release@wrs-test-001 ~]$ kubectl get po -n nebula
NAME READY STATUS RESTARTS AGE
nebula-exporter-58db8f6d9d-xzsz2 1/1 Running 0 17m
nebula-graphd-0 1/1 Running 2 (106s ago) 11m
nebula-metad-0 1/1 Running 0 17m
nebula-storaged-0 0/1 Running 0 11m
但是pod会频繁重启。这是graphd和nebula-storaged-0的日志
++ hostname
++ hostname

exec /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf --meta_server_addrs=nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local:9559,nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local:9559 --local_ip=nebula-graphd-0.nebula-graphd-svc.nebula.svc.cluster.local --ws_ip=nebula-graphd-0.nebula-graphd-svc.nebula.svc.cluster.local --daemonize=false

其他几个容器看着是running，其实还是没有启动成功。
另外您在帮我看一下，graphd-svc按照官网文档部署好以后应该是clusterip，我的为什么部署好以后是NodePort呢？
kubectl get svc -n nebula
NAME TYPE CLUSTER-IP EXTERNAL-IP
nebula-exporter-svc ClusterIP 192.168.129.104
nebula-graphd-svc NodePort 192.168.177.109
nebula-metad-headless ClusterIP None
nebula-storaged-headless ClusterIP None

最后这是我的helm install命令
helm install ${NEBULA_CLUSTER_NAME} nebula-operator/nebula-cluster
–namespace ${NEBULA_CLUSTER_NAMESPACE}
–set nameOverride=${NEBULA_CLUSTER_NAME}
–set nebula.storageClassName=“${STORAGE_CLASS_NAME}”