NebulaGraph 在 k8s 上删除 Storaged 节点遇到的奇怪问题

Sajo · 2021 年6 月 23 日 03:47

nebula 版本：2.0
部署方式：k8s
是否为线上版本：Y
问题的具体描述
之前已经发过一贴询问过如何在 Nebula-Console 中删除 k8s 上的 storaged 节点
但是之前在使用 Nebula Operator 时，按照
BALANCE DATA REMOVE “nebula-storage-0":9779 这样的方式去删除是没问题的
但是在自研的 kubernetes 平台上则会提示

image2318×1028 163 KB

直接使用 pod ip 则会提示

image1686×300 19.6 KB

是哪里的问题呢

kevin.qiao · 2021 年6 月 24 日 02:26

space basketballplayer replicas_factor 是多少

Sajo · 2021 年6 月 24 日 02:28

3

kevin.qiao · 2021 年6 月 24 日 02:36

报错原因是下掉一个节点后剩余节点数小于副本数，无法满足数据分片的高可用分布

Sajo · 2021 年6 月 24 日 02:44

了解了说到高可用我想了解一下副本与节点数量大概要成什么比例，有什么推荐的数量配置吗。
理论上三副本可以接受掉一半的节点，但是我三个节点三副本，重启一个storaged pod就会出现读写错误了

Sajo · 2021 年6 月 24 日 03:00

副本数改为1之后，删除操作能提交了
但是看起来删除之前的 balance data 失败了

kevin.qiao · 2021 年6 月 24 日 03:16

如果replicas_factor是3，你的storaged实例数至少要3个，另外推荐配置需要看的使用场景了。
重启pod都写错误是什么错误返回呢？

Sajo · 2021 年6 月 24 日 03:23

@kevin.qiao
副本数为3 Storaged 节点为3
之前使用 import 导入测试的时候随机重启一个 storaged pod 会报
ErrMsg: Storage Error: The leader has changed. Try again later, ErrCode: -8
ErrMsg: Storage Error: part: 8, error: E_RPC_FAILURE(-3)., ErrCode: -8
pod 正常之后就恢复了

kevin.qiao · 2021 年6 月 24 日 05:40

client端可以针对错误码返回添加重试机制

Sajo · 2021 年6 月 24 日 05:57

@kevin.qiao
了解了那个 BALANCE DATA 失败的原因有头绪吗日志看不出来啥问题

pandasheeps · 2021 年6 月 24 日 06:30

这个报错是正常的，没有问题
因为你重启的那个机器，有部分part的leader。
当你停掉一台机器时，会重新选leader。
先leader选举好了。就可以正常使用了。

pandasheeps · 2021 年6 月 24 日 06:31

你用nightly最新版试试，应该ok了。

Sajo · 2021 年6 月 24 日 06:32

谢谢我尝试一下

Sajo · 2021 年6 月 24 日 06:51

似乎参数有变化？

E0624 06:49:56.986775     1 MetaDaemon.cpp:206] Bad local host addr, status:Bad ip format:nebula-metad-0

之前指定本地地址的 flag 会报错

pandasheeps · 2021 年6 月 24 日 07:03

你看看你的配置文件格式是否正确啊

Sajo · 2021 年6 月 24 日 07:04

配置文件是直接用的 Nebula Operator 生成的 ConfigMap 挂载的
运行参数用的 flag 自定义，单纯修改了镜像版本没有改，2.0是没问题的

pandasheeps · 2021 年6 月 24 日 07:10

你的配置文件跟nightly里面的配置文件比对下。

Sajo · 2021 年6 月 24 日 08:09

我把 nightly 镜像中的配置文件导入作为 ConfigMap 了
看配置文件字段中是有 local_ip 字段的
但是会提示

ERROR: unknown command line flag 'local_ip'

启动时使用的 command

command:
        - /bin/bash
        - '-ecx'
        - >-
          exec /usr/local/nebula/bin/nebula-graphd
          --flagfile=/usr/local/nebula/etc/nebula-graphd.conf
          --meta_server_addrs=nebula-metad-0:9559,nebula-metad-1:9559
          --local_ip=nebula-graphd-0 --heartbeat_interval_secs=10
          --minloglevel=1 --v=0 --daemonize=false

pandasheeps · 2021 年6 月 24 日 08:21

因为最新的conf中使用了–local_config=true
因此你需要将命令行中国的flag修改到conf文件中

Sajo · 2021 年6 月 24 日 08:23

我这里没有启动 local_config=true

configmapName: nebula-graphd
data:
  nebula-graphd.conf: >

    ########## basics ##########

    # Whether to run as a daemon process

    --daemonize=true

    # The file to host the process id

    --pid_file=pids/nebula-graphd.pid

    ########## logging ##########

    # The directory to host logging files, which must already exists

    --log_dir=logs

    # Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively

    --minloglevel=0

    # Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose
    of the logging

    --v=0

    # Maximum seconds to buffer the log messages

    --logbufsecs=0

    # Whether to redirect stdout and stderr to separate output files

    --redirect_stdout=true

    # Destination filename of stdout and stderr, which will also reside in
    log_dir.

    --stdout_log_file=stdout.log

    --stderr_log_file=stderr.log

    # Copy log messages at or above this level to stderr in addition to
    logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are
    0, 1, 2, and 3, respectively.

    --stderrthreshold=2

    ########## networking ##########

    # Meta Server Address

    --meta_server_addrs=127.0.0.1:45500

    # Local ip

    --local_ip=127.0.0.1

    # Network device to listen on

    --listen_netdev=any

    # Port to listen on

    --port=3699

    # To turn on SO_REUSEPORT or not

    --reuse_port=false

    # Backlog of the listen socket, adjust this together with net.core.somaxconn

    --listen_backlog=1024

    # Seconds before the idle connections are closed, 0 for never closed

    --client_idle_timeout_secs=0

    # Seconds before the idle sessions are expired, 0 for no expiration

    --session_idle_timeout_secs=0

    # The number of threads to accept incoming connections

    --num_accept_threads=1

    # The number of networking IO threads, 0 for # of CPU cores

    --num_netio_threads=0

    # The number of threads to execute user queries, 0 for # of CPU cores

    --num_worker_threads=0

    # HTTP service ip

    --ws_ip=127.0.0.1

    # HTTP service port

    --ws_http_port=13000

    # HTTP2 service port

    --ws_h2_port=13002

    # The default charset when a space is created

    --default_charset=utf8

    # The defaule collate when a space is created

    --default_collate=utf8_bin


    ########## authorization ##########

    # Enable authorization

    --enable_authorize=false

    ########## Authentication ##########

    # User login authentication type, password for nebula authentication, ldap
    for ldap authentication, cloud for cloud authentication

    --auth_type=password