br备份数据到本地报错: E_LIST_CLUSTER_NO_AGENT_FAILURE

JustinRong · 2022 年7 月 25 日 01:36

背景

使用nebula-br进行单节点数据备份报错

$ ./bin/br backup full --meta "127.0.0.1:9559" --storage "local:///Users/jermey/projects/nebula-backup"

{"level":"info","meta address":"127.0.0.1:9559","msg":"Try to connect meta service.","time":"2022-07-22T16:48:33.974Z"}
{"level":"info","meta address":"127.0.0.1:9559","msg":"Connect meta server successfully.","time":"2022-07-22T16:48:33.981Z"}
Error: parse cluster response failed: response is not successful, code is E_LIST_CLUSTER_NO_AGENT_FAILURE

环境

nebula-br

我的nebula-br版本

$ ./bin/br version
Nebula Backup And Restore Utility Tool,V-0.6.1
   GitSha: 5aca40c
   GitRef: master
please run "help" subcommand for more infomation.%

nebula 服务

nebula 是通过 docker 启动的单节点服务

> docker-compose ps
  Name                Command                  State                                                  Ports
------------------------------------------------------------------------------------------------------------------------------------------------------
graphd     /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:19669->19669/tcp,:::19669->19669/tcp, 0.0.0.0:19670->19670/tcp,:::19670->19670/tcp,
                                                           0.0.0.0:9669->9669/tcp,:::9669->9669/tcp
metad      /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:19559->19559/tcp,:::19559->19559/tcp, 0.0.0.0:19560->19560/tcp,:::19560->19560/tcp,
                                                           0.0.0.0:9559->9559/tcp,:::9559->9559/tcp, 9560/tcp
storaged   /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:19779->19779/tcp,:::19779->19779/tcp, 0.0.0.0:19780->19780/tcp,:::19780->19780/tcp,
                                                           9777/tcp, 9778/tcp, 0.0.0.0:9779->9779/tcp,:::9779->9779/tcp, 9780/tcp
studio     ./server                         Up             0.0.0.0:7001->7001/tcp,:::7001->7001/tcp

nebula graphd/metad/storaged/ 使用的都是 v3.1.0

下面是metad服务的配置文件信息：

########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-metad.pid

########## logging ##########
# The directory to host logging files
--log_dir=logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=metad-stdout.log
--stderr_log_file=metad-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2
# wether logging files' name contain time stamp, If Using logrotate to rotate logging files, than should set it to true.
--timestamp_in_logfile_name=true

########## networking ##########
# Comma separated Meta Server addresses
--meta_server_addrs=127.0.0.1:9559
# Local IP used to identify the nebula-metad process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=127.0.0.1
# Meta daemon listening port
--port=9559
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19559
# Port to listen on Storage with HTTP protocol, it corresponds to ws_http_port in storage's configuration file
--ws_storage_http_port=19779

########## storage ##########
# Root data path, here should be only single path for metad
--data_path=data/meta

########## Misc #########
# The default number of parts when a space is created
--default_parts_num=100
# The default replica factor when a space is created
--default_replica_factor=1

--heartbeat_interval_secs=10
--agent_heartbeat_interval_secs=60

nebula-agent

> ./agent --agent="127.0.0.1:8888" --meta="127.0.0.1:9559"
{"file":"command-line-arguments/agent.go:31","func":"main.main","level":"info","msg":"Start agent server...","time":"2022-07-22T16:59:49.120Z","version":"96646b8"}
{"file":"github.com/vesoft-inc/nebula-agent/internal/clients/meta.go:75","func":"github.com/vesoft-inc/nebula-agent/internal/clients.connect","level":"info","meta address":"127.0.0.1:9559","msg":"try to connect meta service","time":"2022-07-22T16:59:49.121Z"}
{"file":"github.com/vesoft-inc/nebula-agent/internal/clients/meta.go:102","func":"github.com/vesoft-inc/nebula-agent/internal/clients.connect","level":"info","meta address":"127.0.0.1:9559","msg":"connect meta server successfully","time":"2022-07-22T16:59:49.126Z"}

agent 显示都已经连接成功的

但是执行备份命令时：

> ./bin/br backup full --meta "127.0.0.1:9559" --storage "local:///Users/jermey/projects/nebula-backup"
{"level":"info","meta address":"127.0.0.1:9559","msg":"Try to connect meta service.","time":"2022-07-22T17:01:02.004Z"}
{"level":"info","meta address":"127.0.0.1:9559","msg":"Connect meta server successfully.","time":"2022-07-22T17:01:02.009Z"}
Error: parse cluster response failed: response is not successful, code is E_LIST_CLUSTER_NO_AGENT_FAILURE

根据报错信息说链接metad服务不成功，

但是我查阅metad的配置文件中 meta_server_addrs 的ip也是 127.0.0.1，我telnet也是通的

telnet 127.0.0.1 9559
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

但是我通过 real ip 进行测试，也是会报这个错误

各位大佬能看看帮忙看看原因吗

best wish！

wey · 2022 年7 月 25 日 02:08

Agent 有一个隐含的假设是和所有 host 的 hostname/或者IP 相同，在 docker 部署的情况下，需要用到 network_mode: 让每一个服务实体的hostname 8888 上都起一个 agent。

你可以参考我写的 nebula-up （https://github.com/wey-gu/nebula-up/blob/main/backup_restore/docker-compose.yaml）

当然，你也可以直接用

curl -fsSL nebula-up.siwei.io/all-in-one.sh | bash -s -- v3 br

部署 nebulagraph + br 全套

JustinRong · 2022 年7 月 25 日 02:59

hi wey!

我已经对应每个服务都跑起相应的agent服务

  metad-agent:
    image: weygu/nebula-br:0.6.0
    container_name: metad-agent
    command: --agent="metad:8888" --meta="metad:9559"
    network_mode: 'container:metad'

  storaged-agent:
    image: weygu/nebula-br:0.6.0
    container_name: storaged-agent
    command: --agent="storaged:8888" --meta="metad:9559"
    network_mode: 'container:storaged'

  graphd-agent:
    image: weygu/nebula-br:0.6.0
    container_name: graphd-agent
    command: --agent="graphd:8888" --meta="metad:9559"
    network_mode: 'container:graphd'

当前容器状态

> docker-compose ps
     Name                   Command                  State                                                                                     Ports
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
graphd           /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:19669->19669/tcp,:::19669->19669/tcp, 0.0.0.0:19670->19670/tcp,:::19670->19670/tcp, 0.0.0.0:9669->9669/tcp,:::9669->9669/tcp
graphd-agent     /usr/local/bin/agent --age ...   Up
metad            /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:19559->19559/tcp,:::19559->19559/tcp, 0.0.0.0:19560->19560/tcp,:::19560->19560/tcp, 0.0.0.0:9559->9559/tcp,:::9559->9559/tcp, 9560/tcp
metad-agent      /usr/local/bin/agent --age ...   Up
storaged         /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:19779->19779/tcp,:::19779->19779/tcp, 0.0.0.0:19780->19780/tcp,:::19780->19780/tcp, 9777/tcp, 9778/tcp, 0.0.0.0:9779->9779/tcp,:::9779->9779/tcp, 9780/tcp
storaged-agent   /usr/local/bin/agent --age ...   Up
studio           ./server                         Up             0.0.0.0:7001->7001/tcp,:::7001->7001/tcp

但是这样的话是否通过容器跑备份br命令，例如：

docker exec -it graphd-agent br backup full --meta "metad:9559" --storage "local:///Users/jermey/projects/nebula-backup"

并且将文件生成到本地的话

这里的–storge是否对应的是容器内部目录地址？

wey · 2022 年7 月 25 日 03:07

这种情况不能用 local storage 了哈

你可以直接把我 https://github.com/wey-gu/nebula-up/tree/main/backup_restore 里边的东西弄过去，会帮你弄起来一个直接可以用的 minio 集群。

然后比如这就是一个 br show 的调用例子，不过前提是你要在 minio 里创建 nebula-br-bucket 这个bucket

docker exec -it backup_restore_graphd1-agent_1 br show --s3.endpoint "http://nginx:9000" --storage="s3://nebula-br-bucket/" --s3.access_key=minioadmin --s3.secret_key=minioadmin --s3.region=default

backup

docker exec -it backup_restore_graphd1-agent_1 br backup full --meta "metad0:9559" --s3.endpoint "http://nginx:9000" --storage="s3://nebula-br-bucket/" --s3.access_key=minioadmin --s3.secret_key=minioadmin --s3.region=default

https://github.com/wey-gu/nebula-up#try-backup-and-restore-with-minio-as-storage 这里介绍了你可以在浏览器访问 minio 看到你的创建的备份文件

JustinRong · 2022 年7 月 25 日 03:44

--s3.endpoint "http://nginx:9000"

如果是要访问外网的minio，该参数是否能写成 ip + port

但是我尝试了一下，显示 timeout错误

docker exec -it graphd-agent br backup full --meta "metad:9559" --s3.endpoint "http://192.168.31.xx:30080" --storage="s3://minio/jermey/" --s3.access_key=xxx --s3.secret_key=*** --s3.region=default

error.log

{"backup name":"BACKUP_2022_07_25_11_25_02","file":"github.com/vesoft-inc/nebula-br/pkg/cleanup/cleanup.go:123","func":"github.com/vesoft-inc/nebula-br/pkg/cleanup.(*Cleanup).Clean","level":"info","msg":"Clean up backup data successfully.","time":"2022-07-25T03:29:03.739Z"}
Cleanup backup BACKUP_2022_07_25_11_25_02 successfully after backup failed.Error: upload local tmp file to remote storage s3://minio/jermey/BACKUP_2022_07_25_11_25_02/BACKUP_2022_07_25_11_25_02.meta failed: upload from /tmp/nebula-br/BACKUP_2022_07_25_11_25_02.meta to jermey/BACKUP_2022_07_25_11_25_02/BACKUP_2022_07_25_11_25_02.meta failed: RequestError: send request failed
caused by: Put "http://192.168.31.10:30080/minio/jermey/BACKUP_2022_07_25_11_25_02/BACKUP_2022_07_25_11_25_02.meta": dial tcp 192.168.31.10:30080: i/o timeout

wey · 2022 年7 月 25 日 03:47

你从外边可以用 ip 访问的，因为 nginx 的 port 是有外部映射的，不过我那个配置里外部映射的 port 不是 9000，是19000，我的配置里 nebulagraph 和 minio 在同一个容器网络，所以用容器名字作为给定的地址就好了，外部ip访问要换19000

  nginx:
    image: nginx:1.19.2-alpine
    hostname: nginx
    volumes:
      - ${PWD}/nginx.conf:/etc/nginx/nginx.conf:ro
    ports:
      - "19000:9000" # <------- 有外部映射
      - "9001:9001"  # <-------

参考我的网络配置，如果一样的话，应该可以全都用域名的，不要指定ip，走容器网络

JustinRong · 2022 年7 月 25 日 07:27

好的，可以访问了，谢谢

JustinRong · 2022 年7 月 25 日 08:09

hi wey

还有个问题，就是我restore数据时

我只有一个metad节点

运行恢复备份命令时：

docker exec -it graphd-agent br restore full --meta "metad:9559" --s3.endpoint "http://192.168.31.xx:30080" --storage="s3://jermey/backup/" --s3.access_key=aasminio --s3.secret_key=*** --s3.region=default --name BACKUP_2022_07_25_15_43_56

报了这样的一个错误：

{"file":"github.com/vesoft-inc/nebula-br/pkg/clients/utils.go:20","func":"github.com/vesoft-inc/nebula-br/pkg/clients.connect","level":"info","meta address":"metad:9559","msg":"Try to connect meta service.","time":"2022-07-25T08:02:52.125Z"}
{"file":"github.com/vesoft-inc/nebula-br/pkg/clients/utils.go:44","func":"github.com/vesoft-inc/nebula-br/pkg/clients.connect","level":"info","meta address":"metad:9559","msg":"Connect meta server successfully.","time":"2022-07-25T08:02:52.131Z"}
{"file":"github.com/vesoft-inc/nebula-br/pkg/clients/utils.go:20","func":"github.com/vesoft-inc/nebula-br/pkg/clients.connect","level":"info","meta address":"metad:9559","msg":"Try to connect meta service.","time":"2022-07-25T08:02:52.131Z"}
{"file":"github.com/vesoft-inc/nebula-br/pkg/clients/utils.go:44","func":"github.com/vesoft-inc/nebula-br/pkg/clients.connect","level":"info","meta address":"metad:9559","msg":"Connect meta server successfully.","time":"2022-07-25T08:02:52.133Z"}
{"file":"github.com/vesoft-inc/nebula-br/pkg/utils/hosts.go:67","func":"github.com/vesoft-inc/nebula-br/pkg/utils.(*NebulaHosts).LoadFrom","host info":"map[graphd:graphd:8888[AGENT]: (data: , root: ) | graphd:9669[GRAPH]: (data: , root: /usr/local/nebula) metad:metad:8888[AGENT]: (data: , root: ) | metad:9559[META]: (data: /data/meta, root: /usr/local/nebula) storaged:storaged:9779[STORAGE]: (data: /data/storage, root: /usr/local/nebula) | storaged:8888[AGENT]: (data: , root: )]","level":"info","msg":"Get cluster topology from the nebula.","time":"2022-07-25T08:02:52.133Z"}
{"backup":"BACKUP_2022_07_25_15_43_56","file":"github.com/vesoft-inc/nebula-br/pkg/restore/restore.go:509","func":"github.com/vesoft-inc/nebula-br/pkg/restore.(*Restore).Restore","level":"info","msg":"Check backup dir successfully.","time":"2022-07-25T08:02:52.141Z","uri":"s3://jermey/backup/BACKUP_2022_07_25_15_43_56"}
{"dir":"/usr/local/nebula","file":"github.com/vesoft-inc/nebula-br/pkg/restore/restore.go:358","func":"github.com/vesoft-inc/nebula-br/pkg/restore.(*Restore).stopCluster","host":"storaged","level":"info","msg":"Stop services.","role":"STORAGE","time":"2022-07-25T08:02:52.144Z"}
{"error":"get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: \"bash\": executable file not found in $PATH","file":"github.com/vesoft-inc/nebula-br/pkg/restore/fix.go:181","func":"github.com/vesoft-inc/nebula-br/pkg/restore.retry","level":"info","msg":"Get dead services failed, try times=1.","time":"2022-07-25T08:02:52.148Z"}
{"error":"get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: \"bash\": executable file not found in $PATH","file":"github.com/vesoft-inc/nebula-br/pkg/restore/fix.go:181","func":"github.com/vesoft-inc/nebula-br/pkg/restore.retry","level":"info","msg":"Get dead services failed, try times=2.","time":"2022-07-25T08:02:53.150Z"}
{"error":"get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: \"bash\": executable file not found in $PATH","file":"github.com/vesoft-inc/nebula-br/pkg/restore/fix.go:181","func":"github.com/vesoft-inc/nebula-br/pkg/restore.retry","level":"info","msg":"Get dead services failed, try times=3.","time":"2022-07-25T08:02:55.151Z"}
Fix failed when restore failed get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: "bash": executable file not found in $PATH
Error: stop cluster failed: stop services in host storaged failed: agent, stop service failed: rpc error: code = Unknown desc = exec: "bash": executable file not found in $PATH

但是我的storaged状态时healthy的

  Name                Command                  State                                                  Ports
------------------------------------------------------------------------------------------------------------------------------------------------------
graphd     /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:49669->19669/tcp, 0.0.0.0:49670->19670/tcp, 0.0.0.0:59669->9669/tcp
metad      /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:49559->19559/tcp, 0.0.0.0:49560->19560/tcp, 0.0.0.0:39559->9559/tcp, 9560/tcp
storaged   /usr/local/nebula/bin/nebu ...   Up (healthy)   0.0.0.0:49779->19779/tcp, 0.0.0.0:49780->19780/tcp, 9777/tcp, 9778/tcp,
                                                           0.0.0.0:39779->9779/tcp, 9780/tcp
studio     ./server                         Up             0.0.0.0:7007->7001/tcp

这是怎么回事呢？

wey · 2022 年7 月 25 日 08:23

show hosts agent 呢

wey · 2022 年7 月 25 日 08:23

agent 和数据库版本能匹配上哈？

JustinRong · 2022 年7 月 25 日 08:23

都在线的

JustinRong · 2022 年7 月 25 日 08:26

agent 我用的是你的这个镜像 image: weygu/nebula-br:0.6.0
数据库版本是 v3.1.0

在哪可以看对应的版本？

wey · 2022 年7 月 25 日 08:26

Nebula-up 里那些软件是我自己打包的容器，固定的是 3.1.0 的 nebulagraph，如果你也是 nebulagraph 3.1.0 没问题。

看报错像是不匹配，在做 get dead service，可能对端还没有对应的接口

cc @spw

JustinRong:

{"error":"get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: \"bash\": executable file not found in $PATH","file":"github.com/vesoft-inc/nebula-br/pkg/restore/fix.go:181","func":"github.com/vesoft-inc/nebula-br/pkg/restore.retry","level":"info","msg":"Get dead services failed, try times=1.","time":"2022-07-25T08:02:52.148Z"}
{"error":"get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: \"bash\": executable file not found in $PATH","file":"github.com/vesoft-inc/nebula-br/pkg/restore/fix.go:181","func":"github.com/vesoft-inc/nebula-br/pkg/restore.retry","level":"info","msg":"Get dead services failed, try times=2.","time":"2022-07-25T08:02:53.150Z"}
{"error":"get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: \"bash\": executable file not found in $PATH","file":"github.com/vesoft-inc/nebula-br/pkg/restore/fix.go:181","func":"github.com/vesoft-inc/nebula-br/pkg/restore.retry","level":"info","msg":"Get dead services failed, try times=3.","time":"2022-07-25T08:02:55.151Z"}
Fix failed when restore failed get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: "bash": executable file not found in $PATH
Error: stop cluster failed: stop services in host storaged failed: agent, stop service failed: rpc error: code = Unknown desc = exec: "bash": executable file not found in $PATH

JustinRong · 2022 年7 月 25 日 08:32

exposed port我修改过

restore访问是应该是容器内部的端口，这个不影响吧

108eb931f636 vesoft/nebula-storaged:v3.1.0 “/usr/local/nebula/b…” 45 minutes ago Up 45 minutes (healthy) 9777-9778/tcp, 9780/tcp, 0.0.0.0:39779->9779/tcp, 0.0.0.0:49779->19779/tcp, 0.0.0.0:49780->19780/tcp storaged

wey · 2022 年7 月 25 日 08:43

等一下，不是你配置的问题，我原来的setup 好像就是不能 restore 的，我看看为啥哈。

wey · 2022 年7 月 25 日 08:53

JustinRong:

{"error":"get service status in host storaged failed: agent, get service status failed: rpc error: code = Unknown desc = get STORAGE status by daemon failed: exec: \"bash\": executable file not found in $PATH","file":"github.com/vesoft-inc/nebula-br/pkg/restore/fix.go:181","func":"github.com/vesoft-inc/nebula-br/pkg/restore.retry","level":"info","msg":"Get dead services failed, try times=3.","time":"2022-07-25T08:02:55.151Z"}

和 @spw 确认了下，现在 agent 在处理 restore 的过程中启停服务假设了非容器环境（裸机、agent 和服务在同一个 host 下），暂时容器里是不支持 restore 哈。看来我之前在 docker 里玩耍 br 的方式如果想跑通只能先 hack mock 这个 bash call。

https://github.com/vesoft-inc/nebula-agent/blob/96646b8f19bea1d9faae99dc72990d8110532aa0/internal/clients/daemon.go#L95-L100

如果你要测试 restore 的话，只能裸机部署了

这个问题我会去提 issue

JustinRong · 2022 年7 月 25 日 08:54

好的，非常感谢解答！

wey · 2022 年7 月 25 日 09:03

https://github.com/vesoft-inc/nebula-agent/issues/22

system · 2022 年8 月 1 日 09:04

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。