nebula graph 1.2版本,创建快照失败

提问参考模版:

  • nebula 版本:1.2
  • 部署方式(分布式 / 单机 / Docker / DBaaS):分布式
  • 硬件信息
    • 磁盘( 推荐使用 SSD)SSD
    • CPU、内存信息
  • 问题的具体描述
    CREATE SNAPSHOT失败.

但是show snapshots显示存在。
image

1 个赞

而且,snapshot自己在增多:

storage日志:

请问个是最新master的代码吗?

第一次发生的 409 错误是说在checkpoint的过程中发生了leader change。导致checkpoint报错。
但是会自动产生snapshot这个问题需要调查一下,看起来是一分钟一次。应用端有配置定时的checkpoint了吗?

checkpoint过程中没有进行balance leader,应用端没有配置定时的checkpoin,感觉好像是因为失败了自动进行重试。

用docker swarm部署的,镜像是vesoft/nebula-graphd:v1.2.0。部署有一个多月了,应该不是最新的。

问一下,docker镜像最新链接是哪个?

更新了镜像,成功了几次,现在又失败了:
报错:[ERROR (-8)]: RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: Timed Out

只执行了一次,生成了3个快照:

storage日志:

E0309 09:59:53.399646    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.450421    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.501212    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.551978    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.602739    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.653492    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.704360    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.755228    23 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.806111    23 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.856842    23 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.907537    23 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:53.958381    23 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.009222    23 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.060096    23 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.111002    26 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.161890    29 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.212622    26 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.263267    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.314007    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.364724    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.415467    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.466233    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.516908    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.567560    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.618185    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267
E0309 09:59:54.668829    40 NebulaStore.cpp:815] Part sync failed. space : 739 Part : 267

麻烦把你的集群配置以及集群架构发一下,复现步骤也描述一下。
没有复现这个问题,看了一下代码逻辑,失败之后不太可能retry,也不太可能创建出三个不同名称的snapshot。

我现在用的镜像是:
vesoft/nebula-graphd:v1.2.0
vesoft/nebula-metad:v1.2.0
vesoft/nebula-storaged:v1.2.0
看了下这个是3个月前的镜像。

我想先试下最新的docker镜像,我用的是1.0版本的,应该用下面的哪个镜像?
1.
vesoft/nebula-graphd:nightly
vesoft/nebula-metad:nightly
vesoft/nebula-storaged:nightly
2.
vesoft/nebula-graphd:latest
vesoft/nebula-metad:latest
vesoft/nebula-storaged:latest

有可能是因为不是最新代码的原因

集群15台机器,docker swarm部署的,5台meta,15storaged,15graphd:
配置:
meta:

      - --meta_server_addrs=
      - --local_ip=
      - --ws_ip=
      - --port=45500
      - --data_path=/data/meta
      - --log_dir=/logs
      - --v=0
      - --minloglevel=2
      - --heartbeat_interval_secs=10

storaged:

      - --meta_server_addrs=
      - --local_ip= 
      - --ws_ip= 
      - --port=44500
      - --rocksdb_block_cache=32768
      - --rocksdb_batch_size=4096
      - --data_path=/data/storage0,/data/storage1,/data/storage2,/data/storage3,/data/storage4,/data/storage5,/data/storage6
      - --log_dir=/logs
      - --v=0
      - --minloglevel=2
      - --auto_remove_invalid_space=true

graphd:

      - --meta_server_addrs= 
      - --port=3699
      - --ws_ip= 
      - --log_dir=/logs
      - --v=0
      - --minloglevel=2
      - --enable_authorize=true

其他配置:

(root@nebula) [nba]> show configs
==============================================================================================================================================================================================================================
| module  | name                              | type   | mode    | value                                                                                                                                                     |
==============================================================================================================================================================================================================================
| GRAPH   | v                                 | INT64  | MUTABLE | 0                                                                                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| GRAPH   | minloglevel                       | INT64  | MUTABLE | 2                                                                                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| GRAPH   | slow_op_threshhold_ms             | INT64  | MUTABLE | 50                                                                                                                                                        |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| GRAPH   | heartbeat_interval_secs           | INT64  | MUTABLE | 3                                                                                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| GRAPH   | meta_client_retry_times           | INT64  | MUTABLE | 3                                                                                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | v                                 | INT64  | MUTABLE | 0                                                                                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | wal_ttl                           | INT64  | MUTABLE | 3600                                                                                                                                                      |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | minloglevel                       | INT64  | MUTABLE | 2                                                                                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | rocksdb_db_options                | NESTED | MUTABLE | {
  "max_background_jobs": "1",
  "max_subcompactions": "1"
}                                                                                             |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | enable_multi_versions             | BOOL   | MUTABLE | False                                                                                                                                                     |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | slow_op_threshhold_ms             | INT64  | MUTABLE | 50                                                                                                                                                        |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | clean_wal_interval_secs           | INT64  | MUTABLE | 600                                                                                                                                                       |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | heartbeat_interval_secs           | INT64  | MUTABLE | 10                                                                                                                                                        |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | meta_client_retry_times           | INT64  | MUTABLE | 3                                                                                                                                                         |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | enable_reservoir_sampling         | BOOL   | MUTABLE | False                                                                                                                                                     |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | custom_filter_interval_secs       | INT64  | MUTABLE | 86400                                                                                                                                                     |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | max_edge_returned_per_vertex      | INT64  | MUTABLE | 2147483647                                                                                                                                                |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | rocksdb_column_family_options     | NESTED | MUTABLE | {
  "max_bytes_for_level_base": "268435456",
  "max_write_buffer_number": "4",
  "disable_auto_compactions": "false",
  "write_buffer_size": "67108864"
} |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| STORAGE | rocksdb_block_based_table_options | NESTED | MUTABLE | {
  "block_size": "8192"
}                                                                                                                                |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Got 19 rows (Time spent: 12.493/14.063 ms)

复现步骤:
很简单,就是执行create snapshot

怀疑是meta错乱了,麻烦尝试一下只留一台meta?

只留了一台meta

(root@nebula) [(none)]> create snapshot
[ERROR (-8)]: RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
[root@node-01 logs]# docker service list       
ID                  NAME                MODE                REPLICAS            IMAGE                                            PORTS
pomfudrdtd40        nebula_graphd0      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
58084hcew9ve        nebula_graphd1      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
80ajief2nraj        nebula_graphd2      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
iis745y1s1wy        nebula_graphd3      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
twffk519r71n        nebula_graphd4      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
1hd8y2euljs2        nebula_graphd5      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
lqmz7zy0kl6p        nebula_graphd6      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
motj0hbdl4il        nebula_graphd7      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
oc8fqv1oib4r        nebula_graphd8      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
80y21eh0quaa        nebula_graphd9      replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
vwmsya6ljlhj        nebula_graphd10     replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
u59h2dsev827        nebula_graphd11     replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
ym42o54ywcji        nebula_graphd12     replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
us1ntt8g0b2p        nebula_graphd13     replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
cyyjwkkqbawq        nebula_graphd14     replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-graphd:v1.2.0     
9jrexp96ol88        nebula_metad0       replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-metad:v1.2.0      
0z1eu5vub7cz        nebula_metad1       replicated          0/0                 mirror.jd.com/9n/vesoft/nebula-metad:v1.2.0      
to2umpv44jbm        nebula_metad2       replicated          0/0                 mirror.jd.com/9n/vesoft/nebula-metad:v1.2.0      
x6e6e0ls3m4t        nebula_metad3       replicated          0/0                 mirror.jd.com/9n/vesoft/nebula-metad:v1.2.0      
l6vs48mv2oq3        nebula_metad4       replicated          0/0                 mirror.jd.com/9n/vesoft/nebula-metad:v1.2.0      
33lixqtz7nzh        nebula_storaged0    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
59an08nml659        nebula_storaged1    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
6y6hmwb3rfiv        nebula_storaged2    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
pg0smiwhk2hg        nebula_storaged3    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
lcwkevlbtjox        nebula_storaged4    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
p9l7knsjn6fe        nebula_storaged5    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
d34t2cdsp973        nebula_storaged6    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
ssakb491sico        nebula_storaged7    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
72q5pvg90w9m        nebula_storaged8    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
z82jc8afjah3        nebula_storaged9    replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
pflxngmarl6f        nebula_storaged10   replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
2pcdbv0kv6g2        nebula_storaged11   replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
nr4duju058kl        nebula_storaged12   replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
7yr3s2u6i90j        nebula_storaged13   replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   
k6gn22ocpnum        nebula_storaged14   replicated          1/1                 mirror.jd.com/9n/vesoft/nebula-storaged:v1.2.0   

meta日志:

E0309 11:20:09.176964   147 RaftPart.cpp:365] [Port: 45501, Space: 0, Part: 0] The partition is not a leader
E0309 11:20:09.177827   147 RaftPart.cpp:635] [Port: 45501, Space: 0, Part: 0] Cannot append logs, clean the buffer
E0309 11:22:56.943444    44 RaftPart.cpp:773] [Port: 45501, Space: 0, Part: 0] Replicate logs failed
E0309 11:24:16.970481   143 CreateSnapshotProcessor.cpp:54] Write snapshot meta error
E0309 11:24:18.973976   143 CreateSnapshotProcessor.cpp:54] Write snapshot meta error
E0309 11:24:31.523420   143 RaftPart.cpp:365] [Port: 45501, Space: 0, Part: 0] The partition is not a leader
E0309 11:24:31.523509   143 RaftPart.cpp:635] [Port: 45501, Space: 0, Part: 0] Cannot append logs, clean the buffer
E0309 11:25:09.061215   143 CreateSnapshotProcessor.cpp:54] Write snapshot meta error
E0309 11:26:13.871280   143 RaftPart.cpp:365] [Port: 45501, Space: 0, Part: 0] The partition is not a leader
E0309 11:26:13.871376   143 RaftPart.cpp:635] [Port: 45501, Space: 0, Part: 0] Cannot append logs, clean the buffer

storage日志

E0309 11:27:20.739928    64 MetaClient.cpp:110] Heartbeat failed, status:Unknown error(409): Leader changed!
E0309 11:27:33.754613    63 MetaClient.cpp:524] Send request to [172.18.154.38:45500], exceed retry limit
E0309 11:27:33.754868    64 MetaClient.cpp:110] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0309 11:27:46.768954    48 MetaClient.cpp:524] Send request to [172.18.154.38:45500], exceed retry limit
E0309 11:27:46.769156    64 MetaClient.cpp:110] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0309 11:27:59.784420    57 MetaClient.cpp:524] Send request to [172.18.153.39:45500], exceed retry limit
E0309 11:27:59.784773    64 MetaClient.cpp:110] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0309 11:28:12.793458    64 MetaClient.cpp:110] Heartbeat failed, status:Unknown error(409): Leader changed!

如果是meta错乱应该怎么解决呢?

只留一个meta之后,Leader count都是0了

留下的这台meta是 graph和storage的配置参数 --meta_server_addrs 里的吗?
猜测留下的meta的角色不是正在工作的master

可以尝试graph和storage的配置参数(–meta_server_addrs)改下,只留一个,然后全部(all storage, all graph , one meta)重启一下。

ERROR (-8)]: RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: Timed Out
Tue Mar 9 11:14:24 2021

E0309 11:14:47.508344 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.154.38:44500
E0309 11:14:49.157444 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.155.101:44500
E0309 11:14:50.782269 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.155.102:44500
E0309 11:14:52.918222 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.31:44500
E0309 11:14:55.065866 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.33:44500
E0309 11:14:57.278834 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.39:44500
E0309 11:14:59.217269 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.40:44500
E0309 11:15:00.958760 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.42:44500
E0309 11:15:02.733294 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.44:44500
E0309 11:15:04.362973 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.46:44500
E0309 11:15:06.031448 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.48:44500
E0309 11:15:07.661350 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.55:44500
E0309 11:15:10.424243 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.57:44500
E0309 11:15:12.092144 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.59:44500
E0309 11:15:13.716800 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.61:44500
E0309 11:15:15.579665 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.154.38:44500
E0309 11:15:17.234797 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.155.101:44500
E0309 11:15:19.420120 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.155.102:44500
E0309 11:15:21.591212 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.31:44500
E0309 11:15:23.432749 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.33:44500
E0309 11:15:25.347010 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.39:44500
E0309 11:15:27.279222 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.40:44500
E0309 11:15:29.606353 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.42:44500
E0309 11:15:31.310940 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.44:44500
E0309 11:15:32.941073 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.46:44500
E0309 11:15:34.608043 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.48:44500
E0309 11:15:36.234722 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.55:44500
E0309 11:15:39.189229 144 SnapShot.cpp:68] Send blocking sign error on host : 172.18.153.57:44500