集群部署,设置多副本,停掉1个stroaged服务导致无法查询

  • nebula 版本:3.8.0

  • 部署方式:分布式
  • 安装方式:源码编译
  • 是否上生产环境:N
  • 硬件信息
    • 磁盘( 推荐使用 SSD):SSD
    • CPU、内存信息:(4核32G)*2
  • 问题的具体描述
    测试集群一共2台机器,每台机器分别部署了1个graphd、metad、storaged服务,创建了一个测试图空间demo2,设置10分区2副本,建图语句如下:
# Create Space 
CREATE SPACE `demo2` (partition_num = 10, replica_factor = 2, charset = utf8, collate = utf8_bin, vid_type = FIXED_STRING(32));
:sleep 20;
USE `demo2`;

# Create Tag: 
CREATE TAG `player` ( `name` string NULL, `age` int64 NULL) ttl_duration = 0, ttl_col = "";
CREATE TAG `team` ( `name` string NULL) ttl_duration = 0, ttl_col = "";

# Create Edge: 
CREATE EDGE `follow` ( `degree` int64 NULL) ttl_duration = 0, ttl_col = "";
CREATE EDGE `serve` ( `start_year` int64 NULL, `end_year` int64 NULL) ttl_duration = 0, ttl_col = "";
:sleep 20;

# Create Index: 
CREATE TAG INDEX `player_index_0` ON `player` ();
CREATE TAG INDEX `player_index_1` ON `player` ( `name`(20));

当我停掉1台其中一台机器的storaged服务,

就会导致无法查询,查询会报错,报错如下:

E20250321 09:46:45.305526 11815 StorageClientBase-inl.h:227] Request to "172.21.120.3":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20250321 09:46:45.305652 11806 StorageClientBase-inl.h:143] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20250321 09:46:45.305724 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 6
E20250321 09:46:45.305742 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 5
E20250321 09:46:45.305752 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 7
E20250321 09:46:45.305759 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 2
E20250321 09:46:45.305776 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 4
E20250321 09:46:45.305785 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 8
E20250321 09:46:45.305806 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 9
E20250321 09:46:45.305814 11803 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 10
E20250321 09:46:45.305864 11805 QueryInstance.cpp:151] Storage Error: RPC failure, probably timeout., query: MATCH (v) RETURN v LIMIT 3;
E20250321 09:46:49.184321 11815 StorageClientBase-inl.h:227] Request to "172.21.120.3":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20250321 09:46:49.184422 11803 StorageClientBase-inl.h:143] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20250321 09:46:49.184474 11806 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 3
E20250321 09:46:49.184510 11806 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 1
E20250321 09:46:49.184518 11806 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 5
E20250321 09:46:49.184525 11806 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 7
E20250321 09:46:49.184533 11806 StorageAccessExecutor.h:47] ScanVerticesExecutor failed, error E_RPC_FAILURE, part 9
E20250321 09:46:49.184569 11803 QueryInstance.cpp:151] Storage Error: RPC failure, probably timeout., query: MATCH (v) RETURN v LIMIT 3;

以上问题是什么原因,我设置2副本,理论上停掉1个storaged服务,还有另一个stroaged服务有全量数据能够提供查询,应该怎么解决。

raft 的可靠性机制是要求是至少是 3 副本,且建议为奇数;更多信息可以了解下 raft 本身机制