停止Nebula集群之后,查询卡住不返回错误

  • nebula 版本:1.0.0
  • 部署方式:分布式
  • 硬件信息
    • 磁盘 SSD 2.9T
    • CPU、内存信息
  • 问题的具体描述

链接Nebula成功后把Nebula集群停止,查询卡住不返回错误。

std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = std::make_shared<nebula::storage::StorageClient>(ioThreadPool, tmpmetaClient.get());
metaVector.push_back(tmpmetaClient);
clientVector.push_back(tmpstorageClient);

-------
std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = clientVector.at(t);
auto futureGet = tmpstorageClient->get(spaceId, std::move(keys), true);

现象:

我们在业务服务端内部用Storage Client 成功连接Nebula Server之后,停止Nebula集群,再次触发查询,一直卡住在get阶段不返回错误,一直不停打印 meta 心跳信息错误。

  1. 请问一下是什么原因。
  2. 怎么才能返回错误?

客户端日志:

E0402 09:47:26.335455 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:27.082682 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:28.084795 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:29.091791 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.098790 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.339421 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.343441 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:31.100625 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:32.103116 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:33.105141 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.110198 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.347332 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.348481 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused

很久也不返回错误吗?我记得有一个timeout错误可以返回的

是的,长时间不返回错误,一直打印meta心跳错误信息,至少半个多小时

是把nebula整个集群停止了?还是只把storage节点和meta节点停止了?

@dingding , 我们storageClient 可以用这个参数吗? storage_client_timeout_ms

可以,这个就是向storage请求的超时时间,现在默认是60秒 @zmh0531 ,你这个storageclient是用nebula repo里面的吗 nebula/StorageClient.cpp at master · vesoft-inc/nebula · GitHub

整个集群停掉,storage client time out,meta time out,retry time out 都试过了。也好复现,client 链接成功之后 sleep30s,停掉集群,再查询就会复现

集群停掉之后,storageclient应该会收到异常信息,请问你是在get的过程还没返回结果就停掉nebula服务,然后重启服务后,又触发一个get操作吗?还有你的 ioThreadPool 是不是就只有一个thread。你停掉服务前的get是不是没有完成?
贴下你使用storageClient的程序的日志吧。

是还没请求get得时候停掉服务的,ioThreadPool 是不是就只有一个thread,报的日志如下:

E0402 09:47:26.335455 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:27.082682 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:28.084795 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:29.091791 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.098790 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.339421 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.343441 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:31.100625 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:32.103116 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:33.105141 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.110198 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.347332 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.348481 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused

想不明白的是storage的get请求,为啥会触发meta的心跳机制的,代码如下:

for (int t = 0; t < thNum; ++t) {
        //get ioThreadPool
       std::shared_ptrfolly::IOThreadPoolExecutor ioThreadPool = 
        std::make_sharedfolly::IOThreadPoolExecutor(ioThreadNum);
       nebula::meta::MetaClientOptions tmpoptions;
      //get meta client
       std::shared_ptrnebula::meta::MetaClient tmpmetaClient =
        std::make_sharednebula::meta::MetaClient(ioThreadPool, hostAddrs.value(), tmpoptions);
        tmpmetaClient->waitForMetadReady();
        //get storage client
        std::shared_ptrnebula::storage::StorageClient tmpstorageClient =
        std::make_sharednebula::storage::StorageClient(ioThreadPool, tmpmetaClient.get());
        metaVector.push_back(tmpmetaClient);
        clientVector.push_back(tmpstorageClient);
}

for (int t = 0; t < thNum; ++t) {
        int mode = t % size;
        int spaceId = spaceIdArr[mode];
        future[t] = async(launch::async,[t, thNum,vecKeys, clientVector,mode, spaceId] {
       std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = clientVector.at(t);
       auto futureGet = tmpstorageClient->get(spaceId, std::move(keys), true);
  }

查看源码知道,后台心跳线程一直检测心跳状态,如何设置心跳失败次数

这个心跳周期发送的,没有失败次数的限制,但是你可以设置发送频率,也是由 heartbeat_interval_secs 控制,你也可以在代码加参数,让它不要发心跳。

心跳的作用可以参考这篇博文

目前我们得使用方式是把源码中的meta客户端抽出来使用的,然后这个meta client对于我们来说好像没啥用,可以直接去掉么?麻烦回答一下

没太看明白把 meta client 抽取出来用,但是觉得 meta client 又没啥用是什么意思?

你们是想只使用 storage 不使用 meta 吗?现在 storage 必须依赖 meta 来管理元数据。

  1. 你看到的Heartbeat failed 不是ioThreadPool报出来的,是一个背景线程
  2. storage client需要基于meta client,否则schema信息 space信息都获取不到,更不要谈查询了

浙ICP备20010487号