停止Nebula集群之后，查询卡住不返回错误

zmh0531 · 2021 年4 月 2 日 01:58

nebula 版本：1.0.0
部署方式：分布式
硬件信息
- 磁盘 SSD 2.9T
- CPU、内存信息
问题的具体描述

链接Nebula成功后把Nebula集群停止，查询卡住不返回错误。

std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = std::make_shared<nebula::storage::StorageClient>(ioThreadPool, tmpmetaClient.get());
metaVector.push_back(tmpmetaClient);
clientVector.push_back(tmpstorageClient);

-------
std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = clientVector.at(t);
auto futureGet = tmpstorageClient->get(spaceId, std::move(keys), true);

现象：

我们在业务服务端内部用Storage Client 成功连接Nebula Server之后，停止Nebula集群，再次触发查询，一直卡住在get阶段不返回错误，一直不停打印 meta 心跳信息错误。

请问一下是什么原因。
怎么才能返回错误？

客户端日志：

E0402 09:47:26.335455 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:27.082682 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:28.084795 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:29.091791 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.098790 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.339421 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.343441 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:31.100625 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:32.103116 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:33.105141 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.110198 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.347332 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.348481 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused

bright-starry-sky · 2021 年4 月 6 日 03:00

很久也不返回错误吗？我记得有一个timeout错误可以返回的

zmh0531 · 2021 年4 月 6 日 03:42

是的，长时间不返回错误，一直打印meta心跳错误信息，至少半个多小时

bright-starry-sky · 2021 年4 月 7 日 02:21

是把nebula整个集群停止了？还是只把storage节点和meta节点停止了？

bright-starry-sky · 2021 年4 月 7 日 02:25

@dingding , 我们storageClient 可以用这个参数吗？ storage_client_timeout_ms

dingding · 2021 年4 月 7 日 02:36

可以，这个就是向storage请求的超时时间，现在默认是60秒 @zmh0531 ，你这个storageclient是用nebula repo里面的吗 nebula/StorageClient.cpp at master · vesoft-inc/nebula · GitHub

zmh0531 · 2021 年4 月 7 日 02:36

整个集群停掉，storage client time out，meta time out，retry time out 都试过了。也好复现，client 链接成功之后 sleep30s，停掉集群，再查询就会复现

dingding · 2021 年4 月 7 日 02:44

集群停掉之后，storageclient应该会收到异常信息，请问你是在get的过程还没返回结果就停掉nebula服务，然后重启服务后，又触发一个get操作吗？还有你的 ioThreadPool 是不是就只有一个thread。你停掉服务前的get是不是没有完成？
贴下你使用storageClient的程序的日志吧。

ByTracy · 2021 年4 月 17 日 09:30

是还没请求get得时候停掉服务的，ioThreadPool 是不是就只有一个thread，报的日志如下：

E0402 09:47:26.335455 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:27.082682 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:28.084795 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:29.091791 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.098790 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.339421 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.343441 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:31.100625 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:32.103116 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:33.105141 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.110198 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.347332 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.348481 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused

ByTracy · 2021 年4 月 17 日 09:38

想不明白的是storage的get请求，为啥会触发meta的心跳机制的，代码如下：

for (int t = 0; t < thNum; ++t) {
        //get ioThreadPool
       std::shared_ptrfolly::IOThreadPoolExecutor ioThreadPool = 
        std::make_sharedfolly::IOThreadPoolExecutor(ioThreadNum);
       nebula::meta::MetaClientOptions tmpoptions;
      //get meta client
       std::shared_ptrnebula::meta::MetaClient tmpmetaClient =
        std::make_sharednebula::meta::MetaClient(ioThreadPool, hostAddrs.value(), tmpoptions);
        tmpmetaClient->waitForMetadReady();
        //get storage client
        std::shared_ptrnebula::storage::StorageClient tmpstorageClient =
        std::make_sharednebula::storage::StorageClient(ioThreadPool, tmpmetaClient.get());
        metaVector.push_back(tmpmetaClient);
        clientVector.push_back(tmpstorageClient);
}

for (int t = 0; t < thNum; ++t) {
        int mode = t % size;
        int spaceId = spaceIdArr[mode];
        future[t] = async(launch::async,[t, thNum,vecKeys, clientVector,mode, spaceId] {
       std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = clientVector.at(t);
       auto futureGet = tmpstorageClient->get(spaceId, std::move(keys), true);
  }

ByTracy · 2021 年4 月 19 日 13:04

查看源码知道，后台心跳线程一直检测心跳状态，如何设置心跳失败次数

dingding · 2021 年4 月 19 日 13:14

这个心跳周期发送的，没有失败次数的限制，但是你可以设置发送频率，也是由 heartbeat_interval_secs 控制，你也可以在代码加参数，让它不要发心跳。

yee · 2021 年4 月 20 日 01:41

心跳的作用可以参考这篇博文。

ByTracy · 2021 年4 月 20 日 01:51

目前我们得使用方式是把源码中的meta客户端抽出来使用的，然后这个meta client对于我们来说好像没啥用，可以直接去掉么？麻烦回答一下

yee · 2021 年4 月 20 日 02:18

没太看明白把 meta client 抽取出来用，但是觉得 meta client 又没啥用是什么意思？

你们是想只使用 storage 不使用 meta 吗？现在 storage 必须依赖 meta 来管理元数据。

critical27 · 2021 年4 月 20 日 10:06

你看到的Heartbeat failed 不是ioThreadPool报出来的，是一个背景线程
storage client需要基于meta client，否则schema信息 space信息都获取不到，更不要谈查询了

bright-starry-sky · 2021 年4 月 25 日 04:49

还是没太明白你的需求是什么，麻烦说下你的需求和业务场景。

critical27 · 2021 年4 月 25 日 06:18

你贴的代码和报错日志不是一个地方，你可能得再仔细研究下代码。另外“查一条数据得时候就马上返回错误，而不是循环整个keys” 这个地方可能需要做点改动，参考“StorageClient::collectResponse”

ByTracy · 2021 年4 月 25 日 06:56

我们现在的需求是使用storage service作为kv存储数据库，然后通过使用你们源码中的storage客户端传入keys的集合，keys集合可能有上千个key，然后返回相应的value集合，现在出现的问题是我把整个集群停掉，没有立即打印异常，而是一直在打印重试信息，重试信息每隔1秒打印一次，等到上千个重试信息都打印完成之后，整个程序才会因为异常退出，上千个打印完差不多都半小时了，现在我们想要的结果是怎么能立即知道集群已经停止，然后程序退出，哪怕几秒钟知道也可以

bright-starry-sky · 2021 年4 月 25 日 07:36

是只想用nebula的分布式存储，而且存储的数据和graph没有关系，只是单纯的kv是吧？
我猜测这其实是一个很大的工程，相当于meta端需要把raft协议相关的代码剥离出来，graph端需要定义你自己的查询接口。
对于上边写出的实际错误，可以仔细看下StorageClient::collectResponse的代码，根据自己的需求来修改错误判断方式和retry次数。