- nebula 版本:1.0.0
- 部署方式:分布式
- 硬件信息
- 问题的具体描述
链接Nebula成功后把Nebula集群停止,查询卡住不返回错误。
std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = std::make_shared<nebula::storage::StorageClient>(ioThreadPool, tmpmetaClient.get());
metaVector.push_back(tmpmetaClient);
clientVector.push_back(tmpstorageClient);
-------
std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = clientVector.at(t);
auto futureGet = tmpstorageClient->get(spaceId, std::move(keys), true);
现象:
我们在业务服务端内部用Storage Client 成功连接Nebula Server之后,停止Nebula集群,再次触发查询,一直卡住在get阶段不返回错误,一直不停打印 meta 心跳信息错误。
- 请问一下是什么原因。
- 怎么才能返回错误?
客户端日志:
E0402 09:47:26.335455 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:27.082682 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:28.084795 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:29.091791 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.098790 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.339421 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.343441 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:31.100625 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:32.103116 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:33.105141 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.110198 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.347332 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.348481 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
很久也不返回错误吗?我记得有一个timeout错误可以返回的
是的,长时间不返回错误,一直打印meta心跳错误信息,至少半个多小时
是把nebula整个集群停止了?还是只把storage节点和meta节点停止了?
@dingding , 我们storageClient 可以用这个参数吗? storage_client_timeout_ms
整个集群停掉,storage client time out,meta time out,retry time out 都试过了。也好复现,client 链接成功之后 sleep30s,停掉集群,再查询就会复现
集群停掉之后,storageclient应该会收到异常信息,请问你是在get的过程还没返回结果就停掉nebula服务,然后重启服务后,又触发一个get操作吗?还有你的 ioThreadPool 是不是就只有一个thread。你停掉服务前的get是不是没有完成?
贴下你使用storageClient的程序的日志吧。
是还没请求get得时候停掉服务的,ioThreadPool 是不是就只有一个thread,报的日志如下:
E0402 09:47:26.335455 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:27.082682 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:28.084795 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:29.091791 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.098790 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.339421 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:30.343441 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0402 09:47:31.100625 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:32.103116 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:33.105141 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.110198 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.347332 15213 MetaClient.cpp:467] Send request to [10.243.65.***:45500], exceed retry limit
E0402 09:47:34.348481 15214 MetaClient.cpp:118] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
想不明白的是storage的get请求,为啥会触发meta的心跳机制的,代码如下:
for (int t = 0; t < thNum; ++t) {
//get ioThreadPool
std::shared_ptrfolly::IOThreadPoolExecutor ioThreadPool =
std::make_sharedfolly::IOThreadPoolExecutor(ioThreadNum);
nebula::meta::MetaClientOptions tmpoptions;
//get meta client
std::shared_ptrnebula::meta::MetaClient tmpmetaClient =
std::make_sharednebula::meta::MetaClient(ioThreadPool, hostAddrs.value(), tmpoptions);
tmpmetaClient->waitForMetadReady();
//get storage client
std::shared_ptrnebula::storage::StorageClient tmpstorageClient =
std::make_sharednebula::storage::StorageClient(ioThreadPool, tmpmetaClient.get());
metaVector.push_back(tmpmetaClient);
clientVector.push_back(tmpstorageClient);
}
for (int t = 0; t < thNum; ++t) {
int mode = t % size;
int spaceId = spaceIdArr[mode];
future[t] = async(launch::async,[t, thNum,vecKeys, clientVector,mode, spaceId] {
std::shared_ptr<nebula::storage::StorageClient> tmpstorageClient = clientVector.at(t);
auto futureGet = tmpstorageClient->get(spaceId, std::move(keys), true);
}
查看源码知道,后台心跳线程一直检测心跳状态,如何设置心跳失败次数
这个心跳周期发送的,没有失败次数的限制,但是你可以设置发送频率,也是由 heartbeat_interval_secs 控制,你也可以在代码加参数,让它不要发心跳。
目前我们得使用方式是把源码中的meta客户端抽出来使用的,然后这个meta client对于我们来说好像没啥用,可以直接去掉么?麻烦回答一下
yee
18
没太看明白把 meta client 抽取出来用,但是觉得 meta client 又没啥用是什么意思?
你们是想只使用 storage 不使用 meta 吗?现在 storage 必须依赖 meta 来管理元数据。
还是没太明白你的需求是什么,麻烦说下你的需求和业务场景。
你贴的代码和报错日志不是一个地方,你可能得再仔细研究下代码。另外“查一条数据得时候就马上返回错误,而不是循环整个keys” 这个地方可能需要做点改动,参考“StorageClient::collectResponse”
我们现在的需求是使用storage service作为kv存储数据库,然后通过使用你们源码中的storage客户端传入keys的集合,keys集合可能有上千个key,然后返回相应的value集合,现在出现的问题是我把整个集群停掉,没有立即打印异常,而是一直在打印重试信息,重试信息每隔1秒打印一次,等到上千个重试信息都打印完成之后,整个程序才会因为异常退出,上千个打印完差不多都半小时了,现在我们想要的结果是怎么能立即知道集群已经停止,然后程序退出,哪怕几秒钟知道也可以
是只想用nebula的分布式存储,而且存储的数据和graph没有关系,只是单纯的kv是吧?
我猜测这其实是一个很大的工程,相当于meta端需要把raft协议相关的代码剥离出来,graph端需要定义你自己的查询接口。
对于上边写出的实际错误,可以仔细看下StorageClient::collectResponse的代码,根据自己的需求来修改错误判断方式和retry次数。