storgaed重启导致连接断开,但是graphd没有重新创建连接

  • nebula 版本:master
  • 部署方式(分布式 / 单机 / Docker / DBaaS):分布式3台机器
  • 硬件信息
    • 磁盘( 推荐使用 SSD)
    • CPU、内存信息:40核CPU,256GB内存
  • 问题的具体描述:

发现storaged重启之后,graphd还是使用了clientManager中的连接,导致访问失败。日志如下:

I0205 10:18:48.595340 30860 GetVerticesExecutor.cpp:36] GV input var: __VAR_0 iter kind: sequential iterator
I0205 10:18:48.595355 30860 GetVerticesExecutor.cpp:40] src vid: 1
I0205 10:18:48.595451 30845 ThriftClientManager.inl:21] Getting a client to [xxx.xxx.xxx.xxx:44500]
E0205 10:18:48.595679 30845 StorageClientBase.inl:223] Request to [xxx.xxx.xxx.xxx:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
I0205 10:18:48.595712 30845 StorageClientBase.inl:160] Invalidate the leader for [6, 6]
I0205 10:18:48.595779 30860 GetVerticesExecutor.cpp:74] Get props time: 418us
E0205 10:18:48.595806 30860 StorageAccessExecutor.h:32] GetVerticesExecutor failed, error E_RPC_FAILURE, part 6
E0205 10:18:48.595820 30860 StorageAccessExecutor.h:103] Storage Error: part: 6, error code: -3.
E0205 10:18:48.595866 30860 QueryInstance.cpp:124] Storage Error: part: 6, error code: -3.

多副本情况下storage重启时,leader会被重新选择,meta会将新leader的信息与graph进行同步,可以贴一下你meta中的日志吗?

Leader是会被重新选择,问题是现在这条连接一直没有释放。从错误日志也可以看出got EOF是服务端把连接断开了。但是ReconnectingRequestChannel的重新连接一直没有起作用。
E0205 10:18:48.595679 30845 StorageClientBase.inl:223] Request to [xxx.xxx.xxx.xxx:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.

麻烦 @darionyaphet 看一下这里

ReconnectingRequestChannel不好用,那个good需要一个探测才能起作用。

pull request for this problem: Fix problem: storage restarted but reconnect is not working. by guojun85 · Pull Request #411 · vesoft-inc/nebula-common · GitHub

感谢你的PR :handshake:,我们会尽快review的。

1 个赞

我得再看下 我这边重启的报错都是类似N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, 这种报错master代码可以正常重连。