使用go-client写入数据时不定时失败,拒绝访问,Failed to open transport, error: dial tcp xxx: connect: connection refused

  • nebula 版本:v2.6.1
  • 部署方式:nebula-operator
  • 安装方式:K8S
  • 是否为线上版本:Y
  • 硬件信息
    • CPU、内存信息
  • 问题的具体描述
  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)

部分写入数据,有时候连续写入10W数据,100次写入记录都没问题,但是有时会突然写入失败

INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:21180" status=OK took_ns=14.781499ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:13259" status=OK took_ns=18.763531ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:18789" status=OK took_ns=17.779718ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:36441" status=OK took_ns=20.245243ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:37424" status=OK took_ns=18.627495ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:5625" status=OK took_ns=25.176534ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:44506" status=OK took_ns=21.07159ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:54318" status=OK took_ns=20.078816ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:29059" status=OK took_ns=19.695645ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:27381" status=OK took_ns=18.106921ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:6209" status=OK took_ns=17.49587ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:11548" status=OK took_ns=25.973524ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:48922" status=OK took_ns=16.251233ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:1986" status=OK took_ns=17.417241ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:60143" status=OK took_ns=17.027376ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:30840" status=OK took_ns=22.861761ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:46159" status=OK took_ns=15.116735ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:8077" status=OK took_ns=17.657854ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:49690" status=OK took_ns=15.406722ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:40085" status=OK took_ns=16.604359ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
2021/12/06 16:01:09 [INFO] Successfully reconnect to host: 170.18.9.200, port: 31969
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:22514" status=OK took_ns=3.182172427s user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:14238" status=OK took_ns=19.305098ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:36784" status=OK took_ns=16.681084ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:53993" status=OK took_ns=14.690944ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
2021/12/06 16:01:13 [ERROR] Failed to reconnect, Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused
ERRO[2021-12-06 16:01:13] graphservice_grpc.pb_impl.go:148 错误原因:Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused 
INFO[2021-12-06 16:01:13] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:62730" status=OK took_ns=3.390351866s user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refusedERRO[2021-12-06 16:01:13] /pb.GraphService/SyncNewSchemaData            err="rpc error: code = Internal desc = error is runtime error: invalid memory address or nil pointer dereference" peer="170.18.9.200:36651" status=KO took_ns=10.477364ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
2021/12/08 16:08:32 [ERROR] Failed to reconnect, Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused
ERRO[2021-12-08 16:08:32] graphservice_grpc.pb_impl.go:148 错误原因:Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused 
INFO[2021-12-08 16:08:32] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:56264" status=OK took_ns=4.694826314s user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refusedERRO[2021-12-08 16:08:32] /pb.GraphService/SyncNewSchemaData            err="rpc error: code = Internal desc = error is runtime error: invalid memory address or nil pointer dereference" peer="170.18.9.200:40679" status=KO took_ns=6.127708ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"

尝试过2.0版本,2.5.0和2.6.1版本,但是一直存在这种问题,我现在不确定哪里有问题,但倾向于代码问题,每次执行都去获取session执行

func Execute(gql string) (*nebula.ResultSet, error) {
	session, err := connectionPool.GetSession(username, password)
	if err != nil {
		fmt.Print(err.Error())
	}

	defer session.Release()

	resultSet, err := session.Execute(spaceDesc.UseSpaceString())
	if err != nil {
		fmt.Print(err.Error())
	}
	if !resultSet.IsSucceed() {
		fmt.Errorf("nebula使用空间%s失败", spaceDesc.Name)
	}

	resultSet, err = session.Execute(gql)
	return resultSet, err
}

这边观察出来,v2.6.1版本,每次写入数据失败,都是因为graphd的pod死掉,但是看日志也看不来东西,因为重启会创建新的日志,在上一次的日志中观察不到啥原因

[root@master ~]# kubectl get pod | grep nebula
mdm-nebula-graphd-0                       1/1     Running     7          2d18h
mdm-nebula-metad-0                        1/1     Running     0          2d18h
mdm-nebula-storaged-0                     1/1     Running     0          2d18h

观察之前的日志也没有发现啥关键信息

sh-4.2# ls -lt
total 212
-rw-r--r--. 1 root root   3774 Dec  9 01:48 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211209-014731.1
lrwxrwxrwx. 1 root root     65 Dec  9 01:47 nebula-graphd.INFO -> nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211209-014731.1
-rw-r--r--. 1 root root   7339 Dec  9 01:47 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-100029.1
-rw-r--r--. 1 root root   4330 Dec  8 10:00 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-094334.1
-rw-r--r--. 1 root root   4330 Dec  8 09:43 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-093332.1
-rw-r--r--. 1 root root 104338 Dec  8 09:33 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-080843.1
-rw-r--r--. 1 root root   4427 Dec  8 08:08 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211206-080114.1
-rw-r--r--. 1 root root  30658 Dec  6 08:01 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211206-070612.1
-rw-r--r--. 1 root root   8474 Dec  6 07:06 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211206-070117.1
-rw-r--r--. 1 root root   3135 Dec  6 07:01 graphd-stderr.log
-rw-r--r--. 1 root root   3284 Dec  6 07:01 nebula-graphd.mdm-nebula-graphd-0.root.log.ERROR.20211206-070117.1
-rw-r--r--. 1 root root   3284 Dec  6 07:01 nebula-graphd.mdm-nebula-graphd-0.root.log.WARNING.20211206-070117.1
lrwxrwxrwx. 1 root root     68 Dec  6 07:01 nebula-graphd.WARNING -> nebula-graphd.mdm-nebula-graphd-0.root.log.WARNING.20211206-070117.1
lrwxrwxrwx. 1 root root     66 Dec  6 07:01 nebula-graphd.ERROR -> nebula-graphd.mdm-nebula-graphd-0.root.log.ERROR.20211206-070117.1
-rw-r--r--. 1 root root      0 Dec  6 07:01 graphd-stdout.log

大哥们,这边里面有需要看的日志,可以call我发出来

摇人 @wey @kevin.qiao

日志列表中每个大的时间段都是一次pod重启

重启后不会删除之前挂载磁盘里的日志文件,可以看看error日志,另外写入数据时graph节点负载怎样的,系统日志里有没有看到异常输出


err日志并没有更新,可以看更新时间,是部署那天的,而pod日志一直只有这个

[root@master ~]# kubectl logs -p mdm-nebula-graphd-0 
++ hostname
++ hostname
+ exec /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf --meta_server_addrs=mdm-nebula-metad-0.mdm-nebula-metad-headless.default.svc.cluster.local:9559 --local_ip=mdm-nebula-graphd-0.mdm-nebula-graphd-svc.default.svc.cluster.local --ws_ip=mdm-nebula-graphd-0.mdm-nebula-graphd-svc.default.svc.cluster.local --daemonize=false

graph节点负载怎么看,不过我今天测的情况是一开始写入就报错,应该还没到负载吧

htop查看,不过要先确认你连接到graph失败的问题,你连接graph是用的什么地址

9669端口,使用k8s的nodeport进行映射转发,失败是因为graphd的pod重启了,这个重启是因为查询或者写入导致的

可以查看重启后有没有产生core文件,可以参考这个链接

我这边测试发现v2.6.1的graph重启原因是执行某条gql语句, 一千条(lookup on item where item.recordID==-5001892981448333553 yield item.recordID as recordID,item.version as version | order by $-.version desc | limit 1)使用union相连,执行就会报错,导致重启,所以我这里改为500条union相连,暂时没发现错误,希望是这个原因 :laughing:

你可以后续留意下,如果还有其他问题的话,欢迎来更新下帖子哈~

好的 希望可以通过减少查询语句的大小来实现正常使用

算是已知问题吧,栈溢出了。

一个是看看能不能把 union 改写成 in (1,2,3)
另外就是我们后续会限制栈大小,如果超过了就直接报错了,这样最少 graph 不会崩溃退出。

2 个赞

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。