使用go-client写入数据时不定时失败，拒绝访问，Failed to open transport, error: dial tcp xxx: connect: connection refused

LongZaiAiYueLiang · 2021 年12 月 8 日 09:03

nebula 版本：v2.6.1
部署方式：nebula-operator
安装方式：K8S
是否为线上版本：Y
硬件信息
- CPU、内存信息
问题的具体描述
相关的 meta / storage / graph info 日志信息（尽量使用文本形式方便检索）

部分写入数据，有时候连续写入10W数据，100次写入记录都没问题，但是有时会突然写入失败

INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:21180" status=OK took_ns=14.781499ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:13259" status=OK took_ns=18.763531ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:18789" status=OK took_ns=17.779718ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:36441" status=OK took_ns=20.245243ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:37424" status=OK took_ns=18.627495ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:5625" status=OK took_ns=25.176534ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 15:12:47] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:44506" status=OK took_ns=21.07159ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:54318" status=OK took_ns=20.078816ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:29059" status=OK took_ns=19.695645ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:27381" status=OK took_ns=18.106921ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:6209" status=OK took_ns=17.49587ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:11548" status=OK took_ns=25.973524ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:48922" status=OK took_ns=16.251233ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:1986" status=OK took_ns=17.417241ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:60143" status=OK took_ns=17.027376ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:30840" status=OK took_ns=22.861761ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:46159" status=OK took_ns=15.116735ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:8077" status=OK took_ns=17.657854ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:49690" status=OK took_ns=15.406722ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:06] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:40085" status=OK took_ns=16.604359ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
2021/12/06 16:01:09 [INFO] Successfully reconnect to host: 170.18.9.200, port: 31969
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:22514" status=OK took_ns=3.182172427s user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:14238" status=OK took_ns=19.305098ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:36784" status=OK took_ns=16.681084ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
INFO[2021-12-06 16:01:09] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:53993" status=OK took_ns=14.690944ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
2021/12/06 16:01:13 [ERROR] Failed to reconnect, Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused
ERRO[2021-12-06 16:01:13] graphservice_grpc.pb_impl.go:148 错误原因:Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused 
INFO[2021-12-06 16:01:13] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:62730" status=OK took_ns=3.390351866s user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refusedERRO[2021-12-06 16:01:13] /pb.GraphService/SyncNewSchemaData            err="rpc error: code = Internal desc = error is runtime error: invalid memory address or nil pointer dereference" peer="170.18.9.200:36651" status=KO took_ns=10.477364ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
2021/12/08 16:08:32 [ERROR] Failed to reconnect, Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused
ERRO[2021-12-08 16:08:32] graphservice_grpc.pb_impl.go:148 错误原因:Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refused 
INFO[2021-12-08 16:08:32] /pb.GraphService/SyncNewSchemaData            err="<nil>" peer="170.18.9.200:56264" status=OK took_ns=4.694826314s user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"
Failed to open transport, error: dial tcp 170.18.9.200:31969: connect: connection refusedERRO[2021-12-08 16:08:32] /pb.GraphService/SyncNewSchemaData            err="rpc error: code = Internal desc = error is runtime error: invalid memory address or nil pointer dereference" peer="170.18.9.200:40679" status=KO took_ns=6.127708ms user-agent="[grpc-python/1.33.2 grpc-c/13.0.0 (linux; chttp2)]"

尝试过2.0版本，2.5.0和2.6.1版本，但是一直存在这种问题，我现在不确定哪里有问题，但倾向于代码问题，每次执行都去获取session执行

func Execute(gql string) (*nebula.ResultSet, error) {
	session, err := connectionPool.GetSession(username, password)
	if err != nil {
		fmt.Print(err.Error())
	}

	defer session.Release()

	resultSet, err := session.Execute(spaceDesc.UseSpaceString())
	if err != nil {
		fmt.Print(err.Error())
	}
	if !resultSet.IsSucceed() {
		fmt.Errorf("nebula使用空间%s失败", spaceDesc.Name)
	}

	resultSet, err = session.Execute(gql)
	return resultSet, err
}

LongZaiAiYueLiang · 2021 年12 月 9 日 01:53

这边观察出来，v2.6.1版本，每次写入数据失败，都是因为graphd的pod死掉，但是看日志也看不来东西，因为重启会创建新的日志，在上一次的日志中观察不到啥原因

[root@master ~]# kubectl get pod | grep nebula
mdm-nebula-graphd-0                       1/1     Running     7          2d18h
mdm-nebula-metad-0                        1/1     Running     0          2d18h
mdm-nebula-storaged-0                     1/1     Running     0          2d18h

观察之前的日志也没有发现啥关键信息

sh-4.2# ls -lt
total 212
-rw-r--r--. 1 root root   3774 Dec  9 01:48 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211209-014731.1
lrwxrwxrwx. 1 root root     65 Dec  9 01:47 nebula-graphd.INFO -> nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211209-014731.1
-rw-r--r--. 1 root root   7339 Dec  9 01:47 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-100029.1
-rw-r--r--. 1 root root   4330 Dec  8 10:00 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-094334.1
-rw-r--r--. 1 root root   4330 Dec  8 09:43 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-093332.1
-rw-r--r--. 1 root root 104338 Dec  8 09:33 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211208-080843.1
-rw-r--r--. 1 root root   4427 Dec  8 08:08 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211206-080114.1
-rw-r--r--. 1 root root  30658 Dec  6 08:01 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211206-070612.1
-rw-r--r--. 1 root root   8474 Dec  6 07:06 nebula-graphd.mdm-nebula-graphd-0.root.log.INFO.20211206-070117.1
-rw-r--r--. 1 root root   3135 Dec  6 07:01 graphd-stderr.log
-rw-r--r--. 1 root root   3284 Dec  6 07:01 nebula-graphd.mdm-nebula-graphd-0.root.log.ERROR.20211206-070117.1
-rw-r--r--. 1 root root   3284 Dec  6 07:01 nebula-graphd.mdm-nebula-graphd-0.root.log.WARNING.20211206-070117.1
lrwxrwxrwx. 1 root root     68 Dec  6 07:01 nebula-graphd.WARNING -> nebula-graphd.mdm-nebula-graphd-0.root.log.WARNING.20211206-070117.1
lrwxrwxrwx. 1 root root     66 Dec  6 07:01 nebula-graphd.ERROR -> nebula-graphd.mdm-nebula-graphd-0.root.log.ERROR.20211206-070117.1
-rw-r--r--. 1 root root      0 Dec  6 07:01 graphd-stdout.log

大哥们，这边里面有需要看的日志，可以call我发出来

LongZaiAiYueLiang · 2021 年12 月 9 日 01:55

摇人 @wey @kevin.qiao

LongZaiAiYueLiang · 2021 年12 月 9 日 01:55

日志列表中每个大的时间段都是一次pod重启

kevin.qiao · 2021 年12 月 9 日 02:26

重启后不会删除之前挂载磁盘里的日志文件，可以看看error日志，另外写入数据时graph节点负载怎样的，系统日志里有没有看到异常输出

LongZaiAiYueLiang · 2021 年12 月 9 日 02:29

err日志并没有更新，可以看更新时间，是部署那天的，而pod日志一直只有这个

[root@master ~]# kubectl logs -p mdm-nebula-graphd-0 
++ hostname
++ hostname
+ exec /usr/local/nebula/bin/nebula-graphd --flagfile=/usr/local/nebula/etc/nebula-graphd.conf --meta_server_addrs=mdm-nebula-metad-0.mdm-nebula-metad-headless.default.svc.cluster.local:9559 --local_ip=mdm-nebula-graphd-0.mdm-nebula-graphd-svc.default.svc.cluster.local --ws_ip=mdm-nebula-graphd-0.mdm-nebula-graphd-svc.default.svc.cluster.local --daemonize=false

LongZaiAiYueLiang · 2021 年12 月 9 日 02:31

graph节点负载怎么看，不过我今天测的情况是一开始写入就报错，应该还没到负载吧

kevin.qiao · 2021 年12 月 9 日 04:59

htop查看，不过要先确认你连接到graph失败的问题，你连接graph是用的什么地址

LongZaiAiYueLiang · 2021 年12 月 9 日 05:11

9669端口，使用k8s的nodeport进行映射转发，失败是因为graphd的pod重启了，这个重启是因为查询或者写入导致的

kevin.qiao · 2021 年12 月 9 日 06:08

可以查看重启后有没有产生core文件，可以参考这个链接

LongZaiAiYueLiang · 2021 年12 月 13 日 02:44

我这边测试发现v2.6.1的graph重启原因是执行某条gql语句，一千条(lookup on item where item.recordID==-5001892981448333553 yield item.recordID as recordID,item.version as version | order by $-.version desc | limit 1)使用union相连，执行就会报错，导致重启，所以我这里改为500条union相连，暂时没发现错误，希望是这个原因

steam · 2021 年12 月 13 日 06:02

你可以后续留意下，如果还有其他问题的话，欢迎来更新下帖子哈~

LongZaiAiYueLiang · 2021 年12 月 13 日 06:56

好的希望可以通过减少查询语句的大小来实现正常使用

HarrisChu · 2021 年12 月 13 日 07:27

算是已知问题吧，栈溢出了。

一个是看看能不能把 union 改写成 in (1,2,3)
另外就是我们后续会限制栈大小，如果超过了就直接报错了，这样最少 graph 不会崩溃退出。

github.com/vesoft-inc/nebula

fix crash when the expression exceed the depth

vesoft-inc:master ← heroicNeZha:limit_depth_of_expression

opened 07:17AM - 08 Dec 21 UTC

heroicNeZha

+532 -0

#### What type of PR is this? - [x] bug - [ ] feature - [ ] enhancement ##…## What does this PR do? fix crash When the expression's depth exceeds the maximum limit （The depth is related to the stack size. This pr limited 815 is calculated based on the stack size of the development machine） #### Which issue(s)/PR(s) this PR relates to? fixed #3393 #### Special notes for your reviewer, ex. impact of this fix, etc: #### Additional context: #### Checklist： - [ ] Documentation affected （Please add the label if documentation needs to be modified.) - [ ] Incompatible （If it is incompatible, please describe it and add corresponding label.） - [ ] Need to cherry-pick （If need to cherry-pick to some branches, please label the destination version(s).) - [ ] Performance impacted: Consumes more CPU/Memory #### Release notes： Please confirm whether to reflect in release notes and how to describe: > `

system · 2022 年1 月 10 日 08:31

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。