nebula-graph 服务经常oom

  • nebula 版本:3.1.0
  • 部署方式:分布式
  • 安装方式:源码编译
  • 是否上生产环境:Y
  • 硬件信息
    • 磁盘 10T的SSD
    • CPU 64C
    • 内存 256G
  • 问题的具体描述
  • nebula-graph进程占用内存一直升高直到服务OOM,没有其他日志报错。
E20230721 12:35:00.578649 131791 GraphSessionManager.cpp:260] Update sessions failed: Session not existed!
E20230721 12:35:00.578802 131826 GraphSessionManager.cpp:284] Update sessions failed: Update sessions failed: Session not existed!
E20230723 10:47:00.124019 131668 QueryInstance.cpp:137] Used memory hits the high watermark(0.500000) of total system memory.
E20230723 10:47:00.130874 131658 QueryInstance.cpp:137] Used memory hits the high watermark(0.500000) of total system memory.
E20230723 10:47:00.132870 131668 QueryInstance.cpp:137] Used memory hits the high watermark(0.500000) of total system memory.

你可以把 memory watermark 调高一点,参考这个链接:https://docs.nebula-graph.com.cn/3.0.0/20.appendix/0.FAQ/#error_-1005_used_memory_hits_the_high_watermark0800000_of_total_system_memory

在日志中有记录哪条语句触发了这个high watermark, 可以分析下这个语句,是不是碰到超级节点了

这个参数意义不大,平时正常时都是内存使用10%左右,异常就完全控制不了。

nebula-graph的info日志,没有具体的查询语句

I20230723 10:46:01.253739 131688 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.256418 131658 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.258870 131658 GraphService.cpp:68] Authenticating user root from 10.129.143.100:48900
I20230723 10:46:01.261724 131690 GraphService.cpp:68] Authenticating user root from 10.129.161.146:55146
I20230723 10:46:01.261854 131690 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.264469 131690 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.270134 131650 GraphService.cpp:68] Authenticating user root from 10.129.143.100:55606
I20230723 10:46:01.272780 131658 GraphService.cpp:68] Authenticating user root from 10.129.143.100:48906
I20230723 10:46:01.273185 131688 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.276281 131686 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.278481 131688 GraphService.cpp:68] Authenticating user root from 10.129.143.100:55606
I20230723 10:46:01.281191 131668 GraphService.cpp:68] Authenticating user root from 10.129.161.146:55146
I20230723 10:46:01.281841 131688 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.283851 131658 GraphService.cpp:68] Authenticating user root from 10.129.161.146:54652
I20230723 10:46:01.284042 131658 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.286588 131679 GraphService.cpp:68] Authenticating user root from 10.129.143.100:48906
I20230723 10:46:01.286599 131688 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:46:01.289580 131668 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
I20230723 10:47:00.119834 131668 GraphService.cpp:68] Authenticating user root from 10.129.161.146:55146
I20230723 10:47:00.123724 131658 SwitchSpaceExecutor.cpp:37] Graph switched to `risk_cx_yaoka', space id: 36
E20230723 10:47:00.124019 131668 QueryInstance.cpp:137] Used memory hits the high watermark(0.500000) of total system memory.
I20230723 10:47:00.125562 131668 GraphService.cpp:68] Authenticating user root from 10.129.143.100:48900

nebula-meta 的info日志

I20230723 10:05:29.209256 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_tj error: E_LEADER_CHANGED
I20230723 10:09:23.813501 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_yaoka error: E_LEADER_CHANGED
I20230723 10:12:31.318755 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_yaoka error: E_LEADER_CHANGED
I20230723 10:16:36.658339 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_yaoka error: E_LEADER_CHANGED
I20230723 10:20:16.896430 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_tj error: E_LEADER_CHANGED
I20230723 10:21:07.131704 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_tj error: E_LEADER_CHANGED
I20230723 10:23:13.299448 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_yaoka error: E_LEADER_CHANGED
I20230723 10:27:29.219110 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_yaoka error: E_LEADER_CHANGED
I20230723 10:30:59.652446 131927 GetSpaceProcessor.cpp:18] Get space Failed, SpaceName risk_cx_yaoka error: E_LEADER_CHANGED
I20230723 11:09:38.952163 131927 HBProcessor.cpp:33] Receive heartbeat from "10.129.72.35":9669, role = GRAPH
在此处键入或粘贴代码

nebula-storaged.INFO 日志

I20230723 10:31:13.434484 136883 CompactionFilter.h:92] Do default minor compaction!
I20230723 10:31:19.843351 136883 EventListener.h:35] Rocksdb compaction completed column family: default because of Ttl, status: OK, compacted 6 files into 3, base level is 1, output level is 2
I20230723 10:31:19.883831 136883 EventListener.h:21] Rocksdb start compaction column family: default because of Ttl, status: OK, compacted 3 files into 0, base level is 1, output level is 2
I20230723 10:31:19.883878 136883 CompactionFilter.h:92] Do default minor compaction!
I20230723 10:31:26.771695 136883 EventListener.h:35] Rocksdb compaction completed column family: default because of Ttl, status: OK, compacted 3 files into 2, base level is 1, output level is 2
I20230723 10:31:26.809937 136883 EventListener.h:21] Rocksdb start compaction column family: default because of Ttl, status: OK, compacted 4 files into 0, base level is 1, output level is 2
I20230723 10:31:26.809978 136883 CompactionFilter.h:92] Do default minor compaction!
I20230723 10:31:31.965975 136883 EventListener.h:35] Rocksdb compaction completed column family: default because of Ttl, status: OK, compacted 4 files into 2, base level is 1, output level is 2
I20230724 10:08:32.823822 136883 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 6 files into 0, base level is 0, output level is 1
I20230724 10:08:32.832691 136883 CompactionFilter.h:82] Do full/manual compaction!
I20230724 10:08:32.832726 120420 CompactionFilter.h:82] Do full/manual compaction!

在graph 的error log 里面你搜high watermark, 找到语句,profile 一下看看

E20230506 09:15:43.531373  4904 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total system memory., query: FIND ALL PATH WITH PROP FROM 6167652577562338980 TO 755392968367704043 OVER Relation UPTO 3 STEPS YIELD path AS nPath | ORDER BY $-.nPath

graph 的error log日志里只有

Used memory hits the high watermark(0.500000) of total system memory

没有query等其他信息了,直到oom。如下图

那可能是版本区别,我的是v3.4。 这样你只能在业务上看看那个时间点是什么查询,业务方有没有日志留下来,把相关的语句都拿出来一个个的试试

这个很一条一条查,业务跑批量任务。我统计了下,每天这个时间点有上万条查询执行,这个数据库仅是数据有每天增量,每天查询语句相同,就是有的时候就会oom。

我试过在服务nebula-graph的内存增长到20%时,外部停止所有查询,但是nebula-graph 内存还在一直增长直至oom,不释放内存无法控制。即使我重启当前的nebula-graph服务,集群的随机的另一个节点的nebula-graph内存也会增长直到oom。这样一个集群所有节点交替down机。全部down一遍后才恢复正常。

应该就是某些查询碰到超级节点了吧。每天查询语句相同 这个你写代码批量调用下接口,一次100个,循环一下看看哪次OOM 了,就能判断出来哪些语句了。

1 个赞

好的,多谢,我这边具体排查看看

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。