graph内存不释放导致oom

  • nebula 版本:v.2.0.1
  • 部署方式(分布式 / 单机 / Docker / DBaaS):分布式:3节点3副本
  • 是否为线上版本:Y
  • 硬件信息
    单机配置: 32核+128G内存+2TSSD*2
  • 数据分布
    vertices 6.7 亿
    edges 8.4亿

背景描述

1、数据查询数据特点(所查数据管理2跳的数据量,近40条数据)

340973,304050,199631,141379,123531,92423,92227,86868,26662,20324,9212,7336,5019,458,36,28,27,24,22,21,13,9,8,7,7,5,5,4,4,4,3,3,3,2,2,1,1,1

2、查询语句

MATCH (v)--(v2)--(v3) where id(v) == 'vid' and v.attr!='1' and v2.attr!='1' and v3.attr!='1' RETURN v.key,v.tag_type,v2.key,v2.tag_type,v3.key,v3.tag_type;

3、数据库数据描述

10种实体、8种关系
  • 共两个问题
    1、使用nebula-console 查询大点信息,查询完毕后graphd 占用内存不释放
    2、用golang编写的客户端。两个coroutine各自建立一个session分别请求两个graphd,其中一个graphd占用内存持续增加直到oom。

请求信息如背景中所描述。
top查看资源如下:

18401 xx    20   0  163.1g  67.9g      0 S   2.8 54.0  37:02.71 nebula-graphd
17353 xx    20   0   74.0g  48.8g      0 S   2.5 38.8  84:41.23 nebula-storaged
  • 配置说明
    THP(transparent huge page) 已关

graphd部分配置如下

--num_netio_threads=0
# The number of threads to execute user queries, 0 for # of CPU cores
--num_worker_threads=0

点边在什么规模?console 执行的时候,结果返回之后 graphd 的内存还不释放?对 console 的执行情况,可以给个 PROFILE query 的执行计划结果吗?

我认为只和查询数据有关系(如:背景描述中所述)
数据规模:vertices 6.7 亿 、edges 8.4亿

执行计划

"id","name","dependencies","profiling data","operator info"
"21","Project","20","ver: 0, rows: 348367, execTime: 1227010us, totalTime: 1227015us","outputVar: [
  {
    ""colNames"": [
      ""v.key"",
      ""v.tag_type"",
      ""v2.key"",
      ""v2.tag_type"",
      ""v3.key"",
      ""v3.tag_type""
    ],
    ""name"": ""__Project_21"",
    ""type"": ""DATASET""
  }
]
inputVar: __Filter_20
columns: [
  ""$v.key"",
  ""$v.tag_type"",
  ""$v2.key"",
  ""$v2.tag_type"",
  ""$v3.key"",
  ""$v3.tag_type""
]"
"20","Filter","19","ver: 0, rows: 348367, execTime: 1251362us, totalTime: 1251367us","outputVar: [
  {
    ""colNames"": [
      ""v"",
      ""v2"",
      ""v3"",
      ""__COL_0""
    ],
    ""name"": ""__Filter_20"",
    ""type"": ""DATASET""
  }
]
inputVar: __Filter_19
condition: ((((id($v)==""effd57f13d22eec8b5554d1b51a7059d"") AND ($v.attr!=""1"")) AND ($v2.attr!=""1"")) AND ($v3.attr!=""1""))"
"19","Filter","18","ver: 0, rows: 349135, execTime: 1492374us, totalTime: 1492380us","outputVar: [
  {
    ""colNames"": [
      ""v"",
      ""v2"",
      ""v3"",
      ""__COL_0""
    ],
    ""name"": ""__Filter_19"",
    ""type"": ""DATASET""
  }
]
inputVar: __Project_18
condition: (hasSameEdgeInPath($-.__COL_0)==false)"
"18","Project","17","ver: 0, rows: 349145, execTime: 8518223us, totalTime: 8518230us","outputVar: [
  {
    ""colNames"": [
      ""v"",
      ""v2"",
      ""v3"",
      ""__COL_0""
    ],
    ""name"": ""__Project_18"",
    ""type"": ""DATASET""
  }
]
inputVar: __InnerJoin_17
columns: [
  ""startNode($-._path_0) AS v"",
  ""startNode($-._path_1) AS v2"",
  ""startNode($-._path_2) AS v3"",
  ""PathBuild[PathBuild[$-._path_0,$-._path_1,$-._path_2]] AS __COL_0""
]"
"17","InnerJoin","16","ver: 0, rows: 349145, execTime: 4791868us, totalTime: 4791873us","outputVar: [
  {
    ""colNames"": [
      ""_path_0"",
      ""_path_1"",
      ""_path_2""
    ],
    ""name"": ""__InnerJoin_17"",
    ""type"": ""DATASET""
  }
]
inputVar: {
  ""rightVar"": {
    ""__Project_16"": ""0""
  },
  ""leftVar"": {
    ""__InnerJoin_12"": ""0""
  }
}
hashKeys: [
  ""endNode($-._path_1)._vid""
]
probeKeys: [
  ""startNode($-._path)._vid""
]
kind: InnerJoin"
"16","Project","15","ver: 0, rows: 347748, execTime: 1803401us, totalTime: 1803407us","outputVar: [
  {
    ""colNames"": [
      ""_path""
    ],
    ""name"": ""__Project_16"",
    ""type"": ""DATASET""
  }
]
inputVar: __GetVertices_15
columns: [
  ""PathBuild[VERTEX]""
]"
"15","GetVertices","14","{
ver: 0, rows: 347748, execTime: 203372us, totalTime: 17324091us
total_rpc: 17242680(us)
""172.25.41.11"":44500 exec/total: 16080829(us)/16980464(us)
""172.25.41.12"":44500 exec/total: 11767284(us)/12474453(us)
""172.25.41.13"":44500 exec/total: 16292667(us)/17113934(us)
}","outputVar: [
  {
    ""colNames"": [],
    ""name"": ""__GetVertices_15"",
    ""type"": ""DATASET""
  }
]
inputVar: __Dedup_14
space: 9
dedup: false
limit: 9223372036854775807
filter: 
orderBy: []
src: $-._vid
props: [
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""_tag""
    ],
    ""tagId"": 14
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""attr"",
      ""label"",
      ""file_type"",
      ""_tag""
    ],
    ""tagId"": 13
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""attr"",
      ""label"",
      ""_tag""
    ],
    ""tagId"": 12
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""attr"",
      ""label"",
      ""_tag""
    ],
    ""tagId"": 10
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""attr"",
      ""label"",
      ""protocol"",
      ""_tag""
    ],
    ""tagId"": 11
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""_tag""
    ],
    ""tagId"": 15
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""_tag""
    ],
    ""tagId"": 16
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""_tag""
    ],
    ""tagId"": 17
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""title"",
      ""label"",
      ""url"",
      ""_tag""
    ],
    ""tagId"": 18
  },
  {
    ""props"": [
      ""key"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""tag_type"",
      ""_tag""
    ],
    ""tagId"": 19
  }
]
exprs: []"
"14","Dedup","13","ver: 0, rows: 347854, execTime: 70755us, totalTime: 70764us","outputVar: [
  {
    ""colNames"": [
      ""_vid""
    ],
    ""name"": ""__Dedup_14"",
    ""type"": ""DATASET""
  }
]
inputVar: __Project_13"
"13","Project","12","ver: 0, rows: 349252, execTime: 618590us, totalTime: 618598us","outputVar: [
  {
    ""colNames"": [
      ""_vid""
    ],
    ""name"": ""__Project_13"",
    ""type"": ""DATASET""
  }
]
inputVar: __InnerJoin_12
columns: [
  ""endNode($-._path_1)._vid""
]"
"12","InnerJoin","11","ver: 0, rows: 349252, execTime: 2327018us, totalTime: 2327026us","outputVar: [
  {
    ""colNames"": [
      ""_path_0"",
      ""_path_1""
    ],
    ""name"": ""__InnerJoin_12"",
    ""type"": ""DATASET""
  }
]
inputVar: {
  ""rightVar"": {
    ""__Filter_11"": ""0""
  },
  ""leftVar"": {
    ""__Filter_6"": ""0""
  }
}
hashKeys: [
  ""endNode($-._path)._vid""
]
probeKeys: [
  ""startNode($-._path)._vid""
]
kind: InnerJoin"
"11","Filter","10","ver: 0, rows: 349252, execTime: 682283us, totalTime: 682290us","outputVar: [
  {
    ""colNames"": [
      ""_path""
    ],
    ""name"": ""__Filter_11"",
    ""type"": ""DATASET""
  }
]
inputVar: __Project_10
condition: (length($-._path)>=1)"
"10","Project","9","ver: 0, rows: 349252, execTime: 2416481us, totalTime: 2416489us","outputVar: [
  {
    ""colNames"": [
      ""_path""
    ],
    ""name"": ""__Project_10"",
    ""type"": ""DATASET""
  }
]
inputVar: __GetNeighbors_9
columns: [
  ""PathBuild[VERTEX,EDGE] AS _path""
]"
"9","GetNeighbors","8","{
ver: 0, rows: 0, execTime: 262229us, totalTime: 1315418us
""172.25.41.12"":44500 exec/total/vertices: 2978(us)/3949(us)/4,
total_rpc_time: 1053087(us)
""172.25.41.13"":44500 exec/total/vertices: 596328(us)/1028378(us)/4,
""172.25.41.11"":44500 exec/total/vertices: 2128(us)/3122(us)/2,
}","outputVar: [
  {
    ""colNames"": [],
    ""name"": ""__GetNeighbors_9"",
    ""type"": ""DATASET""
  }
]
inputVar: __Dedup_8
space: 9
dedup: false
limit: -1
filter: 
orderBy: []
src: $-._vid
edgeTypes: []
edgeDirection: BOTH
vertexProps: []
edgeProps: [
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-27""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""27""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-26""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""26""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""create_date"",
      ""expire_date""
    ],
    ""type"": ""-25""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""create_date"",
      ""expire_date""
    ],
    ""type"": ""25""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-21""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""21""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-20""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""20""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""status"",
      ""method"",
      ""port""
    ],
    ""type"": ""-22""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""status"",
      ""method"",
      ""port""
    ],
    ""type"": ""22""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""query_type""
    ],
    ""type"": ""-23""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""query_type""
    ],
    ""type"": ""23""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-24""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""24""
  }
]
statProps: 
exprs: 
random: false"
"8","Dedup","7","ver: 0, rows: 10, execTime: 5us, totalTime: 6us","outputVar: [
  {
    ""colNames"": [
      ""_vid""
    ],
    ""name"": ""__Dedup_8"",
    ""type"": ""DATASET""
  }
]
inputVar: __Project_7"
"7","Project","6","ver: 0, rows: 10, execTime: 16us, totalTime: 17us","outputVar: [
  {
    ""colNames"": [
      ""_vid""
    ],
    ""name"": ""__Project_7"",
    ""type"": ""DATASET""
  }
]
inputVar: __Filter_6
columns: [
  ""endNode($-._path)._vid""
]"
"6","Filter","5","ver: 0, rows: 10, execTime: 30us, totalTime: 31us","outputVar: [
  {
    ""colNames"": [
      ""_path""
    ],
    ""name"": ""__Filter_6"",
    ""type"": ""DATASET""
  }
]
inputVar: __Project_5
condition: (length($-._path)>=1)"
"5","Project","4","ver: 0, rows: 10, execTime: 197us, totalTime: 198us","outputVar: [
  {
    ""colNames"": [
      ""_path""
    ],
    ""name"": ""__Project_5"",
    ""type"": ""DATASET""
  }
]
inputVar: __GetNeighbors_4
columns: [
  ""PathBuild[VERTEX,EDGE] AS _path""
]"
"4","GetNeighbors","3","{
ver: 0, rows: 0, execTime: 99us, totalTime: 2861us
""172.25.41.12"":44500 exec/total/vertices: 1471(us)/2555(us)/1,
total_rpc_time: 2732(us)
}","outputVar: [
  {
    ""colNames"": [],
    ""name"": ""__GetNeighbors_4"",
    ""type"": ""DATASET""
  }
]
inputVar: __Dedup_3
space: 9
dedup: false
limit: -1
filter: 
orderBy: []
src: $-._vid
edgeTypes: []
edgeDirection: BOTH
vertexProps: []
edgeProps: [
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-27""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""27""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-26""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""26""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""create_date"",
      ""expire_date""
    ],
    ""type"": ""-25""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""create_date"",
      ""expire_date""
    ],
    ""type"": ""25""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-21""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""21""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-20""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""20""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""status"",
      ""method"",
      ""port""
    ],
    ""type"": ""-22""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""status"",
      ""method"",
      ""port""
    ],
    ""type"": ""22""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""query_type""
    ],
    ""type"": ""-23""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source"",
      ""query_type""
    ],
    ""type"": ""23""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""-24""
  },
  {
    ""props"": [
      ""_src"",
      ""_type"",
      ""_rank"",
      ""_dst"",
      ""add_time"",
      ""found_time"",
      ""update_time"",
      ""source""
    ],
    ""type"": ""24""
  }
]
statProps: 
exprs: 
random: false"
"3","Dedup","2","ver: 0, rows: 1, execTime: 2us, totalTime: 3us","outputVar: [
  {
    ""colNames"": [
      ""_vid""
    ],
    ""name"": ""__Dedup_3"",
    ""type"": ""DATASET""
  }
]
inputVar: __Project_2"
"2","Project","1","ver: 0, rows: 1, execTime: 7us, totalTime: 8us","outputVar: [
  {
    ""colNames"": [
      ""_vid""
    ],
    ""name"": ""__Project_2"",
    ""type"": ""DATASET""
  }
]
inputVar: __VAR_3
columns: [
  ""$__VAR_3._vid""
]"
"1","PassThrough","0","ver: 0, rows: 0, execTime: 3us, totalTime: 4us","outputVar: [
  {
    ""colNames"": [
      ""_vid""
    ],
    ""name"": ""__VAR_3"",
    ""type"": ""DATASET""
  }
]
inputVar: "
"0","Start","","ver: 0, rows: 0, execTime: 2us, totalTime: 81us","outputVar: [
  {
    ""colNames"": [],
    ""name"": ""__Start_0"",
    ""type"": ""DATASET""
  }
]"

执行计划附件result.csv (16.0 KB)

因为目前不太清楚什么样的数据拓扑会导致你说的这种表现,本地也没有复现出类似情况。可否借住一些内存分析工具给我们提供一些线索,比如 valgrind/heaptrack 等。

1 个赞

1和2的语句是一样的吗?1 没有 oom 吗?

两个语句是一样的。但是2是连续查了多个点的信息,1只查了一个点
我觉得都是内存没释放导致。
今天试了一下,用一台128G的虚拟机部署graphd,内存持续上涨到128G

你说的 “内存持续上涨到128G” 是单条还是多条语句导致的?

单线程多个查询语句导致。
据我观察单条查询内存最多会占用到15G(12%左右)

valgrind 获取内存情况花了近两个小时

启动命令

root     12710 99.0 35.3 47258940 46672804 ?   Ssl  17:15  54:03 valgrind --tool=memcheck --log-file=valgrind_log --trace-children=yes /usr/local/nebula/bin/nebula-graphd --flagfile /usr/local/nebula/etc/nebula-graphd.conf

top 详情

22352 nebula    20   0   16.9g  13.9g   9600 S   0.3 11.0   6:31.65 nebula-storaged
12710 root      20   0   45.1g  44.5g   9032 S 100.0 35.4  49:33.43 memcheck-amd64-

日志

valgrind_log (46.2 KB)

2 个赞

发一下语句和 schema,我这边尝试复现一下

查询语句

@steam 拉个小群吧,后面再过来同步

感谢nebula官方的协助。
闭环一下最终结论。
1、oom的原因是由于查询数据太大,甚至超过128G
2、graphd占用内存不全部释放是为了给下次查询复用,涉及 Jemalloc 管理的 memory pool

5 个赞

该话题在最后一个回复创建后7天后自动关闭。不再允许新的回复。