meta服务心跳延迟高

liujianan · 2023 年6 月 27 日 02:00

nebula 版本：3.4.1
部署方式：分布式
安装方式：源码编译
是否上生产环境：Y
硬件信息: 4台物理机 256G SSD 32C64G
一台机器的meta服务延迟很高如下图

image1598×316 28 KB

有问题的机器会一直写INFO日志，如下图：导致日志很快写满磁盘，机器负载变高。并使得导入和查询图数据库性能变慢。
出问题节点即meta0机器上的meta.INFO日志：

I20230626 19:37:31.348703 32176 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.348721 32176 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.348788 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.349120 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.349143 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.349198 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.349524 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.349545 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.349627 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.349946 32176 JobManager.cpp:311] jobFinished, spaceId=97, jobId=875, result=FAILED
I20230626 19:37:31.349978 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.350045 32179 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.350064 32179 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.350131 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.350433 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.350461 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.350512 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.350775 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.350793 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.350847 32179 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.351155 32179 JobManager.cpp:311] jobFinished, spaceId=97, jobId=875, result=FAILED
I20230626 19:37:31.351179 32179 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.351233 32176 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.351249 32176 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.351321 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.351646 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.351671 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.351722 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.352016 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.352031 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.352083 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.352377 32176 JobManager.cpp:311] jobFinished, spaceId=97, jobId=875, result=FAILED
I20230626 19:37:31.352398 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.352452 32179 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.352465 32179 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.352533 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.352854 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.352880 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.352936 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.353214 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.353241 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.353307 32179 JobDescription.cpp:53] p = depended_by_day_index

其他正常机器meta.INFO日志：

I20230626 18:08:07.795874  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:16:42.965276  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:16:42.985111  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:25:19.666574  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:25:19.677080  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:33:54.337836  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:33:54.345790  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:42:31.912842  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:42:31.926230  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:51:08.028556  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:51:08.041301  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:59:44.096387  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:59:44.099869  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 19:08:24.429461  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 19:08:24.433949  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 19:19:11.120577  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 19:19:11.126242  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 19:37:02.020364  5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 19:37:02.025600  5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1

日志中报失败的job的状态：

由于job的状态为FINISHED，无法通过RECOVER JOB id 来恢复job

sunny · 2023 年6 月 30 日 01:43

你部署架构是4个storage？几个meta呢？ meta0上的日志级别是多少？

最简单的办法，集群stop，稍等一会，再重新启动试试？

wey · 2023 年6 月 30 日 02:50

如 @sunny 问到的，看您是 4 个 metad

Meta 每一个 instance 是 Raft 中的一个成员，总数量不应该是偶数个哈，把集群清理一下，重新部署，meta 只在其中三个上部署。

liujianan · 2023 年6 月 30 日 09:23

我们现在是四台机器，每台机器上都部署了graph,meta,storage。那我是不是应该吧每个服务的节点数都改为3台呢？
数据量：

每天都会跑任务计算出每个HiveTable节点n层下游结点的数量

liujianan · 2023 年6 月 30 日 09:27

另外这几天graph服务也经常会挂掉。其中一台节点的graph.INFO日志如下:

E20230630 11:03:10.482440 20379 QueryInstance.cpp:151] std::runtime_error: Used memory hits the high watermark(0.9D COUNT($-.a) - 1  AS vCount;
E20230630 11:03:10.483747 20373 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst1  AS vCount;
E20230630 11:03:10.509338 20377 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.534242 20385 QueryInstance.cpp:151] std::runtime_error: Used memory hits the high watermark(0.9 COUNT($-.a) - 1  AS vCount;
E20230630 11:03:10.568114 20389 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.600255 20382 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst1  AS vCount;
E20230630 11:03:10.729213 20370 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.732945 20385 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.732939 20382 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.733844 20371 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.734591 20376 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.734766 20397 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.735172 20387 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst AS vCount;
E20230630 11:03:10.735780 20375 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst  AS vCount;
E20230630 11:03:10.735889 20375 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total syst YIELD SUM(CASE WHEN tags($-.a)[0] == 'HiveTable' THEN 0 ELSE 1 END) AS vCount | YIELD $-.vCount > 0 AS flag;

另一台节点的graph.INFO日志如下：

E20230630 15:03:29.173085 19262 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:713154' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.173069 19295 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:712934' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.173000 19301 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:713035' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.173086 19299 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:713342' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.173099 19288 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:659370' OUT DEPENDED_BY YIELD VERTICES as e | UNWI$-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.172998 19291 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:713094' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.300027 19291 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:659525' OUT DEPENDED_BY YIELD VERTICES as e | UNWI$-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.300024 19289 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:713962' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.300022 19279 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:713742' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.355511 19286 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:713983' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;
E20230630 15:03:29.399291 19291 QueryInstance.cpp:151] Used memory hits the high watermark(0.900000) of total systmemory., query: GET SUBGRAPH 10000 STEPS FROM 'bdsp:bdsptask:id:714073' IN DEPENDED_BY YIELD VERTICES as e | UNWIN-.e as a | YIELD COUNT($-.a) - 1  AS vCount;

这些查询命令之前已经平稳运行了比较久了。最近几天才一直出问题。不知是否和本贴标题提到的问题有关。
之前根据论坛上一个帖子的做法改过nebula-storaged.conf配置中rocksdb相关的如下两个配置：

--rocksdb_db_options={"max_open_files":"50000"}
--rocksdb_block_based_table_options={"block_size":"32768"}

wey · 2023 年6 月 30 日 09:36

不需要，只需要 meta 改成三台，但是，meta 数量是不能修改的，需要把环境清空重新部署

liujianan · 2023 年7 月 2 日 13:46

大佬可以再看看这个问题吗

MuYi-方扬 · 2023 年10 月 8 日 01:58

看起来这个并没有挂掉啊，而是超出内存水位了
至于之前没啥问题，现在有问题，可能要看下你数据有没有变更，比如新增了数据或者变更了schema