- nebula 版本:3.4.1
- 部署方式:分布式
- 安装方式:源码编译
- 是否上生产环境:Y
- 硬件信息: 4台物理机 256G SSD 32C64G
- 一台机器的meta服务延迟很高 如下图
有问题的机器会一直写INFO日志,如下图:导致日志很快写满磁盘,机器负载变高。并使得导入和查询图数据库性能变慢。 - 出问题节点即meta0机器上的meta.INFO日志:
I20230626 19:37:31.348703 32176 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.348721 32176 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.348788 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.349120 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.349143 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.349198 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.349524 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.349545 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.349627 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.349946 32176 JobManager.cpp:311] jobFinished, spaceId=97, jobId=875, result=FAILED
I20230626 19:37:31.349978 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.350045 32179 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.350064 32179 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.350131 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.350433 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.350461 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.350512 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.350775 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.350793 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.350847 32179 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.351155 32179 JobManager.cpp:311] jobFinished, spaceId=97, jobId=875, result=FAILED
I20230626 19:37:31.351179 32179 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.351233 32176 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.351249 32176 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.351321 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.351646 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.351671 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.351722 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.352016 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.352031 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.352083 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.352377 32176 JobManager.cpp:311] jobFinished, spaceId=97, jobId=875, result=FAILED
I20230626 19:37:31.352398 32176 JobDescription.cpp:53] p = depended_by_day_index
I20230626 19:37:31.352452 32179 JobDescription.cpp:113] Loading job description failed, error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.352465 32179 JobManager.cpp:441] LoadJobDesc failed, jobId 619 error: E_JOB_NOT_IN_SPACE
I20230626 19:37:31.352533 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.352854 32178 JobManager.cpp:311] jobFinished, spaceId=97, jobId=835, result=FAILED
I20230626 19:37:31.352880 32178 JobDescription.cpp:53] p = bdsptask_day_index
I20230626 19:37:31.352936 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.353214 32180 JobManager.cpp:311] jobFinished, spaceId=97, jobId=830, result=FAILED
I20230626 19:37:31.353241 32180 JobDescription.cpp:53] p = inherited_by_day_index
I20230626 19:37:31.353307 32179 JobDescription.cpp:53] p = depended_by_day_index
其他正常机器meta.INFO日志:
I20230626 18:08:07.795874 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:16:42.965276 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:16:42.985111 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:25:19.666574 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:25:19.677080 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:33:54.337836 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:33:54.345790 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:42:31.912842 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:42:31.926230 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:51:08.028556 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:51:08.041301 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 18:59:44.096387 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 18:59:44.099869 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 19:08:24.429461 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 19:08:24.433949 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 19:19:11.120577 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 19:19:11.126242 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
I20230626 19:37:02.020364 5793 EventListener.h:21] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 0, base level is 0, output level is 1
I20230626 19:37:02.025600 5793 EventListener.h:35] Rocksdb compaction completed column family: default because of LevelL0FilesNum, status: OK, compacted 5 files into 1, base level is 0, output level is 1
日志中报失败的job的状态:
由于job的状态为FINISHED,无法通过RECOVER JOB id 来恢复job