job compact stuck

提问参考模版:

  • nebula 版本:v3.4.0
  • 部署方式: 分布式
  • 安装方式: RPM
  • 是否上生产环境:N
  • 硬件信息
    • 磁盘 SSD
    • CPU、内存信息: 128G 内存
  • 问题的具体描述
  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)

之前测试v3.3 的时候发现,job compact 会卡住 compaction 过了两天了还没有结束
这次在v3.4 上发现还是有个这个问题。

操作: 在A space上提交的job compact, 在B space 上也提交job compact QUEUE,前两天提交的。
今天发现A 上的一直Running 状态,今天重启meta 之后A 显示failed 了,B 进入Running status, 但是看log 只是开始了,并没有进行真正的compact

# A Space
 show jobs
+--------+------------------+------------+----------------------------+----------------------------+
| Job Id | Command          | Status     | Start Time                 | Stop Time                  |
+--------+------------------+------------+----------------------------+----------------------------+
| 24     | "LEADER_BALANCE" | "FINISHED" | 2023-03-24T02:49:52.000000 | 2023-03-24T02:49:57.000000 |
| 23     | "COMPACT"        | "FAILED"   | 2023-03-23T01:36:59.000000 | 2023-03-24T02:49:52.000000 |
+--------+------------------+------------+----------------------------+----------------------------+
Got 2 rows (time spent 39.865ms/40.779452ms)

Fri, 24 Mar 2023 11:36:36 CST

# B Space

show jobs
+--------+-----------+-----------+----------------------------+----------------------------+
| Job Id | Command   | Status    | Start Time                 | Stop Time                  |
+--------+-----------+-----------+----------------------------+----------------------------+
| 22     | "COMPACT" | "RUNNING" | 2023-03-24T02:51:58.000000 |                            |
| 21     | "STATS"   | "FAILED"  | 2023-03-23T01:35:16.000000 | 2023-03-24T02:51:58.000000 |
+--------+-----------+-----------+----------------------------+----------------------------+
Got 2 rows (time spent 38.89ms/39.747516ms)

Fri, 24 Mar 2023 11:37:07 CST

logs 几台节点都是这个状态,后续就没了

I20230324 10:51:58.838047 14269 AdminTaskManager.cpp:315] subtask of task(21, 2) finished, unfinished task 1
I20230324 10:51:58.844677 14274 AdminTaskManager.cpp:326] task(21, 2) runSubTask() exit
I20230324 10:51:58.848191 14269 AdminTaskManager.cpp:326] task(21, 2) runSubTask() exit
I20230324 10:51:58.944547 14278 AdminTaskManager.cpp:315] subtask of task(21, 2) finished, unfinished task 0
I20230324 10:51:58.944617 14278 StatsTask.cpp:312] task(21, 2) finished, rc=[E_USER_CANCEL]
I20230324 10:51:58.948653 14079 AdminTaskManager.cpp:92] reportTaskFinish(), job=21, task=2, rc=E_USER_CANCEL
I20230324 10:51:58.948897 14079 AdminTaskManager.cpp:92] reportTaskFinish(), job=22, task=3, rc=E_TASK_EXECUTION_FAILED
I20230324 10:51:58.999516 14079 AdminTaskManager.cpp:134] reportTaskFinish(), job=21, task=2, rc=SUCCEEDED
I20230324 10:51:59.010535 13949 EventListener.h:21] Rocksdb start compaction column family: default because of ManualCompaction, status: OK, compacted 47741 files into 0, base level is 0, output level is 1
I20230324 10:51:59.031738 13949 CompactionFilter.h:82] Do full/manual compaction!

另外:
配置如下 enable 了memory_tracker , 发现storage 占用了很高内存

有两个问题,

  1. enable memory_tracker 和之前的参数 --system_memory_high_watermark_ratio=0.9 作用一样吗
  2. 现在这个没有查询 为什么会一直有这么高的内存占用? 该怎么解决? 重启storage 服务吗?
 System memory high watermark ratio, cancel the memory checking when the ratio greater than 1.0
--system_memory_high_watermark_ratio=0.9

########## metrics ##########
--enable_space_level_metrics=false

--meta_client_timeout_ms=200000
--num_operator_threads=16
########## experimental feature ##########
# if use experimental features
--enable_experimental_feature=true

# if use balance data feature, only work if enable_experimental_feature is true
--enable_data_balance=true

########## session ##########
# Maximum number of sessions that can be created per IP and per user
--max_sessions_per_ip_per_user=300

########## memory tracker ##########
# trackable memory ratio (trackable_memory / (total_memory - untracked_reserved_memory) )
--memory_tracker_limit_ratio=0.8
# untracked reserved memory in Mib
--memory_tracker_untracked_reserved_memory_mb=50

# enable log memory tracker stats periodically
--memory_tracker_detail_log=false
# log memory tacker stats interval in milliseconds
--memory_tracker_detail_log_interval_ms=60000

# enable memory background purge (if jemalloc is used)
--memory_purge_enabled=true
# memory background purge interval in seconds
--memory_purge_interval_seconds=10

22那个job, 也就是compaction, 没有真正执行是因为执行失败了

I20230324 10:51:58.948653 14079 AdminTaskManager.cpp:92] reportTaskFinish(), job=21, task=2, rc=E_USER_CANCEL
I20230324 10:51:58.948897 14079 AdminTaskManager.cpp:92] reportTaskFinish(), job=22, task=3, rc=E_TASK_EXECUTION_FAILED

所以meta可能没有再下发任务了, 也有可能下发下来storage认为已经执行出错了, 不必再执行了. 最好少执行stop job这种操作. 你可以重启下meta, 后续的可以正常运作.

作用现在不太一样了, system_memory_high_watermark_ratio只是一个监测线程, 实际内存管理的是那一堆memory tracker的参数.

占用高不好说,你得提供点额外信息,比如监控历史之类的

  • 所以meta可能没有再下发任务了, 也有可能下发下来storage认为已经执行出错了
    这个哪种log 来定位?

  • 你得提供点额外信息,比如监控历史之类的
    这个需要什么log? storage的吗? 这个内存升高 我看就是在第一次重启meta 之后,第二个compaction job 开始之后发生的

啊 我们看了下逻辑, 貌似是有点问题, 稍等

我说下结合日志, 猜测出来的操作步骤, 你看下对不对(忽略space A, 没啥日志猜不出来发生了什么):

  1. 在space B提交了个stats job, id为21
  2. 在space B提交了个compact job, id为22
  3. 你期望21和22可以并发执行, (实际上不行)
  4. 期望先执行compact, 所以执行了stop job 21, 然而发现22还是没有执行
  5. 你重启了meta (我想知道4和5之间隔了多久),
  6. 你发现21是FAILEd, 22变成了RUNNING, 但怀疑没执行

抛开这些步骤不谈, 但按我猜测, compaction是在执行, 因为下面的日志说明了在compaction, 首先文件非常多L0文件, 耗时会非常长, 再加上你又说内存上去了. 可以top -Hp pid看看是不是有rocksdb线程在工作

I20230324 10:51:59.010535 13949 EventListener.h:21] Rocksdb start compaction column family: default because of ManualCompaction, status: OK, compacted 47741 files into 0, base level is 0, output level is 1
I20230324 10:51:59.031738 13949 CompactionFilter.h:82] Do full/manual compaction!

另外你贴个space B的show jobs我再看下

1 个赞

如果是这样可能还好, 重启下storage应该没事, 你可以把storage的日志贴上来看看, 不好猜

我看了下操作日志,发现说错了:

  1. Space B 上job 21 ( stats job 一直是Running)然后job 22 (Compact QUEUE)
  2. Space A job 23 (Compact Running) job 24 (Leader Balance)
  3. 我一直没注意到job 21 是被stats Running 卡住了。 因为一直认为compact job 是顺序执行的,看到job 23 Running 就以为是job 23 限制性了。
  4. 然后重启了一个meta 发现job 23 fail 了,执行了job 24 (因为leader 不平衡)
  5. 然后回到space B 才注意到job 21(stats) 在Running,就stop 了job 21。
  6. 然后发现Job 22 还是一直不执行,就全部重启了meta
  7. 现在job 还在Running status

Space B 的 show jobs 历史 如下

(root@nebula) [SupplyChain]> show jobs
+--------+-----------+-----------+----------------------------+-----------+
| Job Id | Command   | Status    | Start Time                 | Stop Time |
+--------+-----------+-----------+----------------------------+-----------+
| 22     | "COMPACT" | "QUEUE"   |                            |           |
| 21     | "STATS"   | "RUNNING" | 2023-03-23T01:35:16.000000 |           |
+--------+-----------+-----------+----------------------------+-----------+
Got 2 rows (time spent 35.036ms/35.725287ms)

Fri, 24 Mar 2023 10:51:42 CST

(root@nebula) [SupplyChain]> stop job 21
+---------------+
| Result        |
+---------------+
| "Job stopped" |
+---------------+
Got 1 rows (time spent 31.276ms/31.911511ms)

Fri, 24 Mar 2023 10:51:58 CST

(root@nebula) [SupplyChain]> show jobs
+--------+-----------+-----------+----------------------------+----------------------------+
| Job Id | Command   | Status    | Start Time                 | Stop Time                  |
+--------+-----------+-----------+----------------------------+----------------------------+
| 22     | "COMPACT" | "RUNNING" | 2023-03-24T02:51:58.000000 |                            |
| 21     | "STATS"   | "FAILED"  | 2023-03-23T01:35:16.000000 | 2023-03-24T02:51:58.000000 |
+--------+-----------+-----------+----------------------------+----------------------------+
Got 2 rows (time spent 31.669ms/32.41315ms)

Fri, 24 Mar 2023 10:52:00 CST

(root@nebula) [SupplyChain]> 
(root@nebula) [SupplyChain]> show jobs
+--------+-----------+-----------+----------------------------+----------------------------+
| Job Id | Command   | Status    | Start Time                 | Stop Time                  |
+--------+-----------+-----------+----------------------------+----------------------------+
| 22     | "COMPACT" | "RUNNING" | 2023-03-24T02:51:58.000000 |                            |
| 21     | "STATS"   | "FAILED"  | 2023-03-23T01:35:16.000000 | 2023-03-24T02:51:58.000000 |
+--------+-----------+-----------+----------------------------+----------------------------+
Got 2 rows (time spent 33.157ms/34.062213ms)

Fri, 24 Mar 2023 10:59:08 CST


目前重启storage 后发现 异常log 如下

Log file created at: 2023/03/24 14:36:54
Running on machine: nebula3-02
Running duration (h:mm:ss): 0:00:00
Log line format: [IWEF]yyyymmdd hh:mm:ss.uuuuuu threadid file:line] msg
E20230324 14:36:54.473713  7264 Serializer.h:43] Thrift serialization is only defined for structs and unions, not containers thereof. Attemping to deserialize a value of type `nebula::HostAddr`.

中间的234步骤应该和现象无关, space B的compaction应该是要很久, 执行时候看看cpu吧.

那个错误日志不用管

现在是虽然显示Runing 但是看log 还是没有跑起来,等等看?

一直没有跑起来,metad 重启后,stop job 了,然后重新提交job
还有个问题,我看到文档说,balance data 是企业版的功能了? 现在社区版不能用?,后期扩缩容 怎么把partition 的数据平衡到其他节点上呀?

还是没有正常跑起来,后来删了空间,重新导入数据了:sleepy:

不是吧, 你之前贴的日志就是已经开始compaction了啊, 怎么看出来没跑的?

貌似社区版没啥办法了, 因为之前社区版放出来的有丢数据风险, 后来关掉了, 后面可能会再做.

就一直卡在第一句话那里, 后面log 也没数据,磁盘IO 也没拉起来,硬盘df -h 也没减少。
CompactionFilter.h:82] Do full/manual compaction!

能否帮忙看看这是不是一个bug,现在在考虑是否升级?
compact job 一直fail, 其中节点2, 看到listen log too many file error, 调整ulimit -n 由65535 → 100000 之后还是failed,
现在重新又跑了一次,但是看log 节点2没有触发compact job, 其他都正常触发再跑。


(root@nebula) [SupplyChain]> show jobs
+--------+------------------+------------+----------------------------+----------------------------+
| Job Id | Command          | Status     | Start Time                 | Stop Time                  |
+--------+------------------+------------+----------------------------+----------------------------+
| 40     | "LEADER_BALANCE" | "FINISHED" | 2023-03-27T11:55:05.000000 | 2023-03-27T11:55:10.000000 |
| 39     | "COMPACT"        | "FAILED"   | 2023-03-27T08:59:19.000000 | 2023-03-27T11:01:37.000000 |
| 38     | "LEADER_BALANCE" | "FINISHED" | 2023-03-27T08:51:57.000000 | 2023-03-27T08:52:22.000000 |
| 37     | "LEADER_BALANCE" | "FINISHED" | 2023-03-27T07:48:44.000000 | 2023-03-27T07:49:14.000000 |
| 36     | "COMPACT"        | "FAILED"   | 2023-03-27T03:57:44.000000 | 2023-03-27T07:48:44.000000 |
| 35     | "LEADER_BALANCE" | "FINISHED" | 2023-03-27T03:57:11.000000 | 2023-03-27T03:57:36.000000 |
| 34     | "LEADER_BALANCE" | "FINISHED" | 2023-03-27T03:54:22.000000 | 2023-03-27T03:54:57.000000 |
| 33     | "COMPACT"        | "FAILED"   | 2023-03-26T23:31:13.000000 | 2023-03-27T03:21:19.000000 |
| 32     | "COMPACT"        | "FAILED"   | 2023-03-26T12:05:04.000000 | 2023-03-26T23:01:05.000000 |
| 31     | "COMPACT"        | "FINISHED" | 2023-03-25T02:01:00.000000 | 2023-03-25T08:59:08.000000 |
| 30     | "LEADER_BALANCE" | "FINISHED" | 2023-03-24T14:10:34.000000 | 2023-03-24T14:10:39.000000 |
+--------+------------------+------------+----------------------------+----------------------------+
Got 11 rows (time spent 31.208ms/32.240561ms)


nebula-storaged.nebula3-02.root.log.INFO.20230327-150637.1948 (28.6 KB)
nebula-storaged.nebula3-02.root.log.INFO.20230327-173238.2510 (24.2 KB)
nebula-graphd.nebula3-02.root.log.INFO.20230327-150637.1896 (8.9 KB)

节点二storage 内存太高好像被系统kill了,现在storage down 了。
看了下dmesg, executor-pri3-2 这个被kill 了,翻了下 之前有过类似的的schedule 问题 2.0版本一个奇怪操作引发coredump - #18,来自 sworduo

[29296.431825] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[29296.433521] [  462]     0   462     9765     1129      23        0             0 systemd-journal
[29296.435560] [  494]     0   494    11421      209      24        0         -1000 systemd-udevd
[29296.437315] [  614]     0   614    13883      101      27        0         -1000 auditd
[29296.439233] [  641]     0   641    11042       88      27        0             0 qemu-ga
[29296.441001] [  642]    81   642    14559      165      34        0          -900 dbus-daemon
[29296.442927] [  643]     0   643     1097       35       8        0             0 acpid
[29296.444567] [  644]     0   644     6596       76      18        0             0 systemd-logind
[29296.446312] [  645]   999   645   153058     1860      61        0             0 polkitd
[29296.448144] [  647]     0   647     5404       77      16        0             0 irqbalance
[29296.450325] [  653]   998   653    29452      115      30        0             0 chronyd
[29296.452199] [  729]     0   729   137545     1032      86        0             0 NetworkManager
[29296.454267] [  970]     0   970   181836     3250      25        0             0 node_exporter
[29296.456456] [  971]     0   971   143569     3325      97        0             0 tuned
[29296.458253] [ 1214]     0  1214    22427      259      42        0             0 master
[29296.460040] [ 1240]    89  1240    22470      250      43        0             0 qmgr
[29296.461720] [ 1309]     0  1309    56680      486      46        0             0 rsyslogd
[29296.463367] [ 1314]     0  1314   285265     4237      42        0             0 elkeid-agent
[29296.465180] [ 1358]     0  1358    31596      156      19        0             0 crond
[29296.466960] [ 1425]     0  1425    28225      256      57        0         -1000 sshd
[29296.468582] [ 1426]     0  1426    27551       33      11        0             0 agetty
[29296.470228] [ 1428]     0  1428    27551       34      10        0             0 agetty
[29296.472064] [ 1671]     0  1671    25726      514      51        0             0 dhclient
[29296.474004] [ 1886]     0  1886   271750    71372     296        0             0 nebula-metad
[29296.475740] [ 1896]     0  1896   156429    71996     206        0             0 nebula-graphd
[29296.477556] [ 2399]     0  2399    40377      344      81        0             0 sshd
[29296.479332] [ 2401]     0  2401    28886      108      14        0             0 bash
[29296.480899] [ 2543]     0  2510 45081322 31527927   79735        0             0 executor-pri3-2
[29296.482541] [ 2945]     0  2945    40412      358      82        0             0 sshd
[29296.484046] [ 2947]     0  2947    28886      107      14        0             0 bash
[29296.485518] [ 2993]     0  2993    18063      188      40        0             0 sftp-server
[29296.487146] [ 3022]    89  3022    22453      249      43        0             0 pickup
[29296.488840] Out of memory: Kill process 2543 (executor-pri3-2) score 930 or sacrifice child
[29296.490471] Killed process 2543 (executor-pri3-2), UID 0, total-vm:180325288kB, anon-rss:126109248kB, file-rss:2460kB, shmem-rss:0kB

贴下完整日志, 没见过这个错诶. 你那环境之前是没有开自动compaction吧?

对没有开启,因为之前测试用nebula-spark 开启auto_comapct 速度很慢,就关了,导完数据之后,进行手动compact。现在这个space 的边比较多

listen log 这个日志内容比较少。最近的crash 和这个没有关系了

Could not create logging file: Too many open files
COULD NOT CREATE A LOGGINGFILE 20230327-171239.1948!F20230327 17:12:39.974443  2030 RaftPart.cpp:1092] [Port: 9780, Space: 29, Part: 36] Failed to commit logs
*** Check failure stack trace: ***
ExceptionHandler::GenerateDump sys_pipe failed:Too many open files
ExceptionHandler::SendContinueSignalToChild sys_write failed:ExceptionHandler::WaitForContinueSignal sys_read failed:Bad file descriptor
Bad file descriptor
*** Aborted at 1679908360 (Unix time, try 'date -d @1679908360') ***
*** Signal 6 (SIGABRT) (0x79c) received by PID 1948 (pthread TID 0x7fb2af2fe700) (linux TID 2030) (maybe from PID 1948, UID 0) (code: -6), stack trace: ***
(error retrieving stack trace)
~                                                                                                                                                                
~                                        

这是上午又一次compact 的日志。这个节点还是crash 了。 看内存曲线 还是内存100% 之后crash 了。

nebula-graphd.nebula3-02.root.log.INFO.20230327-150637.1896 (15.2 KB)
nebula-graphd.nebula3-02.root.log.WARNING.20230327-230137.1896 (502 字节)
nebula-storaged.nebula3-02.root.log.INFO.20230328-100531.3427 (24.8 KB)
nebula-storaged.nebula3-02.root.log.WARNING.20230328-103813.3427 (695 字节)
nebula-storaged.nebula3-02.root.log.ERROR.20230328-103813.3427 (375 字节)

请问有什么思路吗 这个crash 问题