job compact stuck

句柄太少了, ulmit设置下
把compaction线程数调少点

ulimit -n 设为130000了,我把 subcompaction 由 8 调成4 试试

--rocksdb_db_options={"max_subcompactions":"4","max_background_jobs":"4"}

设置之后重启,原先的compact failed。重新 compact 任务,还是没有跑起来
比如一号节点,IO 和log 是这样的

I20230403 11:08:19.130861 27479 EventListener.h:35] Rocksdb compaction completed column family: default because of ManualCompaction, status: OK, compacted 13 files into 13, base level is 1, output level is 1
I20230403 11:08:19.261539 27480 EventListener.h:21] Rocksdb start compaction column family: default because of ManualCompaction, status: OK, compacted 13 files into 0, base level is 1, output level is 1
I20230403 11:08:19.261695 31659 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.261770 31660 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.261727 31658 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.261780 31661 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.261999 31662 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.262145 31663 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.262279 31664 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.262403 31665 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.262455 27480 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:19.262497 31666 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.096343 27480 EventListener.h:35] Rocksdb compaction completed column family: default because of ManualCompaction, status: OK, compacted 13 files into 13, base level is 1, output level is 1
I20230403 11:08:37.225275 27482 EventListener.h:21] Rocksdb start compaction column family: default because of ManualCompaction, status: OK, compacted 13 files into 0, base level is 1, output level is 1
I20230403 11:08:37.225440 31692 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.225473 31694 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.225489 31695 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.225448 31693 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.225710 31696 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.225968 31697 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.226039 31698 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.226140 27482 CompactionFilter.h:82] Do full/manual compaction!
I20230403 11:08:37.226152 31699 CompactionFilter.h:82] Do full/manual compaction!

二号节点 IO 没拉起来, 日志还是下面这种, 跟之前一样,等到job 然后过段时间 storaged 就crash 了

I20230403 10:52:38.517237 14512 NebulaStore.cpp:79] Register handler...
I20230403 10:52:38.517266 14512 StorageServer.cpp:253] Init LogMonitor
I20230403 10:52:38.517587 14512 StorageServer.cpp:120] Starting Storage HTTP Service
I20230403 10:52:38.519131 14512 StorageServer.cpp:124] Http Thread Pool started
I20230403 10:52:38.524035 14944 WebService.cpp:124] Web service started on HTTP[19779]
I20230403 10:52:38.524145 14512 RocksEngineConfig.cpp:371] Emplace rocksdb option max_background_jobs=4
I20230403 10:52:38.524164 14512 RocksEngineConfig.cpp:371] Emplace rocksdb option max_subcompactions=4
I20230403 10:52:38.524308 14512 RocksEngineConfig.cpp:371] Emplace rocksdb option max_bytes_for_level_base=268435456
I20230403 10:52:38.524323 14512 RocksEngineConfig.cpp:371] Emplace rocksdb option max_write_buffer_number=4
I20230403 10:52:38.524331 14512 RocksEngineConfig.cpp:371] Emplace rocksdb option write_buffer_size=67108864
I20230403 10:52:38.524339 14512 RocksEngineConfig.cpp:371] Emplace rocksdb option disable_auto_compactions=true
I20230403 10:52:38.524479 14512 RocksEngineConfig.cpp:371] Emplace rocksdb option block_size=8192
I20230403 10:52:38.626150 14512 RocksEngine.cpp:107] open rocksdb on /usr/local/nebula/data/storage/nebula/0/data
I20230403 10:52:38.626317 14512 AdminTaskManager.cpp:22] max concurrent subtasks: 10
I20230403 10:52:38.627151 14512 AdminTaskManager.cpp:40] exit AdminTaskManager::init()
I20230403 10:52:38.627336 14965 AdminTaskManager.cpp:224] waiting for incoming task
I20230403 10:52:38.627573 14966 AdminTaskManager.cpp:92] reportTaskFinish(), job=45, task=0, rc=E_TASK_EXECUTION_FAILED
I20230403 10:52:38.641672 14512 MemoryUtils.cpp:171] MemoryTracker set static ratio: 0.4
I20230403 10:52:38.682651 14966 AdminTaskManager.cpp:134] reportTaskFinish(), job=45, task=0, rc=SUCCEEDED
I20230403 10:52:49.496403 14575 MetaClient.cpp:3259] Load leader of "172.18.163.85":9779 in 3 space
I20230403 10:52:49.496490 14575 MetaClient.cpp:3259] Load leader of "172.18.163.114":9779 in 3 space
I20230403 10:52:49.496502 14575 MetaClient.cpp:3259] Load leader of "172.18.163.115":9779 in 0 space
I20230403 10:52:49.496527 14575 MetaClient.cpp:3259] Load leader of "172.18.163.124":9779 in 3 space
I20230403 10:52:49.496536 14575 MetaClient.cpp:3265] Load leader ok
E20230403 11:04:38.536161 14548 Serializer.h:43] Thrift serialization is only defined for structs and unions, not containers thereof. Attemping to deserialize a value of type `nebula::HostAddr`.
I20230403 11:04:52.175076 14575 MetaClient.cpp:3259] Load leader of "172.18.163.85":9779 in 3 space
I20230403 11:04:52.175132 14575 MetaClient.cpp:3259] Load leader of "172.18.163.114":9779 in 3 space
I20230403 11:04:52.175161 14575 MetaClient.cpp:3259] Load leader of "172.18.163.115":9779 in 3 space
I20230403 11:04:52.175199 14575 MetaClient.cpp:3259] Load leader of "172.18.163.124":9779 in 3 space
I20230403 11:04:52.175212 14575 MetaClient.cpp:3265] Load leader ok
I20230403 11:05:01.575136 14542 AdminTask.cpp:21] createAdminTask (47, 0)
I20230403 11:05:01.575318 14542 AdminTaskManager.cpp:155] enqueue task(47, 0)
I20230403 11:05:01.575482 14965 AdminTaskManager.cpp:236] dequeue task(47, 0)
I20230403 11:05:01.575554 14965 AdminTaskManager.cpp:279] run task(47, 0), 1 subtasks in 1 thread
I20230403 11:05:01.576241 14965 AdminTaskManager.cpp:224] waiting for incoming task

试了几次 storaged 还是crash 了

20230403 17:13:20.035351 17080 EventListener.h:21] Rocksdb start compaction column family: default because of ManualCompaction, status: OK, compacted 66311 files into 0, base level is 0, output level is 1
I20230403 17:13:20.065162 18972 CompactionFilter.h:82] Do full/manual compaction!
I20230403 17:13:20.065271 18973 CompactionFilter.h:82] Do full/manual compaction!
I20230403 17:13:20.065281 17080 CompactionFilter.h:82] Do full/manual compaction!
I20230403 17:13:20.065320 18974 CompactionFilter.h:82] Do full/manual compaction!
I20230403 17:14:18.985951 17324 MemoryUtils.cpp:227] sys:34.712GiB/125.754GiB 27.60% usr:28.763GiB/50.282GiB 57.20%
I20230403 17:15:18.986356 17324 MemoryUtils.cpp:227] sys:42.229GiB/125.754GiB 33.58% usr:36.218GiB/50.282GiB 72.03%
I20230403 17:16:19.985954 17324 MemoryUtils.cpp:227] sys:49.830GiB/125.754GiB 39.62% usr:43.767GiB/50.282GiB 87.04%
I20230403 17:17:19.986378 17324 MemoryUtils.cpp:227] sys:57.357GiB/125.754GiB 45.61% usr:51.232GiB/50.282GiB 101.89%
I20230403 17:18:19.985491 17324 MemoryUtils.cpp:227] sys:64.789GiB/125.754GiB 51.52% usr:58.604GiB/50.282GiB 116.55%
I20230403 17:19:19.986600 17324 MemoryUtils.cpp:227] sys:72.368GiB/125.754GiB 57.55% usr:66.142GiB/50.282GiB 131.54%
I20230403 17:20:19.985999 17324 MemoryUtils.cpp:227] sys:79.704GiB/125.754GiB 63.38% usr:73.421GiB/50.282GiB 146.02%
I20230403 17:21:19.986316 17324 MemoryUtils.cpp:227] sys:87.090GiB/125.754GiB 69.25% usr:80.751GiB/50.282GiB 160.60%
I20230403 17:22:19.985656 17324 MemoryUtils.cpp:227] sys:94.456GiB/125.754GiB 75.11% usr:88.057GiB/50.282GiB 175.13%
I20230403 17:23:19.985388 17324 MemoryUtils.cpp:227] sys:100.513GiB/125.754GiB 79.93% usr:95.376GiB/50.282GiB 189.68%
W20230403 17:23:47.987363 17324 MemoryUtils.cpp:133] Memory usage has hit the high watermark of system, available: 2.41871e+10 vs. total: 135026831360 in bytes.
I20230403 17:24:19.986222 17324 MemoryUtils.cpp:227] sys:106.319GiB/125.754GiB 84.55% usr:102.582GiB/50.282GiB 204.01%
I20230403 17:25:20.984989 17324 MemoryUtils.cpp:227] sys:112.217GiB/125.754GiB 89.24% usr:109.902GiB/50.282GiB 218.57%
W20230403 17:25:27.985296 17324 MemoryUtils.cpp:133] Memory usage has hit the high watermark of system, available: 1.38074e+10 vs. total: 135026831360 in bytes.
I20230403 17:26:20.986110 17324 MemoryUtils.cpp:227] sys:118.011GiB/125.754GiB 93.84% usr:117.088GiB/50.282GiB 232.86%
W20230403 17:27:07.985222 17324 MemoryUtils.cpp:133] Memory usage has hit the high watermark of system, available: 3.44659e+09 vs. total: 135026831360 in bytes.
I20230403 17:27:21.985656 17324 MemoryUtils.cpp:227] sys:123.895GiB/125.754GiB 98.52% usr:124.388GiB/50.282GiB 247.38%

能帮忙看看这个啥原因吗? 担心到时候上生产有这个问题

第一个节点已经没事了, 第二个节点L0文件太多了, 线程数改成1, 你可以直接走curl命令手动compact第二个节点, 比如这种curl -G “http://127.0.0.1:19779/admin?space=ldbc_snb_sf10&op=compact

把space名字换下. 先改成1试试, 你上线最好不要关掉compaction, 否则这么多L0文件 内存是有可能扛不住

这个状态是对的把?


不能后台,看log 在跑

我准备一点点的导入数据试试,导入部分数据,compact 一次。
关闭compact 的原因是,用spark-import 试了下,这个打开auto compact 导入速度很慢,还不如nebula-importer csv 的tool。 当时没有好的办法,就关闭了。

搞错了,不知道为啥双引号提交不成功 换成单引号才成功

改成1 之后,还是OOM storage down 了,重启storage 之后 请问我怎么查看通过curl 提交的这个compact job 状态

没法用curl查, 只有提交有HTTP接口.

那你试另外一招吧, 把自动compaction打开, 重启, 期望自动的不挂. 看上去也没啥辙了, 另外你可以看看系统参数THP关掉没有

好的。感谢!curl 提交的compact, 是没办法查看状态的意思吧? 我看show jobs 没有相关记录

THP 关闭的,已经先开了 auto compaction, 确实看log 在做minor compaction,只是我把compaction 线程调成1了,可能比较慢,后续看看什么时候结束。

[root@nebula-server-01 scripts]# cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。