删除space,磁盘数据没有释放,文件描述符泄露

提问参考模版:

  • nebula 版本:release3.1.0
  • 部署方式:分布式
  • 安装方式:源码编译
  • 是否为线上版本:Y
  • 硬件信息
    • 磁盘 nvme
    • CPU 64核、内存信息 512g
  • 问题的具体描述
    执行ngql 语句 drop space, 返回成功,看meta和storage日志,也都是符合逻辑,
    但是实际查看磁盘空间,space的目录被删除了,但是磁盘空间并没有释放,
    执行了lsof -p storagePid |grep delete 发现sst文件没有被释放

相关的 storage info 日志信息

I20220928 16:28:40.227464 50974 NebulaStore.cpp:584] Space 180, part 24 has been removed!
I20220928 16:28:43.053258 50977 Part.h:59] [Port: 9780, Space: 180, Part: 87] ~Part()
I20220928 16:28:43.136325 50978 Part.h:59] [Port: 9780, Space: 180, Part: 57] ~Part()
I20220928 16:28:43.483928 50975 Part.h:59] [Port: 9780, Space: 180, Part: 52] ~Part()
I20220928 16:28:43.575768 50976 Part.h:59] [Port: 9780, Space: 180, Part: 97] ~Part()
I20220928 16:28:43.841101 50977 Part.h:59] [Port: 9780, Space: 180, Part: 82] ~Part()
I20220928 16:28:44.061311 50978 Part.h:59] [Port: 9780, Space: 180, Part: 17] ~Part()
I20220928 16:28:44.272780 50975 Part.h:59] [Port: 9780, Space: 180, Part: 12] ~Part()
I20220928 16:28:44.591847 50976 Part.h:59] [Port: 9780, Space: 180, Part: 37] ~Part()
I20220928 16:28:44.741097 50977 Part.h:59] [Port: 9780, Space: 180, Part: 47] ~Part()
I20220928 16:28:44.845252 50975 Part.h:59] [Port: 9780, Space: 180, Part: 62] ~Part()
I20220928 16:28:44.856282 50978 Part.h:59] [Port: 9780, Space: 180, Part: 2] ~Part()
I20220928 16:28:44.972342 50977 Part.h:59] [Port: 9780, Space: 180, Part: 77] ~Part()
I20220928 16:28:45.151687 50977 Part.h:59] [Port: 9780, Space: 180, Part: 84] ~Part()
I20220928 16:28:45.359885 50978 Part.h:59] [Port: 9780, Space: 180, Part: 92] ~Part()
I20220928 16:28:45.467921 50975 Part.h:59] [Port: 9780, Space: 180, Part: 22] ~Part()
I20220928 16:28:45.590116 50975 Part.h:59] [Port: 9780, Space: 180, Part: 7] ~Part()
I20220928 16:28:45.654263 50978 Part.h:59] [Port: 9780, Space: 180, Part: 42] ~Part()
I20220928 16:28:45.739512 50976 Part.h:59] [Port: 9780, Space: 180, Part: 67] ~Part()
I20220928 16:28:46.082106 50975 Part.h:59] [Port: 9780, Space: 180, Part: 5] ~Part()
I20220928 16:28:46.104126 50976 Part.h:59] [Port: 9780, Space: 180, Part: 65] ~Part()
I20220928 16:28:46.119904 50978 Part.h:59] [Port: 9780, Space: 180, Part: 15] ~Part()
I20220928 16:28:46.240337 50975 Part.h:59] [Port: 9780, Space: 180, Part: 55] ~Part()
I20220928 16:28:46.533342 50977 Part.h:59] [Port: 9780, Space: 180, Part: 14] ~Part()
I20220928 16:28:46.552376 50978 Part.h:59] [Port: 9780, Space: 180, Part: 72] ~Part()
I20220928 16:28:46.713559 50977 Part.h:59] [Port: 9780, Space: 180, Part: 100] ~Part()
I20220928 16:28:46.795778 50976 Part.h:59] [Port: 9780, Space: 180, Part: 27] ~Part()
I20220928 16:28:46.807927 50975 Part.h:59] [Port: 9780, Space: 180, Part: 40] ~Part()
I20220928 16:28:46.867821 50978 Part.h:59] [Port: 9780, Space: 180, Part: 20] ~Part()
I20220928 16:28:46.931847 50977 Part.h:59] [Port: 9780, Space: 180, Part: 70] ~Part()
I20220928 16:28:47.180322 50975 Part.h:59] [Port: 9780, Space: 180, Part: 50] ~Part()
I20220928 16:28:47.252436 50975 Part.h:59] [Port: 9780, Space: 180, Part: 69] ~Part()
I20220928 16:28:47.268565 50975 Part.h:59] [Port: 9780, Space: 180, Part: 45] ~Part()
I20220928 16:28:47.391836 50975 Part.h:59] [Port: 9780, Space: 180, Part: 35] ~Part()
I20220928 16:28:47.495918 50978 Part.h:59] [Port: 9780, Space: 180, Part: 34] ~Part()
I20220928 16:28:47.579764 50976 Part.h:59] [Port: 9780, Space: 180, Part: 39] ~Part()
I20220928 16:28:47.728164 50977 Part.h:59] [Port: 9780, Space: 180, Part: 80] ~Part()
I20220928 16:28:47.796299 50978 Part.h:59] [Port: 9780, Space: 180, Part: 49] ~Part()
I20220928 16:28:47.852370 50975 Part.h:59] [Port: 9780, Space: 180, Part: 54] ~Part()
I20220928 16:28:47.883455 50977 Part.h:59] [Port: 9780, Space: 180, Part: 4] ~Part()
I20220928 16:28:47.895524 50978 Part.h:59] [Port: 9780, Space: 180, Part: 24] ~Part()
I20220928 16:28:47.898157 50976 Part.h:59] [Port: 9780, Space: 180, Part: 95] ~Part()
I20220928 16:28:47.927596 50978 Part.h:59] [Port: 9780, Space: 180, Part: 60] ~Part()
I20220928 16:28:47.928581 50977 Part.h:59] [Port: 9780, Space: 180, Part: 59] ~Part()
I20220928 16:28:47.944646 50977 Part.h:59] [Port: 9780, Space: 180, Part: 64] ~Part()
I20220928 16:28:47.960238 50976 Part.h:59] [Port: 9780, Space: 180, Part: 25] ~Part()
I20220928 16:28:47.984714 50977 Part.h:59] [Port: 9780, Space: 180, Part: 19] ~Part()
I20220928 16:28:47.991362 50978 Part.h:59] [Port: 9780, Space: 180, Part: 9] ~Part()
I20220928 16:28:48.075845 50975 Part.h:59] [Port: 9780, Space: 180, Part: 89] ~Part()
I20220928 16:28:48.119575 50976 Part.h:59] [Port: 9780, Space: 180, Part: 10] ~Part()
I20220928 16:28:48.128206 50975 Part.h:59] [Port: 9780, Space: 180, Part: 75] ~Part()
I20220928 16:28:48.243876 50976 Part.h:59] [Port: 9780, Space: 180, Part: 79] ~Part()
I20220928 16:28:48.248394 50977 Part.h:59] [Port: 9780, Space: 180, Part: 29] ~Part()
I20220928 16:28:48.263955 50976 Part.h:59] [Port: 9780, Space: 180, Part: 44] ~Part()
I20220928 16:28:48.631410 50976 Part.h:59] [Port: 9780, Space: 180, Part: 94] ~Part()
I20220928 16:28:48.712472 50975 Part.h:59] [Port: 9780, Space: 180, Part: 74] ~Part()
I20220928 16:28:48.819569 50978 Part.h:59] [Port: 9780, Space: 180, Part: 90] ~Part()
I20220928 16:28:49.620184 50977 Part.h:59] [Port: 9780, Space: 180, Part: 99] ~Part()
I20220928 16:28:49.904706 50977 Part.h:59] [Port: 9780, Space: 180, Part: 30] ~Part()
I20220928 16:28:50.164180 50976 Part.h:59] [Port: 9780, Space: 180, Part: 85] ~Part()
I20220928 16:29:23.400190 50974 RocksEngine.h:203] Release rocksdb on /data/mmdb/storage/nebula/180
I20220928 16:29:23.444912 50974 NebulaStore.cpp:717] Try to remove space directory: /data/mmdb/storage/nebula/180
I20220928 16:29:23.985895 50974 NebulaStore.cpp:719] Space directory removed: /data/mmdb/storage/nebula/180
I20220928 16:29:23.985921 50974 NebulaStore.cpp:535] Data space 180 has been removed!


你看下文档说明呢?你这个参数设置成 true 还是 false 了。

是ture的,三台storage 3副本,其他两个storage都正常删除释放了,这个根据代码逻辑也是删除了数据目录,但是就是sst的文件描述符还在持有,一直不释放,对应的磁盘空间也没办法释放了

50893 是 storaged 进程吗?重启下应该就好了。

为啥要重启呢?
动不动就重启好悲催啊

重启是可以解决,这是我们线上集群,想看下什么原因导致的fd泄漏了,重启是影响线上服务的

按照目前的drop space代码逻辑来看,不应该出现这个问题的,感觉像个bug

我来查一下,猜测可能是 rocksdb 后台线程在使用 sst。有几个问题方便给点信息么?

  1. 在 drop space 前有正在执行的任务之类的吗?
  2. 未 delete 的 sst 大概占之前总 sst 的多少?是所有的 sst 都没有释放么?

1, 没有执行任务,只有执行drop space命令,看结果也返回成功了。
2,lsof -p StorageID |grep delete 看未释放的sst文件,是spaceID 180 所有的sst,都没有被释放

做过 snapshot吗?

提供几个东西我们看下

  1. 这个space rocksdb的日志 nebula/180/data/LOG
  2. lsof | grep 其中一个sst, 看下tid是多少 然后看下线程名

BTW,不重启应该也可以放掉fd,可以搜一下

没有,这个space的数据比较大, 启动了storage以后,过了10分钟就执行了drop space,中间什么都没有做

1,前面有描述,这个space的 目录已经被删除了,rocksdb的sst和 log都被删除了, 只是文件描述符没有释放
2,已经重启了, 没办法看tid了

感谢提出问题,我提了一个 issue 来追踪:Drop space and the folder was removed but the disk space was not released · Issue #5157 · vesoft-inc/nebula · GitHub

之后我们先复现然后再修复一下。

1 个赞

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。