nebula 2.0.0-rc1 storage重启失败问题

提问参考模版:

  • nebula 版本:nebula 2.0rc1
  • 部署方式(分布式 / 单机 / Docker / DBaaS):docker swarm
  • 硬件信息
    • hdd
      docker swarm重启后一直起不来,每隔几分钟就会重启,
      storage 日志如下,Do custom minor compaction!后进程就会stopping,不知道为啥。。。

storage 日志

I0123 11:03:01.069562    69 EventListner.h:18] Rocksdb start compaction column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 0, base level is 1, output level is 2
I0123 11:03:01.102402    69 EventListner.h:30] Rocksdb compaction completed column family: default because of LevelMaxLevelSize, status: OK, compacted 1 files into 1, base level is 1, output level is 2
I0123 11:03:01.111342    69 EventListner.h:18] Rocksdb start compaction column family: default because of LevelL0FilesNum, status: OK, compacted 4143 files into 0, base level is 0, output level is 1
I0123 11:03:01.118247    69 CompactionFilter.h:62] Do custom minor compaction!
I0123 11:04:23.432250     1 StorageDaemon.cpp:142] Signal 15(Terminated) received, stopping this server

把配置贴一下?

这是storage的配置

storaged2:
    image: vesoft/nebula-storaged:v2.0.0-rc1
    env_file:
      - ./nebula.env
    command:
      - --meta_server_addrs=10.0.0.3:9559,10.0.0.3:9559,10.0.0.3:9559
      - --local_ip=10.0.0.3
      - --ws_ip=10.0.0.3
      - --port=9779
      - --data_path=/data1/storaged,/data2/storaged,/data3/storaged,/data4/storaged,/data5/storaged,/data6/storaged,/data7/storaged,/data8/storaged,/data9/storaged,/data10/storaged,/data11/storaged
      - --log_dir=/logs
      - --v=0
      - --minloglevel=0
      - --raft_heartbeat_interval_secs=60
      - --raft_rpc_timeout_ms=30000
      - --heartbeat_interval_secs=90
      - --rocksdb_block_cache=20480
      - --enable_rocksdb_prefix_filtering=true
    deploy:
      replicas: 1
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.hostname == node2
    depends_on:
      - metad0
      - metad1
      - metad2
    healthcheck:
      test: ["CMD", "curl", "-f", "http://10.0.0.3:19779/status"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 20s      
    ports:
      - target: 19779
        published: 19779
        protocol: tcp
        mode: host
      - target: 19780
        published: 19780
        protocol: tcp
        mode: host
      - target: 9779
        published: 9779
        protocol: tcp
        mode: host
    volumes:
      - /data1/nebula/data/storaged:/data1/storaged
      - /data2/nebula/data/storaged:/data2/storaged
      - /data3/nebula/data/storaged:/data3/storaged
      - /data4/nebula/data/storaged:/data4/storaged
      - /data5/nebula/data/storaged:/data5/storaged
      - /data6/nebula/data/storaged:/data6/storaged
      - /data7/nebula/data/storaged:/data7/storaged
      - /data8/nebula/data/storaged:/data8/storaged
      - /data9/nebula/data/storaged:/data9/storaged
      - /data10/nebula/data/storaged:/data10/storaged
      - /data11/nebula/data/storaged:/data11/storaged
      - /data/nebula/logs/storaged:/logs
    networks:
      - nebula-net

rocksdb_block_cache 改小一些试试

内存不够。容器配置多少

docker info 的数据是这么多250.8GiB
容器内存没有配置其他参数


我理解这里我只有一个storaged,11个rocksdb实例,rocksdb不是查缓存吗,这个不是多个rocksdb共享的吗

好的,我试下,后面会换到ssd

我改成rocksdb_block_cache=2048,然后storage占用内存一直往上涨到30G+,再就又挂掉重启了,我记得我没重启之前内存占用有60G+

I0126 07:01:05.685215     1 NebulaStore.cpp:81] Scan path "/data4/storaged/1"
I0126 07:01:05.697016     1 RocksEngineConfig.cpp:244] Emplace rocksdb option max_bytes_for_level_base=268435456
I0126 07:01:05.697207     1 RocksEngineConfig.cpp:244] Emplace rocksdb option max_write_buffer_number=4
I0126 07:01:05.697407     1 RocksEngineConfig.cpp:244] Emplace rocksdb option write_buffer_size=67108864
I0126 07:01:05.697607     1 RocksEngineConfig.cpp:244] Emplace rocksdb option disable_auto_compactions=true
I0126 07:01:05.697825     1 RocksEngineConfig.cpp:244] Emplace rocksdb option block_size=8192
I0126 07:01:50.948037     1 StorageDaemon.cpp:142] Signal 15(Terminated) received, stopping this server
I0126 07:01:52.346240     1 RocksEngine.cpp:105] open rocksdb on /data4/storaged/nebula/1/data
I0126 07:01:52.358177     1 NebulaStore.cpp:81] Scan path "/data5/storaged/1"
I0126 07:01:52.358435     1 RocksEngineConfig.cpp:244] Emplace rocksdb option max_bytes_for_level_base=268435456
I0126 07:01:52.358605     1 RocksEngineConfig.cpp:244] Emplace rocksdb option max_write_buffer_number=4
I0126 07:01:52.358762     1 RocksEngineConfig.cpp:244] Emplace rocksdb option write_buffer_size=67108864

很抱歉,更正一下我上文中的错误回答,上文中提到的rocksdb_block_cache是在同一个进程中的多个kvEngine是共享的,单位是MB。
这个配置改小之后,内存打到30GB+后就又挂掉了,看log信息在07:01:50的时候收到了kill的Signal。目前系统的ulimit -n是多少?

有core文件吗?可否查看一下堆栈信息?

partition bloomfilter 那个参数改成true

配置了还是一样

docker swarm部署要怎么取到core日志

ulimit -n 是102400

20480改成100

还是一样,内存增长到10G+,然后就收到终止信号重启

你容器开的内存是多少?


如图,是没有限制容器内存的,这里没显示是因为我服务停掉了

每次重启增长到的内存数量都不太一样?昨天留言是30GB+,今天是10GB+?

Signal 15(Terminated) received, stopping this server 这个信号哪里发过来的? 你把完整的INFO日志也贴一下呢?

你要是不带数据测试一下能正常启动吗?你说要换SSD测试,换了吗?

你的物理机呢,启动时候有注意过内存资源占用情况不。 你swarm 要启动多少个storaged呢?3个?每个都要那么多资源?