nebula gragh 2.0 磁盘使用问题

  • nebula 版本:nebula gragh 2.0 ga
  • 部署方式:分布式
  • 是否为线上版本:N
  • 硬件信息
    • 磁盘:HHD
      GH-204:40G
      GH-205:40G
      GH-206:40G
    • CPU、内存信息
  • 问题的具体描述:
    使用exchange 2.0 从hive中导入数据 到nebula
    hive中原始数据大小 4.2 G , 3台 nebula 服务器合计使用 90G 磁盘空间
    GH-204:
    image
    GH-205:
    image
    GH-206:
    image
    nebula space 创建:
   CREATE SPACE neb(partition_num=15, replica_factor=1, vid_type=fixed_string(800))
   CREATE TAG entity(name string)

exchange 配置:

        {
  # Spark relation config
  spark: {
    app: {
      name: Nebula Exchange 2.0
    }

    driver: {
      cores: 1
      maxResultSize: 4G
      memory: 4G
    }

    executor: {
        memory:6G
    }

    cores:{
      max: 10
    }
  }

  # if the hive is hive-on-spark with derby mode, you can ignore this hive configure
  # get the config values from file $HIVE_HOME/conf/hive-site.xml or hive-default.xml
    #hive: {
    #  waredir: "hdfs://act62.hadoop:8020/apps/hive/warehouse/"
    #  connectionURL: "jdbc:mysql://act62.hadoop:3306/hive?characterEncoding=UTF-8"
    #  connectionDriverName: "com.mysql.jdbc.Driver"
    #  connectionUserName: "hive"
    #  connectionPassword: "123456"
    #}

  # Nebula Graph relation config
  nebula: {
    address:{
      graph:["GH-204:9669","GH-205:9669","GH-206:9669"]
      meta:["GH-204:9559","GH-205:9559","GH-206:9559"]
    }
    user: user
    pswd: password
    space: neb

    # parameters for SST import, not required
    path:{
        local:"/tmp"
        remote:"/sst"
        hdfs.namenode: "hdfs://act62.hadoop:8020"
    }

    connection {
      timeout: 3000
      retry: 3
    }

    execution {
      retry: 3
    }

    error: {
      max: 32
      # failed import job will be recorded in output path
      output: /tmp/errors
    }

    rate: {
      limit: 1024
      timeout: 1000
    }
  }

  # Processing tags
  # There are tag config examples for different dataSources.
  tags: [

 
    # Hive
    {
      name: entity
      type: {
        source: hive
        sink: client
      }
      exec: "select vid,name from neb.entity"
      fields: [name]
      nebula.fields: [name]
      vertex: {
        field: vid
        # policy: "hash"
      }
      batch: 256
      partition: 32
    }

  ]

  # Processing edges
  # There are edge config examples for different dataSources.
 #edges: [
 # 
 #  # Hive
 #  {
 #    name: follow
 #    type: {
 #      source: hive
 #      sink: client
 #    }
 #    exec: "select degree,b_id,e_id from neb.follow"
 #    fields: [ degree]
 #    nebula.fields: [degree]
 #    source: b_id
 #    target: e_id
 #    batch: 256
 #    partition: 32
 #  }
 #]
}

首先你要看下你的space 是 几副本, 假如是多副本,数据存储肯定是大于 4.2 * 副本数。
其次你应该看 /usr/local/nebula 下面子目录的占用大小,也有可能是日志 logs 目录占了很大空间,假如全部是 data目录占用,那你按照文档做下compact
https://docs.nebula-graph.com.cn/2.0.1/3.ngql-guide/18.operation-and-maintenance-statements/4.job-statements/#submit_job_compact

可能是vid浪费了太多的空间,800…

1 个赞

fixed_string(800),应该有一定的影响,观察了下,导入任务刚执行完的时候磁盘占用很大(90G),但是过了几天再上去看的时候,磁盘使用明显下降了很多(10 G左右,确认过数据是在nebula中可以查询的),想咨询下,exchange执行完后,nebula 还在进行其他的磁盘操作吗

compaction