nubula-exchange 生成写入hdfs 文件有很多这种小的sst 文件

zhuliquan · 2022 年7 月 8 日 07:31

我跑数据设置batch 是 1024 partition 是 10个，有什么办法可以让sst 文件大一点，因为小文件上传会特别费时，同时还会导致大量时间都在上传文件，整体效率非常慢。跑一个只有1万个节点的数据，就需要用30多分钟。

nicole · 2022 年7 月 9 日 13:27

使用repartitionWithNebula这个配置，生成的sst文件数=nebula space part数

zhuliquan · 2022 年7 月 10 日 16:17

我看了一下源码，那个repartition 设置为图空间的parition 数，这个是为啥呢？

vesoft-inc/nebula-exchange/blob/543659e2dda90ff40e74a9eb2af9ee36b2d156e2/nebula-exchange_spark_2.2/src/main/scala/com/vesoft/nebula/exchange/processor/VerticesProcessor.scala#L137

    
      
              iter.map { row =>
                encodeVertex(row, partitionNum, vidType, spaceVidLen, tagItem, fieldTypeMap)
              }
            }(Encoders.tuple(Encoders.BINARY, Encoders.BINARY, Encoders.BINARY))
            .flatMap(line => {
              List((line._1, emptyValue), (line._2, line._3))
            })(Encoders.tuple(Encoders.BINARY, Encoders.BINARY))
          
          
// repartition dataframe according to nebula part, to make sure sst files for one part has no overlap
          if (tagConfig.repartitionWithNebula) {
            sstKeyValueData = customRepartition(spark, sstKeyValueData, partitionNum)
          }
          
          
sstKeyValueData
            .toDF("key", "value")
            .sortWithinPartitions("key")
            .foreachPartition { iterator: Iterator[Row] =>
              val generateSstFile = new GenerateSstFile
              generateSstFile.writeSstFiles(iterator,
                                            fileBaseConfig,
                                            partitionNum,

可以设置为图空间 partition 的整数倍吗？

      if (tagConfig.repartitionWithNebula) {
        sstKeyValueData = customRepartition(spark, sstKeyValueData, partitionNum)
      }

nicole · 2022 年8 月 7 日 02:43

因为这样可以确保nebula一个partition的sst数据只落在一个sst文件中，这样就能保证不同的sst之间不存在key的overlap，在将sst文件ingest到底层rocksdb存储中时，sst文件将直接落在LSM的L6层。（前提是数据库为空）

system · 2022 年9 月 6 日 02:43

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。