nebula2.5.0 sst 导入连接hive生成sst 报错

zhengshuai1030 · 2021 年10 月 19 日 09:52

有null ，也有空字符

zhengshuai1030 · 2021 年10 月 19 日 09:57

这个问题那边可以提交到 client master 分支吗，我看master 还没有改

nicole · 2021 年10 月 20 日 03:03

client上 fix的pr还没有合，等合入后你可以用snapshot版本的Exchange。

zhengshuai1030 · 2021 年10 月 20 日 03:23

恩，不过等不及了。我自己改了自己打包了，谢啦

zhengshuai1030 · 2021 年10 月 20 日 06:37

再问一下，这个sst 生成文件是一个一个tag 生成数据吗，还是多个tag 同时并行，现在多个tag跑spark 直接卡主不动了

zhengshuai1030 · 2021 年10 月 20 日 07:29

现在测试sst 发现是先写文件到sprak 临时目录，在上传hdfs ,这边数据一多spark 临时磁盘就不足了，导致sst 生成失败，这个可以直接向hdfs 写sst 吗，一定要先生成sst 文件在上传吗

nicole · 2021 年10 月 20 日 10:09

是一个tag一个tag进行处理的，每个tag内部是并发处理的

因为要向同一个文件多次写数据，所以不能直接写hdfs。你可以修改下配置文件中的local path，不用临时磁盘

zhengshuai1030 · 2021 年10 月 21 日 01:32

现在还有一个问题，一个tag 数据为1亿数据，现在跑者就卡主了，我看任务是task 就只有一个。没有多个任务task 跑，我加了重新分区也不行

(Encoders.tuple(Encoders.BINARY, Encoders.BINARY))
.toDF(“key”, “value”).repartition(100) //这我加的重新分区，也不行
.sortWithinPartitions(“key”) //这个算子代码一直卡这里执行好久，
.foreachPartition { iterator: Iterator[Row] =>
val taskID = TaskContext.get().taskAttemptId()
var writer: NebulaSSTWriter = null
var currentPart = -1
val localPath = fileBaseConfig.localPath
val remotePath = fileBaseConfig.remotePath

      try {
        iterator.foreach { vertex =>
          val key   = vertex.getAs[Array[Byte]](0)
          val value = vertex.getAs[Array[Byte]](1)
          var part = ByteBuffer
            .wrap(key, 0, 4)
            .order(ByteOrder.nativeOrder)
            .getInt >> 8
          if (part <= 0) {
            part = part + partitionNum
          }

nicole · 2021 年10 月 21 日 02:10

你提交任务分配了多少个executor，代码里不用再加一下repartition，因为在读取hive数据之后会根据配置文件中配的partition数进行repatition的。

zhengshuai1030 · 2021 年10 月 21 日 02:18

我设置了多个executor ，但是就是没起作用，才加上repartion 的现在加了数据多了还是不行奇怪，帮忙看看
${SPARK_HOME}/bin/spark-submit
–queue root.ipd.daily
–name “nebula-import-sst”
–master yarn
–driver-cores 16
–driver-memory 48g
–executor-memory 48g
–deploy-mode cluster
–num-executors 48
–executor-cores 16
–conf spark.port.maxRetries=1
–conf spark.yarn.maxAppAttempts=1
–conf spark.executor.memoryOverhead=8g
–conf spark.driver.memoryOverhead=8g
–conf spark.hadoop.fs.defaultFS="$ALG_HDFS"
–conf spark.default.parallelism=48
–conf spark.executor.extraJavaOptions="-XX:MaxDirectMemorySize=7372m"
–files “$conf”
–class com.vesoft.nebula.exchange.Exchange
lib/nebula-exchange-2.5-SNAPSHOT.jar -c $conf -h -d

.sortWithinPartitions(“key”) //这个算子代码一直卡这里执行好久，这个方法好像用不到分区，有点奇怪数据多了就不行，数据少一点就不卡