nebula2.5.0 sst 导入连接hive生成sst 报错

nebula 版本:2.5.0
部署方式(分布式 rpm):
是否为线上版本:Y /
硬件信息
磁盘500ssd
CPU16c、32G内存信息
问题的具体描述
nebula2.5.0 sst 导入连接hive生成sst 报错
tags: [
# 与上述类似
# 从 Hive 加载将执行命令 $ {exec} 作为数据集

  {
  name: Thing
  type: {
    source: hive
    sink: SST
  }
  exec: "select thing_id, thing_name, thing_title  from oppo_kg_dw.dwd_kg_release_spo_thing_df_v3_4_ht_v6 where ds = '20211011' limit 300"
  fields:  [ thing_name, thing_title]
  nebula.fields: [Thing_name, Thing_title]
  vertex: {field:thing_id}
  header: true
  batch: 128
  partition: 24
}

]
21/10/19 14:27:56 ERROR VerticesProcessor: java.lang.RuntimeException: Unsupported default value yet
java.lang.RuntimeException: Unsupported default value yet
at com.vesoft.nebula.encoder.RowWriterImpl.checkUnsetFields(RowWriterImpl.java:766)
at com.vesoft.nebula.encoder.RowWriterImpl.finish(RowWriterImpl.java:855)
at com.vesoft.nebula.encoder.NebulaCodecImpl.encode(NebulaCodecImpl.java:200)
at com.vesoft.nebula.encoder.NebulaCodecImpl.encodeTag(NebulaCodecImpl.java:157)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$1$$anonfun$apply$1.apply(VerticesProcessor.scala:175)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$1$$anonfun$apply$1.apply(VerticesProcessor.scala:127)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.sort_addToSorter_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$2.apply(VerticesProcessor.scala:189)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$2.apply(VerticesProcessor.scala:181)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

我们在 encoder 里,还未实现 default value 的处理,所以如果你 schema 有 default value,然后数据是空的话,会报错。

数据当然有空的值啊,不能保证每个字段都值啊,怎么处理

还有其他的错误
21/10/19 17:07:46 ERROR VerticesProcessor: java.lang.RuntimeException: Wrong strNum: 13
java.lang.RuntimeException: Wrong strNum: 13
at com.vesoft.nebula.encoder.RowWriterImpl.processOutOfSpace(RowWriterImpl.java:837)
at com.vesoft.nebula.encoder.RowWriterImpl.finish(RowWriterImpl.java:859)
at com.vesoft.nebula.encoder.NebulaCodecImpl.encode(NebulaCodecImpl.java:200)
at com.vesoft.nebula.encoder.NebulaCodecImpl.encodeTag(NebulaCodecImpl.java:157)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$1$$anonfun$apply$1.apply(VerticesProcessor.scala:177)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$1$$anonfun$apply$1.apply(VerticesProcessor.scala:127)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.sort_addToSorter_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$2.apply(VerticesProcessor.scala:191)
at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$2.apply(VerticesProcessor.scala:183)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

数据字段里面有 逗号
刘灿, 报社主编, , , , , 山木, 刘灿,笔名山木。山东新泰人。中共党员。1970年入伍。,
导入程序可以解析正确吗??、上面日志好像解析失败了

错误是 default value 的,这块还没支持。

如果你 schema 没有 default value,允许为空的话,那是可以是空值。

你的属性类型是不是fixed_string, 然后你的value值 超过了fixed_string 的长度。

对我默认都是string 没有设置长度

但是之前我用exchange 直接导入也没有设置长度,都可以导入。这个是生成sst 怎么不行了

desc tag Thing 贴一下执行结果

(刘灿,
报社主编,
,
,
,
,
山木,
刘灿,笔名山木。山东新泰人。中共党员。1970年入伍。,
,
,
,
,
,
60.42,
false,
baike.baidu.com,
刘灿(报社主编)_百度百科,
Person,
,
,
{“baike.baidu.com”:“3662838”},

这话数据没有长度很长的字段,string 默认是多大啊??

string 没有长度限制, fixed_string 才有长度限制。
ps:你为啥要额外设一个default值为null? 你不设 默认值也是_NULL_

这个你们1.0和2.0返回数据不统一,我们这边业务查询要求兼容统一才都设置为null 的,
你们的2.0 返回_NULL_ 多了前后的下划线,不统一我们也没办法

Wrong strNum: 13 这个报错是什么原因??是数据为空解析失败吗??

这个问题和数据为空应该没关系,你配置文件中只配置了这两个属性, 数据中这两个属性的值分别是多少?

我配置全部字段 报 Wrong strNum: 13
exec: “select thing_id, thing_name, thing_title, thing_namech, thing_nameen, thing_abbreviation, thing_tag, thing_alias, thing_abstract, thing_image, thing_video, thing_audio, thing_gmtcreated, thing_gmtmodified, thing_popularity, thing_prior, thing_datasource, thing_urls, thing_class, thing_imagejson, thing_embedding, thing_sourceids, thing_videocover from oppo_kg_dw.dwd_kg_release_spo_thing_df_v3_4_ht_v6 where ds = ‘20211011’ limit 300”
fields: [thing_name, thing_title, thing_namech, thing_nameen, thing_abbreviation, thing_tag, thing_alias, thing_abstract, thing_image, thing_video, thing_audio, thing_gmtcreated, thing_gmtmodified, thing_popularity, thing_prior, thing_datasource, thing_urls, thing_class, thing_imagejson, thing_embedding, thing_sourceids, thing_videocover]
nebula.fields: [Thing_name, Thing_title, Thing_nameCh, Thing_nameEn, Thing_abbreviation, Thing_tag, Thing_alias, Thing_abstract, Thing_image, Thing_video, Thing_audio, Thing_gmtCreated, Thing_gmtModified, Thing_popularity, Thing_prior, Thing_dataSource, Thing_urls, Thing_class, Thing_imageJson, Thing_embedding, Thing_sourceIds, Thing_videoCover]
vertex: {field:thing_id}

就这条数据
(刘灿,
报社主编,
,
,
,
,
山木,
刘灿,笔名山木。山东新泰人。中共党员。1970年入伍。,
,
,
,
,
,
60.42,
false,
baike.baidu.com,
刘灿(报社主编)_百度百科,
Person,
,
,
{“baike.baidu.com”:“3662838”},

这应该是个bug, 你全部的数据里面是否有存在值为null的数据?
参考这个pr:Bugfix/encoder row writer by MMyheart · Pull Request #366 · vesoft-inc/nebula-java · GitHub

1 个赞