exchange 编译nebula-exchange_spark_3.0报错

  • java version “21.0.2” 2024-01-16 LTS
  • Apache Maven 3.9.6
执行mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true -pl nebula-exchange_spark_2.2 -am -Pscala-2.11 -Pspark-2.2  返回打包成功
执行mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true -pl nebula-exchange_spark_3.0 -am -Pscala-2.12 -Pspark-3.0 
Reactor Summary for nebula-exchange 3.0-SNAPSHOT:
[INFO] 
[INFO] nebula-exchange .................................... SUCCESS [  0.191 s]
[INFO] exchange-common .................................... FAILURE [  9.581 s]
[INFO] nebula-exchange_spark_3.0 .......................... SKIPPED

[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:compile (scala-compile) on project exchange-common: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]

[ERROR] error:
[INFO]   bad constant pool index: 0 at pos: 48461
[INFO]      while compiling: <no file>
[INFO]         during phase: globalPhase=<no phase>, enteringPhase=<some phase>
[INFO]      library version: version 2.12.10
[INFO]     compiler version: version 2.12.10

[ERROR] error: scala.reflect.internal.FatalError:
[INFO]   bad constant pool index: 0 at pos: 48461
[INFO]      while compiling: <no file>
[INFO]         during phase: globalPhase=<no phase>, enteringPhase=<some phase>
[INFO]      library version: version 2.12.10
[INFO]     compiler version: version 2.12.10

在排查中发现spark2.4版本使用的scala是2.11版本,而spark 3.0使用 的scala版本是2.12,而里面调用 的maven-scala-plugin使用 的2.5.2仅支持2.11,请问spark3.0编译无法成功可能是这个原因造成的吗?

[DEBUG] cmd:  /home/iie4bu/jdk-21.0.2/bin/java -classpath /home/iie4bu/.m2/repository/org/scala-lang/scala-compiler/2.12.10/scala-compiler-2.12.10.jar:/home/iie4bu/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar:/home/iie4bu/.m2/repository/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jar:/home/iie4bu/.m2/repository/org/scala-lang/scala-library/2.12.0/scala-library-2.12.0.jar:/home/iie4bu/.m2/repository/org/scala-lang/modules/scala-xml_2.12/1.0.6/scala-xml_2.12-1.0.6.jar:/home/iie4bu/.m2/repository/org/scala-tools/maven-scala-plugin/2.15.2/maven-scala-plugin-2.15.2.jar -Xbootclasspath/a:/home/iie4bu/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar -Xss4096K org_scala_tools_maven_executions.MainWithArgsInFile scala.tools.nsc.Main /tmp/scala-maven-11661124521732053355.args
[ERROR] error:
[INFO]   bad constant pool index: 0 at pos: 48461
[INFO]      while compiling: <no file>
[INFO]         during phase: globalPhase=<no phase>, enteringPhase=<some phase>
[INFO]      library version: version 2.12.10
[INFO]     compiler version: version 2.12.10

在上面的打印中,该语句使用java和scala进行运行,但是scala为2.12,其中包含maven-scala-plugin 2.15.2 ,请问是这种错误吗?这个句子就报错了

1 个赞

好像改一下Scala-Maven-plugin版本号需要

按理说不需要修改pom,github repo每天都会进行编译打包的 snapshot · vesoft-inc/nebula-exchange@0784c0b · GitHub

1 个赞

不太确定,改了版本就可以正常编译了,还有一个问题,就是对于vertex其policy设为hash是正常的,但是edge将policy设为hash插入关系是失败的,关闭hash是正确的,你们碰到过这个问题吗?插入是在两个任务中,vertex两种情况都是正常的,edge关闭hash是正确 的,打开hash就插入为空,基于spark 3.3

插入失败,错误信息是什么,日志中会打印的,可以贴一下看看

在conf中的配置如下:

edges:
[
  {
      name: 人拥有身份证号
      data_source : wangx_test
      type: {
        source: hive
        sink: client
      }
      exec: "select ssn,name ,ssn as `身份证号`,  name as `姓名` ,email as `电子邮箱`,address as `地址` from bigData.fakedata20240304_test"
      fields: [电子邮箱,地址,身份证号,姓名]
      nebula.fields: [电子邮箱,地址,身份证号,姓名]
      source: {
        field: ssn
      policy:hash
      }
      target: {
        field: name
      policy:hash
      }
      batch: 256
      partition: 32
    }
    {
      name: 身份证关联人
      type: {
        source: hive
        sink: client
      }
      exec: "select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test"
      fields: [身份证号,姓名]
      nebula.fields: [身份证号,姓名]
      source: {
        field: 身份证号
      "prefix":"身份证"
      policy:hash
      }
      target: {
        field: 姓名
      "prefix":"姓名"
      policy:hash
      }
      batch: 256
      partition: 32
    }
  ]

其中人拥有身份证号和身份证关联人唯一的区别是身份证关联人加了prefix,下面运行打印的日志输出:

24/03/28 15:30:14 INFO Exchange$: >>>>> import for edge 身份证关联人, cost time: 17.60s
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: batchSuccess.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: recordSuccess.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: batchFailure.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: recordFailure.身份证关联人: 0

24/03/28 15:30:10 INFO Exchange$: >>>>> import for edge 人拥有身份证号, cost time: 14.51s
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: batchSuccess.人拥有身份证号: 7
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: recordSuccess.人拥有身份证号: 14
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: batchFailure.人拥有身份证号: 0
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: recordFailure.人拥有身份证号: 0


可以看到加了prefix入的数量是0,如果不加是正常的,和hash没有关系
还有请问我处理这种查询的变量与本体定义不一致的情况是正确的吗?我如果不加as如果在fields设置name会提示nebula graph没有name配置

同样的配置node都是正常的,只有edge加了prefix就无输出,不论prefix是汉语还是英文

看了下代码逻辑上是一样的,暂没发现问题原因,在executor中会有error日志,提示了 写入失败的 error message,贴一下这个日志吧。

这个日志 失败的记录也是0啊,这个语句执行结果是有数据的吗?

select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test

查询语句结果是有的,因为这个语句添加prefix就无输出,删除prefix就可以,很奇怪,我也打印了edgeSet,就是添加prefix就是空的

没有任何error日志吗

你在exchange配置文件中只保留 身份证关联人这个 edge,然后使用命令
spark-submit --master xxx nebula-exchange_spark_3.0-3.0-SNAPSHOT.jar -c xxx.conf -D

这样会把源数据中的数据 show 出来的,你看看有数据吗

24/03/28 17:27:53 INFO DAGScheduler: Job 1 finished: show at Exchange.scala:236, took 6.097156 s
24/03/28 17:27:53 INFO CodeGenerator: Code generated in 25.909857 ms
+---------+---------------------+
|身份证号 |姓名                 |
+---------+---------------------+
|887191301|Brian Nguyen         |
|735404485|Jeffrey Mcgee        |
|761846187|Michael Singh        |
|592745125|Gabriela Lawrence    |
|670533796|Matthew Hammond      |
|403585996|Jose Wade            |
|323878136|Emily Bennett        |
|538603116|Susan Thomas         |
|554585414|Mrs. Christina Thomas|
|388540069|Margaret Ruiz        |
|388540069|Margaret Ruiz        |
|388540069|Margaret Ruiz        |
|554585414|Mrs. Christina Thomas|
|887191301|Brian Nguyen         |
+---------+---------------------+

24/03/28 17:27:53 INFO SparkContext: Successfully stopped SparkContext
24/03/28 17:27:53 INFO Exchange$: 
>>>>>> exchange job finished, cost 16.57s 
>>>>>> total client batchSuccess:0 
>>>>>> total client recordsSuccess:0 
>>>>>> total client batchFailure:0 
>>>>>> total client recordsFailure:0 
>>>>>> total SST failure:0 
>>>>>> total SST Success:0
24/03/28 17:27:53 INFO Exchange$: >>>>>> exchange import qps: 0.00/s
24/03/28 17:27:53 INFO ShutdownHookManager: Shutdown hook called

如果去掉prefix那么是正常的,可以入,hash也是没有问题的

看到的效果好像是配置prefix后阻挡了某一步的输出,我重新编译exchange spark 3.0代码,打印每一步的prefix和policy,发现不设置prefix的时候正常打印:

SourcePrefix: null, TargetPrefix: null, SourcePolicy: hash, TargetPolicy: hash

设置prefix后文件也无法打印,spark运行显示正常

24/03/28 19:21:39 INFO SparkContext: Successfully stopped SparkContext
24/03/28 19:21:39 INFO Exchange$: 
>>>>>> exchange job finished, cost 13.17s 
>>>>>> total client batchSuccess:0 
>>>>>> total client recordsSuccess:0 
>>>>>> total client batchFailure:0 
>>>>>> total client recordsFailure:0 
>>>>>> total SST failure:0 
>>>>>> total SST Success:0
24/03/28 19:21:39 INFO Exchange$: >>>>>> exchange import qps: 0.00/s
24/03/28 19:21:39 INFO ShutdownHookManager: Shutdown hook called

24/03/29 17:12:00 INFO Configs$: Source Config Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test
24/03/29 17:12:00 INFO Configs$: Sink Config Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test
24/03/29 17:12:00 INFO Configs$: Edge Config: Edge name: 身份证关联人, source: Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test, sink: Nebula sink addresses: [10.26.120.55:9669, 10.26.120.53:9669], writeMode: insert, source field: 身份证号, source policy: Some(hash), ranking: None, target field: 姓名, target policy: Some(hash), batch: 256, partition: 32, ignoreIndex: false, srcVertexUdf: NonedstVertexUdf: None.
24/03/29 17:12:00 INFO Exchange$: >>>>> Config Configs(DataBaseConfigEntry:{graphAddress:List(10.26.120.55:9669, 10.26.120.53:9669), space:test汉语, metaAddress:List(10.26.120.55:9559, 10.26.120.53:9559)},UserConfigEntry{user:root, password:xxxxx},cConnectionConfigEntry:{timeout:3000, retry:3},ExecutionConfigEntry:{timeout:2147483647, retry:3},ErrorConfigEntry:{errorPath:file:///tmp/errors, errorMaxSize:32},RateConfigEntry:{limit:1024, timeout:1000},SslConfigEntry:{enableGraph:false, enableMeta:false, signType:ca},,List(),List(Edge name: 身份证关联人, source: Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test, sink: Nebula sink addresses: [10.26.120.55:9669, 10.26.120.53:9669], writeMode: insert, source field: 身份证号, source policy: Some(hash), ranking: None, target field: 姓名, target policy: Some(hash), batch: 256, partition: 32, ignoreIndex: false, srcVertexUdf: NonedstVertexUdf: None.),None)
24/03/29 17:12:00 INFO Exchange$: >>>>> you don't com.vesoft.exchange.common.config hive source, so using hive tied with spark.

从这个打印的日志中发现没有prefix的输入识别,使用的exchange版本为nebula-exchange_spark_3.0-3.7.0.jar
udf的功能都是正确的,只有prefix设置后无输出

我知道了,prefix是针对string类型的vid的,如果你space的vid_type 是int,那是不能用prefix的,因为prefix是作用在最终的id上的

1 个赞

您好,这个问题很好的解决了,我想问一下,如果我的某列可能包含多个值,中间用,隔开,比如电话号码列的值为13788768898,13455334456 ,其中列的值多少不定,咱们的exchange有什么好的方法处理吗?很明显这里电话号码应该是两个node,只是因为统计的原因写在了一列,而且实际上可能这列有几十个值,如果在hive打开可能冗余挺多,请问exchange里有现成的可用的方法入库吗?还是需要自己更改exchange

这种建议你使用spark connector,将hive数据读入spark dataframe,然后针对某列做一些自定义预处理,再调用conenctor的写入api 写入nebula。

1 个赞