exchange 编译nebula-exchange_spark_3.0报错

王鑫1 · 2024 年3 月 20 日 08:49

java version “21.0.2” 2024-01-16 LTS
Apache Maven 3.9.6

执行mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true -pl nebula-exchange_spark_2.2 -am -Pscala-2.11 -Pspark-2.2  返回打包成功
执行mvn clean package -Dmaven.test.skip=true -Dgpg.skip -Dmaven.javadoc.skip=true -pl nebula-exchange_spark_3.0 -am -Pscala-2.12 -Pspark-3.0 
Reactor Summary for nebula-exchange 3.0-SNAPSHOT:
[INFO] 
[INFO] nebula-exchange .................................... SUCCESS [  0.191 s]
[INFO] exchange-common .................................... FAILURE [  9.581 s]
[INFO] nebula-exchange_spark_3.0 .......................... SKIPPED

[ERROR] Failed to execute goal org.scala-tools:maven-scala-plugin:2.15.2:compile (scala-compile) on project exchange-common: wrap: org.apache.commons.exec.ExecuteException: Process exited with an error: 1(Exit value: 1) -> [Help 1]

[ERROR] error:
[INFO]   bad constant pool index: 0 at pos: 48461
[INFO]      while compiling: <no file>
[INFO]         during phase: globalPhase=<no phase>, enteringPhase=<some phase>
[INFO]      library version: version 2.12.10
[INFO]     compiler version: version 2.12.10

[ERROR] error: scala.reflect.internal.FatalError:
[INFO]   bad constant pool index: 0 at pos: 48461
[INFO]      while compiling: <no file>
[INFO]         during phase: globalPhase=<no phase>, enteringPhase=<some phase>
[INFO]      library version: version 2.12.10
[INFO]     compiler version: version 2.12.10

王鑫1 · 2024 年3 月 20 日 09:48

在排查中发现spark2.4版本使用的scala是2.11版本，而spark 3.0使用的scala版本是2.12，而里面调用的maven-scala-plugin使用的2.5.2仅支持2.11，请问spark3.0编译无法成功可能是这个原因造成的吗？

[DEBUG] cmd:  /home/iie4bu/jdk-21.0.2/bin/java -classpath /home/iie4bu/.m2/repository/org/scala-lang/scala-compiler/2.12.10/scala-compiler-2.12.10.jar:/home/iie4bu/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar:/home/iie4bu/.m2/repository/org/scala-lang/scala-reflect/2.12.10/scala-reflect-2.12.10.jar:/home/iie4bu/.m2/repository/org/scala-lang/scala-library/2.12.0/scala-library-2.12.0.jar:/home/iie4bu/.m2/repository/org/scala-lang/modules/scala-xml_2.12/1.0.6/scala-xml_2.12-1.0.6.jar:/home/iie4bu/.m2/repository/org/scala-tools/maven-scala-plugin/2.15.2/maven-scala-plugin-2.15.2.jar -Xbootclasspath/a:/home/iie4bu/.m2/repository/org/scala-lang/scala-library/2.12.10/scala-library-2.12.10.jar -Xss4096K org_scala_tools_maven_executions.MainWithArgsInFile scala.tools.nsc.Main /tmp/scala-maven-11661124521732053355.args
[ERROR] error:
[INFO]   bad constant pool index: 0 at pos: 48461
[INFO]      while compiling: <no file>
[INFO]         during phase: globalPhase=<no phase>, enteringPhase=<some phase>
[INFO]      library version: version 2.12.10
[INFO]     compiler version: version 2.12.10

在上面的打印中，该语句使用java和scala进行运行，但是scala为2.12，其中包含maven-scala-plugin 2.15.2 ，请问是这种错误吗？这个句子就报错了

王鑫1 · 2024 年3 月 20 日 13:59

好像改一下Scala-Maven-plugin版本号需要

nicole · 2024 年3 月 21 日 02:36

按理说不需要修改pom，github repo每天都会进行编译打包的 snapshot · vesoft-inc/nebula-exchange@0784c0b · GitHub

王鑫1 · 2024 年3 月 26 日 04:14

不太确定，改了版本就可以正常编译了，还有一个问题，就是对于vertex其policy设为hash是正常的，但是edge将policy设为hash插入关系是失败的，关闭hash是正确的，你们碰到过这个问题吗？插入是在两个任务中，vertex两种情况都是正常的，edge关闭hash是正确的，打开hash就插入为空，基于spark 3.3

nicole · 2024 年3 月 27 日 08:35

插入失败，错误信息是什么，日志中会打印的，可以贴一下看看

王鑫1 · 2024 年3 月 28 日 07:49

在conf中的配置如下：

edges:
[
  {
      name: 人拥有身份证号
      data_source : wangx_test
      type: {
        source: hive
        sink: client
      }
      exec: "select ssn,name ,ssn as `身份证号`,  name as `姓名` ,email as `电子邮箱`,address as `地址` from bigData.fakedata20240304_test"
      fields: [电子邮箱,地址,身份证号,姓名]
      nebula.fields: [电子邮箱,地址,身份证号,姓名]
      source: {
        field: ssn
      policy:hash
      }
      target: {
        field: name
      policy:hash
      }
      batch: 256
      partition: 32
    }
    {
      name: 身份证关联人
      type: {
        source: hive
        sink: client
      }
      exec: "select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test"
      fields: [身份证号,姓名]
      nebula.fields: [身份证号,姓名]
      source: {
        field: 身份证号
      "prefix":"身份证"
      policy:hash
      }
      target: {
        field: 姓名
      "prefix":"姓名"
      policy:hash
      }
      batch: 256
      partition: 32
    }
  ]

其中人拥有身份证号和身份证关联人唯一的区别是身份证关联人加了prefix，下面运行打印的日志输出：

24/03/28 15:30:14 INFO Exchange$: >>>>> import for edge 身份证关联人, cost time: 17.60s
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: batchSuccess.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: recordSuccess.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: batchFailure.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: recordFailure.身份证关联人: 0

24/03/28 15:30:10 INFO Exchange$: >>>>> import for edge 人拥有身份证号, cost time: 14.51s
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: batchSuccess.人拥有身份证号: 7
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: recordSuccess.人拥有身份证号: 14
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: batchFailure.人拥有身份证号: 0
24/03/28 15:30:10 INFO Exchange$: >>>>> Client-Import: recordFailure.人拥有身份证号: 0

可以看到加了prefix入的数量是0，如果不加是正常的，和hash没有关系
还有请问我处理这种查询的变量与本体定义不一致的情况是正确的吗？我如果不加as如果在fields设置name会提示nebula graph没有name配置

王鑫1 · 2024 年3 月 28 日 08:01

同样的配置node都是正常的，只有edge加了prefix就无输出，不论prefix是汉语还是英文

nicole · 2024 年3 月 28 日 08:43

看了下代码逻辑上是一样的，暂没发现问题原因，在executor中会有error日志，提示了写入失败的 error message，贴一下这个日志吧。

nicole · 2024 年3 月 28 日 08:46

王鑫1:

24/03/28 15:30:14 INFO Exchange$: >>>>> import for edge 身份证关联人, cost time: 17.60s
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: batchSuccess.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: recordSuccess.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: batchFailure.身份证关联人: 0
24/03/28 15:30:14 INFO Exchange$: >>>>> Client-Import: recordFailure.身份证关联人: 0

这个日志失败的记录也是0啊，这个语句执行结果是有数据的吗？

select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test

王鑫1 · 2024 年3 月 28 日 09:02

查询语句结果是有的，因为这个语句添加prefix就无输出，删除prefix就可以，很奇怪，我也打印了edgeSet，就是添加prefix就是空的

nicole · 2024 年3 月 28 日 09:06

没有任何error日志吗

nicole · 2024 年3 月 28 日 09:10

你在exchange配置文件中只保留身份证关联人这个 edge，然后使用命令
spark-submit --master xxx nebula-exchange_spark_3.0-3.0-SNAPSHOT.jar -c xxx.conf -D

这样会把源数据中的数据 show 出来的，你看看有数据吗

王鑫1 · 2024 年3 月 28 日 09:31

24/03/28 17:27:53 INFO DAGScheduler: Job 1 finished: show at Exchange.scala:236, took 6.097156 s
24/03/28 17:27:53 INFO CodeGenerator: Code generated in 25.909857 ms
+---------+---------------------+
|身份证号 |姓名                 |
+---------+---------------------+
|887191301|Brian Nguyen         |
|735404485|Jeffrey Mcgee        |
|761846187|Michael Singh        |
|592745125|Gabriela Lawrence    |
|670533796|Matthew Hammond      |
|403585996|Jose Wade            |
|323878136|Emily Bennett        |
|538603116|Susan Thomas         |
|554585414|Mrs. Christina Thomas|
|388540069|Margaret Ruiz        |
|388540069|Margaret Ruiz        |
|388540069|Margaret Ruiz        |
|554585414|Mrs. Christina Thomas|
|887191301|Brian Nguyen         |
+---------+---------------------+

24/03/28 17:27:53 INFO SparkContext: Successfully stopped SparkContext
24/03/28 17:27:53 INFO Exchange$: 
>>>>>> exchange job finished, cost 16.57s 
>>>>>> total client batchSuccess:0 
>>>>>> total client recordsSuccess:0 
>>>>>> total client batchFailure:0 
>>>>>> total client recordsFailure:0 
>>>>>> total SST failure:0 
>>>>>> total SST Success:0
24/03/28 17:27:53 INFO Exchange$: >>>>>> exchange import qps: 0.00/s
24/03/28 17:27:53 INFO ShutdownHookManager: Shutdown hook called

如果去掉prefix那么是正常的，可以入，hash也是没有问题的

王鑫1 · 2024 年3 月 28 日 11:13

看到的效果好像是配置prefix后阻挡了某一步的输出，我重新编译exchange spark 3.0代码，打印每一步的prefix和policy，发现不设置prefix的时候正常打印：

SourcePrefix: null, TargetPrefix: null, SourcePolicy: hash, TargetPolicy: hash

设置prefix后文件也无法打印，spark运行显示正常

24/03/28 19:21:39 INFO SparkContext: Successfully stopped SparkContext
24/03/28 19:21:39 INFO Exchange$: 
>>>>>> exchange job finished, cost 13.17s 
>>>>>> total client batchSuccess:0 
>>>>>> total client recordsSuccess:0 
>>>>>> total client batchFailure:0 
>>>>>> total client recordsFailure:0 
>>>>>> total SST failure:0 
>>>>>> total SST Success:0
24/03/28 19:21:39 INFO Exchange$: >>>>>> exchange import qps: 0.00/s
24/03/28 19:21:39 INFO ShutdownHookManager: Shutdown hook called

王鑫1 · 2024 年3 月 29 日 09:28

24/03/29 17:12:00 INFO Configs$: Source Config Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test
24/03/29 17:12:00 INFO Configs$: Sink Config Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test
24/03/29 17:12:00 INFO Configs$: Edge Config: Edge name: 身份证关联人, source: Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test, sink: Nebula sink addresses: [10.26.120.55:9669, 10.26.120.53:9669], writeMode: insert, source field: 身份证号, source policy: Some(hash), ranking: None, target field: 姓名, target policy: Some(hash), batch: 256, partition: 32, ignoreIndex: false, srcVertexUdf: NonedstVertexUdf: None.
24/03/29 17:12:00 INFO Exchange$: >>>>> Config Configs(DataBaseConfigEntry:{graphAddress:List(10.26.120.55:9669, 10.26.120.53:9669), space:test汉语, metaAddress:List(10.26.120.55:9559, 10.26.120.53:9559)},UserConfigEntry{user:root, password:xxxxx},cConnectionConfigEntry:{timeout:3000, retry:3},ExecutionConfigEntry:{timeout:2147483647, retry:3},ErrorConfigEntry:{errorPath:file:///tmp/errors, errorMaxSize:32},RateConfigEntry:{limit:1024, timeout:1000},SslConfigEntry:{enableGraph:false, enableMeta:false, signType:ca},,List(),List(Edge name: 身份证关联人, source: Hive source exec: select ssn as `身份证号`,name as `姓名` from bigData.fakedata20240304_test, sink: Nebula sink addresses: [10.26.120.55:9669, 10.26.120.53:9669], writeMode: insert, source field: 身份证号, source policy: Some(hash), ranking: None, target field: 姓名, target policy: Some(hash), batch: 256, partition: 32, ignoreIndex: false, srcVertexUdf: NonedstVertexUdf: None.),None)
24/03/29 17:12:00 INFO Exchange$: >>>>> you don't com.vesoft.exchange.common.config hive source, so using hive tied with spark.

从这个打印的日志中发现没有prefix的输入识别，使用的exchange版本为nebula-exchange_spark_3.0-3.7.0.jar
udf的功能都是正确的，只有prefix设置后无输出

nicole · 2024 年4 月 1 日 02:44

我知道了，prefix是针对string类型的vid的，如果你space的vid_type 是int，那是不能用prefix的，因为prefix是作用在最终的id上的

github.com

vesoft-inc/nebula-exchange/blob/0784c0b6f7925bab0d23baee7afaeb492512c164/nebula-exchange_spark_3.0/src/main/scala/com/vesoft/nebula/exchange/processor/EdgeProcessor.scala#L265


      
                  streamFlag,
                  s"space vidType is int, but your $fieldType $fieldValue is not numeric.your row data is $row")
                false
              } else if (idFlag && policy.isDefined && isVidStringType) {
                printChoice(
                  streamFlag,
                  s"only int vidType can use policy, but your vidType is FIXED_STRING.your row data is $row")
                false
              } else true
          
            val udfFlag = isVidStringType || policy.isEmpty || (edgeConfig.sourcePrefix == null && edgeConfig.targetPrefix == null)
            idFlag && policyFlag && udfFlag
          }
          
          /**
            * convert row data to {@link Edge}
            */
          def convertToEdge(row: Row,
                            edgeConfig: EdgeConfigEntry,
                            isVidStringType: Boolean,
                            fieldKeys: List[String],

王鑫1 · 2024 年4 月 3 日 10:48

您好，这个问题很好的解决了，我想问一下，如果我的某列可能包含多个值，中间用,隔开，比如电话号码列的值为13788768898,13455334456 ，其中列的值多少不定，咱们的exchange有什么好的方法处理吗？很明显这里电话号码应该是两个node，只是因为统计的原因写在了一列，而且实际上可能这列有几十个值，如果在hive打开可能冗余挺多，请问exchange里有现成的可用的方法入库吗？还是需要自己更改exchange

nicole · 2024 年4 月 7 日 08:27

这种建议你使用spark connector，将hive数据读入spark dataframe，然后针对某列做一些自定义预处理，再调用conenctor的写入api 写入nebula。

github.com

vesoft-inc/nebula-spark-connector/blob/master/example/src/main/scala/com/vesoft/nebula/examples/connector/NebulaSparkWriterExample.scala

/* Copyright (c) 2020 vesoft inc. All rights reserved.
 *
 * This source code is licensed under Apache 2.0 License.
 */

package com.vesoft.nebula.examples.connector

import com.facebook.thrift.protocol.TCompactProtocol
import com.vesoft.nebula.connector.{
  NebulaConnectionConfig,
  WriteMode,
  WriteNebulaEdgeConfig,
  WriteNebulaVertexConfig
}
import com.vesoft.nebula.connector.connector.NebulaDataFrameWriter
import com.vesoft.nebula.connector.ssl.SSLSignType
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.storage.StorageLevel
import org.slf4j.LoggerFactory

该文件已被截断。显示原文

system · 2024 年5 月 7 日 08:28

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。