spark-writer 从 hive 导入数据，vertex_key 类型、列错位问题

W_A · 2020 年7 月 6 日 12:16

Hi, 最近我在使用 spark-writer 将数据从 hive 表导入到 nba space.
以 team 表为例，hive 表的 schema 为，其中的数据为 team_1, team_2 … team_n 这样的测试数据

CREATE TABLE test.nebula_test_tag_player (
  id bigint,
  name varchar
)

tags 配置信息为

tags: [
  {
    name: team
    type: hive
    exec: "select id, name from test.nebula_test_tag_team"
    fields: {
      name: name
    }
    vertex: id
  }
]

首先是列错位的问题
在执行 spark-submit 之后 executor 生成的 insert 语句为：

SparkClientGenerator$: Exec : INSERT VERTEX team(name) VALUES team_20000000: (“team_20000000”)

看起来是因为 SparkClientGenerator 代码中，根据配置信息拿到的 vertexIndex = 1, 但是 hive sql 中 id 字段的 index = 0, 导致生成 insert 语句的时候错位了

将 id 字段调整到 name 之后， hive sql 更改为

select name, id from test.nebula_test_tag_team

执行 spark-submit, 会在 driver 端抛 java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String 的异常，似乎是因为 SparkClientGenerator 中是使用 row.getString(vertexIndex)
将 hive 表的 id 字段更改为 string 类型之后，才能跑通

darionyaphet · 2020 年7 月 7 日 02:46

列顺序的问题现在有一个PR了会合进去的

ClassCastException 的问题是这样因为有些用户的VID默认是String 所以统一使用的String 可以使用 SQL cast一下