nebula2.0.1 sst导入失败

  • nebula 版本:v2.0.1
  • 部署方式:docker-compose
  • 是否为线上版本:N
  • 问题的具体描述
  • 相关的 meta / storage / graph info 日志信息

当前测试从csv文件加载,然后以sst的方式导入的nebula。以下两个问题导致失败:

  1. csv field和nebula.field对应不起来,报key id not found。对比1.2的代码后发现,在NebulaUtils.scala L67 - L83添加了 vertexField等。但是在2.0的spark utils工程里面没有添加。

  2. 我本地修改上述field不对应的问题后,导入程序报File not open错误。日志如下

21/04/22 01:57:39 INFO NebulaSSTWriter: Loading RocksDB successfully
21/04/22 01:57:39 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
org.rocksdb.RocksDBException: File is not opened
	at org.rocksdb.SstFileWriter.finish(Native Method)
	at org.rocksdb.SstFileWriter.finish(SstFileWriter.java:223)
	at com.vesoft.nebula.exchange.writer.NebulaSSTWriter.close(FileBaseWriter.scala:46)
	at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$2.apply(VerticesProcessor.scala:196)
	at com.vesoft.nebula.exchange.processor.VerticesProcessor$$anonfun$process$2.apply(VerticesProcessor.scala:160)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

添加了vertexField的 会有一项限制,当你把csv中某一列同时作为NebulaGraph的vid和属性时,就会限制该属性与当前vid有相同的数据类型。 即 vid为String时,属性也必须为String类型。 2.0去掉vertexField是解除该限制的。

你可以把你的key id not found的日志贴一下看下。

错误日志是:

21/04/22 03:42:57 INFO CodeGenerator: Code generated in 5.605027 ms
21/04/22 03:42:57 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.util.NoSuchElementException: key not found: id
	at scala.collection.MapLike$class.default(MapLike.scala:228)
	at scala.collection.AbstractMap.default(Map.scala:59)
	at scala.collection.MapLike$class.apply(MapLike.scala:141)
	at scala.collection.AbstractMap.apply(Map.scala:59)
	at com.vesoft.nebula.exchange.processor.Processor$class.extraValueForSST(Processor.scala:77)

上述原因是csv field包含了id,但是fieldTypeMap不包含id.

我现在的做法是将fieldKeys拷贝一份,过滤掉id:

改前代码:EdgeProcessor.scala#L199

property <- fieldKeys if property.trim.length != 0

改后代码:

property <- fieldKeys.filter(x=> !x.equals(tagConfig.vertexField)) if property.trim.length != 0
  1. 我本地修改上述field不对应的问题后,导入程序报File not open错误。

看一下你的 conf 里,nebula.path.local 配置的路径,是不是在 spark 集群每个节点机器都存在。

生成 SST 的时候,会在这个路径下,创建 sst 文件,如果路径不存在,文件会创建失败,然后 close 的时候,就会出错了。修正了错误信息的 pr,fix rocksdb file close by Nicole00 · Pull Request #63 · vesoft-inc/nebula-spark-utils · GitHub

fieldTypeMap中存储的是<csv中字段名,该字段在Nebula中的数据类型>。 fieldTypeMap中不包含id的话,你可以看下你的配置文件中fields配置项。
key not found这个问题在论坛中有类似的,贴一下你的exchange配置文件吧,我看下你的配置项。

改了local file后,解决问题了。

applicaiton.conf文件如下。点(person.csv)和边(person_knows_person.csv)文件,用的是 https://github.com/ldbc/ldbc_snb_datagen生成的person和knows文件.

{
  # Spark 相关配置
  spark: {
    app: {
      name: Nebula Exchange 2.0
    }

    driver: {
      cores: 1
      maxResultSize: 1G
    }

    executor: {
        memory:1G
    }

    cores {
      max: 16
    }
  }

  nebula: {
    address:{
      graph:["172.28.0.6:9669"]
      meta:["172.28.0.4:9559"]
    }
    user: root
    pswd: nebula

    space: ldbc_snb_sf100_vid_string

    path:{
      local: "/tmp"
      remote: "/user/consumer/sst/"
      hdfs.namenode: "hdfs://127.0.0.1:9000"
   }

    connection {
      timeout: 3000
      retry: 3
    }

    execution {
      retry: 3
    }

    error: {
      max: 32
      output: /tmp/errors
    }

    rate: {
      limit: 1024
      timeout: 1000
    }
  }

  tags: [
    {
      name: person
      type: {
        source: csv
        sink: sst
      }

      path: "/exchange/social_network/dynamic/person.csv"
        
      fields: [firstName, lastName, gender, birthday, creationDate, locationIP, browserUsed, id]
      nebula.fields: [first_name, last_name, gender, birthday, birthday, ip, browser]
      
      vertex: {
        field: id
      }

      separator: "|"
      header: true
      batch: 256
      partition: 32
      isImplicit: true
    }
  ]
  edges: [
    {
      name: knows
      type: {
        source: csv
        sink: sst
      }

      path: "/exchange/social_network/dynamic/person_knows_person.csv"

      fields: [creationDate, Person.id0, Person.id1]
      nebula.fields: [time]

      source: {
        field: Person.id0
      }
      target: {
        field: Person.id1
      }

      batch: 256
      separator: "|"
      header: true
      partition: 32
      isImplicit: true
    }
  ]
}
fields: [firstName, lastName, gender, birthday, creationDate, locationIP, browserUsed, id]
nebula.fields: [first_name, last_name, gender, birthday, birthday, ip, browser]

fields 数量多了最后一个 id,这里的配置,只是配 field,需要和 nebula.fields 一一对应

1赞

删掉后成功了

浙ICP备20010487号