批量导入工具SparkClientGenerator insert vertex报错

使用工具批量导入数据:
sh spark-submit --class com.vesoft.nebula.tools.generator.v2.SparkClientGenerator --master yarn --executor-memory XXG --num-executors XX --executor-cores X sst.generator-1.0.0-rc4.jar -c nebula_writer.conf -d

映射文件:

{
  spark: {
    app: {
      name: Nebula_graph_importer
    }

    driver: {
      cores: 4
      maxResultSize: 1G
    }
  }

  nebula: {
    addresses: [ip_list]
    user: user
    pswd: password
    space: relation0
    connection {
      timeout: 10000
      retry: 3
    }
    execution {
      retry: 6
    }
  }
  tags: [
    {
      name: user
      type: parquet
      path: "XXXXXXX"
      fields: {
        user_id: user_id,
        ......
      }
      vertex: user_id
      batch : 100
    }
  ]
}

insert batch 设置的 100

导入异常日志:ERROR AsyncGraphClientImpl: execute error: Insert vertex not complete, completeness: 94

storage error 日志:E0422 15:47:36.724191 120110 Part.cpp:422] [Port: 44501, Space: 10, Part: 32] Consensus error -5

其他非error日志:

I0422 15:36:05.621559 120102 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 3] , total time:191ms, Write WAL, total 5
I0422 15:36:05.867715 120121 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 4] , total time:72ms, Write WAL, total 8
I0422 15:36:06.034884 120121 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 2] , total time:120ms, Write WAL, total 11
I0422 15:36:06.418220 120123 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 62] , total time:69ms, Write WAL, total 15
I0422 15:36:06.554832 120104 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 31] , total time:67ms, Write WAL, total 7
I0422 15:36:06.910140 120134 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 61] , total time:80ms, Write WAL, total 10
I0422 15:36:06.911126 120134 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 60] , total time:70ms, Total send logs: 35
I0422 15:36:07.817355 120133 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 119] , total time:77ms, Write WAL, total 2
I0422 15:36:08.194954 120120 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 118] , total time:52ms, Write WAL, total 5
I0422 15:36:09.135303 120133 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 90] , total time:405ms, Write WAL, total 21
I0422 15:36:09.919193 120115 SlowOpTracker.h:33] [Port: 44501, Space: 10, Part: 120] , total time:72ms, Write WAL, total 3
W0422 15:36:10.012905 120134 RaftPart.cpp:576] [Port: 44501, Space: 10, Part: 33] The appendLog buffer is full. Please slow down the log appending rate.replicatingLogs_ :1

buffer超了,batch设置小一点试试

试了下batch size 改小后,问题还存在,client还存在execute error

另外batch size改小后(比如50),确认一批insert 只有50条数据,但是错误日志execute error: Insert vertex not complete, completeness: 94, completeness:xx 值还是94,95 比batch值大

另外仔细分析了下storage的日志和graph的日志
storage的日志是:
E0423 11:27:07.900024 35264 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900049 35258 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900116 35248 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900167 35264 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900202 35257 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900223 35264 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900269 35255 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900348 35271 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900401 35258 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900552 35257 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900578 35264 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900684 35264 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900709 35258 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900921 35271 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.900983 35271 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.901593 35258 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.901675 35264 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.901794 35257 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.901955 35264 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.902011 35257 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5
E0423 11:27:07.902041 35271 Part.cpp:422] [Port: 44501, Space: 16, Part: 4] Consensus error -5

疑问:
1、这个错误日志的频率很高
2、Consensus error -5,错误码 -5 代表 ERR_LEADER_CHANGED,leader change会导致写入错误吗

graph 的日志:
E0423 12:05:41.643735 35732 ExecutionPlan.cpp:76] Execute failed: Insert edge `follow’ not complete, completeness: 96
E0423 12:05:41.670919 35743 InsertEdgeExecutor.cpp:263] Insert edge failed, error -16, part 36

graph层的错误码 -16 代表什么意思

这个是94%

-5是buffer overflow

buffer overflow的话有参数可以调大buffer size吗

etc/nebula-storage.conf : --max_batch_size

请问"buffer overflow" 的 buffer 具体指哪一层的buffer?是put 数据到storage 的缓存(相当于hbase 的memstore),然后再flush 到磁盘

应该是partition数量太少了,导致每个partition对应多个并发收到的客户端(spark worker)的写入太多了。

是Raft的wal,要跨网络的。类比HBase写HDFS wal 的 3 拷贝出去那样

请问 etc/nebula-storage.conf : --max_batch_size 设置成多少合适?

暂时没有分场景的 官方推荐, 先按默认的来吧.

1 个赞

诶?麻烦问下所以最后解决方案是什么?
在副本数为1的时候 导入效率相当高
在副本数量为3的时候 导入效率大幅下降了
有官方建议的配置比例吗?
batch数量多少 spark executor数量多少 ?~

一般官方推荐的就是最好的,当然也分不同的场景。

auto compact关了 ,对比试试看吧。

如果有索引的话,先把索引删除了。导入后重新建索引。