nebula graph 1.2.0 使用exchange导入遇到spark卡住BUG

xiaopqr · 2021 年1 月 21 日 01:09

nebula 版本：v1.2.0
部署方式（分布式 / 单机 / Docker / DBaaS）：Docker
硬件信息
- 磁盘（必须为 SSD ，不支持 HDD）:SSD
- CPU、内存信息：24核48线程、378GB

ID	Name	                       Partition number	Replica Factor	  Charset	Collate	
133	knowledge_2hm            15	                        3	                  utf8	utf8_bin

问题的具体描述
spark导入任务启动一段时间以后，会出现：

21/01/20 15:00:06 ERROR AbstractNebulaCallback: onError: java.util.concurrent.TimeoutException: Operation class com.vesoft.nebula.graph.GraphService$AsyncClient$execute_call timed out after 5005 ms.
java.util.concurrent.TimeoutException: Operation class com.vesoft.nebula.graph.GraphService$AsyncClient$execute_call timed out after 5005 ms.
	at com.facebook.thrift.async.TAsyncClientManager$SelectThread.timeoutMethods(TAsyncClientManager.java:157)
	at com.facebook.thrift.async.TAsyncClientManager$SelectThread.run(TAsyncClientManager.java:114)

然后整个spark任务就卡住了，好几个小时都不动。
导入的exchange配置文件如下：

{
  spark: {
    app: {
      name: Spark Writer
    }

    driver: {
      cores: 2
      maxResultSize: 8G
    }

    cores {
      max: 8
    }
  }

  nebula: {
    address: {
      graph: ["10.38.16.87:3699", "10.38.16.87:3700", "10.38.16.87:3701"]
      meta: ["10.38.16.87:45501"]
    }

    user: user
    pswd: password

    space: knowledge_2hm

    connection {
      timeout: 5000
      retry: 3
    }

    execution {
      retry: 3
    }
    error: {
      max: 32
      output: /tmp/errors
    }
    rate: {
      limit: 1024
      timeout: 1000
    }
  }


  tags: [
    {
      name: Thing
      type: {
        source: hive
        sink: client
      }
      exec: "select thing_id, thing_name, thing_namech, thing_nameen, thing_abbreviation, thing_tag, thing_alias, thing_abstract, thing_image, thing_video, thing_audio, thing_gmtcreated, thing_gmtmodified, thing_popularity, thing_prior, thing_datasource, thing_urls from oppo_kg_dw.thing_20210103 where ds = '20210103'"
      fields: [thing_name, thing_namech, thing_nameen, thing_abbreviation, thing_tag, thing_alias, thing_abstract, thing_image, thing_video, thing_audio, thing_gmtcreated, thing_gmtmodified, thing_popularity, thing_prior, thing_datasource, thing_urls]
      nebula.fields: [Thing_name, Thing_nameCh, Thing_nameEn, Thing_abbreviation, Thing_tag, Thing_alias, Thing_abstract, Thing_image, Thing_video, Thing_audio, Thing_gmtCreated, Thing_gmtModified, Thing_popularity, Thing_prior, Thing_dataSource, Thing_urls]
      vertex: thing_id
      isImplicit: true
      batch: 128
      partition: 8
    }
    
  ]

  edges: [
    {
      name: Thing_type
      type: {
        source: hive
        sink: client
      }
      exec: "select src_id, dst_id from oppo_kg_dw.edge_20210103 where ds = '20210103' and edge_label = 'Thing_type'"
      fields: []
      nebula.fields: []
      source:  src_id
      target:  dst_id
      isImplicit: true
      batch: 256
      partition: 8
    }
  ]
}

nicole · 2021 年1 月 21 日 02:40

这个问题现象是在导入过程中graph响应超时了，出现异常但Spark任务没有及时退出。需要你这边再多提供一些信息：

你导入的数据量是多大的
show hosts 看下storaged服务
贴一下graph和storage的日志吧

xiaopqr · 2021 年1 月 25 日 09:24

导入2亿节点，10亿边，我感觉导入使用异步客户端不太好用，我准备改成同步客户端再试试

jay · 2021 年3 月 2 日 13:36

解决了没？我也遇到这个问题了

jamieliu1023 · 2021 年3 月 3 日 07:58

@xiaopqr 来分享一下？

steam · 2021 年3 月 5 日 03:21

贴下上面 Nicole 说的一些信息（graph、storage 信息）呢，让她帮你看看

jay · 2021 年3 月 5 日 03:23

我把partion啥的改小点好像没出来了，暂时不管了，嘿嘿，可能网络不好吧

jay · 2021 年3 月 5 日 03:24

下次还出现我再来贴信息

steam · 2021 年3 月 5 日 03:25

好啊，可以保持更新，遇到问题的话

jay · 2021 年3 月 5 日 03:26

好的嘿嘿

xiaopqr · 2021 年4 月 8 日 04:25

我们通过修改源码解决了这个问题，我们把exchange中的异步客户端改成了同步客户端，这样既能解决该问题，同时导入速度不需要我们自己去评估，这样反而更有利于提升导入速度