nebula graph 1.2.0 使用exchange导入遇到spark卡住BUG

  • nebula 版本:v1.2.0
  • 部署方式(分布式 / 单机 / Docker / DBaaS):Docker
  • 硬件信息
    • 磁盘( 必须为 SSD ,不支持 HDD):SSD
    • CPU、内存信息:24核48线程、378GB
ID	Name	                       Partition number	Replica Factor	  Charset	Collate	
133	knowledge_2hm            15	                        3	                  utf8	utf8_bin
  • 问题的具体描述
    spark导入任务启动一段时间以后,会出现:
21/01/20 15:00:06 ERROR AbstractNebulaCallback: onError: java.util.concurrent.TimeoutException: Operation class com.vesoft.nebula.graph.GraphService$AsyncClient$execute_call timed out after 5005 ms.
java.util.concurrent.TimeoutException: Operation class com.vesoft.nebula.graph.GraphService$AsyncClient$execute_call timed out after 5005 ms.
	at com.facebook.thrift.async.TAsyncClientManager$SelectThread.timeoutMethods(TAsyncClientManager.java:157)
	at com.facebook.thrift.async.TAsyncClientManager$SelectThread.run(TAsyncClientManager.java:114)

然后整个spark任务就卡住了,好几个小时都不动。
导入的exchange配置文件如下:

{
  spark: {
    app: {
      name: Spark Writer
    }

    driver: {
      cores: 2
      maxResultSize: 8G
    }

    cores {
      max: 8
    }
  }

  nebula: {
    address: {
      graph: ["10.38.16.87:3699", "10.38.16.87:3700", "10.38.16.87:3701"]
      meta: ["10.38.16.87:45501"]
    }

    user: user
    pswd: password

    space: knowledge_2hm

    connection {
      timeout: 5000
      retry: 3
    }

    execution {
      retry: 3
    }
    error: {
      max: 32
      output: /tmp/errors
    }
    rate: {
      limit: 1024
      timeout: 1000
    }
  }


  tags: [
    {
      name: Thing
      type: {
        source: hive
        sink: client
      }
      exec: "select thing_id, thing_name, thing_namech, thing_nameen, thing_abbreviation, thing_tag, thing_alias, thing_abstract, thing_image, thing_video, thing_audio, thing_gmtcreated, thing_gmtmodified, thing_popularity, thing_prior, thing_datasource, thing_urls from oppo_kg_dw.thing_20210103 where ds = '20210103'"
      fields: [thing_name, thing_namech, thing_nameen, thing_abbreviation, thing_tag, thing_alias, thing_abstract, thing_image, thing_video, thing_audio, thing_gmtcreated, thing_gmtmodified, thing_popularity, thing_prior, thing_datasource, thing_urls]
      nebula.fields: [Thing_name, Thing_nameCh, Thing_nameEn, Thing_abbreviation, Thing_tag, Thing_alias, Thing_abstract, Thing_image, Thing_video, Thing_audio, Thing_gmtCreated, Thing_gmtModified, Thing_popularity, Thing_prior, Thing_dataSource, Thing_urls]
      vertex: thing_id
      isImplicit: true
      batch: 128
      partition: 8
    }
    
  ]

  edges: [
    {
      name: Thing_type
      type: {
        source: hive
        sink: client
      }
      exec: "select src_id, dst_id from oppo_kg_dw.edge_20210103 where ds = '20210103' and edge_label = 'Thing_type'"
      fields: []
      nebula.fields: []
      source:  src_id
      target:  dst_id
      isImplicit: true
      batch: 256
      partition: 8
    }
  ]
}

这个问题现象是 在导入过程中graph响应超时了,出现异常但Spark任务没有及时退出。需要你这边再多提供一些信息:

  1. 你导入的数据量是多大的
  2. show hosts 看下storaged服务
  3. 贴一下graph和storage的日志吧

导入2亿节点,10亿边,我感觉导入使用异步客户端不太好用,我准备改成同步客户端再试试

1 个赞

解决了没?我也遇到这个问题了

@xiaopqr 来分享一下? :grin:

贴下上面 Nicole 说的一些信息(graph、storage 信息)呢,让她帮你看看

我把partion啥的改小点好像没出来了,暂时不管了,嘿嘿,可能网络不好吧

下次还出现我再来贴信息

好啊,可以保持更新,遇到问题的话 :partying_face:

好的嘿嘿

我们通过修改源码解决了这个问题,我们把exchange中的异步客户端改成了同步客户端,这样既能解决该问题,同时导入速度不需要我们自己去评估,这样反而更有利于提升导入速度

2 个赞