图库报错E_LEADER_LEASE_FAILED

  • nebula 版本:3.1.0
  • 部署方式:分布式
  • 安装方式: RPM
  • 是否为线上版本: N
  • 硬件信息
    • 磁盘( 推荐使用 SSD)SSD
    • CPU、内存信息 500G
  • 问题的具体描述
  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)

使用exchange导入数据,数据量较大,在一个任务完成后,启动另一个导数任务,但是发现该任务无法建立连接,全是ReadTimeOut,于是关闭任务

22/06/07 13:50:25 WARN scheduler.TaskSetManager: Lost task 19.1 in stage 2.0 (TID 84, dn4, executor 1): com.vesoft.nebula.client.graph.exception.IOErrorException: java.net.SocketTimeoutException: Read timed out
        at com.vesoft.nebula.client.graph.net.SyncConnection.executeWithParameter(SyncConnection.java:189)
        at com.vesoft.nebula.client.graph.net.Session.executeWithParameter(Session.java:113)
        at com.vesoft.nebula.client.graph.net.Session.execute(Session.java:78)
        at com.vesoft.exchange.common.GraphProvider.submit(GraphProvider.scala:78)
        at com.vesoft.exchange.common.writer.NebulaGraphClientWriter.writeEdges(ServerBaseWriter.scala:153)
        at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$com$vesoft$nebula$exchange$processor$EdgeProcessor$$processEachPartition$1.apply(EdgeProcessor.scala:71)
        at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$com$vesoft$nebula$exchange$processor$EdgeProcessor$$processEachPartition$1.apply(EdgeProcessor.scala:69)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at com.vesoft.nebula.exchange.processor.EdgeProcessor.com$vesoft$nebula$exchange$processor$EdgeProcessor$$processEachPartition(EdgeProcessor.scala:69)
        at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3.apply(EdgeProcessor.scala:178)
        at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3.apply(EdgeProcessor.scala:178)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
        at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
        at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 19.2 in stage 2.0 (TID 121, dn4, executor 1, partition 19, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 14.1 in stage 2.0 (TID 82) on dn4, executor 1: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 1]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 14.2 in stage 2.0 (TID 122, dn4, executor 1, partition 14, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 29.1 in stage 2.0 (TID 83) on nd2, executor 3: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 2]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 29.2 in stage 2.0 (TID 123, nd2, executor 3, partition 29, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 25.1 in stage 2.0 (TID 81) on nd2, executor 3: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 3]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 25.2 in stage 2.0 (TID 124, nd2, executor 3, partition 25, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 15.1 in stage 2.0 (TID 86) on dn5, executor 4: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 4]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 15.2 in stage 2.0 (TID 125, dn5, executor 4, partition 15, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 22.1 in stage 2.0 (TID 88) on dn4, executor 1: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 5]

此时在console进行查询,就无法查询了,全是如下报错。

(root@nebula) [trans]> match (v) return v limit 10

[ERROR (-1005)]: Storage Error: part: 22, error: E_LEADER_LEASE_FAILED(-3531).

Tue, 07 Jun 2022 22:01:46 CST

日志

E20220607 14:15:37.818251 108569 StorageAccessExecutor.h:39] AppendVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 27
E20220607 14:15:37.818378 108569 StorageAccessExecutor.h:39] AppendVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 26
E20220607 14:15:37.818416 108569 StorageAccessExecutor.h:39] AppendVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 36
E20220607 14:15:37.818450 108569 StorageAccessExecutor.h:136] Storage Error: part: 27, error: E_LEADER_LEASE_FAILED(-3531).
E20220607 14:15:37.818590 108557 QueryInstance.cpp:137] Storage Error: part: 27, error: E_LEADER_LEASE_FAILED(-3531).

进行balance leader ,一直FAILED,目前图库完全不可用,但是进程都正常。
但是info日志一直在load leader,也一直没恢复

I20220607 14:35:49.519730 109114 MetaClient.cpp:3085] Load leader ok
I20220607 14:35:59.584862 109114 MetaClient.cpp:3079] Load leader of "10.100.2.243":9779 in 2 space
I20220607 14:35:59.584939 109114 MetaClient.cpp:3079] Load leader of "10.100.2.244":9779 in 2 space
I20220607 14:35:59.585009 109114 MetaClient.cpp:3079] Load leader of "10.100.2.245":9779 in 2 space
I20220607 14:35:59.585021 109114 MetaClient.cpp:3085] Load leader ok

有开Toss吗

一开始开了toss导入,发现导入比较慢,然后把toss关掉了,继续导入剩下的数据,导入完成后再导入另一个图库的数据,就变成这样了

跑了多久挂的?storage cpu 大概有多少?

麻烦多发一些Storage的LOG。不止ERROR,INFO也需要。这个问题比较难查

没有挂,只是查询语句都报错,cpu使用很少的,1核都没有

日志请看 20220607 15:00:00 之前的,发生的时间大概是下午13.50,链接:https://space.dingtalk.com/s/gwHOA8gw0QLOEVwinwPaACBjN2QzYTMyMTI2NWM0NTM5OWM0YmViOTIyNDY3Y2M5NA 密码: Z95U

补充点信息,我执行了一次compact任务,完成后,部分可以查询了,但是还有部分点报错

[ERROR (-1005)]: Storage Error: Not the leader of 41. Please retry later.

你可能得再传点日志 之前的日志级别不够详细 导致很多信息没打印出来 把storage里面配置–v改成1
最好先不要开toss,先确认下前面导入的数据是否能够正常读出来

我这边为了能够使用把图库给重启了,重启之后是没问题的,出现这个情况是因为我一个space导入了大批量的数据之后立马向另一个space导入大批量数据。
之前我是做完compact之后,没有报错E_LEADER_LEASE_FAILED,但是某些查询还会报错 Not the leader of 17. Please retry later.
我先把日志打开吧, 后续我再导数看能否复现。

1 个赞

使用exchange导数,导入跟图库相同的数据,
命令spark-submit --master local[20] --class com.vesoft.nebula.exchange.Exchange \ --conf spark.drivewr.memory=20g \ nebula-exchange_spark_2.4-3.0-SNAPSHOT.jar -c application.conf
配置如下:

{
      name: address
      type: {
        source: csv
        sink: client
      }
      path: "hdfs://10.100.2.90:8020/user/graph_dev/req1000w/workaddress.csv"
      fields: [_c1,_c2,_c3]
      nebula.fields: [word,name,num]
      vertex: {
        field:_c0
        # policy:hash
      }
      separator: ","
      header: false
      batch: 2560
      partition: 40

    }

jmter查询图库数据,并发500线程,进行查询。
storage 日志没有记录
graph日志 出现大量报错

E20220615 17:56:51.745271 122314 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.745450 122328 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:51.749065 122334 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.749302 122335 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:51.786526 122267 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.786799 122306 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:51.791447 122306 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.791711 122278 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:51.827911 122278 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.828207 122328 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:51.834795 122335 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.835050 122325 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:51.837221 122325 StorageAccessExecutor.h:39] GetVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 30
E20220615 17:56:51.837306 122325 StorageAccessExecutor.h:136] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:51.837471 122335 QueryInstance.cpp:137] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:51.864755 122328 StorageAccessExecutor.h:39] GetVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 30
E20220615 17:56:51.864847 122328 StorageAccessExecutor.h:136] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:51.865048 122267 QueryInstance.cpp:137] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:51.892114 122325 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.892468 122315 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:51.897945 122325 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:51.898176 122267 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.035589 122276 StorageAccessExecutor.h:39] GetVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 30
E20220615 17:56:52.035668 122276 StorageAccessExecutor.h:136] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.035867 122278 QueryInstance.cpp:137] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.049520 122267 StorageAccessExecutor.h:39] GetVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 30
E20220615 17:56:52.049585 122267 StorageAccessExecutor.h:136] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.049786 122281 QueryInstance.cpp:137] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.075212 122315 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.075570 122334 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.079699 122315 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.079977 122334 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.092885 122328 StorageAccessExecutor.h:39] GetVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 30
E20220615 17:56:52.092958 122328 StorageAccessExecutor.h:136] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.093106 122328 QueryInstance.cpp:137] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.094175 122278 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.094476 122335 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.097288 122278 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.097551 122276 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.114856 122322 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.115108 122314 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.118309 122314 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.118633 122315 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.127970 122243 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.128180 122315 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.
E20220615 17:56:52.177963 122315 StorageAccessExecutor.h:39] GetVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 30
E20220615 17:56:52.178033 122315 StorageAccessExecutor.h:136] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.178167 122267 QueryInstance.cpp:137] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.189901 122278 StorageAccessExecutor.h:39] GetVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 30
E20220615 17:56:52.189973 122278 StorageAccessExecutor.h:136] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.190116 122278 QueryInstance.cpp:137] Storage Error: part: 30, error: E_LEADER_LEASE_FAILED(-3531).
E20220615 17:56:52.209879 122322 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_LEADER_CHANGED, part 30
E20220615 17:56:52.210256 122246 QueryInstance.cpp:137] Storage Error: Not the leader of 30. Please retry later.

同时导数任务也出现错误,task均执行失败


此时将导数任务关掉,查询仍然没有恢复正常,仍继续报错。

先把batch改小到默认值256试试看哈

请问,现在解决了吗

暂时没有找到根本原因,但是可以通过避免高并发查询和写入同时进行,然后导入数据的batch调小一点,线程数也不要太多,这样是可以避免的。

1 个赞