- nebula 版本:3.1.0
- 部署方式:分布式
- 安装方式: RPM
- 是否为线上版本: N
- 硬件信息
- 磁盘( 推荐使用 SSD)SSD
- CPU、内存信息 500G
- 问题的具体描述
- 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)
使用exchange导入数据,数据量较大,在一个任务完成后,启动另一个导数任务,但是发现该任务无法建立连接,全是ReadTimeOut,于是关闭任务
22/06/07 13:50:25 WARN scheduler.TaskSetManager: Lost task 19.1 in stage 2.0 (TID 84, dn4, executor 1): com.vesoft.nebula.client.graph.exception.IOErrorException: java.net.SocketTimeoutException: Read timed out
at com.vesoft.nebula.client.graph.net.SyncConnection.executeWithParameter(SyncConnection.java:189)
at com.vesoft.nebula.client.graph.net.Session.executeWithParameter(Session.java:113)
at com.vesoft.nebula.client.graph.net.Session.execute(Session.java:78)
at com.vesoft.exchange.common.GraphProvider.submit(GraphProvider.scala:78)
at com.vesoft.exchange.common.writer.NebulaGraphClientWriter.writeEdges(ServerBaseWriter.scala:153)
at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$com$vesoft$nebula$exchange$processor$EdgeProcessor$$processEachPartition$1.apply(EdgeProcessor.scala:71)
at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$com$vesoft$nebula$exchange$processor$EdgeProcessor$$processEachPartition$1.apply(EdgeProcessor.scala:69)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at com.vesoft.nebula.exchange.processor.EdgeProcessor.com$vesoft$nebula$exchange$processor$EdgeProcessor$$processEachPartition(EdgeProcessor.scala:69)
at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3.apply(EdgeProcessor.scala:178)
at com.vesoft.nebula.exchange.processor.EdgeProcessor$$anonfun$process$3.apply(EdgeProcessor.scala:178)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:980)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2107)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:411)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:417)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 19.2 in stage 2.0 (TID 121, dn4, executor 1, partition 19, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 14.1 in stage 2.0 (TID 82) on dn4, executor 1: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 1]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 14.2 in stage 2.0 (TID 122, dn4, executor 1, partition 14, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 29.1 in stage 2.0 (TID 83) on nd2, executor 3: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 2]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 29.2 in stage 2.0 (TID 123, nd2, executor 3, partition 29, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 25.1 in stage 2.0 (TID 81) on nd2, executor 3: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 3]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 25.2 in stage 2.0 (TID 124, nd2, executor 3, partition 25, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 15.1 in stage 2.0 (TID 86) on dn5, executor 4: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 4]
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Starting task 15.2 in stage 2.0 (TID 125, dn5, executor 4, partition 15, NODE_LOCAL, 7778 bytes)
22/06/07 13:50:25 INFO scheduler.TaskSetManager: Lost task 22.1 in stage 2.0 (TID 88) on dn4, executor 1: com.vesoft.nebula.client.graph.exception.IOErrorException (java.net.SocketTimeoutException: Read timed out) [duplicate 5]
此时在console进行查询,就无法查询了,全是如下报错。
(root@nebula) [trans]> match (v) return v limit 10
[ERROR (-1005)]: Storage Error: part: 22, error: E_LEADER_LEASE_FAILED(-3531).
Tue, 07 Jun 2022 22:01:46 CST
日志
E20220607 14:15:37.818251 108569 StorageAccessExecutor.h:39] AppendVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 27
E20220607 14:15:37.818378 108569 StorageAccessExecutor.h:39] AppendVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 26
E20220607 14:15:37.818416 108569 StorageAccessExecutor.h:39] AppendVerticesExecutor failed, error E_LEADER_LEASE_FAILED, part 36
E20220607 14:15:37.818450 108569 StorageAccessExecutor.h:136] Storage Error: part: 27, error: E_LEADER_LEASE_FAILED(-3531).
E20220607 14:15:37.818590 108557 QueryInstance.cpp:137] Storage Error: part: 27, error: E_LEADER_LEASE_FAILED(-3531).
进行balance leader ,一直FAILED,目前图库完全不可用,但是进程都正常。
但是info日志一直在load leader,也一直没恢复
I20220607 14:35:49.519730 109114 MetaClient.cpp:3085] Load leader ok
I20220607 14:35:59.584862 109114 MetaClient.cpp:3079] Load leader of "10.100.2.243":9779 in 2 space
I20220607 14:35:59.584939 109114 MetaClient.cpp:3079] Load leader of "10.100.2.244":9779 in 2 space
I20220607 14:35:59.585009 109114 MetaClient.cpp:3079] Load leader of "10.100.2.245":9779 in 2 space
I20220607 14:35:59.585021 109114 MetaClient.cpp:3085] Load leader ok