LDBC-SNB-复杂查询报错:Storage Error: Not the leader of xx. Please retry later

  • nebula 版本:3.4.0
    image
  • 部署方式:分布式、k8s
  • 安装方式:NebulaGraph Operator
  • 是否上生产环境:N
  • 硬件信息
    • SSD
    • 32C 64G * 3,每台1块2TSSD
  • 服务分布
    A机器:1个graph服务、1个meta服务、1个storage服务
    B机器:1个graph服务、1个meta服务、1个storage服务
    C机器:1个graph服务、1个meta服务、1个storage服务
  • space信息
    CREATE SPACE IF NOT EXISTS mytest (partition_num=60,replica_factor=3,vid_type=INT);
  • 数据规模
    采用LDBC SNB sf100的数据,共2.8亿点、18亿边
    参考nebula-bench,利用nebula-import完成数据导入
  • 问题的具体描述
    执行以下nGQL语句,响应为:Storage Error: Not the leader of xx. Please retry later,多请求几次后可能会有成功。
MATCH (root:Person)-[:KNOWS*1..2]-(friend:Person) \
WHERE id(root)==21990232978350 and friend <> root \
WITH collect(distinct friend) as friends \
UNWIND friends as friend \
    MATCH (friend)<-[:HAS_CREATOR]-(message) \
    WHERE coalesce(message.`Comment`.creationDate,message.Post.creationDate) < datetime(timestamp(1324080000)) \
RETURN \
    id(friend) AS personId, \
    friend.Person.firstName AS personFirstName, \
    friend.Person.lastName AS personLastName, \
    id(message) AS commentOrPostId, \
    coalesce(message.`Comment`.content,message.Post.content,message.Post.imageFile) AS commentOrPostContent, \
    timestamp(coalesce(message.`Comment`.creationDate,message.Post.creationDate)) AS commentOrPostCreationDate \
ORDER BY \
    commentOrPostCreationDate DESC, \
    commentOrPostId ASC \
LIMIT 20

storagd-log

I20230310 04:22:53.857503    43 AdminProcessor.h:44] Receive transfer leader for space 10, part 34, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.857626    22 AdminProcessor.h:44] Receive transfer leader for space 10, part 23, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.857734    26 AdminProcessor.h:44] Receive transfer leader for space 10, part 44, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.857762    46 AdminProcessor.h:44] Receive transfer leader for space 10, part 49, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.857942    32 AdminProcessor.h:44] Receive transfer leader for space 10, part 19, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.857964    34 AdminProcessor.h:44] Receive transfer leader for space 10, part 56, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858060    28 AdminProcessor.h:44] Receive transfer leader for space 10, part 5, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858078    35 AdminProcessor.h:44] Receive transfer leader for space 10, part 33, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858247    42 AdminProcessor.h:44] Receive transfer leader for space 10, part 43, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858278    38 AdminProcessor.h:44] Receive transfer leader for space 10, part 48, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858278    19 AdminProcessor.h:44] Receive transfer leader for space 10, part 3, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858299    45 AdminProcessor.h:44] Receive transfer leader for space 10, part 9, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858328    47 AdminProcessor.h:44] Receive transfer leader for space 10, part 45, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858350    36 AdminProcessor.h:44] Receive transfer leader for space 10, part 26, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858392    39 AdminProcessor.h:44] Receive transfer leader for space 10, part 21, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858507    24 AdminProcessor.h:44] Receive transfer leader for space 10, part 39, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
I20230310 04:22:53.858566    44 AdminProcessor.h:44] Receive transfer leader for space 10, part 1, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]
E20230310 04:22:53.861686    46 Serializer.h:43] Thrift serialization is only defined for structs and unions, not containers thereof. Attemping to serialize a value of type `nebula::HostAddr`.
I20230310 04:22:53.867838   230 AdminProcessor.h:115] Can't find leader for space 10 part 5 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.868073   231 AdminProcessor.h:115] Can't find leader for space 10 part 34 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.868309   232 AdminProcessor.h:115] Can't find leader for space 10 part 56 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.868633   233 AdminProcessor.h:115] Can't find leader for space 10 part 19 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.868933   234 AdminProcessor.h:115] Can't find leader for space 10 part 45 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.869252   235 AdminProcessor.h:115] Can't find leader for space 10 part 23 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.869545   236 AdminProcessor.h:115] Can't find leader for space 10 part 39 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.869768   237 AdminProcessor.h:115] Can't find leader for space 10 part 3 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.870025   238 AdminProcessor.h:115] Can't find leader for space 10 part 33 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.870244   239 AdminProcessor.h:115] Can't find leader for space 10 part 21 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.870512   240 AdminProcessor.h:115] Can't find leader for space 10 part 49 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.870765   241 AdminProcessor.h:115] Can't find leader for space 10 part 1 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.871013   242 AdminProcessor.h:115] Can't find leader for space 10 part 48 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.871284   243 AdminProcessor.h:115] Can't find leader for space 10 part 9 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.871695   244 AdminProcessor.h:115] Can't find leader for space 10 part 44 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.872277   245 AdminProcessor.h:104] Found new leader of space 10 part 43: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:53.872334   246 AdminProcessor.h:115] Can't find leader for space 10 part 26 on "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.867951   230 AdminProcessor.h:104] Found new leader of space 10 part 5: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.868182   231 AdminProcessor.h:104] Found new leader of space 10 part 34: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.868403   232 AdminProcessor.h:104] Found new leader of space 10 part 56: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.868718   233 AdminProcessor.h:104] Found new leader of space 10 part 19: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.869024   234 AdminProcessor.h:104] Found new leader of space 10 part 45: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.869343   235 AdminProcessor.h:104] Found new leader of space 10 part 23: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.869640   236 AdminProcessor.h:104] Found new leader of space 10 part 39: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.869879   237 AdminProcessor.h:104] Found new leader of space 10 part 3: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.870115   238 AdminProcessor.h:104] Found new leader of space 10 part 33: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.870332   239 AdminProcessor.h:104] Found new leader of space 10 part 21: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.870606   240 AdminProcessor.h:104] Found new leader of space 10 part 49: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:22:58.870851   241 AdminProcessor.h:104] Found new leader of space 10 part 1: "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779
I20230310 04:23:06.284849    73 MetaClient.cpp:3259] Load leader of "nebula-storaged-0.nebula-storaged-headless.nebula.svc.cluster.local":9779 in 1 space
I20230310 04:23:06.284891    73 MetaClient.cpp:3259] Load leader of "nebula-storaged-1.nebula-storaged-headless.nebula.svc.cluster.local":9779 in 1 space
I20230310 04:23:06.284910    73 MetaClient.cpp:3259] Load leader of "nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local":9779 in 1 space
I20230310 04:23:06.284915    73 MetaClient.cpp:3265] Load leader ok
E20230310 05:22:00.235684    66 ThriftClientManager-inl.h:70] Failed to resolve address for 'nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local': Name or service not known (error=-2): Unknown error -2
I20230310 05:22:01.252557    66 ThriftClientManager-inl.h:67] resolve "nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.0.186":9559
I20230310 05:22:02.251830    66 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.52.208":9559
I20230310 05:22:04.254729    66 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.52.208":9559
E20230310 05:22:05.253751    66 MetaClient.cpp:772] Send request to "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559, exceed retry limit
E20230310 05:22:05.253793    66 MetaClient.cpp:773] RpcResponse exception: apache::thrift::transport::TTransportException: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: write timed out during connection, type = Timed out
E20230310 05:22:05.253834    73 MetaClient.cpp:192] Heartbeat failed, status:RPC failure in MetaClient: apache::thrift::transport::TTransportException: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: write timed out during connection, type = Timed out
I20230310 05:22:15.264613    71 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.52.208":9559
I20230310 05:22:17.271698    71 ThriftClientManager-inl.h:67] resolve "nebula-metad-2.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.244.30":9559
I20230310 05:22:18.274253    71 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.52.208":9559
I20230310 05:22:20.272725    71 ThriftClientManager-inl.h:67] resolve "nebula-metad-1.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.0.186":9559
E20230310 05:22:20.273692    73 MetaClient.cpp:192] Heartbeat failed, status:Machine not existed!
I20230310 05:22:30.284572    72 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.52.213":9559
I20230310 05:22:40.288447    65 ThriftClientManager-inl.h:67] resolve "nebula-metad-0.nebula-metad-headless.nebula.svc.cluster.local":9559 as "192.168.52.213":9559

做了以下操作后,不再出现此问题了
1、删除了直接改过的graphService的configmap,然后用了operator自动生成的
2、服务加了timezone_name参数
3、服务加了TZ环境变量
4、改statefulset的replicas停服务启服务,直到整个集群对应的workload是可用的,才发起请求

你是执行过 balance data 或者 balance leader 之类的命令把?

没。如果执行了:balance data 或者 balance leader,应该是短时间内出现该错误,之后就没有了。
但我这边是一个持续性错误

如果没有执行过 balance data 或者 balance leader,是不会有

Receive transfer leader for space 10, part 26, to [nebula-storaged-2.nebula-storaged-headless.nebula.svc.cluster.local, 9779]

日志的

是的,确实执行过balance leader。
执行前和执行后,都报了这个错。
另外想学习下源码,有推荐入手方式么,代码多,不知道从哪开始看,刚了解了下cmake make是啥,有C/java的基础 :dotted_line_face:

那可能是 balance leader 耗时时间可能比较长。

找感兴趣的部分,顺着单元测试看把

1 个赞