集群方式部署,停掉其中一storeage进程后系统异常

  • nebula 版本:v1.1.0

  • 部署方式(分布式 / 单机 / Docker / DBaaS):分布式(172.16.2.62、172.16.2.42)

  • 硬件信息

    • 磁盘(SSD / HDD):SSD
    • CPU、内存信息:32核心、128GB
  • 出问题的 Space 的创建方式:CREATE SPACE IF NOT EXISTS ldbc_snb(PARTITION_NUM = 24, REPLICA_FACTOR = 2)

  • 问题的具体描述
    停掉172.16.2.62上的storaged进程,查询报错:

(a@127.0.0.1:3699) [ldbc_snb]> go from 933 over knows;
[ERROR (-8)]: Get neighbors failed
Mon Oct 12 15:19:41 2020
=============
(a@127.0.0.1:3699) [ldbc_snb]> show parts
====================================================================================
| Partition ID | Leader | Peers                                | Losts             |
====================================================================================
| 1            |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 2            |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 3            |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 4            |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 5            |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 6            |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 7            |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 8            |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 9            |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 10           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 11           |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 12           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 13           |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 14           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 15           |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 16           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 17           |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 18           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 19           |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 20           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 21           |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 22           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 23           |        | 172.16.2.62:44500, 172.16.2.42:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
| 24           |        | 172.16.2.42:44500, 172.16.2.62:44500 | 172.16.2.62:44500 |
------------------------------------------------------------------------------------
Got 24 rows (Time spent: 846/1549 us)


===========
(a@127.0.0.1:3699) [ldbc_snb]> show hosts
===============================================================================================
| Ip          | Port  | Status  | Leader count | Leader distribution | Partition distribution |
===============================================================================================
| 172.16.2.42 | 44500 | online  | 24           | ldbc_snb: 24        | ldbc_snb: 24           |
-----------------------------------------------------------------------------------------------
| 172.16.2.62 | 44500 | offline | 0            |                     | ldbc_snb: 24           |

看一下storage的log

为什么是2啊,一般都是奇数啊

测试资源有限,后续加上


[root@graphdb_test-1 logs]# tail nebula-storaged.WARNING
W1012 16:20:58.504233 12804 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 10] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504696 12797 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 23] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504243 12774 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 14] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.503921 12796 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 12] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504287 12803 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 7] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504415 12781 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 24] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504418 12800 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 2] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504478 12806 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 21] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504482 12795 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 1] Only 0 hosts succeeded, Need to try again
W1012 16:20:58.504510 12783 RaftPart.cpp:953] [Port: 44501, Space: 37, Part: 18] Only 0 hosts succeeded, Need to try again

确定是v1.1.0么,这个日志的行数对不上的

看上去是quorum没达到的问题,两节点的raft必须两节点都在线

2的多数派还是2啊。。。没法failover

1 个赞

了解了,多谢。

利用nebula-bench导入数据,storaged出现了fatal错误,导致现在任何GO语句执行失败:

(a@127.0.0.1:3699) [ldbc_snb]> GO 1 STEP FROM 933 OVER knows
[ERROR (-8)]: Get neighbors failed
Tue Oct 13 16:05:52 2020

===================================
storaged fatal错误:

Log file created at: 2020/10/13 14:42:36
Running on machine: graphdb_test-1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
F1013 14F42:36.01 57367642:10157 489 141 4212590t06. 8R00] :[Port: 44501, Space: 69, Part: 8] ^@^@^@^@^@12607
F1013 14F42:36.01 57367642:10157 489 141 4212590t06. 8R00] :[Port: 44501, Space: 69, Part: 8] ^@^@^@^@^@12607
RaftPart.cpp:885ce: 69, Part: 20]
Fa] FPort: 44501, Space: 69, Part: 22] Failed to commit logs
1013 14:42:36.573490 F 10FaftPart.cpp1885^@] 101410^@^@ [Port: 44501, Space: 69, Part: 14] 4213iled to commit logs
3 130 :5734914.^@:^@10F21F 510607^@3613 10 :14542488^@^@1:3600 RaftPaRa361: 851 2[Port: ::501, Space: 69, Part: 12] ^@^@^@^@^@F] 88593 c36[Port 457348985a[Port: 44501[Port: 44501, Space: 69, Part: 2] ^@ F885ed to commit logs^@
^@^@ ] [Port: 44501, Space: 69, Part: 6] 12605 RaftPart.cpp:885] [Port: 44501, Space: 69, Part: 18] Failed to commit logs
F1013 14F42:36.01 57367642:10157 489 141 4212590t06. 8R00] :[Port: 44501, Space: 69, Part: 8] ^@^@^@^@^@12607
RaftPart.cpp:885ce: 69, Part: 20]
Fa] FPort: 44501, Space: 69, Part: 22] Failed to commit logs
1013 14:42:36.573490 F 10FaftPart.cpp1885^@] 101410^@^@ [Port: 44501, Space: 69, Part: 14] 4213iled to commit logs
3 130 :5734914.^@:^@10F21F 510607^@3613 10 :14542488^@^@1:3600 RaftPaRa361: 851 2[Port: ::501, Space: 69, Part: 12] ^@^@^@^@^@F] 88593 c36[Port 457348985a[Port: 44501[Port: 44501, Space: 69, Part: 2] ^@ F885ed to commit logs^@
^@^@ ] [Port: 44501, Space: 69, Part: 6] 12605 RaftPart.cpp:885] [Port: 44501, Space: 69, Part: 18] Failed to commit logs
F1013 14F42:36.01 57367642:10157 489 141 4212590t06. 8R00] :[Port: 44501, Space: 69, Part: 8] ^@^@^@^@^@12607
RaftPart.cpp:885ce: 69, Part: 20]
Fa] FPort: 44501, Space: 69, Part: 22] Failed to commit logs
1013 14:42:36.573490 F 10FaftPart.cpp1885^@] 101410^@^@ [Port: 44501, Space: 69, Part: 14] 4213iled to commit logs
3 130 :5734914.^@:^@10F21F 510607^@3613 10 :14542488^@^@1:3600 RaftPaRa361: 851 2[Port: ::501, Space: 69, Part: 12] ^@^@^@^@^@F] 88593 c36[Port 457348985a[Port: 44501[Port: 44501, Space: 69, Part: 2] ^@ F885ed to commit logs^@
^@^@ ] [Port: 44501, Space: 69, Part: 6] 12605 RaftPart.cpp:885] [Port: 44501, Space: 69, Part: 18] Failed to commit logs
F1013 14:42:36.573482 12597 RaftPart.cpp:885] [Port: 44501, Space: 69, Part: 10] Failed to commit logs
F1013 14F42:36.01 57367642:10157 489 141 4212590t06. 8R00] :[Port: 44501, Space: 69, Part: 8] ^@^@^@^@^@12607
RaftPart.cpp:885ce: 69, Part: 20]
Fa] FPort: 44501, Space: 69, Part: 22] Failed to commit logs
1013 14:42:36.573490 F 10FaftPart.cpp1885^@] 101410^@^@ [Port: 44501, Space: 69, Part: 14] 4213iled to commit logs
3 130 :5734914.^@:^@10F21F 510607^@3613 10 :14542488^@^@1:3600 RaftPaRa361: 851 2[Port: ::501, Space: 69, Part: 12] ^@^@^@^@^@F] 88593 c36[Port 457348985a[Port: 44501[Port: 44501, Space: 69, Part: 2] ^@ F885ed to commit logs^@
^@^@ ] [Port: 44501, Space: 69, Part: 6] 12605 RaftPart.cpp:885] [Port: 44501, Space: 69, Part: 18] Failed to commit logs
F1013 14F42:36.01 57367642:10157 489 141 4212590t06. 8R00] :[Port: 44501, Space: 69, Part: 8] ^@^@^@^@^@12607
RaftPart.cpp:885ce: 69, Part: 20]
Fa] FPort: 44501, Space: 69, Part: 22] Failed to commit logs
1013 14:42:36.573490 F 10FaftPart.cpp1885^@] 101410^@^@ [Port: 44501, Space: 69, Part: 14] 4213iled to commit logs
nebula-stor

==========
客户端错误:
2020/10/13 16:04:37 driver.go:142: Statement: INSERT EDGE has_member(time) VALUES 39582464812214 -> 24189256220946:(“2011-12-27T01:06:19.603+0000”), 39582464812214 -> 24189256282051:(“2011-11-20T17:21:19.463+0000”), 39582464812214 -> 26388279128627:(“2012-02-23T21:55:34.303+0000”), 39582464812214 -> 26388279234141:(“2012-03-11T12:19:55.032+0000”), 39582464812214 -> 26388279263728:(“2012-03-28T09:22:45.969+0000”), 39582464812214 -> 26388279273980:(“2012-03-18T11:23:56.342+0000”), 39582464812214 -> 26388279286648:(“2012-02-08T03:17:14.231+0000”), 39582464812214 -> 26388279294251:(“2012-03-07T23:28:33.071+0000”), 39582464812214 -> 26388279320035:(“2012-02-19T23:01:57.323+0000”), 39582464812214 -> 26388279329003:(“2012-01-25T19:09:06.557+0000”), 39582464812214 -> 26388279382711:(“2012-01-31T07:08:42.399+0000”), 39582464812214 -> 26388279386337:(“2012-01-27T18:18:52.770+0000”), 39582464812214 -> 26388279405538:(“2012-02-23T05:04:26.587+0000”), 39582464812214 -> 26388279416595:(“2012-02-19T11:21:35.509+0000”), 39582464812214 -> 26388279528576:(“2012-02-10T15:02:09.357+0000”), 39582464812214 -> 26388279539413:(“2012-01-19T08:15:40.493+0000”), 39582464812214 -> 26388279542075:(“2012-03-07T21:35:14.809+0000”), 39582464812214 -> 26388279550465:(“2012-03-16T15:02:49.522+0000”), 39582464812214 -> 26388279556833:(“2012-02-16T18:02:24.166+0000”), 39582464812214 -> 28587302337528:(“2012-03-27T17:33:39.990+0000”), 39582464812214 -> 28587302357858:(“2012-04-16T17:50:08.616+0000”), 39582464812214 -> 28587302384355:(“2012-04-01T09:19:01.806+0000”), 39582464812214 -> 28587302392469:(“2012-05-02T17:46:38.959+0000”), 39582464812214 -> 28587302403668:(“2012-03-29T19:41:37.057+0000”), 39582464812214 -> 28587302490648:(“2012-04-08T22:01:43.994+0000”), 39582464812214 -> 28587302513103:(“2012-04-18T11:42:51.223+0000”), 39582464812214 -> 28587302550257:(“2012-04-18T08:07:39.999+0000”), 39582464812214 -> 28587302570169:(“2012-04-25T10:55:06.309+0000”), 39582464812214 -> 28587302672289:(“2012-04-01T01:54:08.745+0000”), 39582464812214 -> 28587302693632:(“2012-05-04T06:15:53.867+0000”), 39582464812214 -> 28587302700828:(“2012-04-07T07:40:19.415+0000”), 39582464812214 -> 28587302729029:(“2012-05-01T04:00:10.232+0000”), 39582464812214 -> 28587302730481:(“2012-05-29T05:08:24.575+0000”), 39582464812214 -> 28587302775594:(“2012-05-19T08:28:39.651+0000”), 39582464812214 -> 28587302775992:(“2012-04-07T14:10:16.988+0000”), 39582464812214 -> 28587302781726:(“2012-04-16T13:57:10.321+0000”), 39582464812214 -> 30786325606690:(“2012-07-04T21:11:52.492+0000”), 39582464812214 -> 30786325669866:(“2012-06-05T04:06:25.603+0000”), 39582464812214 -> 30786325740461:(“2012-07-12T19:03:22.504+0000”), 39582464812214 -> 30786325798472:(“2012-07-17T06:34:17.684+0000”), 39582464812214 -> 30786325808985:(“2012-06-15T18:03:13.598+0000”), 39582464812214 -> 30786325825271:(“2012-06-20T15:09:17.184+0000”), 39582464812214 -> 30786325855854:(“2012-05-29T08:53:21.003+0000”), 39582464812214 -> 30786325878757:(“2012-07-04T10:51:10.530+0000”), 39582464812214 -> 30786325911758:(“2012-06-17T09:07:36.801+0000”), 39582464812214 -> 30786325932296:(“2012-06-29T07:18:54.825+0000”), 39582464812214 -> 30786325937927:(“2012-06-09T16:32:46.618+0000”), 39582464812214 -> 30786326013417:(“2012-06-28T19:49:32.585+0000”), ErrorCode: E_EXECUTION_ERROR, ErrorMsg: Insert edge `has_member’ not complete, completeness: 0

show parts
show hosts
检查下是不是有机器不正常,或者重新选举过。如果重新选举过,会需要重试。

往rocksdb写失败了 需要检查下环境 最容易出现的情况就是磁盘满了