【生产环境】Storaged服务无法启动 9779一直闪烁 日志无明显错误 求助很急!

  • nebula 版本:V3.2.0
  • 部署方式:分布式
  • 安装方式:RPM
  • 是否为线上版本:Y
  • 硬件信息 SSD

    • CPU、内存信息 : 32c256g
  • 问题的具体描述
    storaged服务无法启动 日志一直猛刷 服务占用资源也不少 无法正常查询数据
  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)
    meta.INFO
I20221105 17:07:25.881327 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9779, role = STORAGE
I20221105 17:07:27.209878 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9669, role = GRAPH
I20221105 17:07:27.344532 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9669, role = GRAPH
I20221105 17:07:27.422540 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9779, role = STORAGE
I20221105 17:07:28.925192 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.202":9669, role = GRAPH
I20221105 17:07:29.435111 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.205":9669, role = GRAPH
I20221105 17:07:31.797423 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.206":9779, role = STORAGE
I20221105 17:07:32.872483 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.202":9779, role = STORAGE
I20221105 17:07:33.106631 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.205":9779, role = STORAGE
I20221105 17:07:34.733525 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.206":9669, role = GRAPH
I20221105 17:07:35.892315 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9779, role = STORAGE
I20221105 17:07:37.220553 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9669, role = GRAPH
I20221105 17:07:37.345732 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9669, role = GRAPH
I20221105 17:07:37.427868 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9779, role = STORAGE
I20221105 17:07:38.928171 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.202":9669, role = GRAPH
I20221105 17:07:39.446223 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.205":9669, role = GRAPH
I20221105 17:07:41.798938 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.206":9779, role = STORAGE
I20221105 17:07:42.873999 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.202":9779, role = STORAGE
I20221105 17:07:43.117722 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.205":9779, role = STORAGE
I20221105 17:07:44.744496 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.206":9669, role = GRAPH
I20221105 17:07:45.893570 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9779, role = STORAGE
I20221105 17:07:47.231570 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9669, role = GRAPH
I20221105 17:07:47.356362 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9669, role = GRAPH
I20221105 17:07:47.433125 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9779, role = STORAGE
I20221105 17:07:48.929932 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.202":9669, role = GRAPH
I20221105 17:07:49.448105 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.205":9669, role = GRAPH
I20221105 17:07:51.809860 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.206":9779, role = STORAGE
I20221105 17:07:52.875859 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.202":9779, role = STORAGE
I20221105 17:07:53.121052 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.205":9779, role = STORAGE
I20221105 17:07:54.755455 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.206":9669, role = GRAPH
I20221105 17:07:55.904532 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9779, role = STORAGE
I20221105 17:07:57.237556 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9669, role = GRAPH
I20221105 17:07:57.360690 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.203":9669, role = GRAPH
I20221105 17:07:57.443522 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.204":9779, role = STORAGE
I20221105 17:07:58.940050 11351 HBProcessor.cpp:33] Receive heartbeat from "172.17.126.202":9669, role = GRAPH

graph.INFO

E20221105 16:57:42.552389 12045 StorageAccessExecutor.h:136] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
E20221105 16:57:42.552417 12045 QueryInstance.cpp:137] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
E20221105 16:57:42.916901 12071 StorageClientBase-inl.h:206] Request to "172.17.126.203":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:57:42.916960 12046 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:57:42.917009 12046 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 43
E20221105 16:57:42.917023 12046 StorageAccessExecutor.h:136] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
E20221105 16:57:42.917052 12046 QueryInstance.cpp:137] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
E20221105 16:57:43.278789 12072 StorageClientBase-inl.h:206] Request to "172.17.126.206":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:57:43.278867 12044 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:57:43.278918 12047 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 43
E20221105 16:57:43.278936 12047 StorageAccessExecutor.h:136] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
E20221105 16:57:43.278964 12044 QueryInstance.cpp:137] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
I20221105 16:58:29.115856 12043 GraphService.cpp:76] Authenticating user root from 172.17.126.202:44862
I20221105 16:58:33.207428 12043 SwitchSpaceExecutor.cpp:37] Graph switched to `rmzk_data', space id: 120
E20221105 16:58:42.125770 12087 StorageClientBase-inl.h:206] Request to "172.17.126.203":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:58:42.125833 12044 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:58:42.125888 12046 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 43
E20221105 16:58:42.125921 12046 StorageAccessExecutor.h:136] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
E20221105 16:58:42.125962 12044 QueryInstance.cpp:137] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
I20221105 16:59:06.850670 12043 GraphService.cpp:76] Authenticating user root from 172.17.126.202:44882
I20221105 16:59:08.927286 12043 SwitchSpaceExecutor.cpp:37] Graph switched to `rmzk_data', space id: 120
E20221105 16:59:54.781452 12078 StorageClientBase-inl.h:206] Request to "172.17.126.206":9779 failed: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:59:54.781533 12046 StorageClientBase-inl.h:135] There some RPC errors: RPC failure in StorageClient: Failed to write to remote endpoint. Wrote 0 bytes. AsyncSocketException: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused)
E20221105 16:59:54.781581 12043 StorageAccessExecutor.h:39] GetNeighborsExecutor failed, error E_RPC_FAILURE, part 43
E20221105 16:59:54.781610 12043 StorageAccessExecutor.h:136] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
E20221105 16:59:54.781641 12044 QueryInstance.cpp:137] Storage Error: part: 43, error: E_RPC_FAILURE(-3).
I20221105 17:04:56.599304 12075 GraphSessionManager.cpp:219] ClientSession 1667638755631405 has expired

storaged.INFO

I20221105 17:09:41.743626 12393 NebulaSnapshotManager.cpp:67] Space 120 Part 44 start send snapshot of commitLogId 31671622 commitLogTerm 13, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.743930 12394 NebulaSnapshotManager.cpp:67] Space 120 Part 37 start send snapshot of commitLogId 31666870 commitLogTerm 15, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.744560 12395 NebulaSnapshotManager.cpp:67] Space 120 Part 9 start send snapshot of commitLogId 31671214 commitLogTerm 17, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.744801 12396 NebulaSnapshotManager.cpp:67] Space 120 Part 29 start send snapshot of commitLogId 31659323 commitLogTerm 14, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.744918 12393 NebulaSnapshotManager.cpp:67] Space 120 Part 44 start send snapshot of commitLogId 31671622 commitLogTerm 13, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.745200 12394 NebulaSnapshotManager.cpp:67] Space 120 Part 37 start send snapshot of commitLogId 31666870 commitLogTerm 15, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.745821 12395 NebulaSnapshotManager.cpp:67] Space 120 Part 9 start send snapshot of commitLogId 31671214 commitLogTerm 17, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.746040 12396 NebulaSnapshotManager.cpp:67] Space 120 Part 29 start send snapshot of commitLogId 31659323 commitLogTerm 14, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.746230 12393 NebulaSnapshotManager.cpp:67] Space 120 Part 44 start send snapshot of commitLogId 31671622 commitLogTerm 13, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.746451 12394 NebulaSnapshotManager.cpp:67] Space 120 Part 37 start send snapshot of commitLogId 31666870 commitLogTerm 15, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.747088 12395 NebulaSnapshotManager.cpp:67] Space 120 Part 9 start send snapshot of commitLogId 31671214 commitLogTerm 17, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.747303 12396 NebulaSnapshotManager.cpp:67] Space 120 Part 29 start send snapshot of commitLogId 31659323 commitLogTerm 14, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.747522 12393 NebulaSnapshotManager.cpp:67] Space 120 Part 44 start send snapshot of commitLogId 31671622 commitLogTerm 13, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.747690 12394 NebulaSnapshotManager.cpp:67] Space 120 Part 37 start send snapshot of commitLogId 31666870 commitLogTerm 15, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.748342 12395 NebulaSnapshotManager.cpp:67] Space 120 Part 9 start send snapshot of commitLogId 31671214 commitLogTerm 17, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.748872 12396 NebulaSnapshotManager.cpp:67] Space 120 Part 44 start send snapshot of commitLogId 31671622 commitLogTerm 13, rate limited to 10485760, batch size is 524288
I20221105 17:09:41.748912 12393 NebulaSnapshotManager.cpp:67] Space 120 Part 37 start send snapshot of commitLogId 31666870 commitLogTerm 15, rate limited to 10485760, batch size is 524288

部署情况:
172.17.126.202 metad:9559 graphd:9669 storaged:9779
172.17.126.203 metad:9559 graphd:9669 storaged:9779
172.17.126.204 metad:9559 graphd:9669 storaged:9779
172.17.126.205 graphd:9669 storaged:9779
172.17.126.206 graphd:9669 storaged:9779
现在五台storaged都起不来 202服务器io占用比较高 其他的基本没有

现在出现了oom情况 服务在启动过程中挂掉


现在情况是启动storaged后日志一直在刷新快照相关的 好像是在互相拷贝 占用的cpu资源和内存也很高 然后过一阵开始大合并 full 之后就会oom被系统kill… 这种情况应该怎么办呢

快照相关的,具体是什么信息。把storage日志打开,trace_raft=1 v=3, 再启动看看日志信息