nebula 日志报错

使用nebula过程中报错
0220505 10:33:11.313828 130285 StorageClientBase-inl.h:278] Request to “IP”:9779 time out: TTransportException: Timed out
E20220505 10:33:11.313988 130146 UpdateExecutor.cpp:70] RPC failure in StorageClient: TTransportException: Timed out
E20220505 10:33:11.314059 130148 QueryInstance.cpp:131] RPC failure in StorageClient: TTransportException: Timed out
重启storage后恢复
出现报错
Storage Error: Invalid data, may be wrong value type
过了大概5分钟 该报错消失

想咨询下第一个报错以及第二个报错出现的原因

来,补充下版本号,:thinking: 第一个报错也许是数据过于频繁插入(你是不是那时候在导入数据)

3.0.2

单独跑一条upsert也报错 重启storage后恢复

机器配置和 storage 的日志贴一下,我让存储的同学看看。

SSD 2T盘 内存256 64核cpu storage日志没报错 但是 graph 日志里面有Request to “IP”:9779 time out: TTransportException: Timed out 连接存储超时的报警

Timeout可能是因为网络原因,存储服务意外重启等?这个错误照理来说只要网络正常会自动恢复的。
Storage Error: Invalid data, may be wrong value type是因为插入的数据跟schema不匹配。

我检查日志,从三号开始就有报错了,由于业务需求,我们一直在导入数据,报错信息如下,直到5号重启storage后恢复,期间执行过show host storage 检查,应该没问题的。我们应该如何避免出现类似的情况,请指导下

E20220503 11:39:54.293435 130247 StorageClientBase-inl.h:278] Request to "1.77":9779 time out: TTransportException: Timed out
E20220503 11:39:54.293614 130148 UpdateExecutor.cpp:70] RPC failure in StorageClient: TTransportException: Timed out
E20220503 11:39:54.293701 130155 QueryInstance.cpp:131] RPC failure in StorageClient: TTransportException: Timed out
E20220503 11:40:12.755578 130265 StorageClientBase-inl.h:183] Request to "1.78":9779 time out: TTransportException: Timed out
E20220503 11:40:12.755803 130163 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 4
E20220503 11:40:12.755861 130163 StorageAccessExecutor.h:138] Storage Error: part: 4, error: E_RPC_FAILURE(-3).
E20220503 11:40:12.755980 130155 QueryInstance.cpp:131] Storage Error: part: 4, error: E_RPC_FAILURE(-3).
E20220503 11:40:16.271100 130275 StorageClientBase-inl.h:183] Request to "1.77":9779 time out: TTransportException: Timed out
E20220503 11:40:16.271265 130155 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 7
E20220503 11:40:16.271323 130155 StorageAccessExecutor.h:138] Storage Error: part: 7, error: E_RPC_FAILURE(-3).
E20220503 11:40:16.271399 130155 QueryInstance.cpp:131] Storage Error: part: 7, error: E_RPC_FAILURE(-3).

问一下,这个timeout error是从3号到5号一直存在的吗?另外问一下你们graph.conf里storage_client_timeout_ms设置成了多少?

storage的日志能否也给一下,想看一下这段时间storage在做什么。你们所导入数据的value一般有多大?如果一直在导入数据,有可能storage在忙于compaction,导致延迟过高。

1 个赞

–storage_client_timeout_ms=60000 导入数据大概500条每秒 storage没有报错

E20220507 10:39:06.329780 183848 IndexScanRule.cpp:436] No valid index found
E20220507 10:39:15.727277 183891 GraphSessionManager.cpp:222] Update sessions failed: Session not existed!
E20220507 10:39:15.727313 183887 GraphSessionManager.cpp:246] Update sessions failed: Update sessions failed: Session not existed!
E20220507 10:40:11.342757 183881 StorageClientBase-inl.h:183] Request to "10.79":9779 time out: TTransportException: Timed out
E20220507 10:40:11.557068 183880 StorageClientBase-inl.h:183] Request to "10.77":9779 time out: TTransportException: Timed out
E20220507 10:40:11.743132 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 14
E20220507 10:40:11.743202 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 13
E20220507 10:40:11.743219 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 8
E20220507 10:40:11.743234 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 1
E20220507 10:40:11.743264 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 6
E20220507 10:40:11.743279 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 15
E20220507 10:40:11.743292 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 12
E20220507 10:40:11.743306 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 11
E20220507 10:40:11.743319 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 10
E20220507 10:40:11.743333 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 9
E20220507 10:40:11.743346 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 3
E20220507 10:40:11.743360 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 2
E20220507 10:40:11.743386 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 7
E20220507 10:40:11.743404 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 4
E20220507 10:40:11.743417 183858 StorageAccessExecutor.h:41] AppendVerticesExecutor failed, error E_RPC_FAILURE, part 5
E20220507 10:40:11.743435 183858 StorageAccessExecutor.h:138] Storage Error: part: 14, error: E_RPC_FAILURE(-3).
E20220507 10:40:11.743494 183848 QueryInstance.cpp:131] Storage Error: part: 14, error: E_RPC_FAILURE(-3).

现在又一直再报这个

能否导出一下现在storaged的日志呢?我们想知道storage现在在做啥?最好设置storaged.conf中
-v=2 (甚至更高,比如3或者4)。然后你tail -f xxxxx/nebula-storaged.INFO。我们想知道这时候storage端在进行的操作是什么。另外,可否确认一下当前storaged的状态:scripts/nebula.service status all?

E20220505 03:59:59.171658 130379 UpdateNode.h:476] edge conflict 7:11:Ncdxm_210506077495528:40:0:Nzrr_610329198709160618
E20220505 04:02:01.071924 130393 UpdateNode.h:476] edge conflict 7:13:Ncdxm_201106077387772:40:0:Nzrr_320922197203183331
E20220505 04:02:13.927969 130390 UpdateNode.h:183] vertex conflict 7:11:9:Nzrr_320922197203183331
E20220505 04:02:13.942790 130390 UpdateNode.h:183] vertex conflict 7:11:17:Nzrr_320922197203183331
E20220505 04:02:13.951030 130390 UpdateNode.h:183] vertex conflict 7:11:32:Nzrr_320922197203183331
E20220505 04:02:13.959132 130390 UpdateNode.h:183] vertex conflict 7:11:31:Nzrr_320922197203183331
E20220505 04:02:13.991355 130390 UpdateNode.h:476] edge conflict 7:13:Ncdxm_201106077387772:40:0:Nzrr_320922197203183331
E20220505 04:02:14.007308 130390 UpdateNode.h:476] edge conflict 7:13:Ncdxm_201106077387772:40:0:Nzrr_320922197203183331
E20220505 04:02:19.256301 130390 UpdateNode.h:183] vertex conflict 7:2:9:Nzrr_13240119750911167X
E20220505 04:02:55.494436 130372 UpdateNode.h:183] vertex conflict 7:11:9:Nzrr_320922197203183331
E20220505 04:02:55.509428 130372 UpdateNode.h:183] vertex conflict 7:11:17:Nzrr_320922197203183331
E20220505 04:02:59.398957 130372 UpdateNode.h:183] vertex conflict 7:11:32:Nzrr_320922197203183331
E20220505 04:02:59.974485 130372 UpdateNode.h:183] vertex conflict 7:11:31:Nzrr_320922197203183331
E20220505 04:03:00.006119 130372 UpdateNode.h:476] edge conflict 7:13:Ncdxm_201106077387772:40:0:Nzrr_320922197203183331
E20220505 04:03:00.024173 130372 UpdateNode.h:476] edge conflict 7:13:Ncdxm_201106077387772:40:0:Nzrr_320922197203183331
E20220505 04:03:00.068651 130372 UpdateNode.h:183] vertex conflict 7:2:9:Nzrr_13240119750911167X
E20220505 04:03:00.085866 130372 UpdateNode.h:183] vertex conflict 7:2:17:Nzrr_13240119750911167X
E20220505 04:03:01.723855 130385 UpdateNode.h:183] vertex conflict 7:2:17:Nzrr_13240119750911167X
E20220505 04:03:01.731515 130385 UpdateNode.h:183] vertex conflict 7:2:32:Nzrr_13240119750911167X
E20220505 04:10:36.975025 130367 UpdateNode.h:183] vertex conflict 7:2:32:Nzrr_13240119750911167X
E20220505 04:10:36.975108 130367 UpdateNode.h:183] vertex conflict 7:2:31:Nzrr_13240119750911167X
E20220505 04:10:36.975136 130377 UpdateNode.h:183] vertex conflict 7:2:31:Nzrr_13240119750911167X
E20220505 04:19:00.967234 130381 UpdateNode.h:476] edge conflict 7:2:Ncdxm_220328077649251:40:0:Nzrr_450981198709223277
E20220505 04:19:19.546137 130393 UpdateNode.h:476] edge conflict 7:2:Ncdxm_220328077649251:40:0:Nzrr_450981198709223277
E20220505 04:25:08.571955 130383 UpdateNode.h:476] edge conflict 7:12:Ncdxm_200924077363222:40:0:Nzrr_510902196501145312
E20220505 04:30:54.483660 130360 UpdateNode.h:476] edge conflict 7:11:Nzrr_372524198003270015:37:0:Ndh_15269993527
E20220505 06:07:39.849887 130368 UpdateNode.h:183] vertex conflict 7:13:25:Ndh_18865295966
E20220505 06:07:39.853845 130371 UpdateNode.h:183] vertex conflict 7:2:33:Ncdxm_220328077649234
E20220505 06:07:40.925901 130392 UpdateNode.h:183] vertex conflict 7:2:33:Ncdxm_220328077649234
E20220505 06:07:40.973513 130358 UpdateNode.h:183] vertex conflict 7:2:12:Ncdxm_220328077649234
E20220505 06:07:41.054414 130391 UpdateNode.h:183] vertex conflict 7:2:12:Ncdxm_220328077649234
E20220505 06:07:41.267524 130389 UpdateNode.h:183] vertex conflict 7:11:9:Nzrr_371202197207202110
E20220505 06:07:41.925748 130389 UpdateNode.h:183] vertex conflict 7:13:25:Ndh_13555491999
E20220505 06:07:43.132889 130363 UpdateNode.h:183] vertex conflict 7:12:9:Nzrr_232302198103021331
E20220505 06:07:43.479748 130365 UpdateNode.h:183] vertex conflict 7:10:25:Ndh_13727086454
E20220505 06:07:47.560051 130367 UpdateNode.h:183] vertex conflict 7:12:9:Nzrr_372926197709163932
E20220505 06:07:49.405622 130360 UpdateNode.h:183] vertex conflict 7:2:25:Ndh_15615599266
E20220505 06:07:52.215273 130376 UpdateNode.h:183] vertex conflict 7:2:25:Ndh_18045575522
E20220505 06:07:54.643631 130373 UpdateNode.h:183] vertex conflict 7:10:25:Ndh_15664401788
E20220505 06:07:56.358098 130383 UpdateNode.h:183] vertex conflict 7:6:9:Nzrr_620502198201273630
E20220505 06:07:57.826728 130372 UpdateNode.h:183] vertex conflict 7:13:25:Ndh_13355056656
E20220505 06:08:00.380738 130393 UpdateNode.h:183] vertex conflict 7:6:9:Nzrr_370124197112154015

以上 storge info日志提示信息

W20220503 02:48:12.826232 130220 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 2] Only 0 hosts succeeded, Need to try again
W20220503 02:48:43.855551 130214 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 12] Only 0 hosts succeeded, Need to try again
W20220503 02:49:09.025434 130211 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 11] Only 0 hosts succeeded, Need to try again
W20220503 02:49:42.331527 130220 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 2] Only 0 hosts succeeded, Need to try again
W20220503 02:50:10.321497 130202 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 11] Only 0 hosts succeeded, Need to try again
W20220503 02:51:04.971928 130200 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 13] Only 0 hosts succeeded, Need to try again
W20220503 02:51:29.052318 130212 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 8] Only 0 hosts succeeded, Need to try again
W20220503 02:52:00.326612 130216 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 7] Only 0 hosts succeeded, Need to try again
W20220503 02:52:35.933336 130216 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 9] Only 0 hosts succeeded, Need to try again
W20220503 02:53:03.739686 130217 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 6] Only 0 hosts succeeded, Need to try again
W20220503 02:53:32.510792 130214 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 12] Only 0 hosts succeeded, Need to try again
W20220503 02:54:15.732470 130218 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 8] Only 0 hosts succeeded, Need to try again
W20220503 02:54:53.486709 130206 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 9] Only 0 hosts succeeded, Need to try again
W20220503 02:55:17.375309 130212 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 11] Only 0 hosts succeeded, Need to try again
W20220503 02:55:53.128180 130204 RaftPart.cpp:1001] [Port: 9780, Space: 7, Part: 8] Only 0 hosts succeeded, Need to try again

以上 是warning提示信息

[INFO] nebula-metad(4805f8c): Running as 129811, Listening on 9559
[INFO] nebula-graphd(4805f8c): Running as 183777, Listening on 9669
[INFO] nebula-storaged(4805f8c): Running as 184420, Listening on 9779 这个状态没问题的

Storage.INFO中最后一条信息是6:08,但是Graphd.INFO中当前出错时间是10:40。两边时间是同步的吧?如果是这样的话,说明storage端没有任何日志?那是否有可能storaged根本没有收到请求?你可以tail -f storaged.INFO确保没有新的日志输出。如果可以的话,最好打开-v=4,然后重启storaged,再tail -f storaged.INFO,看有什么有意思的信息。

上面的info 和warning是3到6号出现的,先看下这个出现的问题是啥吧,graph报错就是最上面的信息

目前你以上graphd的报错还是timeout错误。可能的猜想有:
(1) storaged不响应,服务suspend
(2) storaged相应太慢,可能因为a) 磁盘load太大, b) 机器其他资源不够。
以上这两条需要你通过实时打印更详细的storaged日志才能确定。需要确定a) storaged是否有实时日志产生, b)如果有,能否看一下日志内容。

至于UpdateNode.h:183] vertex conflict 7:10:25:Ndh_13727086454,如果两台机器时间戳同步的话,storaged日志目前最新的时间是6:08,比当前graphd显示的timeout错误的时间10:40早太多,对不上。因此暂且认为跟timeout错误不相关。另外,这个错误的原因是同时更新同一个vertex key。

时间问题先忽略 前后是两个报错,服务suspend 这个原因大概有什么

(1)网络不通。
(2)storaged out of memory,最好查看一下进程的内存占用情况。
(3) 磁盘响应太慢,比如磁盘IO太大。
(4) contention。有大量其他对于storaged的查询。
(5)其他physical server自己的原因导致响应慢。

如果有用dashboard的话,很多指标会更加一目了然。

2 个赞