Match报错 Storage Error: part: 20, error: E_RPC_FAILURE(-3).

dhxu · 2021 年7 月 1 日 11:42

nebula 版本：2.0.1
部署方式（分布式）：
是否为线上版本：
硬件信息
- 磁盘（ HHD）
- CPU、内存 512G

10亿+的节点，MATCH (v:mobile)–(v2:mobile) RETURN v,v2 limit 5 或者 MATCH (v:mobile) return count(v)
都会报 E_RPC_FAILURE(-3) 的错误，偶尔会出现network error的错误

storage.conf 设置了
–rebuild_index_batch_num=16
–storage_client_timeout_ms=300000

jamieliu1023 · 2021 年7 月 2 日 01:36

你参考一下这个帖子：MATCH 执行失败 Storage Error
用你的报错信息在论坛里搜索一下，可以看到一些相关的帖子，先看看是否能解决你的问题，如果不能，再来反馈

dhxu · 2021 年7 月 2 日 02:30

修改成了–storage_client_timeout_ms=3000000， nebula-storaged.conf里也设置了–local_config=true
但是没执行到这个时间，很快就停止了
查看graphd.ERROR，似乎还是超时的错误

dhxu · 2021 年7 月 2 日 02:38

刚回复成我自己了

steam · 2021 年7 月 2 日 03:21

把日志贴一下呢

dhxu · 2021 年7 月 2 日 03:41

graphd.ERROR的日志

dhxu · 2021 年7 月 2 日 03:51

graphd.INFO日志，感觉timeout还是默认的60s

wey · 2021 年7 月 2 日 04:04

能贴一下完整的配置么？

dhxu · 2021 年7 月 2 日 04:11

--local_config=true
--rebuild_index_batch_num=16
--storage_client_timeout_ms=3000000
########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-storaged.pid

########## logging ##########
# The directory to host logging files
--log_dir=/data1/nebula/logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of th
e logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=storaged-stdout.log
--stderr_log_file=storaged-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. Th
e numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3,
 respectively.
--stderrthreshold=2

########## networking ##########
# Comma separated Meta server addresses
--meta_server_addrs=10.142.158.75:9559,10.142.158.76:9559,10.142.158.77:9559,10.
142.158.78:9559
# Local IP used to identify the nebula-storaged process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=10.142.158.77
# Storage daemon listening port
--port=9779
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19779
# HTTP2 service port
--ws_h2_port=19780
# heartbeat with meta service
--heartbeat_interval_secs=10

######### Raft #########
# Raft election timeout
--raft_heartbeat_interval_secs=30
# RPC timeout for raft client (ms)
--raft_rpc_timeout_ms=500
## recycle Raft WAL
--wal_ttl=14400

########## Disk ##########
# Root data path. Split by comma. e.g. --data_path=/disk1/path1/,/disk2/path2/
# One path per Rocksdb instance.
--data_path=/data2/nebula/data/storage,/data3/nebula/data/storage,/data4/nebula/
data/storage,/data5/nebula/data/storage,/data6/nebula/data/storage

# The default reserved bytes for one batch operation
--rocksdb_batch_size=4096
# The default block cache size used in BlockBasedTable.
# The unit is MB.
--rocksdb_block_cache=102400
# The type of storage engine, `rocksdb', `memory', etc.
--engine_type=rocksdb

# Compression algorithm, options: no,snappy,lz4,lz4hc,zlib,bzip2,zstd
# For the sake of binary compatibility, the default value is snappy.
# Recommend to use:
#   * lz4 to gain more CPU performance, with the same compression ratio with snappy
#   * zstd to occupy less disk space
#   * lz4hc for the read-heavy write-light scenario
--rocksdb_compression=lz4

# Set different compressions for different levels
# For example, if --rocksdb_compression is snappy,
# "no:no:lz4:lz4::zstd" is identical to "no:no:lz4:lz4:snappy:zstd:snappy"
# In order to disable compression for level 0/1, set it to "no:no"
--rocksdb_compression_per_level=

# Whether or not to enable rocksdb's statistics, disabled by default
--enable_rocksdb_statistics=false

# Statslevel used by rocksdb to collection statistics, optional values are
#   * kExceptHistogramOrTimers, disable timer stats, and skip histogram stats
#   * kExceptTimers, Skip timer stats
#   * kExceptDetailedTimers, Collect all stats except time inside mutex lock AND time spent on compression.
#   * kExceptTimeForMutex, Collect all stats except the counters requiring to get time inside the mutex lock.
#   * kAll, Collect all stats
--rocksdb_stats_level=kExceptHistogramOrTimers

# Whether or not to enable rocksdb's prefix bloom filter, disabled by default.
--enable_rocksdb_prefix_filtering=false
# Whether or not to enable the whole key filtering.
--enable_rocksdb_whole_key_filtering=true
# The prefix length for each key to use as the filter value.
# can be 12 bytes(PartitionId + VertexID), or 16 bytes(PartitionId + VertexID + TagID/EdgeType).
--rocksdb_filtering_prefix_length=12

############## rocksdb Options ##############
# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
--rocksdb_db_options={}
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_column_family_options={"write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456"}
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_block_based_table_options={"block_size":"8192"}

dhxu · 2021 年7 月 2 日 05:15

集群部署，4个节点，全都添加了。

wey · 2021 年7 月 2 日 05:18

sorry 我才注意到，您是在 storaged.conf 做的配置，这个配置是 client side的，需要在 graphd.conf 里做哈

dhxu · 2021 年7 月 2 日 05:44

我也一直没注意到，多谢多谢我试试

dhxu · 2021 年7 月 2 日 06:06

现在跑一会，会报 network error

wey · 2021 年7 月 2 日 06:12

好滴好滴，已经过了client 超时了对么？可以再贴一些log么？

wey · 2021 年7 月 2 日 06:17

storaged 有没有 crash掉？

dhxu · 2021 年7 月 2 日 06:35

storaged没有crash

这个是graphd.INFO的日志

Log file created at: 2021/07/02 14:22:56
Running on machine: A5-306-HW-2488HV5-2019-011
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0702 14:22:56.807940 69400 GraphDaemon.cpp:110] Starting Graph HTTP Service
I0702 14:22:56.818940 69407 WebService.cpp:131] Web service started on HTTP[19669], HTTP2[19670]
I0702 14:22:56.819001 69400 GraphDaemon.cpp:124] Number of networking IO threads: 112
I0702 14:22:56.819015 69400 GraphDaemon.cpp:133] Number of worker threads: 112
I0702 14:22:56.819888 69400 MetaClient.cpp:50] Create meta client to "10.142.158.76":9559
I0702 14:22:57.933521 69400 MetaClient.cpp:99] Register time task for heartbeat!
I0702 14:22:58.031224 69400 GraphDaemon.cpp:164] Starting nebula-graphd on 0.0.0.0:9669
I0702 14:23:20.918087 69690 GraphService.cpp:33] Authenticating user root from 10.142.158.78:51928
I0702 14:23:24.969338 69690 SwitchSpaceExecutor.cpp:43] Graph switched to `mobile_imei_all', space id: 60
I0702 14:25:13.966357 69689 MetaClient.cpp:3053] Load leader of "10.142.158.75":9779 in 5 space
I0702 14:25:13.966413 69689 MetaClient.cpp:3053] Load leader of "10.142.158.76":9779 in 5 space
I0702 14:25:13.966449 69689 MetaClient.cpp:3053] Load leader of "10.142.158.77":9779 in 5 space
I0702 14:25:13.966468 69689 MetaClient.cpp:3053] Load leader of "10.142.158.78":9779 in 5 space
I0702 14:25:13.966473 69689 MetaClient.cpp:3056] Load leader ok
I0702 14:27:21.728935 69689 GraphService.cpp:33] Authenticating user root from 10.142.158.78:52048
I0702 14:27:30.903767 69689 SwitchSpaceExecutor.cpp:43] Graph switched to `mobile_imei_all', space id: 60
I0702 14:28:00.060885 69687 SwitchSpaceExecutor.cpp:43] Graph switched to `mobile_imei', space id: 32

dhxu · 2021 年7 月 2 日 06:44

这是storaged.INFO的部分日志，重启之后先balance leader，查询差不多14：25分左右执行的

I0702 14:23:45.162096 69540 Host.cpp:149] [Port: 9780, Space: 60, Part: 10] [Host: 10.142.158.75:9780] This is the first time to send the logs to this host, lastLogIdSent = 15907614, lastLogTermSent = 31
I0702 14:23:45.162003 69587 RaftPart.cpp:1248] [Port: 9780, Space: 60, Part: 6] The partition is elected as the leader
I0702 14:23:45.162073 69581 RaftPart.cpp:1152] [Port: 9780, Space: 60, Part: 17] Partition is elected as the new leader for term 49
I0702 14:23:45.162195 69541 Host.cpp:149] [Port: 9780, Space: 60, Part: 9] [Host: 10.142.158.76:9780] This is the first time to send the logs to this host, lastLogIdSent = 15838164, lastLogTermSent = 41
I0702 14:23:45.162263 69581 RaftPart.cpp:1248] [Port: 9780, Space: 60, Part: 17] The partition is elected as the leader
I0702 14:23:45.162279 69542 Host.cpp:149] [Port: 9780, Space: 60, Part: 6] [Host: 10.142.158.77:9780] This is the first time to send the logs to this host, lastLogIdSent = 15776524, lastLogTermSent = 44
I0702 14:23:45.162279 69587 Part.cpp:191] [Port: 9780, Space: 60, Part: 1] Find the new leader "10.142.158.76":9780
I0702 14:23:45.162333 69561 Host.cpp:149] [Port: 9780, Space: 60, Part: 17] [Host: 10.142.158.76:9780] This is the first time to send the logs to this host, lastLogIdSent = 15707326, lastLogTermSent = 48
I0702 14:23:45.162320 69541 Host.cpp:149] [Port: 9780, Space: 60, Part: 9] [Host: 10.142.158.77:9780] This is the first time to send the logs to this host, lastLogIdSent = 15838164, lastLogTermSent = 41
I0702 14:23:45.162345 69482 RaftPart.cpp:422] [Port: 9780, Space: 32, Part: 2] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162400 69482 RaftPart.cpp:436] [Port: 9780, Space: 32, Part: 2] I am already the leader!
I0702 14:23:45.162351 69542 Host.cpp:149] [Port: 9780, Space: 60, Part: 6] [Host: 10.142.158.75:9780] This is the first time to send the logs to this host, lastLogIdSent = 15776524, lastLogTermSent = 44
I0702 14:23:45.162423 69561 Host.cpp:149] [Port: 9780, Space: 60, Part: 17] [Host: 10.142.158.77:9780] This is the first time to send the logs to this host, lastLogIdSent = 15707326, lastLogTermSent = 48
I0702 14:23:45.162468 69482 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 10] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162483 69482 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 10] I am already the leader!
I0702 14:23:45.162629 69482 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 17] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162639 69482 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 17] I am already the leader!
I0702 14:23:45.162648 69483 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 9] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162667 69483 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 9] I am already the leader!
I0702 14:23:45.162659 69481 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 6] Commit transfer leader to "10.142.158.78":9780
I0702 14:23:45.162694 69481 RaftPart.cpp:436] [Port: 9780, Space: 60, Part: 6] I am already the leader!
I0702 14:23:58.640892 69481 RaftPart.cpp:422] [Port: 9780, Space: 32, Part: 1] Commit transfer leader to "10.142.158.76":9780
I0702 14:23:58.640935 69481 RaftPart.cpp:442] [Port: 9780, Space: 32, Part: 1] I am Follower, just wait for the new leader!
I0702 14:23:58.955364 69481 RaftPart.cpp:422] [Port: 9780, Space: 60, Part: 1] Commit transfer leader to "10.142.158.76":9780
I0702 14:23:58.955391 69481 RaftPart.cpp:442] [Port: 9780, Space: 60, Part: 1] I am Follower, just wait for the new leader!
I0702 14:23:59.421893 69481 RaftPart.cpp:422] [Port: 9780, Space: 32, Part: 3] Commit transfer leader to "10.142.158.76":9780
I0702 14:23:59.421919 69481 RaftPart.cpp:442] [Port: 9780, Space: 32, Part: 3] I am Follower, just wait for the new leader!
I0702 14:33:06.375084 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 15] Clean wals number 1
I0702 14:33:06.381273 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 7] Clean wals number 1
I0702 14:33:06.386870 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 2] Clean wals number 1
I0702 14:33:06.389310 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 3] Clean wals number 1
I0702 14:33:06.395366 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 10] Clean wals number 1
I0702 14:33:06.399804 69585 FileBasedWal.cpp:738] [Port: 9780, Space: 60, Part: 17] Clean wals number 1

dhxu · 2021 年7 月 2 日 08:57

出现network error之后还有个现象，执行停止服务命令显示成功，但再启动服务的时候，会发现storage已经在运行了

[root@A5-306-HW-2488HV5-2019-011 ~]# /usr/local/nebula/scripts/nebula.service stop all
[INFO] Stopping nebula-metad...
[INFO] Done
[INFO] Stopping nebula-graphd...
[INFO] Done
[INFO] Stopping nebula-storaged...
[INFO] Done
[root@A5-306-HW-2488HV5-2019-011 ~]# /usr/local/nebula/scripts/nebula.service start all
[INFO] Starting nebula-metad...
[INFO] Done
[ERROR] nebula-graphd already running: 69400
[ERROR] nebula-storaged already running: 69436

wey · 2021 年7 月 2 日 09:09

network error 指的是 closed network connection 这个报错么？

dhxu · 2021 年7 月 2 日 09:17

不是的，就直接是这样