服务器重装系统后,nebula集群起不来

  • nebula 版本:2.6.1
  • 部署方式:分布式
  • 安装方式:源码编译
  • 是否上生产环境:N
  • 硬件信息
    • 磁盘 8T机械
    • nebula部署在固态盘/usr/local,数据存储在机械盘

问题描述
nebula三台集群机器,固态盘分区出现问题后均重装系统,数据存储在机械盘没有动,重新安装nebula应用后,使用命令sudo /usr/local/nebula/scripts/nebula.service start all集群起不来

  • nebula-graphd.ERROR
Log file created at: 2023/06/30 10:06:20
Running on machine: nebula01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0630 10:06:20.243712  4303 MetaClient.cpp:624] Send request to "nebula02":9559, exceed retry limit
E0630 10:06:20.244683  4285 MetaClient.cpp:94] RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0630 10:06:20.244791  4285 GraphService.cpp:43] Failed to wait for meta service ready synchronously.
E0630 10:06:20.244849  4285 GraphDaemon.cpp:158] Failed to wait for meta service ready synchronously.
  • nebula-metad.ERROR
Running on machine: nebula01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0630 10:23:40.856523  4277 RaftPart.cpp:1050] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "nebula02":9560, error code is E_UNKNOWN_PART
E0630 10:23:41.535290  4273 RaftPart.cpp:1050] [Port: 9560, Space: 0, Part: 0] Receive response about askForVote from "nebula02":9560, error code is E_UNKNOWN_PART
E0630 10:23:42.586930  4857 ActiveHostsMan.cpp:246] Get last update time failed, error: E_LEADER_CHANGED
E0630 10:23:44.600821  4857 ActiveHostsMan.cpp:246] Get last update time failed, error: E_LEADER_CHANGED
E0630 10:23:58.620523  4857 ActiveHostsMan.cpp:246] Get last update time failed, error: E_LEADER_CHANGED
  • nebula-metad.ERROR
Log file created at: 2023/06/30 10:06:20
Running on machine: nebula01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0630 10:06:20.343221  4372 MetaClient.cpp:624] Send request to "nebula02":9559, exceed retry limit
E0630 10:06:20.344065  4331 MetaClient.cpp:94] RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0630 10:06:20.344156  4331 StorageServer.cpp:163] waitForMetadReady error!
E0630 10:06:20.344187  4331 StorageDaemon.cpp:161] Storage server start failed
  • graphd-stderr.log
E0630 10:04:15.845803  3664 MetaClient.cpp:624] Send request to "nebula02":9559, exceed retry limit
E0630 10:04:15.875241  3651 MetaClient.cpp:94] RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0630 10:04:15.875334  3651 GraphService.cpp:43] Failed to wait for meta service ready synchronously.
E0630 10:04:15.875391  3651 GraphDaemon.cpp:158] Failed to wait for meta service ready synchronously.
E0630 10:04:33.862599  3996 MetaClient.cpp:624] Send request to "nebula03":9559, exceed retry limit
E0630 10:04:33.863549  3975 MetaClient.cpp:94] RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0630 10:04:33.863656  3975 GraphService.cpp:43] Failed to wait for meta service ready synchronously.
E0630 10:04:33.863729  3975 GraphDaemon.cpp:158] Failed to wait for meta service ready synchronously.
E0630 10:06:20.243712  4303 MetaClient.cpp:624] Send request to "nebula02":9559, exceed retry limit
E0630 10:06:20.244683  4285 MetaClient.cpp:94] RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0630 10:06:20.244791  4285 GraphService.cpp:43] Failed to wait for meta service ready synchronously.
E0630 10:06:20.244849  4285 GraphDaemon.cpp:158] Failed to wait for meta service ready synchronously.
  • metad-stderr.log
*** Aborted at 1688090667 (Unix time, try 'date -d @1688090667') ***
*** Signal 15 (SIGTERM) (0xf18) received by PID 3579 (pthread TID 0x7fe33ecee980) (linux TID 3579) (maybe from PID 3864, UID 0) (code: 0), stack trace: ***
/usr/local/nebula/bin/nebula-metad(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x20ef2b1]
/usr/local/nebula/bin/nebula-metad(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x1b)[0x20e711b]
/usr/local/nebula/bin/nebula-metad[0x20e5457]
/lib64/libpthread.so.0(+0xf62f)[0x7fe33e1bc62f]
/lib64/libc.so.6(nanosleep+0x2d)[0x7fe33dea49fd]
/lib64/libc.so.6(sleep+0xd3)[0x7fe33dea4893]
/usr/local/nebula/bin/nebula-metad(_Z6initKVSt6vectorIN6nebula8HostAddrESaIS1_EES1_+0xa0d)[0xf4ad8d]
/usr/local/nebula/bin/nebula-metad(main+0x79c)[0xf0e7fc]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x7fe33de01554]
/usr/local/nebula/bin/nebula-metad[0xf49a9d]
(safe mode, symbolizer not available)
*** Aborted at 1688090772 (Unix time, try 'date -d @1688090772') ***
*** Signal 15 (SIGTERM) (0x1049) received by PID 3919 (pthread TID 0x7fc0f9b00980) (linux TID 3919) (maybe from PID 4169, UID 0) (code: 0), stack trace: ***
/usr/local/nebula/bin/nebula-metad(_ZN5folly10symbolizer17getStackTraceSafeEPmm+0x31)[0x20ef2b1]
/usr/local/nebula/bin/nebula-metad(_ZN5folly10symbolizer21SafeStackTracePrinter15printStackTraceEb+0x1b)[0x20e711b]
/usr/local/nebula/bin/nebula-metad[0x20e5457]
/lib64/libpthread.so.0(+0xf62f)[0x7fc0f8fce62f]
/lib64/libc.so.6(nanosleep+0x2d)[0x7fc0f8cb69fd]
/lib64/libc.so.6(sleep+0xd3)[0x7fc0f8cb6893]
/usr/local/nebula/bin/nebula-metad(_Z6initKVSt6vectorIN6nebula8HostAddrESaIS1_EES1_+0xa0d)[0xf4ad8d]
/usr/local/nebula/bin/nebula-metad(main+0x79c)[0xf0e7fc]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x7fc0f8c13554]
/usr/local/nebula/bin/nebula-metad[0xf49a9d]
(safe mode, symbolizer not available)

集群配置

  • nebula-graphd.conf
########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-graphd.pid
# Whether to enable optimizer
--enable_optimizer=true
# The default charset when a space is created
--default_charset=utf8
# The defaule collate when a space is created
--default_collate=utf8_bin
# Whether to use the configuration obtained from the configuration file
--local_config=true

########## logging ##########
# The directory to host logging files
--log_dir=/data/nebula/logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=graphd-stdout.log
--stderr_log_file=graphd-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2

########## query ##########
# Whether to treat partial success as an error.
# This flag is only used for Read-only access, and Modify access always treats partial success as an error.
--accept_partial_success=false
# Maximum sentence length, unit byte
--max_allowed_query_size=4194304

########## networking ##########
# Comma separated Meta Server Addresses
--meta_server_addrs=nebula01:9559,nebula02:9559,nebula03:9559
# Local IP used to identify the nebula-graphd process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=nebula01
# Network device to listen on
--listen_netdev=any
# Port to listen on
--port=9669
# To turn on SO_REUSEPORT or not
--reuse_port=false
# Backlog of the listen socket, adjust this together with net.core.somaxconn
--listen_backlog=1024
# Seconds before the idle connections are closed, 0 for never closed
--client_idle_timeout_secs=0
# Seconds before the idle sessions are expired, 0 for no expiration
--session_idle_timeout_secs=0
# The number of threads to accept incoming connections
--num_accept_threads=1
# The number of networking IO threads, 0 for # of CPU cores
--num_netio_threads=0
# The number of threads to execute user queries, 0 for # of CPU cores
--num_worker_threads=0
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19669
# HTTP2 service port
--ws_h2_port=19670
# storage client timeout
--storage_client_timeout_ms=60000
# Port to listen on Meta with HTTP protocol, it corresponds to ws_http_port in metad's configuration file
--ws_meta_http_port=19559

########## authentication ##########
# Enable authorization
--enable_authorize=false
# User login authentication type, password for nebula authentication, ldap for ldap authentication, cloud for cloud authentication
--auth_type=password

########## memory ##########
# System memory high watermark ratio
--system_memory_high_watermark_ratio=0.8

########## experimental feature ##########
# if use experimental features
--enable_experimental_feature=false

  • nebula-metad.conf
########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-metad.pid

########## logging ##########
# The directory to host logging files
--log_dir=/data/nebula/logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=metad-stdout.log
--stderr_log_file=metad-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2

########## networking ##########
# Comma separated Meta Server addresses
--meta_server_addrs=nebula01:9559,nebula02:9559,nebula03:9559
# Local IP used to identify the nebula-metad process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=nebula01
# Meta daemon listening port
--port=9559
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19559
# HTTP2 service port
--ws_h2_port=19560
# Port to listen on Storage with HTTP protocol, it corresponds to ws_http_port in storage's configuration file
--ws_storage_http_port=19779

########## storage ##########
# Root data path, here should be only single path for metad
--data_path=/data/nebula/meta

########## Misc #########
# The default number of parts when a space is created
--default_parts_num=100
# The default replica factor when a space is created
--default_replica_factor=1

--heartbeat_interval_secs=10
  • nebula-storaged.conf
########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids/nebula-storaged.pid
# Whether to use the configuration obtained from the configuration file
--local_config=true

########## logging ##########
# The directory to host logging files
--log_dir=/data/nebula/logs
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=storaged-stdout.log
--stderr_log_file=storaged-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2

########## networking ##########
# Comma separated Meta server addresses
--meta_server_addrs=nebula01:9559,nebula02:9559,nebula03:9559
# Local IP used to identify the nebula-storaged process.
# Change it to an address other than loopback if the service is distributed or
# will be accessed remotely.
--local_ip=nebula01
# Storage daemon listening port
--port=9779
# HTTP service ip
--ws_ip=0.0.0.0
# HTTP service port
--ws_http_port=19779
# HTTP2 service port
--ws_h2_port=19780
# heartbeat with meta service
--heartbeat_interval_secs=10

######### Raft #########
# Raft election timeout
--raft_heartbeat_interval_secs=30
# RPC timeout for raft client (ms)
--raft_rpc_timeout_ms=500
## recycle Raft WAL
--wal_ttl=14400

########## Disk ##########
# Root data path. Split by comma. e.g. --data_path=/disk1/path1/,/disk2/path2/
# One path per Rocksdb instance.
--data_path=/data/nebula/storage

# Minimum reserved bytes of each data path
--minimum_reserved_bytes=268435456

# The default reserved bytes for one batch operation
--rocksdb_batch_size=4096
# The default block cache size used in BlockBasedTable.
# The unit is MB.
--rocksdb_block_cache=4
# The type of storage engine, `rocksdb', `memory', etc.
--engine_type=rocksdb

# Compression algorithm, options: no,snappy,lz4,lz4hc,zlib,bzip2,zstd
# For the sake of binary compatibility, the default value is snappy.
# Recommend to use:
#   * lz4 to gain more CPU performance, with the same compression ratio with snappy
#   * zstd to occupy less disk space
#   * lz4hc for the read-heavy write-light scenario
--rocksdb_compression=lz4

# Set different compressions for different levels
# For example, if --rocksdb_compression is snappy,
# "no:no:lz4:lz4::zstd" is identical to "no:no:lz4:lz4:snappy:zstd:snappy"
# In order to disable compression for level 0/1, set it to "no:no"
--rocksdb_compression_per_level=

# Whether or not to enable rocksdb's statistics, disabled by default
--enable_rocksdb_statistics=false

# Statslevel used by rocksdb to collection statistics, optional values are
#   * kExceptHistogramOrTimers, disable timer stats, and skip histogram stats
#   * kExceptTimers, Skip timer stats
#   * kExceptDetailedTimers, Collect all stats except time inside mutex lock AND time spent on compression.
#   * kExceptTimeForMutex, Collect all stats except the counters requiring to get time inside the mutex lock.
#   * kAll, Collect all stats
--rocksdb_stats_level=kExceptHistogramOrTimers

# Whether or not to enable rocksdb's prefix bloom filter, enabled by default.
--enable_rocksdb_prefix_filtering=true
# Whether or not to enable rocksdb's whole key bloom filter, disabled by default.
--enable_rocksdb_whole_key_filtering=false

############## rocksdb Options ##############
# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
--rocksdb_db_options={}
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_column_family_options={"write_buffer_size":"67108864","max_write_buffer_number":"4","max_bytes_for_level_base":"268435456"}
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_block_based_table_options={"block_size":"8192"}
  • 三台机器可以互相ping通
  • 防火墙已关闭

补充一下,将配置里的nebula02、nebula03删掉,作为单节点可以正常启动,进入7001可以访问以前的图空间,但是执行查询语句报错:-1005:Storage Error: The leader has changed. Try again later

试着删除了一下cluster.id文件和nebula/data目录,也不行

后续:已解决,重装系统后需要在后台开放9559、9669、9560、19559、19669、9777、9778、9779、19779、9780这些端口,以及三台meta服务的机器都要执行start命令,一段时间的重启后就好了…

2 个赞

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。