nebula查询报错

  • nebula 版本:v1.1.0
  • 部署方式(分布式 / 单机 / Docker / DBaaS)
    • 分布式 k8s
    • 1个meta节点, 1个graphd节点, 3个storage节点
  • 硬件信息
    • 磁盘( 推荐使用 SSD)
      - 每个storage节点1块1T SSD, 共3块ssd
    • CPU、内存信息
      - 每个storage是4c128G, meta节点是2c4g, graphd节点是2C4G
  • 问题的具体描述:
    • 在使用的时候,当进行查询的时候,会出现无法连接storage的情况。而且这个情况是不稳定复现的
    • 出现错误的时候storage中并没有日志,meta中的日志是正常的,graphd有错误日志
    • 下面是graphd的错误日志,其中storage IP用xxxx代替了:
E0414 21:55:05.723985  1203 StorageClient.cpp:316] Request to [xxxx22509] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0414 21:55:05.725611  1195 ScanExecutor.cpp:267] ScanVertex error RPC failure in StorageClient: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0414 21:55:05.726768  1195 ExecutionPlan.cpp:80] Execute failed: RPC failure in StorageClient: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0414 21:55:10.988481  1204 StorageClient.cpp:126] Request to [xxxx:22509] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
  • 相关的 meta / storage / graph info 日志信息
    • 下面是分别是graphd,meta与storage的INFO日志
    • graphd
E0415 00:04:14.374117  1078 StorageClient.cpp:126] Request to [xxxxx:22509] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0415 00:04:14.378227  1196 InsertVertexExecutor.cpp:250] Insert vertices failed, error -3, part 3
E0415 00:04:14.379381  1196 ExecutionPlan.cpp:80] Execute failed: Insert vertex not complete, completeness: 0
E0415 00:04:14.409829  1079 StorageClient.cpp:126] Request to [xxxx:22509] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0415 00:04:14.411213  1194 InsertVertexExecutor.cpp:250] Insert vertices failed, error -3, part 511
E0415 00:04:14.412364  1194 ExecutionPlan.cpp:80] Execute failed: Insert vertex not complete, completeness: 0
E0415 00:04:14.442636  1080 StorageClient.cpp:126] Request to [xxxxxx:22509] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0415 00:04:14.443828  1080 StorageClient.cpp:126] Request to [xxxxx:22509] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0415 00:04:14.445029  1194 InsertEdgeExecutor.cpp:261] Insert edge failed, error -3, part 511
E0415 00:04:14.446199  1194 InsertEdgeExecutor.cpp:261] Insert edge failed, error -3, part 3
E0415 00:04:14.447266  1194 ExecutionPlan.cpp:80] Execute failed: Insert edge `edge_trans_user_order_ip' not complete, completeness: 0
E0415 00:04:14.477629  1081 StorageClient.cpp:126] Request to [xxxxx:22509] failed: N6apache6thrift9transport19TTransportExceptionE: Channel got EOF. Check for server hitting connection limit, server connection idle timeout, and server crashes.
E0415 00:04:14.479244  1194 ExecutionPlan.cpp:80] Execute failed: Get neighbors failed
  • meta
I0415 00:04:03.337513  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:04.318336  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:09.763253  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:13.352507  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:14.332794  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:19.776901  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:23.368639  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:24.347168  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:29.791584  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:33.382860  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:34.361493  1933 HBProcessor.cpp:31] Receive heartbeat from [1xxx:22509]
I0415 00:04:39.804872  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:43.397475  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:44.375896  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:49.819263  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:53.411648  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:54.390313  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
I0415 00:04:59.833978  1933 HBProcessor.cpp:31] Receive heartbeat from [xxx:22509]
  • storage中没有这个时间段的日志
  • 诉求
    • 麻烦问一下这个问题可能的原因是什么?我应该如何调整能够避免这个问题的出现

看一下 storage 的配置,是不是设置了 --client_idle_timeout_secs 或 --session_idle_timeout_secs

如果方便的话,可以把脱敏后的配置文件发一下,谢谢

我在storage中没有看到这两个配置,不过在graphd中找到了。
这两个配置的意义是?
下面分别是storage, graphd, meta的配置

storage1

########## basics ##########
# Whether to run as a daemon process
--daemonize=false
# The file to host the process id
--pid_file=/tmp/nebula-storaged.pid

########## logging ##########
# The directory to host logging files, which must already exists
--log_dir=/data/graphdb/storage/log
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=storaged-stdout.log
--stderr_log_file=storaged-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=3

########## networking ##########
# Meta server address
--meta_server_addrs=xxxx:22343
# Local ip
--local_ip=xxxxx
# Storage daemon listening port
--port=xxx09
# HTTP service ip
--ws_ip=xxxxxx
# HTTP service port
--ws_http_port=12000
# HTTP2 service port
--ws_h2_port=12002

--ws_meta_http_port=11000
--ws_meta_h2_port=11002
--ws_storage_http_port=12000
--ws_storage_h2_port=12002

# heartbeat with meta service
--heartbeat_interval_secs=10


########## Disk ##########
# Root data path. split by comma. e.g. --data_path=/disk1/path1/,/disk2/path2/
# One path per Rocksdb instance. 
--data_path=/data/graphdb/storage/1,/data/graphdb/storage/2,/data/graphdb/storage/3,/data/graphdb/storage/4,/data/graphdb/storage/5,/data/graphdb/storage/6,/data/graphdb/storage/7,/data/graphdb/storage/8,/data/graphdb/storage/9,/data/graphdb/storage/10,/data/graphdb/storage/11,/data/graphdb/storage/12,/data/graphdb/storage/13,/data/graphdb/storage/14,/data/graphdb/storage/15,/data/graphdb/storage/16,/data/graphdb/storage/17,/data/graphdb/storage/18,/data/graphdb/storage/19,/data/graphdb/storage/20,/data/graphdb/storage/21,/data/graphdb/storage/22,/data/graphdb/storage/23,/data/graphdb/storage/24,/data/graphdb/storage/25,/data/graphdb/storage/26,/data/graphdb/storage/27,/data/graphdb/storage/28,/data/graphdb/storage/29,/data/graphdb/storage/30,/data/graphdb/storage/31,/data/graphdb/storage/32,

######### Raft #########
# Raft election timeout
--raft_heartbeat_interval_secs=15
# storage not check leader
--check_leader=false
# RPC timeout for raft client (ms)
--raft_rpc_timeout_ms=15000
## recycle Raft WAL 
--wal_ttl=604800

############## Rocksdb Options ##############

# The default reserved bytes for one batch operation
--rocksdb_batch_size=4096

# The default block cache size used in BlockBasedTable.
# The unit is MB.
--rocksdb_block_cache=8192

--rocksdb_disable_wal=false

--enable_partitioned_index_filter=true

--local_config=true

--num_compaction_threads=2

--rate_limit=64

--enable_rocksdb_prefix_filtering=true

--enable_rocksdb_whole_key_filtering=false

--rocksdb_filtering_prefix_length=16

# rocksdb DBOptions in json, each name and value of option is a string, given as "option_name":"option_value" separated by comma
--rocksdb_db_options={"max_background_jobs":"4","max_background_compactions":"4","max_subcompactions":"4","compaction_readahead_size":"32K","writable_file_max_buffer_size":"16M","allow_concurrent_memtable_write":"false","stats_dump_period_sec":"300","stats_persist_period_sec":"300","dump_malloc_stats":"true","table_cache_numshardbits":"12","atomic_flush":"false","write_thread_slow_yield_usec":"5","write_thread_max_yield_usec":"100","persist_stats_to_disk":"false","skip_log_error_on_recovery":"true","log_readahead_size":"4K","error_if_exists":"false","create_if_missing":"true","paranoid_checks":"false"}
# rocksdb ColumnFamilyOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_column_family_options={"prefix_extractor":"rocksdb.FixedPrefix.16","memtable_factory":"SkipListFactory","memtable_insert_with_hint_prefix_extractor":"rocksdb.FixedPrefix.16","compaction_style":"kCompactionStyleUniversal","num_levels":"7","compression_per_level":"kNoCompression:kNoCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kZSTD","memtable_whole_key_filtering":"false","write_buffer_size":"64M","max_write_buffer_number":"4","max_bytes_for_level_base":"256M","level0_file_num_compaction_trigger":"4","level0_slowdown_writes_trigger":"1K","level0_stop_writes_trigger":"4K", "soft_pending_compaction_bytes_limit":"4T", "hard_pending_compaction_bytes_limit":"16T","min_write_buffer_number_to_merge":"1","max_write_buffer_number_to_maintain":"1","compression":"kSnappyCompression","compaction_options_universal":"{stop_style=kCompactionStopStyleTotalSize;compression_size_percent=-1;allow_trivial_move=true;max_merge_width=4294967295;max_size_amplification_percent=50;min_merge_width=2;size_ratio=1;}","table_factory":"BlockBasedTable","bottommost_compression":"kZSTD","optimize_filters_for_hits":"false","level_compaction_dynamic_level_bytes":"true","periodic_compaction_seconds":"604800","disable_auto_compactions":"true","report_bg_io_stats":"true"}
# rocksdb BlockBasedTableOptions in json, each name and value of option is string, given as "option_name":"option_value" separated by comma
--rocksdb_block_based_table_options={"block_size":"16K","block_restart_interval":"32","format_version":"5","enable_index_compression":"true","partition_filters":"true","whole_key_filtering":"false","data_block_index_type":"kDataBlockBinarySearch","index_type":"kTwoLevelIndexSearch","pin_l0_filter_and_index_blocks_in_cache":"true","block_restart_interval":"64","cache_index_and_filter_blocks_with_high_priority":"true","cache_index_and_filter_blocks":"true"}

############### misc ####################

--storage_http_thread_num=5
--storage_client_timeout_ms=600000
--enable_multi_versions=true
--num_io_threads=16
--num_worker_threads=16
--cluster_id_path=/data/graphdb/cluster.id
--gflags_mode_json=/data/graphdb/etc/gflags.json

############# edge samplings ##############
# --enable_reservoir_sampling=false
# --max_edge_returned_per_vertex=2147483647

--storage_remove_future_data=true

--enable_vertex_cache=true

--vertex_cache_num=67108864

--vertex_cache_ttl=3600

--vertex_cache_bucket_exp=12

--reader_handlers=36

--max_handlers_per_req=8

--min_vertices_per_bucket=64

--reserved_edges_one_vertex=4096

--max_edge_returned_per_vertex=262144

--slow_op_threshhold_ms=50

--enable_auto_repair=true

--max_scan_block_size=268435456

--conn_timeout_ms=15

graphd

########## basics ##########
# Whether to run as a daemon process
--daemonize=false
# The file to host the process id
--pid_file=/tmp/nebula-graphd.pid

########## logging ##########
# The directory to host logging files, which must already exists
--log_dir=/data/graphdb/log
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=graphd-stdout.log
--stderr_log_file=graphd-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=3

########## networking ##########
# Meta Server Address
--meta_server_addrs=xxxxxxxxx:22343
# Local ip
--local_ip=xxxxxxxxxxxx
# Network device to listen on
--listen_netdev=any
# Port to listen on
--port=13708
# To turn on SO_REUSEPORT or not
--reuse_port=false
# Backlog of the listen socket, adjust this together with net.core.somaxconn
--listen_backlog=1024
# Seconds before the idle connections are closed, 0 for never closed
--client_idle_timeout_secs=0
# Seconds before the idle sessions are expired, 0 for no expiration
--session_idle_timeout_secs=0
# The number of threads to accept incoming connections
--num_accept_threads=1
# The number of networking IO threads, 0 for # of CPU cores
--num_netio_threads=36
# The number of threads to execute user queries, 0 for # of CPU cores
--num_worker_threads=36
# HTTP service ip
--ws_ip=xxxxxxx
# HTTP service port
--ws_http_port=13000
# HTTP2 service port
--ws_h2_port=13002

--ws_meta_http_port=11000
--ws_meta_h2_port=11002
--ws_storage_http_port=12000
--ws_storage_h2_port=12002

--heartbeat_interval_secs=10

--storage_client_timeout_ms=600000

########## authorization ##########
# Enable authorization
--enable_authorize=true

--whitelist=

--enable_local_authorize=false

--cluster_id_path=/data/graphdb/cluster.id

--gflags_mode_json=/data/graphdb/etc/gflags.json

--use_executor_access_cache=true

--conn_timeout_ms=15

meta

########## basics ##########
# Whether to run as a daemon process
--daemonize=false
# The file to host the process id
--pid_file=/tmp/nebula-metad.pid

########## logging ##########
# The directory to host logging files, which must already exists
--log_dir=/data/graphdb/meta/log
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=metad-stdout.log
--stderr_log_file=metad-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=3

########## networking ##########
# Meta Server Address
--meta_server_addrs=xxxxxxxxx:22343
# Local ip
--local_ip=xxxxx
# Meta daemon listening port
--port=22343
# HTTP service ip
--ws_ip=xxxxxxxxx
# HTTP service port
--ws_http_port=11000
# HTTP2 service port
--ws_h2_port=11002

--ws_meta_http_port=11000
--ws_meta_h2_port=11002
--ws_storage_http_port=12000
--ws_storage_h2_port=12002

# heartbeat from storage with meta service
--heartbeat_interval_secs=10

########## storage ##########
# Root data path, here should be only single path for metad
--data_path=/data/graphdb/meta

############## rocksdb Options ##############
--rocksdb_disable_wal=false

--rocksdb_wal_sync=true

########## Misc #########

--meta_http_thread_num=256

--dispatch_thread_num=256

# The default number of parts when a space is created
--default_parts_num=512
# The default replica factor when a space is created
--default_replica_factor=1

--cluster_id_path=/data/graphdb/cluster.id

--gflags_mode_json=/data/graphdb/etc/gflags.json

--conn_timeout_ms=15

--storage_client_timeout_ms=600000

--task_concurrency=48

嗯,我打错了,是 graph 的配置

--whitelist=

--enable_local_authorize=false

--cluster_id_path=/data/graphdb/cluster.id

--gflags_mode_json=/data/graphdb/etc/gflags.json

--use_executor_access_cache=true

--conn_timeout_ms=15

最后这段配置是你自己配置的么,把 conn_timeout_ms 改成 1000 试下。

下面是gflags的配置

{
    "MUTABLE": [
        "max_edge_returned_per_vertex",
        "minloglevel",
        "v",
        "slow_op_threshhold_ms",
        "wal_ttl"
    ]
}

另外麻烦问一下,上面那个报错的原因是什么?为什么改一下这个参数可能会好?

--conn_timeout_ms=15

问题复现的一些情况

  1. 刚刚又出现了这个问题,出现这个问题之前我导入了一个较大的数据,有3000W左右的数据量。但是数据在12:08的时候已经完成了。我在14:10的时候通过graphd操作的时候,出现了这个问题

conn_timeout_ms 是创建 socket 的超时时间。我们文档上并没有说明要改这个,默认是 1000ms
看你日志应该是 thrift 通信的问题,所以看到你配置改了这个,就建议先改为默认值。

如果导入后,又出现这个,一个是看一下 limit 的设置,要么看一下 storage 是不是在 compact,负载比较高,或者 graph 和 storage 之间的网络情况。

不好意思,注意到你用的 v1.1.0 版本,之前版本有个 bug,最近 v1.2.1 的版本已经修复了。

https://github.com/vesoft-inc/nebula-common/pull/431

1 个赞

1. 我更改了graphd中的conn_timeout_ms的配置,还是出现的了这个问题,请问你说的limit是?

2. 我根据错误日志看到的源码是这个:

这个里面好像没有上面的issue中的内容,打日志的地方我做了标记。请问这个是跟上面说的thrift的通信有关吗?

template <
    class Request,
    class Response = typename decltype(processOne(
        *(storage::cpp2::StorageServiceAsyncClient*)(nullptr),
        *(const Request*)(nullptr))
    )::value_type
>
struct StorageReqsContext : public std::enable_shared_from_this<StorageReqsContext<Request, Response>> {
public:
    StorageReqsContext(
        StorageClient& client,
        std::unordered_map<HostAddr, Request> requests,
        folly::EventBase* evb) :
        client_(client),
        requests_(std::move(requests)),
        resp_(requests_.size()),
        evb_(evb == nullptr ? client.ioThreadPool_->getEventBase() : evb) {
    }

    folly::SemiFuture<StorageRpcResponse<Response>> process(bool retry = false) {
        std::vector<folly::Future<folly::Unit>> results;
        for (auto& pair : requests_) {
            const HostAddr& host = pair.first;
            Request& req = pair.second;

            results.emplace_back(folly::via(evb_, [this, &host, &req] () mutable {
                return processOne(*getClient(host), req).via(evb_);
            }).thenTry([this, &host, &req, retry] (folly::Try<Response>&& reqTry) {
                if (reqTry.hasException()) {
                   // 这里打出的错误日志
                    LOG(ERROR) << "Request to " << host << " failed: " << reqTry.exception().what();
                    if (!retry) {
                        for (auto& part : req.parts) {
                            PartitionID partId = getPartId(part);
                            resp_.failedParts().emplace(partId, storage::cpp2::ErrorCode::E_RPC_FAILURE);
                            client_.invalidLeader(req.get_space_id(), partId);
                        }
                        resp_.markFailure();
                    } else {
                        requestsRetry_[host] = std::move(req);
                    }
                } else {
                    auto resp = std::move(reqTry).value();
                    auto& result = resp.get_result();
                    for (auto& code : result.get_failed_codes()) {
                        VLOG(3) << "Failure! Failed part " << code.get_part_id()
                                << ", failed code " << static_cast<int32_t>(code.get_code());
                        if (code.get_code() == storage::cpp2::ErrorCode::E_LEADER_CHANGED) {
                            auto* leader = code.get_leader();
                            if (leader != nullptr
                                && leader->get_ip() != 0
                                && leader->get_port() != 0) {
                                client_.updateLeader(req.get_space_id(),
                                                     code.get_part_id(),
                                                     HostAddr(leader->get_ip(), leader->get_port()));
                            } else {
                                client_.invalidLeader(req.get_space_id(), code.get_part_id());
                            }
                        } else if (code.get_code() == storage::cpp2::ErrorCode::E_PART_NOT_FOUND
                                   || code.get_code() == storage::cpp2::ErrorCode::E_SPACE_NOT_FOUND) {
                            client_.invalidLeader(req.get_space_id(), code.get_part_id());
                        } else {
                            // Simply keep the result
                            resp_.failedParts().emplace(code.get_part_id(), code.get_code());
                        }
                    }
                    if (result.__isset.leader_hints) {
                        for (const cpp2::LeaderHint& hint : *result.get_leader_hints()) {
                            if (hint.get_leader().get_ip() != 0
                                && hint.get_leader().get_port() != 0) {
                                client_.updateLeader(
                                    req.get_space_id(), hint.get_part_id(),
                                    HostAddr(hint.get_leader().get_ip(), hint.get_leader().get_port()));
                            }
                        }
                    }

                    if (!result.get_failed_codes().empty()) {
                        resp_.markFailure();
                    }

                    // Adjust the latency
                    auto latency = result.get_latency_in_us();
                    resp_.setLatency(host, latency, duration_.elapsedInUSec());
                    // Keep the response
                    resp_.responses().emplace_back(std::move(resp));
                }
            }));
        }

3. 而且我刚刚看了一下,graphd中有很多跟storage的close wait的tcp状态,跟这个原因有关吗?

我这边调整了conn_timeout_ms,但是这个问题还是会出现这个问题。但是这次并有对数据进行导入

确认几个问题:

  1. 我看你上面日志里有Scan(日志里看到ScanExecutor,这个是哪里来的)和Insert,说一下查询的场景,越详细越好
  2. 贴下CLOSE_WAIT的图