nebula 1.1.0监控指标的问题

laughing · 2021 年9 月 27 日 08:15

nebula版本 v1.1.0 （版本虽然比较久了，但是我们业务在使用）

最近我们采用了nebula的监控。
有这么一个指标
nebula_graphd_metaClient_latency_p99_60 （graph服务通过metaclient在60s内发送请求的P99数据）
nebula_graphd_metaClient_qps_count_60（graph服务通过metaclient在60s内发送请求的QPS总数）

监控的图表关系类似如图。
graph服务metaclient发送的请求时延与 graph服务meta发送的请求数成【反比】

graph服务meta发送的请求数越多，P99时延越低；

graph服务meta发送的请求数越少，P99时延越高；

一般来说，QPS越高，不是P99时延越高吗？为什么目前的监控会呈现这样的关系，我看了指标的说明，基本就是如上的表述

steam · 2021 年9 月 27 日 08:15

这个是什么工具，把你用的版本也贴一下

laughing · 2021 年9 月 27 日 08:18

我们采用的是nebula-stats-exporter这个工具，v1版本

通过它把图数据库原始的指标转换为普罗米修斯的对应指标

laughing · 2021 年9 月 27 日 08:21

这个问题对我们比较重要。目前我们现网暂在用1.1.0的版本。我看了1.1.0的meta指标统计。
这里metaclient的统计时延，和QPS没错。成功才会记录。

github.com

vesoft-inc/nebula/blob/v1.1.0/src/meta/client/MetaClient.cpp

/* Copyright (c) 2018 vesoft inc. All rights reserved.
 *
 * This source code is licensed under Apache 2.0 License,
 * attached with Common Clause Condition 1.0, found in the LICENSES directory.
 */

#include "time/Duration.h"
#include "meta/common/MetaCommon.h"
#include "meta/client/MetaClient.h"
#include "network/NetworkUtils.h"
#include "meta/NebulaSchemaProvider.h"
#include "meta/ClusterIdMan.h"
#include "meta/GflagsManager.h"
#include "base/Configuration.h"
#include "stats/StatsManager.h"
#include <folly/ScopeGuard.h>


DEFINE_int32(heartbeat_interval_secs, 3, "Heartbeat interval");
DEFINE_int32(meta_client_retry_times, 3, "meta client retry times, 0 means no retry");

该文件已被截断。显示原文

统计的类为：

github.com

vesoft-inc/nebula/blob/v1.1.0/src/common/stats/Stats.cpp

/* Copyright (c) 2019 vesoft inc. All rights reserved.
 *
 * This source code is licensed under Apache 2.0 License,
 * attached with Common Clause Condition 1.0, found in the LICENSES directory.
 */

#include "stats/StatsManager.h"
#include "stats/Stats.h"

DEFINE_int32(histogram_bucketSize, 1000, "The width of each bucket");
DEFINE_uint32(histogram_min, 1, "The smallest value for the bucket range");
DEFINE_uint32(histogram_max, 1000 * 1000, "The largest value for the bucket range");

namespace nebula {
namespace stats {

Stats::Stats(const std::string& serverName, const std::string& moduleName) {
    qpsStatId_ = StatsManager::registerStats(serverName + "_" + moduleName + "_qps");
    errorQpsStatId_ = StatsManager::registerStats(serverName + "_" + moduleName + "_error_qps");
    latencyStatId_ = StatsManager::registerHisto(serverName + "_" + moduleName + "_latency",

该文件已被截断。显示原文

laughing · 2021 年9 月 27 日 08:35

研发可以看下这个问题不

critical27 · 2021 年9 月 27 日 10:10

现在不太好说，主要是指标太粗粒度了，所有调用MetaClient的RPC接口都会addStatsValue。

要么细化下指标，按接口看看latency，或者可以试试没有任何操作，只用默认heartbeat时候的latency

laughing · 2021 年9 月 27 日 11:16

但是这里是graph调用metaclient的请求所计算的。就算粒度太粗，也不至于 QPS和时延成反比。

如上：
监控的图表关系类似如图。
graph服务metaclient发送的请求时延与 graph服务meta发送的请求数成【反比】

graph服务meta发送的请求数越多，P99时延越低；

graph服务meta发送的请求数越少，P99时延越高；

一般来说，QPS越高，不是P99时延越高吗？为什么目前的监控会呈现这样的关系，我看了指标的说明，基本就是如上的表述

laughing · 2021 年9 月 28 日 01:35

我的问题是：QPS越高，不是P99时延越高吗？为什么目前的监控会呈现这样的关系

critical27 · 2021 年9 月 28 日 03:55

我也想知道但是现在的数据并不足以解释为啥所以我觉得去细化是有必要的

laughing · 2021 年9 月 28 日 06:43

细化是指改源码吗？
这里这个指标属于nebula开源提供的

steam · 2021 年10 月 8 日 03:39

上面的细化指的是我们的指标可以颗粒度再细一点，你可以理解为是一个优化（不是修改源码，应该是新增）

system · 2021 年11 月 7 日 03:40

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。