nebula-stats服务启动失败

Michael_Ye · 2022 年8 月 22 日 01:37

dashboard社区版(1.1.1版本)的nebula-stat服务启动失败，修改配置文件后启动，报：
0819 20:36:33.590030 21467 main.go:92] Providing metrics at :9200/metrics
I0819 20:36:33.828367 21467 exporter.go:90] Start collect
I0819 20:36:33.828391 21467 exporter.go:155] Collect nebula:METAD metad0:19559 Metrics
I0819 20:36:33.828399 21467 exporter.go:155] Collect nebula:GRAPHD graphd0:19669 Metrics
I0819 20:36:33.828403 21467 exporter.go:155] Collect nebula:STORAGED storaged0:19779 Metrics
panic: label value “\xf8\xe7d\x02\x00\x00\x00\x00\x02\x00\x00\x001976@Aa\x9e)\u007f\x00\x00PAa\x9e)\u007f\x00\x00PAa\x9e)\u007f\x00\x009” is not valid UTF-8

goroutine 37 [running]:
github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...)
/Users/liuxinchao/go/pkg/mod/github.com/prometheus/client_golang@v1.9.0/prometheus/value.go:107
github.com/vesoft-inc/nebula-stats-exporter/exporter.(*NebulaExporter).CollectMetrics(0x0, {0xc000332eb8, 0x7}, {0xc000332f20, 0x0}, {0x13442dc, 0xe}, {0xc000332e18, 0x6}, {0xc00066c000, …}, …)
/Users/liuxinchao/github.com/kqzh/nebula-stats-exporter/exporter/exporter.go:133 +0x7a5
github.com/vesoft-inc/nebula-stats-exporter/exporter.(*NebulaExporter).collect.func1()
/Users/liuxinchao/github.com/kqzh/nebula-stats-exporter/exporter/exporter.go:167 +0x25a
created by github.com/vesoft-inc/nebula-stats-exporter/exporter.(*NebulaExporter).collect
/Users/liuxinchao/github.com/kqzh/nebula-stats-exporter/exporter/exporter.go:160 +0x391

kqzh · 2022 年8 月 22 日 06:00

你好，可以贴一下exporter config.yaml的配置吗，看起来是配置里有特殊字符引起的

Michael_Ye · 2022 年8 月 22 日 06:21

[root@10-62-0-52 nebula-stats-exporter]# cat config.yaml
clusters:

name: nebula
instances:
- name: metad0
  endpointIP: 10.62.0.52
  endpointPort: 19559
  componentType: metad
- name: graphd0
  endpointIP: 10.62.0.227
  endpointPort: 19669
  componentType: graphd
- name: storaged0
  endpointIP: 10.62.0.244
  endpointPort: 19779
  componentType: storaged

Michael_Ye · 2022 年8 月 22 日 06:23

我只改了ip，没有手动编辑别的内容

kqzh · 2022 年8 月 22 日 08:09

可以换成这个格式的配置试试？为本地测了下是没问题的

clusters:
  - name: nebula
    instances:
      - name: metad0
        endpointIP: 10.62.0.52
        endpointPort: 19559
        componentType: metad

Michael_Ye · 2022 年8 月 22 日 09:47

github.com/prometheus/node_exporter

Panic: label value is not valid UTF-8

opened 04:04PM - 24 Feb 21 UTC

closed 07:38PM - 03 Mar 21 UTC

hhoffstaette

bug

Disclaimer: I always find the weirdest bugs and apologize in advance. :) ### …Host operating system: output of `uname -a` Linux hho 5.10.18 #3 SMP Wed Feb 24 14:19:12 CET 2021 x86_64 AMD Ryzen 7 PRO 4750U with Radeon Graphics AuthenticAMD GNU/Linux ### node_exporter version: output of `node_exporter --version` node_exporter, version 1.1.1 (branch: non-git, revision: 4e837d4) build user: portage@localhost build date: 20210223-15:25:02 go version: go1.16 platform: linux/amd64 ### node_exporter command line flags --collector.textfile.directory=/run/node_exporter/ --collector.ntp --collector.ntp.server-is-local --collector.qdisc --no-collector.bcache --no-collector.bonding --no-collector.conntrack --no-collector.infiniband --no-collector.ipvs --no-collector.mdadm --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.schedstat --no-collector.time --no-collector.uname --no-collector.xfs --no-collector.zfs --log.level=warn ### Are you running node_exporter in Docker? No ### What did you do that produced an error? Enabled collector.powersupplyclass (disabled above) ### What did you expect to see? Valid output of battery status ### What did you see instead? ``` panic: label value "LNV-5B10W13894\xc0\xc0" is not valid UTF-8 goroutine 28 [running]: github.com/prometheus/client_golang/prometheus.MustNewConstMetric(...) /tmp/portage/app-metrics/node_exporter-1.1.1/homedir/go/pkg/mod/github.com/prometheus/client_golang@v1.9.0/prometheus/value.go:107 github.com/prometheus/node_exporter/collector.(*powerSupplyClassCollector).Update(0xc00035e5a0, 0xc00007f6e0, 0x10afc20, 0x0) /tmp/portage/app-metrics/node_exporter-1.1.1/work/node_exporter-1.1.1/collector/powersupplyclass.go:166 +0x27f9 github.com/prometheus/node_exporter/collector.execute(0xbf759e, 0x10, 0xccec60, 0xc00035e5a0, 0xc00007f6e0, 0xcce520, 0xc0001219b0) ``` ### Analysis This is not strictly a bug in node_exporter, but since it crashes (due to the intentional panic in MustNewConstMetric) I figured I report it here anyway. Let me know if it's better suited for client_golang. It's not clear that anybody except the hardware is at fault here. The problem is caused by garbage bytes returned by a sysfs entry and can be reduced to inspecting the `model_name` attribute of the battery in question: ``` $cat /sys/class/power_supply/BAT0/model_name LNV-5B10W13894�� ``` I traced this back into the drivers/acpi/battery driver and indeed the two extraneous `0xc0` bytes are returned by the hardware (see [here](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/acpi/battery.c#n295); battery is populated in line 204), so AFAICT it's not just random byte garbage due to an uninitilaized char[] or something like that. The two bytes are always the same. My XFCE desktop environment can display the value correctly in xfce-power-manager, presumably because it normalizes the string value (probably by accident). ### Possible Mitigations It's not clear how to 'best' fix this since this seems a problem with this particular battery model and/or my BIOS (1.16, updated a few days ago). I don't know whether it happened before. My first idea was to sanitize all characters always and everywhere, but realized that would be overly drastic, regardless of whether they are sanitized in the kernel, in client_golang or node_exporter. Instead I belive it makes more sense to be less strinct in client_golang when reading an attribute value in NewConstMetric() (called by MustNewConstMetric [here](https://github.com/prometheus/client_golang/blob/master/prometheus/value.go#L87)). If the "invalid UTF-8" exception can be avoided - possibly by catching it and simply ignoring the bogus characters - the (completely unnecessary) performance impact of unconditional sanitation would be mitigated. I don't know enough about Go character set conversions to tell whether that is feasible.

好像有人建议用“strings.ToValidUTF8来解决：
“Yea, that seems like something we should sanitize between the kernel and client_golang. We could use strings.ToValidUTF8 to wrap the label values.”

Michael_Ye · 2022 年8 月 22 日 11:07

试了，也不行，同样的报错

kqzh · 2022 年8 月 22 日 11:13

看来不是配置方面的问题，可以贴下你使用的nebula版本吗，还有storaged的stats → 10.62.0.244:19779/stats

Michael_Ye · 2022 年8 月 22 日 11:22

版本：nebula-graph-3.1.0.el7.x86_64.rpm
dashboard版本：nebula-dashboard-1.1.1.x86_64.tar.gz

奇怪的是，这个版本我部署了两套环境(仿真和生产)。生产环境的没问题，这个问题只在仿真环境存在。

Michael_Ye · 2022 年8 月 29 日 06:42

kqzh · 2022 年8 月 29 日 06:48

可以提供一下storage endportIP: endpointPort/stats的数据吗类似 10.62.0.244:19779/stats 我怀疑是exporter自动打的一些label值有问题

Michael_Ye · 2022 年8 月 29 日 06:54

storage_stat.txt (21.1 KB)

kqzh · 2022 年8 月 29 日 07:02

stats是正常的，可以升级下exporter看看，exporter目前是向下兼容的，可以直接用最新版本，nebula-stats-exporter-linux-amd64-v3.1.0

Michael_Ye · 2022 年8 月 29 日 07:06

好，我试试

Michael_Ye · 2022 年8 月 29 日 07:11

也不行，同样的报错

Michael_Ye · 2022 年8 月 29 日 07:19

graphd的stat好像有问题。但是我看乱码的那些space ID，都不是我们添加的space。
graphd_stat.txt (66.6 KB)

Michael_Ye · 2022 年8 月 29 日 07:25

把Graphd上的”–enable_space_level_metrics“ 设置为false，stat-exporter就可以启动了；之前是设置为true。
这是bug吧？

kqzh · 2022 年8 月 29 日 07:30

了解，space后面跟的应该是space_name，看起来是字符集问题导致的，你可以先把graphd的config --enable_space_level_metrics 设为false，这样graphd就不会暴露space级别的指标，应该能暂时解决

kqzh · 2022 年8 月 29 日 07:32

嗯嗯，是的，可以提供下集群现有的space吗，可以请 @jievince 看看

Michael_Ye · 2022 年8 月 29 日 07:32

嗯，我现在就是这么干的。
字符集的问题的话，是要在graphd上设置吗？我看你们这些服务，默认的字符集都是utf8吧？