graphd 服务突然挂掉

kerman · 2022 年12 月 14 日 09:17

nebula 版本：v2.6.1
部署方式：分布式
安装方式：RPM
是否为线上版本：Y
硬件信息
6台机器
每台都是
磁盘 1T HDD
CPU、内存 16C 64G
问题的具体描述
现象：graphd服务在半夜突然挂掉，不知道原因在哪里，排查过nebula-graphd.ERROR和stderr.log日志，没有什么有用的信息，全是More than one request trying to add/update/delete one edge/vertex at the same time的报错。那个时间点应该都是系统自动的查询和插入操作，没有人工手动执行什么命令。
dmp文件如下:
err.dmp (1.3 MB)
麻烦帮忙看下是什么原因导致的，另外dmp文件是如何查看的?

min.wu · 2022 年12 月 15 日 03:51

论坛搜索下 core dump，应该有人碰到过？

spw · 2022 年12 月 16 日 01:59

coredump 可以转成文本贴下不

kerman · 2022 年12 月 20 日 08:06

就是不知道咋转换，里面全是乱码看不懂

kerman · 2022 年12 月 20 日 08:07

还是得看dump文件确定具体问题才能解决，完全不知道问题出在哪

HarrisChu · 2022 年12 月 20 日 12:08

#0  0x00007f9de58fecf5 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x0000000000eb2299 in void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) ()
#2  0x0000000000ef71a7 in nebula::meta::cpp2::QueryDesc::QueryDesc(nebula::meta::cpp2::QueryDesc const&) ()
#3  0x0000000000ef7214 in ?? ()
#4  0x0000000000ef727f in _ZNSt10_HashtableIlSt4pairIKlN6nebula4meta4cpp29QueryDescEESaIS6_ENSt8__detail10_Select1stESt8equal_toIlESt4hashIlENS8_18_Mod_range_hashingENS8_20_Default_ranged_hashENS8_20_Prime_rehash_policyENS8_17_Hashtable_traitsILb0ELb0ELb1EEEE9_M_assignIZNSJ_C4ERKSJ_EUlPKNS8_10_Hash_nodeIS6_Lb0EEEE_EEvSM_RKT_ ()
#5  0x0000000000ef1c0a in nebula::graph::GraphSessionManager::updateSessionsToMeta() ()
#6  0x0000000000ef2c65 in nebula::graph::GraphSessionManager::threadFunc() ()
#7  0x0000000000ef80bd in std::enable_if<std::is_void<std::result_of<void (nebula::graph::GraphSessionManager::*(nebula::graph::GraphSessionManager*))()>::type>::value, folly::SemiFuture<folly::Unit> >::type nebula::thread::GenericWorker::addDelayTask<void (nebula::graph::GraphSessionManager::*)(), nebula::graph::GraphSessionManager*>(unsigned long, void (nebula::graph::GraphSessionManager::*&&)(), nebula::graph::GraphSessionManager*&&)::{lambda()#1}::operator()() const ()
#8  0x000000000192bc2b in ?? ()
#9  0x0000000001e28f13 in ?? ()
#10 0x0000000001e295e7 in event_base_loop ()
#11 0x000000000192c1ad in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (*)(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::function<void ()> const&), std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::_Bind<void (nebula::thread::GenericWorker::*(nebula::thread::GenericWorker*))()> > > >::_M_run() ()
#12 0x00000000022dec20 in ?? ()
#13 0x00007f9de5b7fdd5 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#14 0x00007f9d991ff700 in ?? ()
#15 0x494b2233b17aa79e in ?? ()
#16 0x494bdb63d080a79e in ?? ()
#17 0x0000000000000000 in ?? ()

kerman · 2022 年12 月 21 日 00:03

看不懂? 看起来像是查询导致的?

HarrisChu · 2022 年12 月 21 日 00:42

你是执行过 kill query 么？
让 graph 的研发看一下吧

yee · 2022 年12 月 21 日 02:43

看起来像是踩内存了，这个要 review 一下对应的代码是不是存在并发的问题

kerman · 2022 年12 月 22 日 00:52

这个是半夜发生的，应该没有执行过

kerman · 2022 年12 月 26 日 00:34

这个是指nebula的源码吗?有啥结果吗？

yee · 2022 年12 月 26 日 02:09

如果不能稳定复现的话，目前只能先记个 issue，后面会排优先级来 fix。

github.com/vesoft-inc/nebula

graphd crash when executing updateSessionsToMeta

opened 02:08AM - 26 Dec 22 UTC

yixinglu

type/bug severity/none affects/none

**Please check the FAQ documentation before raising an issue** **Describe… the bug (__required__)** https://discuss.nebula-graph.com.cn/t/topic/11670/6 **Your Environments (__required__)** * OS: `uname -a` * Compiler: `g++ --version` or `clang++ --version` * CPU: `lscpu` * Commit id (e.g. `a3ffc7d8`) **How To Reproduce(__required__)** Steps to reproduce the behavior: 1. Step 1 2. Step 2 3. Step 3 **Expected behavior** **Additional context**

kerman · 2022 年12 月 26 日 03:36

好的，是偶发性的

kerman · 2023 年1 月 5 日 03:33

又出现了这个问题，我发现是在我大规模导数之后就会出现，但不知道是几天后出现，这个时间不固定，但是我只要一导数就一定会出现graphd挂掉的情况，数据量大多都在几亿。
导数后会做compact操作和stats操作。

steam · 2023 年1 月 5 日 03:39

如果你有补充信息的话，可以点击上面 yee 的 issue 来补充下信息哈

yee · 2023 年1 月 6 日 02:17

@HarrisChu 这类问题有在之前遇到或者测试过不？

HarrisChu · 2023 年1 月 6 日 02:28

我没遇到过。。

xtcyclist · 2023 年1 月 6 日 15:02

你好，这个过程麻烦再多解释下，导完数据以后，是否立即做过查询，此时 graphd 是不是正常的？“不知道是几天后出现”，那么这几天里面做过哪些事情呢？是没有动过？关于 session 有做过什么事情吗？一直保留？另外，最重要的是 crash 发生时 graphd 上有做任何事情吗？

xtcyclist · 2023 年1 月 6 日 15:04

另外，每次 crash 的地方一样吗？再遇到的话，麻烦多提供一些 core dump！谢谢！

kerman · 2023 年1 月 9 日 10:40

我们应用会对接kafka数据，所以导数完成会后面是会一直有查询和插入操作的，此时graph是正常的。几天内就一直在消费kafka做查询插入操作(量大概每天有1一亿把)。其他就没有任何操作了。crash 发生一般在半夜，应该只会有kafka的消费，对graph没有其他操作了。session我们写了一个session pool来管理，如果不出现问题会一直复用。