Graph 进程崩溃, CPU 飙升(非OOM)

提问参考模版:

  • nebula 版本:v2.0.1
  • 部署方式(分布式 / 单机 / Docker / DBaaS):分布式
  • 是否为线上版本:Y
  • 硬件信息
    • 磁盘( 推荐使用 SSD)SSD
    • CPU、内存信息: 16核, 128
  • 问题的具体描述
    目前做了测试9 并发Graph down, CPU 飙升,内存还没OOM
    如果单线程,CPU飙升,Graph 进程不一定down
    用的主语句是
    FIND SHOREST PATH , UPTO 10 STEPS
    中间还有为了获取实体信息,用了fetch prop 语句
  • 相关的 meta / storage / graph info 日志信息(尽量使用文本形式方便检索)
  • core 信息
    core_error0909.log (12.5 KB)
  • graph stderror
    0909.log (24.9 KB)

麻烦帮忙看看,具体啥原因,怎么优化?

帮忙吧 并发测试的语句发一下

接口的形式调用, 接口里面nebula 交互的就两个语句
1. FIND SHORTEST PATH FROM a TO b OVER c,d,e,f,g UPTO 10 STEPS
2. fetch prop Vertex / Edge

graph down 时 有没有core dump文件产生

有,上面core_error0909.log 是我保存下来信息。

这个是否是毕现的, 看core文件都是folly的帧栈,你们的数据量大概是多少

必复现的。我刚刚又试了下,依次发20个请求也会出现CPU上升,但是不会graph down 那么严重。
产生的数据量,我昨天看了一个请求的起始点的五跳 有>1M,另外一个终点比较少20个。

cpu 上升是正常的,graph 挂了是不正常的

这个find path 如果10层的话,逻辑是起始点各找五层,还是都找到10层? 这个很耗CPU吗?

双向的 bfs , 各找五层, 肯定是耗cpu的,计算密集型

明白,那这个有优化空间吗?
另外这个请求还会等到若干个点,还会执行fetch prop on * vid… 这可能有上百/上千个点 或者边,这个可能会导致graph down吗

这个双方一定各找五层吗? 我在想如果A 是超级节点,B 比较稀疏, 如果A出发找了三层,B也许都可以十层找完了,在B的结果集中查看有没有A 这样不就更快了吗?

core0910.log (8.2 KB)
0910.txt (21.9 KB)

刚刚收集的新的graph down 的log 麻烦帮忙看一下

@jmq2020 你好,能麻烦帮忙看看吗?我今天跑了下,又down 了, 看了下graphstderr 跟上面基本一样

收集的信息:
执行下面语句时,会发生错误,测试时第二个点时个稠密节点,五层有19W 数据
这个错误之后,leader changed ,然后nebula-go 一直在拿session 出错, 这个会导致graph down 吗?

(root@nebula) [ProdRelation]> FIND SHORTEST PATH FROM -7848017109177591853 TO -5777877239014968534 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps
[ERROR (-8)]: Storage Error: part: 3, error: E_RPC_FAILURE(-3).
2021/09/13 10:25:32 main.go:30: stmt: Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -5777877239014968534 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps
2021/09/13 10:27:19 logger.go:31: [ERROR] Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -5777877239014968534 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps, ErrorCode: -8, ErrorMsg: Storage Error: part: 24, error: E_RPC_FAILURE(-3).
2021/09/13 10:27:19 main.go:43: Runstmt:  checkResultSet error
2021/09/13 10:27:19 main.go:30: stmt: Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -5363303998853914166 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps
2021/09/13 10:27:19 logger.go:31: [ERROR] Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -5363303998853914166 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps, ErrorCode: -8, ErrorMsg: Storage Error: The leader has changed. Try again later
2021/09/13 10:27:19 main.go:43: Runstmt:  checkResultSet error
2021/09/13 10:27:19 main.go:30: stmt: Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -3150026031308625707 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps
2021/09/13 10:27:19 logger.go:31: [ERROR] Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -3150026031308625707 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps, ErrorCode: -8, ErrorMsg: Storage Error: The leader has changed. Try again later
2021/09/13 10:27:19 main.go:43: Runstmt:  checkResultSet error
2021/09/13 10:27:19 main.go:30: stmt: Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -5252414605977103781 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps
2021/09/13 10:27:19 logger.go:31: [ERROR] Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO -5252414605977103781 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps, ErrorCode: -8, ErrorMsg: Storage Error: The leader has changed. Try again later
2021/09/13 10:27:19 main.go:43: Runstmt:  checkResultSet error
2021/09/13 10:27:19 main.go:30: stmt: Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO 2887011443973716723 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps
2021/09/13 10:27:19 logger.go:31: [ERROR] Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO 2887011443973716723 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps, ErrorCode: -8, ErrorMsg: Storage Error: The leader has changed. Try again later
2021/09/13 10:27:19 main.go:43: Runstmt:  checkResultSet error
2021/09/13 10:27:19 main.go:30: stmt: Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO 4922439781533261677 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps
2021/09/13 10:27:24 logger.go:31: [ERROR] Error info: read tcp 172.16.188.142:33540->10.0.7.251:9669: read: connection reset by peer
2021/09/13 10:27:24 logger.go:31: [ERROR] session.Execute err: read tcp 172.16.188.142:33540->10.0.7.251:9669: read: connection reset by peer
2021/09/13 10:27:24 logger.go:27: [WARNING] Sign out failed, write tcp 172.16.188.142:33540->10.0.7.251:9669: write: broken pipe
2021/09/13 10:27:24 main.go:43: Runstmt:  read tcp 172.16.188.142:33540->10.0.7.251:9669: read: connection reset by peer
2021/09/13 10:27:24 main.go:30: stmt: Use ProdRelation; FIND SHORTEST PATH FROM -7848017109177591853 TO 7872708422411373386 OVER Invest, Legal, Employ, Branch, HisLegal, HisInvest, HisEmploy BIDIRECT UPTO 10 Steps

这个不会导致 graph down 的, leader change 这个问题,在2.5.0 版本上有所缓解,要不升级一下试试

升级2.5 涉及的服务比较多,需要很长的测试周期,两边会同时测试,但是这个graph down 还是需要debug 的。
上面提供了0910 graph down的log,这个有帮助定位graph down 的原因吗?
core0910.log (8.2 KB)
0910.txt (21.9 KB)

好的,出现问题时候 主要操作还是 在运行 find path 语句吗

是的,主要就这个语句

没有出度缓存,所以不知道哪个是超级节点

浙ICP备20010487号