使用exchange导入hive数据,节点导入完成之后,日志里依然显示有很多spark任务在执行,要跑很多天

系统信息

  • nebula 版本:V3.4.1

  • 部署方式: 分布式

  • 问题的具体描述
    使用exchange导入hive数据到图库,节点量10亿+,边量10亿+,查看日志发现导完最后一个节点TAG之后,随后启动了大量的task(部分日志如下),非常耗时,几天都无法彻底结束任务。但是图库里已经能够正常查出数据。

23/09/10 16:39:43 INFO TaskSetManager: Starting task 35393.0 in stage 142.0 (TID 9702) (cnsz26plhejt, executor 24, partition 35393, NODE_LOCAL, 4877 bytes) taskResourceAssignments Map()
23/09/10 16:39:43 INFO TaskSetManager: Finished task 11988.0 in stage 142.0 (TID 9643) in 282761 ms on cnsz26plhejt (executor 24) (1/111320)
23/09/10 16:39:50 INFO TaskSetManager: Starting task 23439.0 in stage 142.0 (TID 9703) (cnsz26pllxha, executor 12, partition 23439, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:39:50 INFO TaskSetManager: Finished task 8516.0 in stage 142.0 (TID 9666) in 289319 ms on cnsz26pllxha (executor 12) (2/111320)
23/09/10 16:39:53 INFO TaskSetManager: Starting task 23308.0 in stage 142.0 (TID 9704) (cnsz26plhjwg, executor 40, partition 23308, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:39:53 INFO TaskSetManager: Finished task 10201.0 in stage 142.0 (TID 9651) in 292233 ms on cnsz26plhjwg (executor 40) (3/111320)
23/09/10 16:39:55 INFO TaskSetManager: Starting task 23440.0 in stage 142.0 (TID 9705) (cnsz26pllxha, executor 15, partition 23440, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:39:55 INFO TaskSetManager: Finished task 3641.0 in stage 142.0 (TID 9641) in 294162 ms on cnsz26pllxha (executor 15) (4/111320)
23/09/10 16:39:56 INFO TaskSetManager: Starting task 23309.0 in stage 142.0 (TID 9706) (cnsz26plhjwg, executor 36, partition 23309, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:39:56 INFO TaskSetManager: Finished task 782.0 in stage 142.0 (TID 9624) in 295458 ms on cnsz26plhjwg (executor 36) (5/111320)
23/09/10 16:39:57 INFO TaskSetManager: Starting task 36573.0 in stage 142.0 (TID 9707) (cnsz26plhejt, executor 17, partition 36573, NODE_LOCAL, 4875 bytes) taskResourceAssignments Map()
23/09/10 16:39:57 INFO TaskSetManager: Finished task 13213.0 in stage 142.0 (TID 9649) in 295933 ms on cnsz26plhejt (executor 17) (6/111320)
23/09/10 16:39:58 INFO TaskSetManager: Starting task 27329.0 in stage 142.0 (TID 9708) (cnsz26pllxha, executor 11, partition 27329, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:39:58 INFO TaskSetManager: Finished task 5114.0 in stage 142.0 (TID 9655) in 297378 ms on cnsz26pllxha (executor 11) (7/111320)
23/09/10 16:40:01 INFO TaskSetManager: Starting task 27330.0 in stage 142.0 (TID 9709) (cnsz26pllxha, executor 10, partition 27330, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:40:01 INFO TaskSetManager: Finished task 7611.0 in stage 142.0 (TID 9660) in 300462 ms on cnsz26pllxha (executor 10) (8/111320)
23/09/10 16:40:03 INFO TaskSetManager: Starting task 36574.0 in stage 142.0 (TID 9710) (cnsz26plhejt, executor 20, partition 36574, NODE_LOCAL, 4875 bytes) taskResourceAssignments Map()
23/09/10 16:40:03 INFO TaskSetManager: Finished task 18243.0 in stage 142.0 (TID 9674) in 302296 ms on cnsz26plhejt (executor 20) (9/111320)
23/09/10 16:40:05 INFO TaskSetManager: Starting task 36575.0 in stage 142.0 (TID 9711) (cnsz26plhejt, executor 27, partition 36575, NODE_LOCAL, 4875 bytes) taskResourceAssignments Map()
23/09/10 16:40:05 INFO TaskSetManager: Finished task 17952.0 in stage 142.0 (TID 9667) in 304402 ms on cnsz26plhejt (executor 27) (10/111320)
23/09/10 16:40:08 INFO TaskSetManager: Starting task 27331.0 in stage 142.0 (TID 9712) (cnsz26pllxha, executor 13, partition 27331, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:40:08 INFO TaskSetManager: Finished task 16199.0 in stage 142.0 (TID 9690) in 307408 ms on cnsz26pllxha (executor 13) (11/111320)
23/09/10 16:40:09 INFO TaskSetManager: Starting task 23310.0 in stage 142.0 (TID 9713) (cnsz26plhjwg, executor 39, partition 23310, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:40:09 INFO TaskSetManager: Finished task 783.0 in stage 142.0 (TID 9625) in 308460 ms on cnsz26plhjwg (executor 39) (12/111320)
23/09/10 16:40:10 INFO TaskSetManager: Starting task 36576.0 in stage 142.0 (TID 9714) (cnsz26plhejt, executor 19, partition 36576, NODE_LOCAL, 4875 bytes) taskResourceAssignments Map()
23/09/10 16:40:10 INFO TaskSetManager: Finished task 3747.0 in stage 142.0 (TID 9633) in 309272 ms on cnsz26plhejt (executor 19) (13/111320)
23/09/10 16:40:10 INFO TaskSetManager: Starting task 23311.0 in stage 142.0 (TID 9715) (cnsz26plhjwg, executor 33, partition 23311, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:40:10 INFO TaskSetManager: Finished task 17914.0 in stage 142.0 (TID 9698) in 309490 ms on cnsz26plhjwg (executor 33) (14/111320)
23/09/10 16:40:11 INFO TaskSetManager: Starting task 23884.0 in stage 142.0 (TID 9716) (cnsz26plhjwg, executor 29, partition 23884, NODE_LOCAL, 4876 bytes) taskResourceAssignments Map()
23/09/10 16:40:11 INFO TaskSetManager: Finished task 11337.0 in stage 142.0 (TID 9661) in 310091 ms on cnsz26plhjwg (executor 29) (15/111320)

日志里显示还有111320个任务要执行,想知道这些任务是在做什么,为什么会跑这么久,而且此时手动停止程序,不影响图库数据查询?

不清楚是否和vid有关系,比如中文,特殊字符,空值,长度过长之类?

和这个没关系,如果异常的话你的任务会直接抛异常终止了。
你是怎么提交的spark任务,没遇到过这种,有10万个partition 哪里来的

如果你确定最后一个tag导入完成了,通过stats 统计下数据量是符合预期的,可以把任务kill掉了,不会影响查询的

另外找了一些数据做测试,该问题可以复现。
测试发现,当用较长的字符串,且字符串里包含除了数字,字母以外其他的字符做vid的时候,就会出现该问题。即会有部分数据导入失败,且随后生成十几万个task,程序无法正常执行完成。
当我把vid换成不重复的整数时,就能很快完成数据导入,程序一切正常。
我的疑问是:

  1. vid的有没有需要特殊处理的字符要求,是否能够正常支持除了数字字母以外的其他字符?
  2. vid的长度如果超过图空间设定的长度,会如何处理?
  3. 使用exchange导入失败的数据,在哪里能够看到?导入失败的数据日志如下:

23/09/19 16:08:00 INFO Exchange$: import for tag specialmobile, data count: 226232, cost time: 12.49s
23/09/19 16:08:00 INFO Exchange$: Client-Import: batchSuccess.specialmobile: 197
23/09/19 16:08:00 INFO Exchange$: Client-Import: batchFailure.specialmobile: 3

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。