spark writter 导入时stroaged日志有错误

为了更快地定位、解决问题,麻烦参考下面模版提问 ^ ^

提问参考模版:

  • nebula 版本:1.1.0
  • 部署方式(分布式 /):
  • 硬件信息
    • 磁盘HDD
    • CPU、内存信息:12C 32G
  • 出问题的 Space 的创建方式:执行 describe space xxx;
  • 问题的具体描述

使用spark writter导入7千万的点和14亿的边,在storaged的日志中由少量错误:
E1020 19:12:39.159327 97232 RaftPart.cpp:365] [Port: 44501, Space: 599, Part: 1] The partition is not a leader
E1020 19:12:39.169883 97232 RaftPart.cpp:635] [Port: 44501, Space: 599, Part: 1] Cannot append logs, clean the buffer
E1020 20:10:13.538108 97427 RaftPart.cpp:1075] [Port: 44501, Space: 277, Part: 67] Receive response about askForVote from [10.57.36.18:44501], error code is -6
E1020 20:10:13.722595 97427 RaftPart.cpp:1075] [Port: 44501, Space: 277, Part: 67] Receive response about askForVote from [10.57.36.19:44501], error code is -6
E1020 20:10:14.225545 97426 RaftPart.cpp:1075] [Port: 44501, Space: 313, Part: 97] Receive response about askForVote from [10.57.36.18:44501], error code is -6
E1020 20:10:14.225662 97426 RaftPart.cpp:1075] [Port: 44501, Space: 313, Part: 97] Receive response about askForVote from [10.57.36.19:44501], error code is -6
E1020 20:10:17.223654 97426 RaftPart.cpp:1075] [Port: 44501, Space: 567, Part: 88] Receive response about askForVote from [10.57.36.18:44501], error code is -6
E1020 20:10:17.223712 97426 RaftPart.cpp:1075] [Port: 44501, Space: 567, Part: 88] Receive response about askForVote from [10.57.36.19:44501], error code is -6
E1020 20:57:55.531424 97237 RaftPart.cpp:773] [Port: 44501, Space: 233, Part: 59] Replicate logs failed
E1020 20:58:08.849247 97229 RaftPart.cpp:773] [Port: 44501, Space: 133, Part: 4] Replicate logs failed
E1020 20:58:09.345399 97227 RaftPart.cpp:773] [Port: 44501, Space: 567, Part: 98] Replicate logs failed
E1020 20:58:09.522493 97251 RaftPart.cpp:773] [Port: 44501, Space: 548, Part: 84] Replicate logs failed
E1020 21:28:58.720558 97224 RaftPart.cpp:773] [Port: 44501, Space: 277, Part: 24] Replicate logs failed
E1020 21:59:29.273088 97220 RaftPart.cpp:773] [Port: 44501, Space: 83, Part: 25] Replicate logs failed
E1020 22:43:04.400045 97425 RaftPart.cpp:1075] [Port: 44501, Space: 344, Part: 31] Receive response about askForVote from [10.57.36.18:44501], error code is -6
E1020 22:43:04.507824 97425 RaftPart.cpp:1075] [Port: 44501, Space: 344, Part: 31] Receive response about askForVote from [10.57.36.19:44501], error code is -6
E1020 23:42:12.147251 97222 RaftPart.cpp:773] [Port: 44501, Space: 313, Part: 52] Replicate logs failed
E1021 00:33:49.931308 97221 RaftPart.cpp:773] [Port: 44501, Space: 567, Part: 81] Replicate logs failed
E1021 00:33:50.026232 97248 RaftPart.cpp:773] [Port: 44501, Space: 246, Part: 8] Replicate logs failed
E1021 01:25:10.910989 97241 RaftPart.cpp:773] [Port: 44501, Space: 233, Part: 39] Replicate logs failed

导入配置信息

{
spark: {
app: {
name: Nebula Spark Writer
}

driver: {
  cores: 4
  memory: 4G
}

executor: {
  memory: 12G
  cores: 8
}

cores: {
  max: 64
}

default: {
  parallelism:600
}

}

nebula: {
addresses: [“xxxx”]

user: user
pswd: password

space: twitter_test

connection {
  timeout: 3000
  retry: 3
}

execution {
  retry: 3
}

}

tags= [
{
name: follow
type: csv
path: “hdfs://ns1/data/twitter.csv”
fields: {
_c0: uid
}
vertex.field: _c0
vertex.policy: hash
batch: 128
},
{
name: followed
type: csv
path: “hdfs://ns1/data/twitter.csv”
fields: {
_c1: uid
}
vertex.field: _c1
vertex.policy: hash
batch: 128
}
]

edges= [
{
name: link
type: csv
path: “hdfs://ns1/data/twitter.csv”
fields: {

  }
  source.field:  _c0
  source.policy: hash
  target.field:  _c1
  target.policy: hash
  batch: 128
}

]
}
导入的这个twitter_test.csv有14亿条记录,由于spark writter中并没有对点的导入去重,所以点的导入应该也是14亿次。
全部导入完成大概花费9个小时

如果只是打出这几句, 没有频繁打应该都是 OK 的. nebula 运行过程中 storage 的 raft 有可能出现 leader 切换, 那该时刻的某些请求会因为 leader change 失败, 但都会有重试的.

好的

在导入完成后查询2度节点的个数时发现出错
[ERROR (-8)]: Get neighbors failed
show hosts中 有一台机器是offline状态。一般这个offline是怎么恢复成online?我现在是把这台机器的服务重启了

Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1021 11:18:49.638280 3908 StorageClient.inl:123] Request to [10.57.36.18:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E1021 11:18:49.658414 3908 StorageClient.inl:123] Request to [10.57.36.19:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E1021 11:18:49.658751 3936 ExecutionPlan.cpp:80] Execute failed: Get neighbors failed
E1021 11:21:29.655000 3908 StorageClient.inl:123] Request to [10.57.36.18:44500] failed: N6apache6thrift9transport19TTransportExceptionE: Timed Out
E1021 11:21:29.655578 3935 ExecutionPlan.cpp:80] Execute failed: Get neighbors failed

如果服务还在的话, 那大概率是因为导入数据之后, nebula 底层的 rocksdb 会开始做 compaction.

会占用比较高的资源, 让 metaclient 发不出来心跳.

一般等 compaction 完了就好了, 不过像您用 HDD, 这个可能比较慢.