- nebula 版本:2.6.1
- 部署方式:分布式
- 安装方式:RPM
- 是否为线上版本:Y
- 硬件信息
- 磁盘 SSD 1T
- CPU:30
- 内存信息:252GB
- 问题的具体描述:自动balance开启状态下,使用exchange导入hive数据到无属性边到nebula,25亿数据单向插入跑了2个多小时,还能优化吗?nebula 和hadoop是不同集群,网络传输数据。
- 配置文件
{
# Spark相关配置
spark: {
app: {
name: Exchange
}
driver: {
cores: 2
maxResultSize: 2G
}
}
# 如果Spark和Hive部署在不同集群,才需要配置连接Hive的参数,否则请忽略这些配置。
#hive: {
# waredir: "hdfs://NAMENODE_IP:9000/apps/svr/hive-xxx/warehouse/"
# connectionURL: "jdbc:mysql://your_ip:3306/hive_spark?characterEncoding=UTF-8"
# connectionDriverName: "com.mysql.jdbc.Driver"
# connectionUserName: "user"
# connectionPassword: "password"
#}
# Nebula Graph相关配置
nebula: {
address:{
graph:["-"]
meta:["-"]
}
user: *
pswd: *
space: test
connection {
timeout: 30000
retry: 3
}
execution {
retry: 3
}
error: {
max: 32
output: /home/nebula/error2
}
rate: {
limit: 1024
timeout: 10000
}
}
# 设置 edge相关信息
edges: [
# 设置 edge_md5_url_http 正向 相关信息
{
name: edge_md5_url_http
type: {
source: hive
sink: client
}
exec: "select md5,url from test_table where day = '20220101'"
fields: []
nebula.fields: []
source: {
field: md5
}
target: {
field: url
}
batch: 512
partition: 300
}
]
}
- 提交命令
${SPARK_HOME2}/bin/spark-submit --class com.vesoft.nebula.exchange.Exchange \
--master yarn \
--deploy-mode cluster \
--driver-memory 4G \
--num-executors 100 \
--executor-cores 3 \
--executor-memory 10G \
--conf spark.sql.shuffle.partitions=300 \
--conf spark.dynamicAllocation.enabled=false \
--queue root.cloud \
--files hive_test.conf \
--jars ./guava-14.0.1.jar \
--conf spark.driver.extraClassPath=./guava-14.0.1.jar \
--conf spark.executor.extraClassPath=./guava-14.0.1.jar \
nebula-exchange-2.6.1.jar \
-c hive_test.conf -h
谢谢!