exchange 导入hive报错: is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [56, 54, 125, 10]

  • nebula 版本:(2.0)
  • 部署方式(分布式 ):
  • 硬件信息
    • 磁盘:2T SSD
    • CPU、内存信息:12核心+128内存
  • 问题的具体描述
    exchange 2.0导入数据,hive 提示报错,exchange只支持parquet文件吗?
Caused by: java.lang.RuntimeException: hdfs://mycluster/real-time/out/hive/ods/ns/year=2021/month=01/day=24/hour=09/part-116-124 is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but            found [56, 54, 125, 10]
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:476)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:445)
        at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:401)
        at org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:106)
        at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:133)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:404)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(ParquetFileFormat.scala:345)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
        at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
        at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
        at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
        at org.apache.spark.scheduler.Task.run(Task.scala:109)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

请先确认下spark是否可正常读取你的hive表中数据。 CSDN上有相似问题:
https://blog.csdn.net/qq_20488317/article/details/80368674?utm_medium=distribute.pc_relevant.none-task-blog-OPENSEARCH-2.control&dist_request_id=1328593.10330.16147465873841189&depth_1-utm_source=distribute.pc_relevant.none-task-blog-OPENSEARCH-2.control

这个看起来应该是Hive 配置的问题 也许是创建表的时候 没有指定数据类型

我的问题是,nebula exchange 2.0对hive存储的格式是不是有要求?
我们的hive在hdfs上存储的格式是textfile,报错要求的是使用parquet

没要求 检查一下创建表的时候 是不是没指定文件类型

浙ICP备20010487号