通过spark访问metaAddress 出现: java.lang.ClassNotFoundException: Failed to find data source: com.vesoft.nebula.connector.NebulaDataSource. Please find packages at http://spark.apache.org/third-party-projects.html错误

  • nebula 版本:3.4
  • 部署方式:单机
  • 安装方式: Docker(nebula up)
  • 是否上生产环境:N

通过pycharm中spark去链接metaAddress 获取数据库中数据 出现以下报错

To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "D:/PycharmProjects/pythonProject2/main.py", line 16, in <module>
    "partitionNumber", 1).load()
  File "D:\spark\spark-2.4.5-bin-hadoop2.7\spark-2.4.5-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "E:\anaconda\envs\successful\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "D:\spark\spark-2.4.5-bin-hadoop2.7\spark-2.4.5-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "E:\anaconda\envs\successful\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o40.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.vesoft.nebula.connector.NebulaDataSource. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.vesoft.nebula.connector.NebulaDataSource.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
	... 13 more

进入容器 /root/download 中显示是有jar包

上次这个问题还没解决?

没有 :sweat_smile:

你在原来的帖子继续提吧。不用一个问题开多个帖子,可能回复人员还要重新要一些背景信息(他们不一定记住这个提问题人之前遇到过什么问题)

这是一个pyspark 的通用问题,就是自己 scala 报要怎么include,等价于提交的时候带上 driver class 和 jars,你可以研究一下,我也还没试过哈。

下边的路径和你不同,但是就是那个意思

spark-submit --master spark://master:7077 \
    --driver-class-path /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \
    --driver-class-path /opt/nebulagraph/ngdi/package/nebula-algo.jar \
    --jars /opt/nebulagraph/ngdi/package/nebula-spark-connector.jar \
    --jars /opt/nebulagraph/ngdi/package/nebula-algo.jar \


参考

1 个赞

这个上面的jar包必须是服务器上的吗? 我本地pycharm必须使用远端spark集群中的Python解释器?

我感觉是,如果是分布式还得在分布式存储上
cc @nicole

我使用本地的Python环境解释器是连不通你开放的host:7077的吗,如果这样在生产环境具体任务下是不是就没办法了

可以呀,之前提到过,参考 nebulagraph-di/Environment_Setup.md at main · wey-gu/nebulagraph-di · GitHub

本质上 pyspark 还是并行(远端)执行,强制本进程执行好像也是可以的,你可以搜索一下文档,但是不推荐。

您看一下我的链接思路有问题吗?

from pyspark.sql import SparkSession, Row
spark = SparkSession.builder \
    .master('spark://192.168.1.230:7077') \
    .config("spark.dynamicAllocation.enabled","false") \
    .config("spark.executor.memory","2g") \
    .config("spark.executor.memoryOverhead","12288") \
    .config("spark.executor.instances","4") \
    .config("--driver-class-path", "/root/download/nebula-spark-connector.jar") \
    .config("--driver-class-path", "/root/download/nebula-algo.jar") \
    .config("--jars", "/root/download/nebula-spark-connector.jar") \
    .config("--jars", "/root/download/nebula-algo.jar") \
    .appName("test1") \
    .getOrCreate()

df = spark.read.format(
  "com.vesoft.nebula.connector.NebulaDataSource").option(
    "type", "vertex").option(
    "spaceName", "demo").option(
    "label", "player").option(
    "returnCols", "name,age").option(
    "metaAddress", "192.168.1.230:33473").option(
    "partitionNumber", 1).load()
df.show(n=2)

运行结果

E:\anaconda\envs\successful\python.exe D:/PycharmProjects/pythonProject2/1.py
Warning: Ignoring non-Spark config property: --jars
Warning: Ignoring non-Spark config property: --driver-class-path
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "D:/PycharmProjects/pythonProject2/1.py", line 43, in <module>
    "partitionNumber", 1).load()
  File "D:\spark\spark-2.4.5-bin-hadoop2.7\spark-2.4.5-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "E:\anaconda\envs\successful\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "D:\spark\spark-2.4.5-bin-hadoop2.7\spark-2.4.5-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "E:\anaconda\envs\successful\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o75.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.vesoft.nebula.connector.NebulaDataSource. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.vesoft.nebula.connector.NebulaDataSource.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
	... 13 more


进程已结束,退出代码为 1

我的jar包只能在容器内部找到所有我写的是容器内部的路径

看栈是本地的,我怀疑你本地需要有 spark connector 的 jar,你改增加 python 解释器所在地方的路径也 include 一下试试?

啥意思

这个路径再加一个你的这个windows 环境里的路径,把包放那里一份也

还有这个

是这样吗?好像还不行

from pyspark.sql import SparkSession, Row
spark = SparkSession.builder \
    .master('spark://192.168.1.230:7077') \
    .config("spark.dynamicAllocation.enabled","false") \
    .config("spark.executor.memory","2g") \
    .config("spark.executor.memoryOverhead","12288") \
    .config("spark.executor.instances","4") \
    .config("--driver-class-path", "/root/download/nebula-spark-connector.jar") \
    .config("--driver-class-path", r"D:\PycharmProjects\pythonProject2\download\nebula-spark-connector.jar") \
    .config("--driver-class-path", "/root/download/nebula-algo.jar") \
    .config("--jars", "/root/download/nebula-spark-connector.jar") \
    .config("--jars", r"D:\PycharmProjects\pythonProject2\download\nebula-spark-connector.jar") \
    .config("--jars", "/root/download/nebula-algo.jar") \
    .appName("test1") \
    .getOrCreate()
df = spark.read.format(
  "com.vesoft.nebula.connector.NebulaDataSource").option(
    "type", "vertex").option(
    "spaceName", "demo").option(
    "label", "player").option(
    "returnCols", "name,age").option(
    "metaAddress", "192.168.1.230:33473").option(
    "partitionNumber", 1).load()
df.show(n=2)

运行结果

E:\anaconda\envs\successful\python.exe D:/PycharmProjects/pythonProject2/1.py
Warning: Ignoring non-Spark config property: --jars
Warning: Ignoring non-Spark config property: --driver-class-path
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "D:/PycharmProjects/pythonProject2/1.py", line 45, in <module>
    "partitionNumber", 1).load()
  File "D:\spark\spark-2.4.5-bin-hadoop2.7\spark-2.4.5-bin-hadoop2.7\python\pyspark\sql\readwriter.py", line 172, in load
    return self._df(self._jreader.load())
  File "E:\anaconda\envs\successful\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "D:\spark\spark-2.4.5-bin-hadoop2.7\spark-2.4.5-bin-hadoop2.7\python\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "E:\anaconda\envs\successful\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o75.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.vesoft.nebula.connector.NebulaDataSource. Please find packages at http://spark.apache.org/third-party-projects.html
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: com.vesoft.nebula.connector.NebulaDataSource.DefaultSource
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
	at scala.util.Try$.apply(Try.scala:192)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
	at scala.util.Try.orElse(Try.scala:84)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
	... 13 more


进程已结束,退出代码为 1

@cuihangrui 如果你只是本地测试,可以把master设置为local,然后在你spark build的代码中增加spark.jars的配置就可以,代码如下。 如果你要配置master为standalone或者yarn,那就需要在spark集群或者yarn集群上放jars

spark = SparkSession.builder.appName(
        "PythonWordCount").master(
        "local").config(
        "spark.jars","/Users/nicole/Desktop/nebula-spark-connector_3.0-3.5.0-jar-with-dependencies.jar").config(
        "spark.driver.extraClassPath","/Users/nicole/Desktop/nebula-spark-connector_3.0-3.5.0-jar-with-dependencies.jar").getOrCreate()
1 个赞

我是需要读取nebulameta中的数据,我的容器中是有jars的

但是就是一直报上边内个错

包没引入成功,看下你程序下面的日志 配置没生效

是我代码的原因还是容器的原因呢?这个问题解决好久了一直是这个报错,都没有思路了 :smiling_face_with_tear: