I was setting up my Spark development via Anaconda package on my Windows10 desktop. I used to have this set up earlier in the same machine working fine..was doing some cleanup and installing fresh again...but I am now getting issues when I invoke spark to show the data. Initializing, loading data to a data frame, importing libraries are all fine...until I call the action show ....something to do with my environment setting, what am I doing wrong?
Environment:
spark-3.1.2-bin-hadoop2.7 (SPARK_HOME & HADOOP_HOME)
jdk1.8.0_281 (JAVA_HOME)
Anaconda Spyder IDE
winutils (for hadoop 2.7.7)
Python 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information.
IPython 7.29.0 -- An enhanced Interactive Python.
import pyspark from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.types import StringType, StructType, StructField from pyspark.sql import SparkSession import pyspark.ml spark = SparkSession.builder.getOrCreate() data2 = [(1, "James Smith"), (2, "Michael Rose"), (3, "Robert Williams"), (4, "Rames Rose"), (5, "Rames rose") ] df2 = spark.createDataFrame(data=data2, schema=["id", "name"]) df2.printSchema()
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
root
|-- id: long (nullable = true)
|-- name: string (nullable = true)
df2.show()
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
[Stage 0:> (0 + 1) / 1]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Python was not found; run without arguments to install from the Microsoft Store, or disable this shortcut from Settings > Manage App Execution Aliases.
Traceback (most recent call last):
File "C:\Users\***\AppData\Local\Temp/ipykernel_25396/2272422252.py", line 1, in <module>
df2.show()
File "C:\Users\***\anaconda3\lib\site-packages\pyspark\sql\dataframe.py", line 484, in show
print(self._jdf.showString(n, 20, vertical))
File "C:\Users\***\anaconda3\lib\site-packages\py4j\java_gateway.py", line 1309, in __call__
return_value = get_return_value(
File "C:\Users\***\anaconda3\lib\site-packages\pyspark\sql\utils.py", line 111, in deco
return f(*a, **kw)
File "C:\Users\***\anaconda3\lib\site-packages\py4j\protocol.py", line 326, in get_return_value
raise Py4JJavaError(
Py4JJavaError: An error occurred while calling o36.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (DellXPS executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:182)
source https://stackoverflow.com/questions/70174906/pyspark-action-df-show-returns-java-error
Comments
Post a Comment