I have a dataset that I have loaded into a SparkDataFrame. On these datasets I do the following
df_spark_bucketed = (
df
.repartition(10, col)
.write
.mode("overwrite")
.parquet(save_path)
)
df_hive_bucketed = (
df
.repartition(10, col)
.write
.bucketBy(10, col)
.mode("overwrite")
.saveAsTable(table_name)
)
I then construct the following query
def dummy_join(df, col):
temp = df.filter(sf.col("B") == col)
return (
temp
.select("A")
.join(temp, "A", "inner")
)
and finally run
plan_sb = dummy_join(df_spark_bucketed, "C").explain()
plan_hb = dummy_join(df_hive_bucketed, "C").explain()
and observe that the two logical plans are identical. They both look like shown in the below figure.
I find this to be odd because I'd expect the second plan to be different, specifically that it should showcase that less (or maybe even no) shuffles occur due to the bucketing. I am doing this experiment to try and showcase how Hive bucketing can be used to improve performance in operations that require shuffling, such as joins and aggregations.
Why are they identical?
source https://stackoverflow.com/questions/76686427/spark-bucketing-and-hive-bucketing-generating-same-logical-plan-for-a-join-query
Comments
Post a Comment