Spark bucketing and Hive bucketing generating same logical plan for a join query, why?

I have a dataset that I have loaded into a SparkDataFrame. On these datasets I do the following

df_spark_bucketed = (
    df
    .repartition(10, col)
    .write
    .mode("overwrite")
    .parquet(save_path)
)

df_hive_bucketed = (
    df
    .repartition(10, col)
    .write
    .bucketBy(10, col)
    .mode("overwrite")
    .saveAsTable(table_name)
)

I then construct the following query

def dummy_join(df, col):
    temp = df.filter(sf.col("B") == col)
    return (
        temp
        .select("A")
        .join(temp, "A", "inner")
    )

and finally run

plan_sb = dummy_join(df_spark_bucketed, "C").explain()
plan_hb = dummy_join(df_hive_bucketed, "C").explain()

and observe that the two logical plans are identical. They both look like shown in the below figure.

I find this to be odd because I'd expect the second plan to be different, specifically that it should showcase that less (or maybe even no) shuffles occur due to the bucketing. I am doing this experiment to try and showcase how Hive bucketing can be used to improve performance in operations that require shuffling, such as joins and aggregations.

Why are they identical?

source https://stackoverflow.com/questions/76686427/spark-bucketing-and-hive-bucketing-generating-same-logical-plan-for-a-join-query

StacksPedia

Search This Blog

Spark bucketing and Hive bucketing generating same logical plan for a join query, why?

Labels

Comments

Post a Comment

Popular posts from this blog

How to show number of registered users in Laravel based on usertype?

Why is my reports service not connecting?

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input