I have a pyspark dataframe like this:
+-------+-------+
| level | value |
+-------+-------+
| 1 | 4 |
| 1 | 5 |
| 2 | 2 |
| 2 | 6 |
| 2 | 3 |
+-------+-------+
I have to create a value for every group in level column and save this in lable column. This value for every group must be unique, so I use ObjectId Mongo function to create that. Next dataframe is like this:
+-------+--------+-------+
| level | lable| value |
+-------+--------+-------+
| 1 | bb76 | 4 |
| 1 | bb76 | 5 |
| 2 | cv86 | 2 |
| 2 | cv86 | 6 |
| 2 | cv86 | 3 |
+-------+--------+-------+
Then I must create a dataframe as following:
+-------+-------+
| lable | value |
+-------+-------+
| bb76 | 9 |
| cv86 | 11 |
+-------+-------+
To do that, first I used spark groupby
:
def create_objectid():
a = str(ObjectId())
return a
def add_lable(df):
df = df.cache()
df.count()
grouped_df = df.groupby('level').agg(sum(df.value).alias('temp'))
grouped_df = grouped_df.withColumnRenamed('level', 'level_temp')
grouped_df = grouped_df.withColumn('lable', udf_create_objectid())
grouped_df = grouped_df.drop('temp')
df = df.join(grouped_df.select('level_temp','lable'), col('level') == col('level_temp'), how="left").drop(grouped_df.level_temp)
return df
When I used the above code on spark dataframe with 2 millions records, it takes about 155
seconds to finish. I searched and found that spark window
has better performance. Then, I changed the last function to this one. Because pandas_udf
needs arg
, so I just pass one and print it:
@f.pandas_udf("string")
def create_objectid_on_window(v: pd.Series) -> str:
print('v:',v)
return str(ObjectId())
def add_lable(df):
w = Window.partitionBy('level')
df = df.withColumn('lable', create_objectid_on_window('level').over(w))
return df
But after running the program, I receive this error:
AttributeError: 'NoneType' object has no attribute '_jvm'
Would you please guide me how to solve the problem?
Any help is really appreciated.
source https://stackoverflow.com/questions/76300991/pandas-udf-error-attributeerror-nonetype-object-has-no-attribute-jvm
Comments
Post a Comment