I'm starting a project to adjust the data lake for the specific purge of data, to comply with data privacy legislation.
Basically the owner of the data opens a call requesting the deletion of records for a specific user, and I need to sweep all AWS S3 bucktes by checking all parquet files and delete this specific record from all parquet files in my data lake.
Has anyone developed a similar project in python or pyspark?
Can you suggest what would be the good market practice for this case?
Today what I'm doing is reading all the parquet files, throwing it to a dataframe , filtering that dataframe excluding the current record, and rewriting the partition where that record was. This solution even works, but to purge where I need to look at a 5-year history, the processing is very heavy.
Can anyone suggest me a more practical solution?
remembering that my parquet files are in AWS S3, there are Athena tables, and my script will run in EMR (Pyspark)
Thanks
source https://stackoverflow.com/questions/75584567/how-to-search-and-delete-specific-lines-from-a-parquet-file-in-pyspark-data-pu
Comments
Post a Comment