I have a DataFrame with a column that contains lists of strings. I want to filter the DataFrame to drop rows with duplicated values of the list column.
For example,
import polars as pl
# Create a DataFrame with a list[str] type column
data = pl.DataFrame({
"id": [1, 2, 3, 4],
"values": [
["a", "a", "a"], # first two rows are duplicated
["a", "a", "a"],
["b", "b", "b"],
["c", "d", "e"]
]
})
print(data)
shape: (4, 2)
āāāāāāā¬āāāāāāāāāāāāāāāāāā
ā id ā values ā
ā --- ā --- ā
ā i64 ā list[str] ā
āāāāāāāŖāāāāāāāāāāāāāāāāāā”
ā 1 ā ["a", "a", "a"] ā
ā 2 ā ["a", "a", "a"] ā
ā 3 ā ["b", "b", "b"] ā
ā 4 ā ["c", "d", "e"] ā
āāāāāāā“āāāāāāāāāāāāāāāāāā
Desired result:
shape: (3, 2)
āāāāāāā¬āāāāāāāāāāāāāāāāāā
ā id ā values ā
ā --- ā --- ā
ā i64 ā list[str] ā
āāāāāāāŖāāāāāāāāāāāāāāāāāā”
ā 1 ā ["a", "a", "a"] ā
ā 3 ā ["b", "b", "b"] ā
ā 4 ā ["c", "d", "e"] ā
āāāāāāā“āāāāāāāāāāāāāāāāāā
Using the unique
method doesn't work for type list[str]
(it works when list contains numeric types, though).
data.unique(subset="values")
ComputeError: grouping on list type is only allowed if the inner type is numeric
source https://stackoverflow.com/questions/76017680/remove-duplicated-rows-of-a-liststr-type-column-in-polars
Comments
Post a Comment