I have a DataFrame with a column that contains lists of strings. I want to filter the DataFrame to drop rows with duplicated values of the list column.
For example,
import polars as pl
# Create a DataFrame with a list[str] type column
data = pl.DataFrame({
"id": [1, 2, 3, 4],
"values": [
["a", "a", "a"], # first two rows are duplicated
["a", "a", "a"],
["b", "b", "b"],
["c", "d", "e"]
]
})
print(data)
shape: (4, 2)
┌─────┬─────────────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪═════════════════╡
│ 1 ┆ ["a", "a", "a"] │
│ 2 ┆ ["a", "a", "a"] │
│ 3 ┆ ["b", "b", "b"] │
│ 4 ┆ ["c", "d", "e"] │
└─────┴─────────────────┘
Desired result:
shape: (3, 2)
┌─────┬─────────────────┐
│ id ┆ values │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪═════════════════╡
│ 1 ┆ ["a", "a", "a"] │
│ 3 ┆ ["b", "b", "b"] │
│ 4 ┆ ["c", "d", "e"] │
└─────┴─────────────────┘
Using the unique
method doesn't work for type list[str]
(it works when list contains numeric types, though).
data.unique(subset="values")
ComputeError: grouping on list type is only allowed if the inner type is numeric
source https://stackoverflow.com/questions/76017680/remove-duplicated-rows-of-a-liststr-type-column-in-polars
Comments
Post a Comment