Remove duplicated rows of a `list[str]` type column in Polars

I have a DataFrame with a column that contains lists of strings. I want to filter the DataFrame to drop rows with duplicated values of the list column.

For example,

import polars as pl

# Create a DataFrame with a list[str] type column
data = pl.DataFrame({
    "id": [1, 2, 3, 4],
    "values": [
        ["a", "a", "a"], # first two rows are duplicated
        ["a", "a", "a"],
        ["b", "b", "b"],
        ["c", "d", "e"]
    ]
})

print(data)

shape: (4, 2)
┌─────┬─────────────────┐
│ id  ┆ values          │
│ --- ┆ ---             │
│ i64 ┆ list[str]       │
╞═════╪═════════════════╡
│ 1   ┆ ["a", "a", "a"] │
│ 2   ┆ ["a", "a", "a"] │
│ 3   ┆ ["b", "b", "b"] │
│ 4   ┆ ["c", "d", "e"] │
└─────┴─────────────────┘

Desired result:

shape: (3, 2)
┌─────┬─────────────────┐
│ id  ┆ values          │
│ --- ┆ ---             │
│ i64 ┆ list[str]       │
╞═════╪═════════════════╡
│ 1   ┆ ["a", "a", "a"] │
│ 3   ┆ ["b", "b", "b"] │
│ 4   ┆ ["c", "d", "e"] │
└─────┴─────────────────┘

Using the unique method doesn't work for type list[str] (it works when list contains numeric types, though).

data.unique(subset="values")

ComputeError: grouping on list type is only allowed if the inner type is numeric

source https://stackoverflow.com/questions/76017680/remove-duplicated-rows-of-a-liststr-type-column-in-polars

StacksPedia

Search This Blog

Remove duplicated rows of a `list[str]` type column in Polars

Labels

Comments

Post a Comment

Popular posts from this blog

How to show number of registered users in Laravel based on usertype?

Why is my reports service not connecting?

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input