filter polars DataFrame based on when rows whose specific columns contain pairs from a list of pairs
In this example, on columns ["foo", "ham"]
, I want rows 1 and 4 to be removed since they match a pair in the list
df = pl.DataFrame(
{
"foo": [1, 1, 2, 2, 3, 3, 4],
"bar": [6, 7, 8, 9, 10, 11, 12],
"ham": ["a", "b", "c", "d", "e", "f", "b"]
}
)
pairs = [(1,"b"),(3,"e"),(4,"g")]
The following worked for me but I think this will be problematic when the dataframe and list of pairs are large.
for a, b in pairs:
df = df.filter(~(pl.col('foo') == a) | ~(pl.col('ham') == b))
I think this is the pandas implementation for this problem Pandas: How to remove rows from a dataframe based on a list of tuples representing values in TWO columns?
I am not sure what the Polars implementation of it is.
(I think this problem can be generalized to any number of selected columns and any number of elements in a group. For instance, rather than a list of pairs, it can be another dataframe. You get the 'set difference', in terms of rows, of the two dataframes based on specific columns.)
source https://stackoverflow.com/questions/75568011/filter-polars-dataframe-based-on-when-rows-whose-specific-columns-contain-pairs
Comments
Post a Comment