I have an example
+─────────────+────────+────────+────────+
| main_group | COL_A | COL_B | COL_C |
+─────────────+────────+────────+────────+
| 0 | TXT1 | | None |
| 0 | TXT2 | | None |
| 0 | 5 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
| 1 | TXT11 | | None |
| 1 | TXT12 | | None |
| 1 | 0.50 | | None |
| 1 | 0.45 | 0.45 | 0 |
| 1 | 0.31 | 0.31 | 1 |
| 1 | 0.35 | 0.35 | 1 |
| 1 | 0.73 | 0.73 | 1 |
| 2 | 0.5 | | None |
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | | None |
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
+─────────────+────────+────────+────────+
I would like to move the value from COL_A
for the last occurrence of a row with None
in column COL_C
per main_group
. This value should be moved to the first element in group for COL_B
column and then the previous contents should be removed.
Here is how it should look like for main_group
== 0:
+─────────────+────────+────────+────────+
| main_group | COL_A | COL_B | COL_C |
+─────────────+────────+────────+────────+
| 0 | TXT1 | 5 | None | <--- value "5" from the last row with "None" in `COL_C` in `main_group` == 0 was moved to the first row in the same group
| 0 | TXT2 | | None |
| 0 | 5 | | None | <--- After that this row should be deleted
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
+─────────────+────────+────────+────────+
+─────────────+────────+────────+────────+
| main_group | COL_A | COL_B | COL_C |
+─────────────+────────+────────+────────+
| 2 | 0.5 | 0.5 | None | <--- value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | 4.46 | None | <--- value in column `COL_B` should be same as in column `COL_A` because there are no other rows in the same `main_group` with "None" in column `COL_C`
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
+─────────────+────────+────────+────────+
After this operation, df
should looks like:
+─────────────+────────+────────+────────+
| main_group | COL_A | COL_B | COL_C |
+─────────────+────────+────────+────────+
| 0 | TXT1 | 5 | None |
| 0 | TXT2 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
| 1 | TXT11 | 0.50 | None |
| 1 | TXT12 | | None |
| 1 | 0.45 | 0.45 | 0 |
| 1 | 0.31 | 0.31 | 1 |
| 1 | 0.35 | 0.35 | 1 |
| 1 | 0.73 | 0.73 | 1 |
| 2 | 0.5 | 0.5 | None |
| 2 | 4.15 | 4.15 | 0 |
| 2 | 2.98 | 2.98 | 0 |
| 2 | 1.53 | 1.53 | 0 |
| 3 | 4.46 | 4.46 | None |
| 3 | 4.00 | 4.00 | 0 |
| 3 | 0.95 | 0.95 | 1 |
| 3 | 1.35 | 1.35 | 1 |
| 3 | 1.79 | 1.79 | 1 |
+─────────────+────────+────────+────────+
In the last step, I would like to concatenating selected COL_A
column in each main_group
where value in COL_C
is None
.
Example:
+─────────────+────────+────────+────────+
| main_group | COL_A | COL_B | COL_C |
+─────────────+────────+────────+────────+
| 0 | TXT1 | 5 | None |
| 0 | TXT2 | | None |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
↓↓↓↓↓↓↓↓↓↓
+─────────────+────────────+────────+────────+
| main_group | COL_A | COL_B | COL_C |
+─────────────+────────────+────────+────────+
| 0 | TXT1 TXT2 | 5 | None | <--- If there are more than 1 row with "None" in column `COL_C` in each group, then values in column `COL_A` should be "merged" into one row, and all others should be deleted
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
The first solution I have is to .loc
those rows in the group that have a value "None" in COL_C
column and then assign to the first element (.iloc
) value from the last row. However, this solution is not quite correct, and I'm additionally convinced that this can be done using .groupby
instead of iterating and searching for elements after each group but I can't do it correctly.
The solution I was able to get this way:
+─────────────+────────────+────────+────────+
| main_group | COL_A | COL_B | COL_C |
+─────────────+────────────+────────+────────+
| 0 | TXT1 TXT2 | 5 | None |
| NaN | NaN | NaN | NaN |
| NaN | NaN | NaN | NaN |
| 0 | 1.93 | 1.93 | 0 |
| 0 | 7.60 | 7.60 | 1 |
| 0 | 2.46 | 2.46 | 1 |
This is partially transferred correctly, but there are still NaN
values in the rows that should not be there anymore. Of course I can delete those rows and re-index df
but this solution relies on loops, which will certainly not be efficient for large df
.
How can I do it with a workaround of these loops over individual groups and swapping values using .loc
and .iloc
?
source https://stackoverflow.com/questions/72368280/pandas-concatenating-selected-rows-from-several-groups-with-additional-conditi
Comments
Post a Comment