I have dataframe like this:
+--------+------------+----------+--------------+--------------+----------------+--------------+--------------+
| company| id|ann_rtn_dt|share_class_nb|shrhldr_seq_nb|shrhldr_first_nm|shrhldr_mid_nm|shrhldr_sur_nm|
+--------+------------+----------+--------------+--------------+----------------+--------------+--------------+
|SYNTHE01|SYNTHE01_1_1|2022-11-28| 1| 1| NIEL| ANDREW| HOPSON|
|SYNTHE01|SYNTHE01_3_1|2022-11-28| 3| 1| NICOLE| CLAIRE| MORE|
|SYNTHE01|SYNTHE01_1_2|2022-11-28| 1| 2| N| C| MORE|
|SYNTHE01|SYNTHE01_2_1|2022-11-28| 2| 1| NEIL| ANDREW| HOPSON|
|SYNTHE01|SYNTHE01_3_1|2022-11-28| 3| 1| NICOLE| CLAIRE| MORE|
|SYNTHE02|SYNTHE02_1_1|2022-11-28| 1| 1| MIKE| | LOPSON|
|SYNTHE02|SYNTHE02_3_1|2022-11-28| 3| 1| NIMIKE| | LOPSON|
|SYNTHE02|SYNTHE02_1_2|2022-11-28| 1| 2| MIKE| | LOPSON|
|SYNTHE02|SYNTHE02_2_1|2022-11-28| 2| 1| MIKE| | LOPSON|
+--------+------------+----------+--------------+--------------+----------------+--------------+--------------+
The whole dataframe can be grouped 2 distinct company
column i.e. SYNTE01
and SYNTHE02
.
My use case is to do matching inside the company
.
STATUS_1
is set to min
of id
, when there is full match of shrhldr_first_nm
, shrhldr_mid_nm
and shrhldr_sur_nm
in the grouop.
STATUS_2
is set to min
of id
, when there is match of first byte of shrhldr_first_nm
and shrhldr_mid_nm
in the group. And shrhldr_sur_nm
matches exactly.
For eg. in COMPANY
SYNTHE01
, NIEL ANDREW HOPSON in row1 matches with NIEL ANDREW HOPSON in row4. The column STATUS_1
is set to min
of id
column for both.
For eg. in COMPANY
SYNTHE01
, the first byte of NICOLE CLAIRE MORE in row2 matches with N C More in row3. The column STATUS_2
is set to min
of id
column for both.
My output dataframe would look like below:
+--------+------------+----------+--------------+--------------+----------------+--------------+--------------+-------------+-------------+
| company| id|ann_rtn_dt|share_class_nb|shrhldr_seq_nb|shrhldr_first_nm|shrhldr_mid_nm|shrhldr_sur_nm| STATUS_1| STATUS_2|
+--------+------------+----------+--------------+--------------+----------------+--------------+--------------+-------------+-------------+
|SYNTHE01|SYNTHE01_1_1|2022-11-28| 1| 1| NIEL| ANDREW| HOPSON| SYNTHE01_1_1| |
|SYNTHE01|SYNTHE01_3_1|2022-11-28| 3| 1| NICOLE| CLAIRE| MORE| SYNTHE01_3_1| SYNTHE01_1_2|
|SYNTHE01|SYNTHE01_1_2|2022-11-28| 1| 2| N| C| MORE| | SYNTHE01_1_2|
|SYNTHE01|SYNTHE01_2_1|2022-11-28| 2| 1| NEIL| ANDREW| HOPSON| SYNTHE01_1_1| |
|SYNTHE01|SYNTHE01_3_2|2022-11-28| 3| 1| NICOLE| CLAIRE| MORE| SYNTHE01_3_1| SYNTHE01_1_2|
|SYNTHE02|SYNTHE02_1_1|2022-11-28| 1| 1| MIKE| | LOPSON| SYNTHE02_1_1| |
|SYNTHE02|SYNTHE02_3_1|2022-11-28| 3| 1| NIMIKE| | LOPSON| | |
|SYNTHE02|SYNTHE02_1_2|2022-11-28| 1| 2| MIKE| | LOPSON| SYNTHE02_1_1| |
|SYNTHE02|SYNTHE02_2_1|2022-11-28| 2| 1| MIKE| | LOPSON| SYNTHE02_1_1| |
+--------+------------+----------+--------------+--------------+----------------+--------------+--------------+-------------+-------------+
We tried this in Pyspark
, could not achieve it. We are now trying to do it in Pandas
. Please suggest any possible approach. Thank you.
source https://stackoverflow.com/questions/76256407/match-the-string-data-inside-a-group-pandas
Comments
Post a Comment