I'd like to do a manual train-test-split for a random forest or linear regression on a dataframe called whatever_df
based on the Date
column. I would use this column to select all rows that have dates earlier than 3 months before the most recent one in the column to make a new dataframe called train_df
with those older dates, and a test_df
with all the dates within the latest 3 months. In raw format the dataframe looks like:
PC1 PC2 Date
0 -0.319258 -0.042817 2019-05-24
1 -0.246079 0.131233 2019-05-24
2 -0.037325 0.562841 2019-05-24
3 -0.080725 0.594007 2019-05-24
4 0.133341 0.322822 2019-05-24
... ... ... ...
3607 -3.583419 3.766158 2022-06-26
3608 -3.305263 4.019327 2022-06-26
3609 -2.913036 4.854316 2022-06-26
3610 -2.755733 4.873996 2022-06-26
3611 -2.535929 4.582312 2022-06-26
So what I'd want for train_df
would be all the rows where Date
is up through March 2022 inclusive and test_df
would be all the rows for March 2022-June 2022. I know I could just hardcode this but I would like a dynamic way to select rows based on month values, and I know with datetime
formatted columns I could find the newest date with just max(df['Date'])
but I'm not sure how to say in Python subtract 3 months from that
I have tried this:
from datetime import datetime
from dateutil.relativedelta import relativedelta
train_df = whatever_df[whatever_df['Date'] - relativedelta(months=3)]
But I get
TypeError: unsupported operand type(s) for -: 'DatetimeArray' and 'relativedelta'
Part of the problem is that each month's day in the Date
column is somewhat arbitrary, so there really is no row with 2022-03-26
in the Date
column, unless by sheer coincidence. Therefore, I need a way to select for just the month of the date in the Date
column, and not exactly 3 months earlier to the day.
source https://stackoverflow.com/questions/73534022/typeerror-unsupported-operand-types-for-datetimearray-and-relativedelta
Comments
Post a Comment