I have a dataframe df1
as follows:
words_separated
0 [lorem, ipsum]
1 [dolor, sit, amet]
2 [lorem, ipsum, dolor, sit, lorem]
So each row contains an array of words. I would like to get something like this dataframe df2
:
lorem, ipsum, dolor, sit, amet
0 1, 1, 0, 0, 0
1 0, 0, 1, 1, 1
2 2, 1, 1, 1, 1
So df2
would have a column for each unique word that appeared in df1
. The rows of df2
would correspond to the rows in df1
and record the number of times a word appeared in the corresponding row of df1
. This is referred to as Count Vectorization.
I thought about using MultiLabelBinarizer
like this:
from sklearn.preprocessing import MultiLabelBinarizer
count_vec = MultiLabelBinarizer()
mlb = count_vec.fit(df["comment text"])
pd.DataFrame(mlb.transform(df["comment text"]), columns=[mlb.classes_])
lorem, ipsum, dolor, sit, amet
0 1, 1, 0, 0, 0
1 0, 0, 1, 1, 1
2 1, 1, 1, 1, 1
But this only returns if a word exists in a row, not how many times the word appeared, which is what I need.
source https://stackoverflow.com/questions/72820935/how-to-get-count-vectorization-of-dataframe-of-arrays-of-strings
Comments
Post a Comment