How to Get Count Vectorization of Dataframe of Arrays of Strings

I have a dataframe df1 as follows:

    words_separated
0   [lorem, ipsum]
1   [dolor, sit, amet]
2   [lorem, ipsum, dolor, sit, lorem]

So each row contains an array of words. I would like to get something like this dataframe df2:

    lorem, ipsum, dolor, sit, amet
0   1,     1,     0,     0,   0
1   0,     0,     1,     1,   1
2   2,     1,     1,     1,   1

So df2 would have a column for each unique word that appeared in df1. The rows of df2 would correspond to the rows in df1 and record the number of times a word appeared in the corresponding row of df1. This is referred to as Count Vectorization.

I thought about using MultiLabelBinarizer like this:

from sklearn.preprocessing import MultiLabelBinarizer

count_vec = MultiLabelBinarizer()
mlb = count_vec.fit(df["comment text"])
pd.DataFrame(mlb.transform(df["comment text"]), columns=[mlb.classes_])

    lorem, ipsum, dolor, sit, amet
0   1,     1,     0,     0,   0
1   0,     0,     1,     1,   1
2   1,     1,     1,     1,   1

But this only returns if a word exists in a row, not how many times the word appeared, which is what I need.

source https://stackoverflow.com/questions/72820935/how-to-get-count-vectorization-of-dataframe-of-arrays-of-strings

StacksPedia

Search This Blog

How to Get Count Vectorization of Dataframe of Arrays of Strings

Labels

Comments

Post a Comment

Popular posts from this blog

How to show number of registered users in Laravel based on usertype?

Why is my reports service not connecting?

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input