I have a dataframe df1 as follows: words_separated 0 [lorem, ipsum] 1 [dolor, sit, amet] 2 [lorem, ipsum, dolor, sit, lorem] So each row contains an array of words. I would like to get something like this dataframe df2 : lorem, ipsum, dolor, sit, amet 0 1, 1, 0, 0, 0 1 0, 0, 1, 1, 1 2 2, 1, 1, 1, 1 So df2 would have a column for each unique word that appeared in df1 . The rows of df2 would correspond to the rows in df1 and record the number of times a word appeared in the corresponding row of df1 . This is referred to as Count Vectorization . I thought about using MultiLabelBinarizer like this: from sklearn.preprocessing import MultiLabelBinarizer count_vec = MultiLabelBinarizer() mlb = count_vec.fit(df["comment text"]) pd.DataFrame(mlb.transform(df["comment text"]), columns=[mlb.classes_]) lorem, ipsum, dolor, sit, amet 0 1, 1, 0, 0, 0 1 0, 0, 1, 1, 1 2 ...
A site where you can share knowledge