I am building a text classification model based on sentiment analysis, the data contains text and sentiment[Positive, Natural, Negative]
As first step, I clean the data and normalize it, then create doc2vec embedding:
# Convert the data to TaggedDocument format for Doc2Vec
documents = [TaggedDocument(words=text.split(), tags=[label]) for text, label in zip(data["text"], data["sentiment"])]
print(documents)
model = Doc2Vec(vector_size=10, window=2, min_count=1, workers=4, epochs=100)
model.build_vocab(documents)
model.train(documents, total_examples=model.corpus_count, epochs=model.epochs)
then split the data:
X_train = [model.infer_vector(text.split()) for text in data["text"]]
print(X_train)
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
label_encoder = LabelEncoder()
y_trainEmbedding = label_encoder.fit_transform(data['sentiment'])
onehot_encoder = OneHotEncoder(sparse=False)
y_trainEmbedding = onehot_encoder.fit_transform(y_trainEmbedding.reshape(-1, 1))
then build LSTM model:
import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense
num_classes = len(np.unique(data["sentiment"]))
model_lstm = Sequential()
model_lstm.add(LSTM(64, input_shape=(10, 1)))
model_lstm.add(Dense(32, activation="relu"))
model_lstm.add(Dense(num_classes, activation="softmax"))
model_lstm.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
X_train_lstm = np.array(X_train).reshape(-1, 10, 1)
y_train_lstm = np.array(y_trainEmbedding)
model_lstm.fit(X_train_lstm, y_train_lstm, epochs=100, batch_size=32)
the result is good and the accuracy is 0.99
but when I try to predict the label of new text such as below:
# Use the trained model to predict the sentiment of new texts
text = "هذا البيت جميل "
text=remove_punctuations(text)
text=remove_repeating_char(text)
text=remove_english_char(text)
text=remove_diacritics(text)
text=remove_noise_char(text)
text=tokenizer(text)
text=remove_stop_word(text)
text=stemming(text)
new_embedding = model.infer_vector(text.split())
print(new_embedding)
new_embedding_lstm = np.array(new_embedding).reshape(-1, 10, 1)
print(new_embedding)
y_pred = model_lstm.predict(new_embedding_lstm)
print(y_pred)
predicted_label = label_encoder.inverse_transform(np.argmax(y_pred))
print(predicted_label)
this error occured:
18
---> 19 predicted_label = label_encoder.inverse_transform(np.argmax(y_pred))
20 print(predicted_label)
1 frames
/usr/local/lib/python3.10/dist-packages/sklearn/utils/validation.py in column_or_1d(y, dtype, warn)
1200 return _asarray_with_order(xp.reshape(y, -1), order="C", xp=xp)
1201
-> 1202 raise ValueError(
1203 "y should be a 1d array, got an array of shape {} instead.".format(shape)
1204 )
ValueError: y should be a 1d array, got an array of shape () instead.
is my process correct? and Anyone can help me solve it?
source https://stackoverflow.com/questions/76401941/sentiment-classification-using-doc2vec-and-lstm-models
Comments
Post a Comment