masked image and language modelling using visualbert

I was coding this piece of code which heavily relies on the demo of visual question answering, and I'm masking inputs while feeding it to the bert using [MASK] token, and providing a label which accompanies the mask. Visual embeddings are being extracted through rcnn, giving me 36 such vectors, in which I'm taking the mean of all 36 vectors as shown below :

features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)

which is being fed to the visualbert for pretraining model, thus giving me prediction_logits. So, now as you can see in the notebook and here too, after taking argmax, prediction logits are :

prediction_logits[0].argmax(-1)

>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])

Now, when I'm trying to get words using the above predictions and the vocabulary of the tokenizer, this is what's being outputted :

.
a
photo
of
a
bathroom
.
is

Instead of bathroom, I should've got cat or atleast near cat but there seems to be difference of 10 values between bathroom (which is voted highest in our output, with score of 9.5069) and cat (with a score of 6.3830). Can we somehow get the score of cat up and make it most desirable output?

source https://stackoverflow.com/questions/72622277/masked-image-and-language-modelling-using-visualbert

StacksPedia

Search This Blog

masked image and language modelling using visualbert

Labels

Comments

Post a Comment

Popular posts from this blog

How to show number of registered users in Laravel based on usertype?

Why is my reports service not connecting?

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input