I was coding this piece of code which heavily relies on the demo of visual question answering, and I'm masking inputs while feeding it to the bert using [MASK] token, and providing a label which accompanies the mask. Visual embeddings are being extracted through rcnn, giving me 36 such vectors, in which I'm taking the mean of all 36 vectors as shown below :
features = torch.mean(output_dict.get("roi_features"), axis=1).reshape(1,1,2048)
which is being fed to the visualbert for pretraining model, thus giving me prediction_logits. So, now as you can see in the notebook and here too, after taking argmax, prediction logits are :
prediction_logits[0].argmax(-1)
>> tensor([1012, 1037, 6302, 1997, 1037, 5723, 1012, 2003])
Now, when I'm trying to get words using the above predictions and the vocabulary of the tokenizer, this is what's being outputted :
.
a
photo
of
a
bathroom
.
is
Instead of bathroom, I should've got cat or atleast near cat but there seems to be difference of 10 values between bathroom (which is voted highest in our output, with score of 9.5069) and cat (with a score of 6.3830). Can we somehow get the score of cat up and make it most desirable output?
source https://stackoverflow.com/questions/72622277/masked-image-and-language-modelling-using-visualbert
Comments
Post a Comment