for loop for finding and storing words that is present in supplied word dataset returning name error
I have a list named 'result'
as below
>>> result
[[['apple'],['banana'],['green','grapes'],nan],[['orange'],['hat'],['party','hat','2'],nan],[['blue'],['navy'],['red','t'],['angry']]]
and I'm using gensim to match the words in the pretrained word2vec model with the words I have and get corresponding vectors.
Given that the pretrained_model.key_to_index
is structured as below, I used below code to store list of words within 'result' that is present in pretrained model named 'pretrained_model'
and to filter the words that are not in pre trained model.
>>> pretrained_model.key_to_index
{'</s>': 0,
'in': 1,
'for': 2,
'that': 3,
'is': 4,
'on': 5,
'##': 6,
'The': 7,
'with': 8,
'said': 9,
'was': 10,
'the': 11,
'at': 12,
...}
import gensim
pretrained_model = gensim.models.KeyedVectors.load_word2vec_format('Downloads/GoogleNews-vectors-negative300.bin', binary=True)
vocabulary = pretrained_model.key_to_index
len(vocabulary)
3000000
documents = []
for x in result:
document = [i for i in j for j in x if i in pretrained_model.key_to_index]
documents.append(document)
now this documents
have only those words which are present in pre trained model's vocab.
So the desired output documents
might look like
[[['apple'],['banana'],['green','grapes']],[['orange'],['hat'],['party','hat']],[['blue'],['navy'],['red','t'],['angry']]]
However above code returns NameError as below
NameError Traceback (most recent call last)
/var/folders/jd/lh_mnln92n17ysb4p01g000gn/T/ipykernel_2855/2806541.py in <module>
1 documents = []
2 for x in result:
----> 3 document = [i for i in j for j in x if i in pretrained_model.key_to_index]
4 documents.append(document)
5 #now this document have only those words which are present in our model's vocab
NameError: name 'j' is not defined
Can anyone help on me this please? Any help would be greatly appreciated!
source https://stackoverflow.com/questions/73805534/for-loop-for-finding-and-storing-words-that-is-present-in-supplied-word-dataset
Comments
Post a Comment