I have a program that looks like so:
bar = {
1245896421: { 1, 2, 3, 4 },
2598732155: { 31, 32, 33, 34 },
4876552519: { 11, 12, 13, 14 },
}
# This emulates a previous step in the process, which yields batches of about
# 1 million rows. Can't do it all at once due to memory constraints
batch_generator = (
np.random.randint(0, 999999999999, size=(100, 5), dtype=np.int64)
for i in range(1000000)
)
for foo in batch_generator:
## This is an example of what foo looks like:
#foo = np.array([
# [ 1, 2, 3, 1245896421, 4 ],
# [ 5, 6, 7, 2598732155, 8 ],
# [ 9, 10, 11, 4876552519, 12 ],
# [ 13, 14, 15, 4876552519, 16 ],
# [ 17, 18, 19, 1245896421, 20 ],
#])
baz = np.array([
row
for row in foo
if row[1] in bar[row[3]]
])
What I'm trying to do is to find all rows of foo
where the value of the second column is in the set of bar
at the index corresponding to foo
's fourth column. In this example, it should return rows 0
(2
is in bar[1245896421]
) and 3
(14
is in bar[4876552519]
).
The code in this example works, but it is far too slow for the scale of data that I'm working with. In my program, it doesn't do this process with one very large-sized foo
, but repeats the process millions of times with many small-to-mediam sized foo
s, which I've emulated with a random generator in this example. I'm wondering if there's a way to achieve this efficiently with Numpy.
I'm able to change the data structures for both foo
and bar
if necessary.
source https://stackoverflow.com/questions/70908325/numpy-indexing-all-rows-where-column-value-is-in-a-list-that-is-different-for-e
Comments
Post a Comment