Skip to main content

Dask LocalCluster Fails to compute random.random above 300Mio data points

I wanted to create some random data for later benchmarking. The chunks need to be configured this way as I want to calculate the rfft later.

However, the sampling of the random data fails as soon as I am around (and above) 300 million data points. The code works fine in local mode. The code works fine when I store the samples directly into a zarr array. The size at which the code breaks is consistent across multiple shapes and chunk sizes. It also does not depend on initialising the cluster with different values.

Following is an minimal example producing the error, please be advised, that the code is working with an array of size=(60, 4_000_000). However, using the slightly bigger array, leads to error.

cluster = dd.LocalCluster(n_workers=1, threads_per_worker=10, memory_limit='30GB')
client = dd.Client(cluster)
# print(client)

RNG_da = da.random.RandomState(42)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()

client.close()
cluster.close()

The same error occurs using LocalCluster() without parameters:

cluster = dd.LocalCluster() 
client = dd.Client(cluster)

RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
client.close()
cluster.close()

However, not specifying or only using the Client works. So all of the versions below work:

RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
client = dd.Client(processes=False)
RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
with dask.config.set(scheduler='processes'):
    RNG_da = da.random.RandomState(1212)
    _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
    print(_.shape)
with dask.config.set(scheduler='threads'):
    RNG_da = da.random.RandomState(1212)
    _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
    print(_.shape)
with dd.LocalCluster(n_workers=1, threads_per_worker=10, memory_limit='15GiB') as cluster, dd.Client(cluster) as client:
        RNG_da = da.random.RandomState(1212)
        _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).persist()
        print(_.shape)

Can it have something to do, with calling the sampling in a multi-processes environment, since using client = Client(process=True) results in this [...]return self.socket.recv_into(buf, len(buf)) OSError: [Errno 22] Invalid argument error.


Here is the error trace, however, I interrupted the program, since it usually runs super long...:

<Client: 'tcp://127.0.0.1:53084' processes=1 threads=10, memory=27.94 GiB>
2023-02-11 18:54:44,007 - distributed.scheduler - ERROR - Couldn't gather keys {"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)": [ā€˜tcp://127.0.0.1:53089'],

[...]

"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 1)": ['tcp://127.0.0.1:53089'], "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)": ['tcp://127.0.0.1:53089']} state: ['memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory'] workers: ['tcp://127.0.0.1:53089']
NoneType: None
2023-02-11 18:54:44,007 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:53089 -> None
Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/tornado/iostream.py", line 973, in _handle_write
    num_bytes = self.write_to_fd(self._write_buffer.peek(size))
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/tornado/iostream.py", line 1146, in write_to_fd
    return self.socket.send(data)  # type: ignore
ConnectionResetError: [Errno 54] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/worker.py", line 1768, in get_data
    response = await comm.read(deserializers=serializers)
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.\_\_class\_\_.\_\_name\_\_}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:53089 remote=tcp://127.0.0.1:53096>: ConnectionResetError: [Errno 54] Connection reset by peer
2023-02-11 18:54:44,009 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)
NoneType: None
2023-02-11 18:54:44,009 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 2)

[...]

NoneType: None
2023-02-11 18:54:44,011 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)
NoneType: None
2023-02-11 18:54:44,013 - distributed.client - WARNING - Couldn't gather 21 keys, rescheduling {"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 2)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 5, 0)": (ā€˜tcp://127.0.0.1:53089',),

[...]
"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 1)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 1)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)": ('tcp://127.0.0.1:53089',)}

^C


source https://stackoverflow.com/questions/75422337/dask-localcluster-fails-to-compute-random-random-above-300mio-data-points

Comments

Popular posts from this blog

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input

So, I am trying to predict the model but its throwing error like it has 10 features but it expacts only 1. So I am confused can anyone help me with it? more importantly its not working for me when my friend runs it. It works perfectly fine dose anyone know the reason about it? cv = KFold(n_splits = 10) all_loss = [] for i in range(9): # 1st for loop over polynomial orders poly_order = i X_train = make_polynomial(x, poly_order) loss_at_order = [] # initiate a set to collect loss for CV for train_index, test_index in cv.split(X_train): print('TRAIN:', train_index, 'TEST:', test_index) X_train_cv, X_test_cv = X_train[train_index], X_test[test_index] t_train_cv, t_test_cv = t[train_index], t[test_index] reg.fit(X_train_cv, t_train_cv) loss_at_order.append(np.mean((t_test_cv - reg.predict(X_test_cv))**2)) # collect loss at fold all_loss.append(np.mean(loss_at_order)) # collect loss at order plt.plot(np.log(al...

Sorting large arrays of big numeric stings

I was solving bigSorting() problem from hackerrank: Consider an array of numeric strings where each string is a positive number with anywhere from to digits. Sort the array's elements in non-decreasing, or ascending order of their integer values and return the sorted array. I know it works as follows: def bigSorting(unsorted): return sorted(unsorted, key=int) But I didnt guess this approach earlier. Initially I tried below: def bigSorting(unsorted): int_unsorted = [int(i) for i in unsorted] int_sorted = sorted(int_unsorted) return [str(i) for i in int_sorted] However, for some of the test cases, it was showing time limit exceeded. Why is it so? PS: I dont know exactly what those test cases were as hacker rank does not reveal all test cases. source https://stackoverflow.com/questions/73007397/sorting-large-arrays-of-big-numeric-stings

How to load Javascript with imported modules?

I am trying to import modules from tensorflowjs, and below is my code. test.html <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title </head> <body> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script> <script type="module" src="./test.js"></script> </body> </html> test.js import * as tf from "./node_modules/@tensorflow/tfjs"; import {loadGraphModel} from "./node_modules/@tensorflow/tfjs-converter"; const MODEL_URL = './model.json'; const model = await loadGraphModel(MODEL_URL); const cat = document.getElementById('cat'); model.execute(tf.browser.fromPixels(cat)); Besides, I run the server using python -m http.server in my command prompt(Windows 10), and this is the error prompt in the console log of my browser: Failed to loa...