Skip to main content

Dask LocalCluster Fails to compute random.random above 300Mio data points

I wanted to create some random data for later benchmarking. The chunks need to be configured this way as I want to calculate the rfft later.

However, the sampling of the random data fails as soon as I am around (and above) 300 million data points. The code works fine in local mode. The code works fine when I store the samples directly into a zarr array. The size at which the code breaks is consistent across multiple shapes and chunk sizes. It also does not depend on initialising the cluster with different values.

Following is an minimal example producing the error, please be advised, that the code is working with an array of size=(60, 4_000_000). However, using the slightly bigger array, leads to error.

cluster = dd.LocalCluster(n_workers=1, threads_per_worker=10, memory_limit='30GB')
client = dd.Client(cluster)
# print(client)

RNG_da = da.random.RandomState(42)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()

client.close()
cluster.close()

The same error occurs using LocalCluster() without parameters:

cluster = dd.LocalCluster() 
client = dd.Client(cluster)

RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
client.close()
cluster.close()

However, not specifying or only using the Client works. So all of the versions below work:

RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
client = dd.Client(processes=False)
RNG_da = da.random.RandomState(1212)
_ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
print(_.shape)
with dask.config.set(scheduler='processes'):
    RNG_da = da.random.RandomState(1212)
    _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
    print(_.shape)
with dask.config.set(scheduler='threads'):
    RNG_da = da.random.RandomState(1212)
    _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).compute()
    print(_.shape)
with dd.LocalCluster(n_workers=1, threads_per_worker=10, memory_limit='15GiB') as cluster, dd.Client(cluster) as client:
        RNG_da = da.random.RandomState(1212)
        _ = RNG_da.random((60, 5_000_000), chunks=(1, 5_000_000)).persist()
        print(_.shape)

Can it have something to do, with calling the sampling in a multi-processes environment, since using client = Client(process=True) results in this [...]return self.socket.recv_into(buf, len(buf)) OSError: [Errno 22] Invalid argument error.


Here is the error trace, however, I interrupted the program, since it usually runs super long...:

<Client: 'tcp://127.0.0.1:53084' processes=1 threads=10, memory=27.94 GiB>
2023-02-11 18:54:44,007 - distributed.scheduler - ERROR - Couldn't gather keys {"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)": [‘tcp://127.0.0.1:53089'],

[...]

"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 1)": ['tcp://127.0.0.1:53089'], "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)": ['tcp://127.0.0.1:53089']} state: ['memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory', 'memory'] workers: ['tcp://127.0.0.1:53089']
NoneType: None
2023-02-11 18:54:44,007 - distributed.worker - ERROR - failed during get data with tcp://127.0.0.1:53089 -> None
Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/tornado/iostream.py", line 973, in _handle_write
    num_bytes = self.write_to_fd(self._write_buffer.peek(size))
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/tornado/iostream.py", line 1146, in write_to_fd
    return self.socket.send(data)  # type: ignore
ConnectionResetError: [Errno 54] Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/worker.py", line 1768, in get_data
    response = await comm.read(deserializers=serializers)
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/Users/me/opt/anaconda3/envs/zarr_benchmarking/lib/python3.10/site-packages/distributed/comm/tcp.py", line 142, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc.\_\_class\_\_.\_\_name\_\_}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed)  local=tcp://127.0.0.1:53089 remote=tcp://127.0.0.1:53096>: ConnectionResetError: [Errno 54] Connection reset by peer
2023-02-11 18:54:44,009 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)
NoneType: None
2023-02-11 18:54:44,009 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 2)

[...]

NoneType: None
2023-02-11 18:54:44,011 - distributed.scheduler - ERROR - Shut down workers that don't have promised key: ['tcp://127.0.0.1:53089'], ('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)
NoneType: None
2023-02-11 18:54:44,013 - distributed.client - WARNING - Couldn't gather 21 keys, rescheduling {"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 0)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 2)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 5, 0)": (‘tcp://127.0.0.1:53089',),

[...]
"('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 6, 1)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 0, 1)": ('tcp://127.0.0.1:53089',), "('random_sample-aaf2531c59d5bd1381c467d7a0f0644c', 2, 1)": ('tcp://127.0.0.1:53089',)}

^C


source https://stackoverflow.com/questions/75422337/dask-localcluster-fails-to-compute-random-random-above-300mio-data-points

Comments

Popular posts from this blog

How to show number of registered users in Laravel based on usertype?

i'm trying to display data from the database in the admin dashboard i used this: <?php use Illuminate\Support\Facades\DB; $users = DB::table('users')->count(); echo $users; ?> and i have successfully get the correct data from the database but what if i want to display a specific data for example in this user table there is "usertype" that specify if the user is normal user or admin i want to user the same code above but to display a specific usertype i tried this: <?php use Illuminate\Support\Facades\DB; $users = DB::table('users')->count()->WHERE usertype =admin; echo $users; ?> but it didn't work, what am i doing wrong? source https://stackoverflow.com/questions/68199726/how-to-show-number-of-registered-users-in-laravel-based-on-usertype

Why is my reports service not connecting?

I am trying to pull some data from a Postgres database using Node.js and node-postures but I can't figure out why my service isn't connecting. my routes/index.js file: const express = require('express'); const router = express.Router(); const ordersCountController = require('../controllers/ordersCountController'); const ordersController = require('../controllers/ordersController'); const weeklyReportsController = require('../controllers/weeklyReportsController'); router.get('/orders_count', ordersCountController); router.get('/orders', ordersController); router.get('/weekly_reports', weeklyReportsController); module.exports = router; My controllers/weeklyReportsController.js file: const weeklyReportsService = require('../services/weeklyReportsService'); const weeklyReportsController = async (req, res) => { try { const data = await weeklyReportsService; res.json({data}) console

How to split a rinex file if I need 24 hours data

Trying to divide rinex file using the command gfzrnx but getting this error. While doing that getting this error msg 'gfzrnx' is not recognized as an internal or external command Trying to split rinex file using the command gfzrnx. also install'gfzrnx'. my doubt is I need to run this program in 'gfzrnx' or in 'cmdprompt'. I am expecting a rinex file with 24 hrs or 1 day data.I Have 48 hrs data in RINEX format. Please help me to solve this issue. source https://stackoverflow.com/questions/75385367/how-to-split-a-rinex-file-if-i-need-24-hours-data