1,150 questions
2
votes
0
answers
50
views
How to optimize NetCDF files and dask for processing long-term climataological indices with xclim (ex. SPI using 30-day rolling window)?
I am trying to analyze the 30 day standardized precipitation index for a multi-state range of the southeastern US for the year 2016. I'm using xclim to process a direct pull of gridded daily ...
0
votes
0
answers
29
views
Is it possible to use dask distributed to pandas with apply working with multiprocessing?
I need advice from you.
Right now i do some computation with pandas library.
Program is using multiprocessing and df.apply.
The simple example showing my idea is here:
import multiprocessing
import ...
0
votes
0
answers
26
views
Use Python streamz to parallel process many realtime updating files?
Using Python streamz and dask, I want to distribute the data of textfiles that are generated to threads. Which then will process every newline generated inside those files.
from streamz import Stream
...
0
votes
0
answers
38
views
Google Cloud Platform Dask RefreshError
import os
from dask_cloudprovider.gcp import GCPCluster
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=r'C:\Users\Me\Documents\credentials\compute_engine_default_key\test-project123-...
0
votes
1
answer
72
views
Dask adaptive deployment in azure kubernetes
I am trying to deploy a dask cluster with 0 workers and 1 scheduler, based on the work load need to scale up the worker to required, i found that the adaptive deployment is the correct way, i am using ...
0
votes
1
answer
237
views
How to Set Dask Dashboard Address with SLURMRunner (Jobqueue) and Access It via SSH Port Forwarding?
I am trying to run a Dask Scheduler and Workers on a remote cluster using SLURMRunner from dask-jobqueue. I want to bind the Dask dashboard to 0.0.0.0 (so it’s accessible via port forwarding) and ...
0
votes
0
answers
120
views
dask_cuda problem with Local CUDA Cluster
I am trying to get this code to work and then use it to train various models on two gpu's:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
if __name__ == "__main__&...
1
vote
1
answer
55
views
How to nest dask.delayed functions within other dask.delayed functions
I am trying to learn dask, and have created the following toy example of a delayed pipeline.
+-----+ +-----+ +-----+
| baz +--+ bar +--+ foo |
+-----+ +-----+ +-----+
So baz has a dependency on ...
0
votes
1
answer
43
views
Use dask with numactl
I am using dask to parallelize an operation that is memory-bound. So, I want to ensure each dask worker has access to a single NUMA node and prevent cross-node memory access. I can do this in the ...
0
votes
0
answers
23
views
Dask worker running long calls
The code running on the dask worker calls asyncio.run() and proceeds to exectue a series of async calls (on the worker running event_loop) that gather data, and then run a small computation.
This ...
0
votes
1
answer
90
views
Understanding if current process is part of a multiprocessing pool with --multiprocessing-fork
I need to find a way for a python process to figure out if it was launched as part of a multiprocessing pool.
I am using dask to parallelize calculations, using dask.distributed.LocalCluster. For UX ...
0
votes
0
answers
108
views
How to fix memory errors merging large dask dataframes?
I am trying to read 23 CSV files into dask dataframes, merge them together using dask, and ouptut to parquet. However, it's failing due to memory issues.
I used to use pandas to join these together ...
0
votes
0
answers
56
views
Null values in log file from Python logging module when used with Dask
I have been trying to setup logging using the logging module in a Python script, and I have got it working properly. It can now log to both the console and a log file. But if fails when I setup a Dask ...
1
vote
1
answer
87
views
How do I modularize functions that work with dask?
I'm trying to modularize my functions that use Dask, but I keep encountering the error "No module named 'setup'". I can't import any local module that is related to Dask, and currently, ...
0
votes
1
answer
80
views
Dask - High CPU consumption unloading parallel workers
I’m using dask to make parallel processing of a simulation. It consists of a series of differential equations that are numerically solved using numpy arrays that are compiled using numba @jit ...
0
votes
0
answers
172
views
How to speed up interpolation in dask
I have a piece of data code that performs interpolation on a large number of arrays.
This is extremely quick with numpy, but:
The data the code will work with in reality will often not fit in memory
...
0
votes
1
answer
296
views
How to use user-defined fsspec filesystem with dask?
I made my own filesystem in the fsspec library and I am trying to read in dask dataframes from this filesystem object to open the dataframe file. However I am getting an error when I try to do this. ...
-1
votes
1
answer
192
views
Understanding task stream and speeding up Distributed Dask
I have implemented some data analysis in Dask using dask-distributed, but the performance is very far from the same analysis implemented in numpy/pandas and I am finding it difficult to understand the ...
0
votes
0
answers
139
views
Dask Can't synchronously read data shared on NFS
Running Dask Scheduler on system A and workers on system A and B. NFS volume from system A is shared on the network through NFS with system B, and contains the data files. This folder has a symbolic ...
1
vote
0
answers
43
views
Dask worker nodes and multiprocessing raising TypeError during object initialization
I have a program that I wrote. I define a class in this program that is a subclass of a class I import. If I run this code without Dask, I successfully run it. When I plug in Dask, I get an error ...
1
vote
0
answers
92
views
asyncio.exceptions.CancelledError when using Dask LocalCluster with processes=False and progress
This is an example:
import numpy as np
import zarr
from dask.distributed import Client, LocalCluster
from dask import array as da
from dask.distributed import progress
def same(x):
return x
x = ...
1
vote
0
answers
186
views
Using dask.distributed with rioxarray rio.to_raster results in `ValueError: Lock is not yet acquired`
I am trying to write some code using dask.distributed.Client and rioxarray to_raster that:
Concatenates two rasters (dask arrays)
Applies a function across all blocks in the concatenated array
Writes ...
0
votes
1
answer
70
views
How Dask manages file descriptors
How does Dask manage file descriptors?
For example when creating a dask.array from an hdf5 file. When the array is large enough to be chunked.
Do the created tasks inherit the file descriptor created ...
1
vote
1
answer
68
views
Why dask shows smaller size than the actual size of the data (numpy array)?
Dask shows slightly smaller size than the actual size of a numpy array. Here is an example of a numpy array that is exactly 32 Mb:
import dask as da
import dask.array
import numpy as np
shape = (1000,...
1
vote
0
answers
187
views
How to Read the Result of Query into a Dask Dataframe in a Distributed Client?
Trying to read the results of a query (from an AWS athena database) to a dask dataframe. Following the read_sql_query method of the official documentation.
Here is how I am calling it.
from dask ...