0

I have setup Dask and JupyterHub on a Kubernetes cluster using Helm with the help of the Dask documentation: http://docs.dask.org/en/latest/setup/kubernetes.html.

Everything deployed fine and I can access the JupyterLab. Then I've create a notebook and downloaded a csv file from a Google Cloud Storage bucket:

storage_client = storage.Client.from_service_account_json(CREDENTIALS)
bucket = storage_client.get_bucket(BUCKET)
download_blob(bucket, file="test-file", destination_dir="data/")

I read in the csv file:

import dask.dataframe as dd
df = dd.read_csv("/home/jovyan/data/*.csv")

I initialize Dask Client so that I can monitor the computation analytics:

from dask.distributed import Client, config
client = Client()

So far so good until I try to interact with the data frame. F.e. when I try to do df.head() I get the error:

[Errno 2] No such file or directory: '/home/jovyan/data/test-file.csv'

Why can't the other workers find the DataFrame? I thought the DataFrame was shared among the memory of all the workers.

Note: At first I was using df.head() without having a Dask Client and that worked but I didn't see any diagnostics so I've add the client = Client().

1 Answer 1

1

You have downloaded the file to the node in which your client is running but the workers, on other nodes in kubernetes, do not have access to that file-system and cannot therefore load the file.

The simplest solution here is to use Dask's native ability to talk with GCS. Yo do not need a local copy of your data at all. You should install gcsfs, and then try:

df = dd.read_csv("gcs://<BUCKET>/test-file.csv", storage_options={'token': CREDENTIALS})

(or you may wish to distribute credentials to your workers by other more secure means).

If you did want a local copy of your data (some loaders cannot take advantage of remote file-systems, for instance), then you would need a shared file-system between the client and workers of your Dask cluster, which would take some kubernetes-foo to achieve.

Further information: http://docs.dask.org/en/latest/remote-data-services.html

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.