Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
38 views

I have a base table A and a result table B in DolphinDB. Table B was initially empty and is used to store calculated results based on table A. When trying to insert the calculated results into table B,...
RORO's user avatar
  • 1
3 votes
1 answer
132 views

I’m working with Apache Ignite 2.17.0. I load database tables into Ignite caches and run SQL queries using the SQLFieldsQuery API. Recently, I modified the cache configuration for some tables to use ...
kushal Baldev's user avatar
0 votes
0 answers
62 views

I have the following code to test. I created a table on worker 1. Then I tried to read the table on worker 2 and it got TABLE_OR_VIEW_NOT_FOUND. Worker 2 is in the some computer as Master. I ran the ...
Rick C. Ferreira's user avatar
0 votes
0 answers
50 views

I’m optimizing a PySpark pipeline that processes records with a heavily skewed categorical column (category). The data has: A few high-frequency categories (e.g., 90% of records fall into 2-3 ...
Bilal Jamil's user avatar
1 vote
1 answer
114 views

I have a Spark DataFrame created from a Delta table, with one column of type STRUCT(JSON). For each row in this DataFrame, I need to make a REST API call using the JSON payload in the column. ...
uds0128's user avatar
  • 53
0 votes
0 answers
327 views

I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on ...
yunjeong's user avatar
1 vote
0 answers
91 views

I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a ...
TGD's user avatar
  • 56
0 votes
0 answers
77 views

I am looking to finetune a pre-trained deberta model on Vertex AI with pytorch. I'm attempting to run a distributed job, making use of the Vertex AI reduction server. I'm following this notebook: ...
purpleFudge's user avatar
1 vote
0 answers
35 views

I have a custom ConstantLengthDataset class: class ConstantLengthDataset(IterableDataset): def __init__( self, tokenizer, dataset, infinite=False, ...
имя's user avatar
  • 11
0 votes
0 answers
78 views

I'm working with multiple GPUs handling large amaounts of data. I want to create an out-of-memory (OOM) catch system that skips the current batch on all GPUs if any are out of memory. However, for ...
Zyzyx's user avatar
  • 534
0 votes
1 answer
477 views

It seems I'm unable to write using the delta format from my spark job, but I'm not sure what I'm missing. I'm using spark 3.5.3 and deltalake 3.2.0. My error: Exception in thread "main" org....
William's user avatar
  • 141
0 votes
0 answers
374 views

Can someone help me with the following error. The code works fine on the 2 T4 GPUs. But fails when run on the 4 L4 GPUs. I am extending the Gemma 2B model for a multi-label multi-class classification ...
Rakesh Jarupula's user avatar
1 vote
0 answers
20 views

I’m using GridDB for a distributed database setup and recently encountered the following error while performing operations across nodes in the cluster: from griddb_python import StoreFactory, ...
Samar Mohamed's user avatar
2 votes
2 answers
222 views

Imagine a 3 node raft cluster. Each node is in sync has log [1,2,3] and entry 3 is committed by the leader. Now leader receives an entry 4 but fails to commit it because of unreliable network and ...
Dumb_Pegasus's user avatar
0 votes
1 answer
133 views

I'm trying apache ignite and must say ignite documentation is incomplete. Anyway, I've setup two node cluster using docker images 2.14.0-arm14 and exposed all Ports for both ignite containers, however ...
JUser's user avatar
  • 196
2 votes
0 answers
416 views

I’m working on a project that involves creating a vector search index for a massive dataset consisting of 1.3 trillion tokens. I want to use AutoFAISS in a distributed environment to handle the scale ...
Cauder's user avatar
  • 2,759
0 votes
1 answer
103 views

I have a workflow with multiple DAGs. Every DAG has multiple tasks. These tasks are simple ETL tasks. It involves geo data in the form of kmls, csvs. An example task: We have meta data of road ...
ShariqHameed's user avatar
0 votes
0 answers
35 views

Is there an existing algorithm or method to conduct lottery-like draws that ensures secure and truly random results without the need for auditing? There are any lib to do this? I search on the web ...
aguiadouro's user avatar
1 vote
2 answers
925 views

I'm trying to RELIABLY implement that pattern. For practical purposes, assume we have something similar to a twitter clone (in cassandra and nodejs). So, user A has 500k followers. When user A posts a ...
InglouriousBastard's user avatar
2 votes
0 answers
25 views

I'm using GridDB for managing a distributed database system and recently encountered the following error while trying to perform operations: 80000 LM_WRITE_LOG_FAILED ERROR Writing to log file failed. ...
omar esawy's user avatar
2 votes
0 answers
42 views

I'm working on a distributed system where I need to synchronize data across a cluster of nodes. However, I'm encountering an error during the synchronization process. The error message I get is: 20037 ...
omar esawy's user avatar
1 vote
0 answers
98 views

Suppose I have a Ray actor that can create a Ray object that associates with some non-serializable states. In the following example, the non-serializable state is a temporary directory. class MyObject:...
Yang Bo's user avatar
  • 3,773
2 votes
0 answers
99 views

I am trying to understand how XGBoost distributed training works. The best explanation I've found so far is in this paper: https://ml-pai-learn.oss-cn-beijing.aliyuncs.com/%E6%9C%BA%E5%99%A8%E5%AD%A6%...
Altamash Rafiq's user avatar
0 votes
1 answer
162 views

I have a complex product that runs like this. A parent Java process which expose an HTTP service. The parent process starts worker subprocesses (new JVM) and manage the lifecycle of them. Worker ...
Joey Liu's user avatar
  • 510
1 vote
1 answer
2k views

I'm trying to do Pytorch Lightning Fabric distributed FSDP training with Huggingface PEFT LORA fine tuning on LLAMA 2 but my code ends up failing with: `FlatParameter` requires uniform dtype but got ...
JobHunter69's user avatar
  • 2,376

1
2 3 4 5
58