5,501 questions
0
votes
1
answer
322
views
How training loop for 1 epoch is taking way longer than all batches of trainloader loop execution time?
I have following specification:
AMD 32 core processors
NVIDIA GPU Tesla V100 16 GB gpu.
To debug/test:
I have used: only 1 GPU.
My dataset is: Imagenet-1k 2012
Batch size: 32
using following details:
...
0
votes
1
answer
532
views
Windows Failover Cluster Generic Script Resource: How to get IP-Address?
I have a generic script resource (vbs). This script needs the cluster application ip address.
ATM the IP is configured in the script. But I would like to get it from the cluster (maybe from the ...
1
vote
0
answers
25
views
Wildfly 27 : Cluster not work when we have multiple servergroups
Trying to configure wildfly 27 in domain and in clustered environment.
2 RHEL Servers and installed wildfly on both. (No HTTPD nothing)
Done required settings on Primary & Secondary
Created a ...
0
votes
1
answer
3k
views
How to get VS Code Server working in RHEL9?
I would like to use the remote development capabilities of VS Code in the context of an LSF compute cluster (but I think the same question would be relevant on e.g. a Slurm cluster). In particular, I ...
1
vote
1
answer
2k
views
Can i do a K-Means cluster analysis based on only one variable (in R)?
I have a dataframe with 2 columns. The first column has the name of a meteorological station and the other column has a corresponding index. Can i do a K-Means cluster analysis in order to group the ...
1
vote
0
answers
69
views
cluster failover generic application
Is there any way to monitor specific variables generated inside a generic application monitored by a failover cluster in windows server ?
I would like to avoid an internal application failure not ...
0
votes
0
answers
91
views
How do I run an Octave code on a cluster?
I recently set-up 5 debian-based computers with the purpose of running a series of tests to determine the effect of additional nodes on the amount of processing time. I am wondering how I should set ...
1
vote
1
answer
602
views
How to allow %run magic command in Azure Databricks cluster?
I am currently trying to work through a Databricks course and am getting the following error stating that "Your administrator has only allowed sql and python commands on this cluster. This ...
0
votes
1
answer
539
views
GKE cluster creation failed in health check
I'm tried to create a GKE standard cluster. I'm continuously facing creation failed after it reaches 83% cluster creation. It failed in Health check stage. Is there any solution to solve this problem?
...
1
vote
0
answers
401
views
Slurm Cluster Python Script Not Running on Multiple Nodes using SBATCH
We recently setup a Slurm Cluster with 2 Nodes(1 headnode+compute node and 1 compute nodes) for some HPC CFD simulations.Right now i am trying to run some python script which is used for feature ...
1
vote
0
answers
224
views
Why is it that I keep getting the error Insufficient 'DISKS_TOTAL_GB' quota when trying to create cluster on google cloud console?
I keep getting the error Insufficient 'DISKS_TOTAL_GB' quota when I try create a compute engine cluster on google cloud console. It is strange because I have no disks that are active. All the disks ...
1
vote
1
answer
193
views
Storing slurmd node computation outputs in a database?
I have a slurm cluster of 8 separate rhel9 server nodes to serve as master, compute, and database nodes. The master and compute nodes are up and talking, but I have not activated the database node yet....
0
votes
2
answers
1k
views
Pyspark stuck and not processing. It shows more than 1K processes
I have a for loop running in databricks and the first iterations run fast, then it gets slower and then it doesn't proceeds at all. While I know that is common in for loops if data size is increasing ...
1
vote
1
answer
940
views
Running an Independent SLURM Job with mpirun Inside a Python Script - Recursive Call Error
I'm currently using a Python script that requires MPI to operate. This script runs on a SLURM system.
In order to run my Python script, I define the number of nodes to use and launch the following ...
2
votes
1
answer
95
views
How to unnest an object from `spatialsampling`'s `spatial_clustering_cv`
I would like to perform spatial clustering on a sf object and attach the fold IDs to my original dataframe, in a new column.
Here's what my input looks like (an sf object with points)
# A tibble: 6 × ...
0
votes
1
answer
145
views
Long-polling with message queues in a clustered environment
I have a system design question that I'm looking for some guidance on. I have two different systems that need to have a basic level of communication. This is abstracted via message queues.
For ...
2
votes
2
answers
1k
views
Install and run containers on Slurm HPC
I have an hpc slurm on ubuntu and I want to install docker or docker rootles. But I can't find anything on official sites, so how could I install docker on a slurm cluster and run containers or if you ...
1
vote
1
answer
91
views
How can I incorporate cluster-robust standard errors into a randomization test using the 'ritest' function?
I'm currently attempting to perform a randomization test following several regressions. However, I'm encountering difficulties in incorporating cluster-robust standard errors into my randomization ...
0
votes
1
answer
148
views
Data not available after neo4j-admin import (Causal Cluster / Neo4J Enterprise Edition)
I am using neo4j-admin import command to import data into a neo4j causal cluster with neo4j enterprise 4.4.7 installed on 9 Ubuntu VMs configured as CORE instances. Cluster is functional and works.
...
0
votes
1
answer
464
views
How to manually cluster rows for heatmap
I am trying to make a heatmap composing of several different "clusters" of gene types. For example, 5 genes are related to protein folding, 5 genes are related to ECM composition, etc. I ...
0
votes
1
answer
268
views
CDO application issue
I run a climate model on cluster. The necessary moudels have been loaded, icluding:
module load netcdf/c/4.6.1-intel-2013.1
module load netcdf/fortran/4.4.4-intel-2013.1
module load cdo/1.9.1
The ...
0
votes
0
answers
318
views
connection to the server X.X.X.X:6443 was refused - did you specify the right host or port?
im trying to configure a cluster with master and 2 workers on ec2 machines.
I followed the guide on kubernetes and it gave final result of
NAME STATUS ROLES AGE VERSION
...
1
vote
1
answer
539
views
slurm ignores dependency on running job
Suppose that on a cluster with slurm the job with ID 12345 is currently running. I want to submit another job that will start after this job finishes. I tried sbatch -d after:12345 job.script, but I ...
0
votes
0
answers
429
views
S2d Failover Cluster Permissions
I have just setup a new 3 node cluster with s2d using a storage pool of 6 drives per node but when I reboot any of the nodes, one or more of the attached drives become detached from the pool and ...
0
votes
1
answer
286
views
problems joining nodes 2 and 3 in a Galera cluster with mysql on ubuntu server 20.04
My project consists of making a Galera cluster with MySQL on ubuntu 20.04. The problem has been generated when joining the nodes (I have 3) to mysql: the main node does not give me any problem but ...