I am on a SLURM cluster and want to run the following multiprocess. The tasks are totally parallelizable but it seems they're still occurring serially.
Code is:
#load data (this is a df of files that need to be processed)
left = loadData()
processes = []
#split the list of files in 22 groups based on column chrom
for i in range(1,23):
left_chrom = left[left['chrom'] == i]
#Pass each DF of files to multiprocessing (note this function calls a subprocess to process the file)
p_ins = multiprocessing.Process(target=ViewVCFConvert, args = (left_chrom,))
processes.append(p_ins)
p_ins.start()
for process in processes:
process.join()
My slurm settings are:
#!/bin/bash
#SBATCH --job-name=VCF
#SBATCH --partition=abc
#SBATCH --nodes=1
#SBATCH --cpus-per-task=22
#SBATCH --mem=1G
#SBATCH --time=10:00:00
However when I run this, the files are processed serially. I have checked this by adding a print function to show when a file is processed. I would expect the output of those print statements to be like:
file1, chrom=2
file4, chrom=5
file3, chrom=8
Instead the output I get is:
file1, chrom=4
file2, chrom=4
file3, chrom=4
This implies the files are being processed in order (although multiprocessing is doing something as it does not always start with chrom=1 as in a normal for loop).
for process in processes:out of the loop.