3

I am on a SLURM cluster and want to run the following multiprocess. The tasks are totally parallelizable but it seems they're still occurring serially.

Code is:

#load data (this is a df of files that need to be processed)
left = loadData()

processes = []

#split the list of files in 22 groups based on column chrom
for i in range(1,23):
            left_chrom = left[left['chrom'] == i]
            #Pass each DF of files to multiprocessing (note this function calls a subprocess to process the file)
            p_ins = multiprocessing.Process(target=ViewVCFConvert, args = (left_chrom,))
            processes.append(p_ins)
            p_ins.start()
            for process in processes:
                process.join()

My slurm settings are:

#!/bin/bash
#SBATCH --job-name=VCF
#SBATCH --partition=abc
#SBATCH --nodes=1
#SBATCH --cpus-per-task=22
#SBATCH --mem=1G
#SBATCH --time=10:00:00
   

However when I run this, the files are processed serially. I have checked this by adding a print function to show when a file is processed. I would expect the output of those print statements to be like:

file1, chrom=2
file4, chrom=5
file3, chrom=8

Instead the output I get is:

file1, chrom=4
file2, chrom=4
file3, chrom=4

This implies the files are being processed in order (although multiprocessing is doing something as it does not always start with chrom=1 as in a normal for loop).

1
  • The problem is that you are joining in the loop, which means the process has to complete before you go to the top of the loop for the next. Move the for process in processes: out of the loop. Commented May 29, 2022 at 23:19

1 Answer 1

1

So the solution came from another answer (Python multiprocessing pool inside a loop). Code is below. Basically I needed to use Pool and not Process as I wanted to run ViewVCFConvert in parallel for all list chrom. If I had several functions and I wanted to run them all in parallel one chrom at a time then I would use Processing. This is why it was still running serially, it was doing ViewVCFConvert once at a time.

from multiprocessing import Pool
    def main():
        chrom = [i for i in range(1,23)]
        pool = Pool(22)
        pool.map(ViewVCFConvert, chrom)
        pool.close()
        pool.join()

view the documentation on the site (https://docs.python.org/3/library/multiprocessing.html) to see the difference between Pool and Process.

Sign up to request clarification or add additional context in comments.

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.