1

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.

My original bash script runs a program multiple times for a range of values of the parameter K.

for num in {1..3}
do
  structure.py -K $num --input=fileprefix --output=fileprefix
done

There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.

Config:

cat config.yaml

infile: fileprefix
K:
  - 1
  - 2
  - 3

Snakemake:

configfile: 'config.yaml'

rule all:
    input:
        expand("output/{sample}.{K}.{ext}",
            sample = config['infile'],
            K = config['K'],
            ext = ['meanQ', 'meanP'])

rule structure:
    output:
        "output/{sample}.{K}.meanQ",
        "output/{sample}.{K}.meanP"
    params:
        prefix = config['infile'],
        K = config['K']
    threads: 3
    shell:
        """
        structure.py -K {params.K} \
        --input=output/{params.prefix} \
        --output=output/{params.prefix}
        """

This was executed with snakemake --cores 3. The problem persists when I only use one thread.

I expected the outputs described above for each value of K, but the run fails with this error:

RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail;  structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
  File "Snakefile", line 84, in __rule_Structure
  File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run

When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.

2
  • Doesn't affect my question, but in the params it should read prefix = config['infile']and not prefix = config['invcf']. Commented Jan 29, 2019 at 14:17
  • Your question can be edited for corrections (as many times as you want, I think). Commented Jan 30, 2019 at 15:39

2 Answers 2

1

You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.

configfile: 'config.yaml'

rule all:
    input:
        expand("output/{sample}.{K}.{ext}",
               sample = config['infile'],
               K = config['K'],
               ext = ['meanQ', 'meanP'])

rule structure:
    output:
        "output/{sample}.{K}.meanQ",
        "output/{sample}.{K}.meanP"
    params:
        prefix = config['invcf'],
        K = config['K']
    threads: 3
    shell:
        "structure.py -K {wildcards.K} "
        "--input=output/{params.prefix} "
        "--output=output/{params.prefix}"

Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.

Sign up to request clarification or add additional context in comments.

Comments

0

There is an option now for parameter space exploration https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.