0

Hello Snakemake community,

I am having quite some troubles to define correctly a function in Snakemake and call it in the params section. The output of the function is a list and my aim is to use each item of the list as a parameter of a shell command. In other words, I would like to run multiple jobs in parallel of the same shell command with a different parameter.

This is the function:

import os, glob
def get_scontigs_names(wildcards):
   scontigs = glob.glob(os.path.join("reference", "Supercontig*"))
   files = [os.path.basename(s) for s in scontigs]
   return name

The output is a list that looks like:

['Supercontig0', 'Supercontig100', 'Supercontig2', ...]

The Snakemake rules are:

rule all:
    input:
        "updated/all_supercontigs.sorted.vcf.gz"
rule update_vcf:
    input:
        len="genome/genome_contigs_len_cumsum.txt",
        vcf="filtered/all.vcf.gz"
    output:
        cat="updated/all_supercontigs.updated.list"
    params:
        scaf=get_scontigs_names
    shell:
        """
        python 3.7 scripts/update_genomic_reg.py -len {input.len} -vcf {input.vcf} -scaf {params.scaf}
        ls updated/*.updated.vcf.gz > {output.cat}
        """

This code is incorrect because all the items of the list are loaded into the shell command when I call {params.scaf}. The current shell commands looks like:

python 3.7 scripts/update_genomic_reg.py -len genome/genome_contigs_len_cumsum.txt -vcf filtered/all.vcf.gz -scaf Supercontig0 Supercontig100 Supercontig2 ...

What I would like to get is:*

python 3.7 scripts/update_genomic_reg.py -len genome/genome_contigs_len_cumsum.txt -vcf filtered/all.vcf.gz -scaf Supercontig0

python 3.7 scripts/update_genomic_reg.py -len genome/genome_contigs_len_cumsum.txt -vcf filtered/all.vcf.gz -scaf Supercontig100

and so on.

I have tried to use wildcards inside the function but I am failing to give it the correct attribute.

There are several posts about input functions and wildcards plus the snakemake docs but I could not really apply them to my case. Can somebody help me with this, please?

1
  • have you considered xargs or parallel to run multiple jobs within a rule? Commented Dec 19, 2018 at 23:32

2 Answers 2

1

What about this below? Note that your get_scontigs_names doesn't make use of wildcards.

import os, glob

def get_scontigs_names():
   scontigs = glob.glob(os.path.join("reference", "Supercontig*"))
   files = [os.path.basename(s) for s in scontigs]
   name = [i.split('_')[0] for i in files]
   return name

supercontigs= get_scontigs_names()

rule all:
    input:
        "updated/all_supercontigs.sorted.vcf.gz"

rule update_vcf:
    input:
        len="genome/genome_contigs_len_cumsum.txt",
        vcf="filtered/all.vcf.gz",
    output:
        upd= "updated/{supercontig}.updated.vcf.gz",
    shell:
        r"""
        python 3.7 scripts/update_genomic_reg.py -len {input.len} \
            -vcf {input.vcf} -scaf {wildcards.supercontig}
        """

rule list_updated: 
    input:
        expand("updated/{supercontig}.updated.vcf.gz", supercontig= supercontigs),
    output:
        "updated/all_supercontigs.sorted.vcf.gz",
    shell:
        r"""
        ls {input} > {output}
        """
Sign up to request clarification or add additional context in comments.

2 Comments

This suggestion did not really work because of this line supercontigs= get_scontigs_names() but I solved the problem by splitting the rule into two (see below).
@SamPer - Indeed, code edited according to your answer below.
0

I have found the solution to my question inspired by @dariober.

rule all:
input:
    "updated/all_supercontigs.updated.list"

import os, glob

def get_scontigs_names(wildcards):
    scontigs = glob.glob(os.path.join("reference", "Supercontig*"))
    files = [os.path.basename(s) for s in scontigs]
    name = [i.split('_')[0] for i in files]
    return name

rule update_vcf:
    input:
        len="genome/genome_contigs_len_cumsum.txt",
        vcf="filtered/all.vcf.gz"
    output:
        vcf="updated/all_{supercontig}.updated.vcf.gz"
    params:
        py3=config["modules"]["py3"],
        scaf=get_scontigs_names
    shell:
        """
        {params.py3} scripts/update_genomic_reg.py -len {input.len} -vcf 
        {input.vcf} -scaf {wildcards.supercontig}
        """


rule list_updated:
    input:
        expand("updated/all_{supercontig}.updated.vcf.gz", supercontig = 
        supercontigs)
    output:
        "updated/all_supercontigs.updated.list"
    shell:
        """
        ls {input} > {output}
        """

2 Comments

Good you found a solution. But I still don't get the point of having get_scontigs_names as an input function in the rule since its output doesn't depend on the wildcards. Personally, I find this makes to code a bit contrived but I may be wrong. (Also, for readability I would put the import statements at the top of the script)
I have not really understood how to use wildcards inside the input function. Does anyone have a very simple and clear example of such a function?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.