0

I had asked an earlier question about running the same snakemake pipeline for multiple datasets and one of the solutions mentioned was using multiple config files by @bli. I am trying to implement it but got an error when I have to read in a file which has sample information. error:

SyntaxError:
Not all output, log and benchmark files of rule fastqc contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
  File "Snakefile", line 64, in <module>

I have seen this error before but I cannot figure out why is it coming in this case when every input and output has sample as a wildcard. Any help is much appreciated!

My Snakefile looks like this:

import os
import pandas as pd
import yaml
configfile: "main_config.yaml"

all_keys = list(config.keys())
print(all_keys)



datasets  =config["datasets"]

print(datasets.items())


for p_id, p_info in datasets.items():
    for key in p_info:
     print(key + '------',p_info[key])
     conf_file=p_info["conf"]
     conf_fh=open(conf_file)
     dat_conf = yaml.safe_load(conf_fh)
     output = dat_conf["output_dir"]
     samples = dat_conf["sampletable"]
     R1 = dat_conf["R1"]
     R2 = dat_conf["R2"]
     print(samples)
     print(R1)


SampleTable = pd.read_table(samples,index_col=0)
SAMPLES = list(SampleTable.index)
print(SAMPLES)
PAIRED_END= ('R2' in SampleTable.columns)
FRACTIONS= ['R1']
if PAIRED_END: FRACTIONS+= ['R2']

qc = config["qc_only"]

def all_input_reads(qc):
    if config["qc_only"]:
        return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)
    else:
    return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)


rule all:
    input:
         all_input_reads


rule fastqc:
    input:
      unpack( lambda wc: dict(SampleTable.loc[wc.sample]))
    output:
      R1= "{output}/fastqc/{sample}{R1}" +"_fastqc.html",
      R2 ="{output}/fastqc/{sample}{R2}"  +"_fastqc.html"
    conda:
      "../envs/fastqc.yaml"
    log:
       "{output}/logs/qc/fastqc_{sample}_unfilt.log"
    shell: "fastqc -o {output}/fastqc {input.R1} {input.R2} >> {log}"

The main config file is :

  datasets:
    
     dat1:
          conf: "config_files/data1_config.yaml"
     dat2:
          conf: "config_files/data2_config.yaml"
     qc_only: FALSE

and the individual config files looks like this data1_config.yaml:

# List of files

sampletable: "samples_data1.tsv"
output_dir: "data1"
## Cutadapt
## IMPORTANT ****** If you want to remove primers uncomment line 51  in utils/rules/qc_cutadapt.smk which will allow for primers to be removed

primers:
# Illumina V3V4 protocol primers
fwd_primer: "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
rev_primer: "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
fwd_primer_rc: "CTGCWGCCNCCCGTAGGCTGTCTCTTATACACATCTGACGCTGCCGACGA"
rev_primer_rc: "GGATTAGATACCCBDGTAGTCCTGTCTCTTATACACATCTCCGAGCCCACGAGAC"


R1: "_R1"
R2: "_R2"

maxEE:
  - 2
  - 2
truncQ: 2

1 Answer 1

0

Here is your fastqc/output with little formatting:

rule fastqc:
    output:
        R1 = "{output}/fastqc/{sample}{R1}" + "_fastqc.html",
        R2 = "{output}/fastqc/{sample}{R2}" + "_fastqc.html"

R1 and R2 wildcards are synonyms, and there is no way for Snakemake to differenciate them. For example, imagine that the rule all requires this file to be created: output_dir/fastqc/sample_aR1_fastqc.html. Which variables should Snakemake assign this file to, output.R1 or output.R2?

You need to separate these parameters with a non-wildcard, like that:

rule fastqc:
    output:
        R1 = "{output}/fastqc/{sample}R1" + "_fastqc.html",
        R2 = "{output}/fastqc/{sample}R2" + "_fastqc.html"
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.