Due to time constraints, you are tasked with writing a nextflow script for the following processes:
I will provide a full nextflow script of the BASH workflow for you to run when you have completed the exercise.
In the BASH workflow, you may have noticed nearly every step required the reference genome or the exome interval list file. In nextflow
we typically intialise files with a channel, however usually channels can only be used once.
Save the script as non-reusable.nf
and run it -> nextflow run non-reusable.nf
.
#!/usr/bin/env nextflow
Channel
.fromPath("/data/MSc/2020/MA5112/Variant_Calling/reference/GRCh37.fasta")
.set{ fasta_ch }
process A{
echo true
input:
file(fasta) from fasta_ch
script:
"""
printf "Process A, fasta file: $fasta\n"
"""
}
process B{
echo true
input:
file(fasta) from fasta_ch
script:
"""
printf "Process B, fasta file: $fasta\n"
"""
}
Note the error message:
Channel
fasta_ch
has been used twice as an input by processB
and processA
.into{}
exampleOne way to overcome this is to put the file into multiple channels using .into{}
instead of .set{}
.
Save the script as into-reusable.nf
and run -> nextflow run into-reusable.nf
#!/usr/bin/env nextflow
Channel
.fromPath("/data/MSc/2020/MA5112/Variant_Calling/reference/GRCh37.fasta")
.into{ fasta_ch1; fasta_ch2 }
process A{
echo true
input:
file(fasta) from fasta_ch1
script:
"""
printf "Process A, fasta file: $fasta\n"
"""
}
process B{
echo true
input:
file(fasta) from fasta_ch2
script:
"""
printf "Process B, fasta file: $fasta\n"
"""
}
Process A and B can now run because the file has been put into two unique channels.
.value(file())
examplePlacing a file into multiple channels is fine but is considered poor practice when writing large pipelines. For the variant calling workflow, we would have to place the fasta reference genome file into multiple channels, with each process hardcoded to a specific fasta_ch(n)
. This is cumbersome to keep track of!
To create a reusable channel we are going to use Channel.value(file())
to tell nextflow to stage the file path as a value and intialise it as a file. We can reuse this throughout the pipeline.
Save the below script as value-reusable.nf
and run -> nextflow run value-reusable.nf
#!/usr/bin/env nextflow
Channel
.value(file("/data/MSc/2020/MA5112/Variant_Calling/reference/GRCh37.fasta"))
.set{ fasta_ch }
process A{
echo true
input:
file(fasta) from fasta_ch
script:
"""
printf "Process A, fasta file: $fasta\n"
"""
}
process B{
echo true
input:
file(fasta) from fasta_ch
script:
"""
printf "Process B, fasta file: $fasta\n"
"""
}
We can see that the channel fasta_ch
is reusable across processes.
I have started you off by staging the input files required for the variant calling workflow. The Index files (.amb
, .ann
, .bwt
, .pac
, .sa
, .fai
, .dict
) are all in a channel called index_ch
. You will never need to specifically call one of these files, they just have to be present in the nextflow workdir for certain processes like read alignment (requires BWA indices) or BQSR (requires SAMtools .fai
and Sequence Dictironary .dict
files).
Other files such as dbSNP
, Mills_KG
or exome_intervals
do need to be called specifically for flags. For this reason, they are in their own channel.
Save the script as stage_inputs.nf
and run -> nextflow run stage_inputs.nf
#!/usr/bin/env nextflow
/*
* Stage PE Sequencing Reads
*/
reads_ch = Channel.fromFilePairs("/data/MSc/2020/MA5112/Variant_Calling/reads/*_r{1,2}.fastq.gz")
/*
* Stage Input files for Variant Calling:
*/
fasta_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/reference/GRCh37.fasta"))
exome_interval_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/assets/exome.bed.interval_list"))
mills_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/assets/Mills_KG_gold.indels.b37.vcf.gz"))
dbsnp_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/assets/dbsnp_138.b37.vcf.gz"))
index_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/reference/*GRCh37.{dict,fasta.}*"))
/*
* I will show you how to call each of these files
*/
process A {
echo true
input:
tuple val(base), file(reads) from reads_ch
file(index) from index_ch
file(exome) from exome_interval_ch
file(dbsnp) from dbsnp_ch
file(mills) from mills_ch
script:
"""
echo $base, $reads
echo $index
echo $exome
echo $dbsnp
echo $mills
"""
}
/*
* Check they can be reused.. (not the fastq pairs)
*/
index_ch.view()
exome_interval_ch.view()
dbsnp_ch.view()
mills_ch.view()
You are required to build on the stage_inputs.nf
script and perform read alignment + markduplicates. Do not run this on the head node, submit the script with the -bg
flag to invoke the SLURM job scheduler settings in your config file.
Refer to the full pipeline we ran for guidance. Below is a skeleton script to help:
#!/usr/bin/env nextflow
params.outdir = "./"
/*
* Stage PE Sequencing Reads
*/
reads_ch = Channel.fromFilePairs("/data/MSc/2020/MA5112/Variant_Calling/reads/*_r{1,2}.fastq.gz")
/*
* Stage Input files for Variant Calling:
*/
fasta_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/reference/GRCh37.fasta"))
exome_interval_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/assets/exome.bed.interval_list"))
mills_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/assets/Mills_KG_gold.indels.b37.vcf.gz"))
dbsnp_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/assets/dbsnp_138.b37.vcf.gz"))
index_ch = Channel.value(file("/data/MSc/2020/MA5112/Variant_Calling/reference/*GRCh37.{dict,fasta.}*"))
/*
* ALIGN READS
*/
process bwa {
publishDir "${params.outdir}/bwa_aln", mode:'copy'
input:
output:
script:
"""
"""
}
/*
* MARK DUPLICATES
*/
// Use the output BAM file as an input for this process
process markduplicates{
publishDir "${params.outdir}/markdups", mode:'copy'
input:
output:
script:
"""
"""
}
If you want to see the full pipeline in nextflow, it is available (variant_calling.nf) at the following repository. Note I used .getval()
instead of Channel.value(file())
, slightly outdated syntax but it still works.
Save the script to your own directory and run it by calling:
nextflow -bg -q run variant_calling.nf \
--outDir $(pwd) \
--analysisDir "/data/MSc/2020/MA5112/Variant_Calling" \
-with-singularity /data/MSc/2020/MA5112/Variant_Calling/container/germline_vc.img