A detailed pipeline has been provided for you to run a germline variant calling analysis on LUGH before attempting to create a nextflow script.
Do not run this analysis on the head node!
Request computing resources on LUGH:
salloc -p MSC -n 1
Check which compute node your job is on:
bdigby@lugh:/data/bdigby$ squeue -u bdigby
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4529470 MSC bash bdigby R 0:07 1 compute01
ssh to the NODELIST listed:
ssh compute01
Due to time constraints, the reference genome files, SNP, INDEL and exome interval files have been prepared for you.
These files and sequencing reads are located under the assets/
, reads/
and reference/
directories at
/data/MSc/2020/MA5112/Variant_Calling
Do not copy these files to your directory, this will cause uneccessary bloat in data/
. We are going to create a shell variable $var_cal
as a shorthand reference to the path.
First, shell into the singularity container:
singularity shell -B /data \
/data/MSc/2020/MA5112/Variant_Calling/container/germline_vc.img
Now create the shorthand path variable:
var_cal="/data/MSc/2020/MA5112/Variant_Calling"
Test it quickly by running the following command:
head ${var_cal}/assets/exome.bed.interval_list
This is much faster than typing out head /data/MSc/2020/MA5112/Variant_Calling/assets/exome.bed.interval_list
and will save time when stepping through the pipeline.
Due to time constraints all indexing has been performed for you. Skip to step 2.
BWA requires building an index for your reference genome to allow computationally efficient searches of the genome during sequence alignment.
bwa index -a bwtsw GRCh37.fasta
-a
: Construction algorithm (bwtsw, is or rb2). For large genomes, specify -a bwtsw
. If you are unsure, omit this flag and bwa
will determine the correct algorithm to use.BWA indexing produces 5 files with the file extensions *.amb
, *.ann
, *btw
, *.pac
& *.sa
.
Indexes (or queries) the reference sequence.
samtools faidx GRCh37.fasta
Samtools fasta indexing generates a *.fai
file.
Creates a sequence dictionary of the reference fasta file specifying the chromosomes and chromosome sizes. Required for downstream analysis tools by GATK.
R
: Reference Genomepicard CreateSequenceDictionary \
R=GRCh37.fasta \
O=GRCh37.dict
O
: Dictionary file. You may name this whatever you want, however common convention dictates using *.dict
extensions.Map the reads to the reference genome using bwa mem
. The SAM files produced have reads in the order that the sequences occurred in the input FASTQ files i.e in randomn order. samtools sort
orders the reads by their leftmost coordinate i.e in ‘genome order’. This is a requirement for downstream tools.
We will also convert the SAM file to to its binary counterpart: a BAM file.
bwa mem \
-K 100000000 \
-R "@RG\tID:sample_1\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:sample_1" \
-t 2 \
${var_cal}/reference/GRCh37.fasta \
${var_cal}/reads/subsample_r1.fastq.gz \
${var_cal}/reads/subsample_r2.fastq.gz \
| samtools sort - > subsample.bam
-K
: process INT input bases in each batch regardless of nThreads (for reproducibility)-R
: Read Group identifier. This information is key for downstream GATK functionality, GATK will not work without a read group tag.-t
: Number of threadsThe script passes the output of bwa mem
(subsample.sam
) directly to samtools sort
by using the pipe command (|
). A dash -
is used to indicate the incoming file from the previous process for samtools. When working with full sized samples, allocate more threads to samtools like so samtools sort --threads 8 - > subsample.bam
.
N.B: bwa mem
requires all bwa idx
output files & the reference genome to be present in the GRCh37.fasta
directory to run.
“Almost all statistical models for variant calling assume some sort of independence between measurements. The duplicates (if one assumes that they arise from PCR artifact) are not independent. This lack of independence will usually lead to a breakdown of the statistical model and measures of statistical significance that are incorrect” – Sean Davis.
--INPUT
: Sorted BAM file.gatk --java-options -Xmx2g \
MarkDuplicates \
--MAX_RECORDS_IN_RAM 50000 \
--INPUT subsample.bam \
--METRICS_FILE subsample.bam.metrics \
--TMP_DIR ./ \
--ASSUME_SORT_ORDER coordinate \
--CREATE_INDEX true \
--OUTPUT subsample.markdup.bam
--MAX_RECORDS_IN_RAM
: When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk.--ASSUME_SORT_ORDER
: Assume that the input file has this order (even if the header says otherwise). Default value: null. Possible values: {unsorted, queryname, coordinate, duplicate, unknown}.--METRICS_FILE
: Statistics of duplication events in sequencing reads.--OUTPUT
: A BAM file with collapsed duplicates.Base Quality Score Recalibration. A data pre-processing step that detects systematic errors made by the sequencing machine when it estimates the accuracy of each base call.
To build the recalibration model, BaseRecalibrator
goes through all of the reads in the input BAM file and tabulates data about the following features of the bases:
For each bin, the tool counts the number of bases within the bin and how often such bases mismatch the reference base, excluding loci known to vary in the population, according to the known variants resource (typically dbSNP, Mills_KG_gold). This information is output to a recalibration file in GATKReport format.
-I
: Marked Duplicates BAM file-L
: Exome Interval list-R
: Reference Genome--known-sites
: Known variants (SNPs + INDELs)gatk --java-options -Xmx2g \
BaseRecalibrator \
-I subsample.markdup.bam \
-O subsample.recal.table \
-L ${var_cal}/assets/exome.bed.interval_list \
--tmp-dir ./ \
-R ${var_cal}/reference/GRCh37.fasta \
--known-sites ${var_cal}/assets/Mills_KG_gold.indels.b37.vcf.gz \
--known-sites ${var_cal}/assets/dbsnp_138.b37.vcf.gz
-O
: Recalibration table file. Required for next step.ApplyBQSR
goes through all the reads again, using the recalibration table file to adjust each base’s score based on which bins it falls in. So effectively the new quality score is:
Following recalibration, the read quality scores are much closer to their empirical scores than before. This means they can be used in a statistically robust manner for downstream processing, such as variant calling. In addition, by accounting for quality changes by cycle and sequence context, we can identify truly high quality bases in the reads, often finding a subset of bases that are Q30 even when no bases were originally labeled as such.
-I
: Marked Duplicates BAM file-L
: Exome Interval list-R
: Reference Genome--bqsr-recal-file
: Recalibration table file (generated in previous step).gatk --java-options -Xmx2g \
ApplyBQSR \
-I subsample.markdup.bam \
-O subsample.recal.bam \
-L ${var_cal}/assets/exome.bed.interval_list \
-R ${var_cal}/reference/GRCh37.fasta \
--bqsr-recal-file subsample.recal.table
-O
: Recalibrated, Marked Duplicate BAM file.Tidy up the indexed BAM file & retrieve the statistics of the recalibrated BAM file.
mv subsample.recal.bai subsample.recal.bam.bai
samtools stats subsample.recal.bam > subsample.recal.stats
N.B Both BQSR steps require the samtools reference index file *.fai
and picards dictionary *.dict
file to be present in the reference genome directory.
Identify germline short variants (SNPs and INDELs) in an individual, or in a cohort. This tutorial is focused on a single sample germline variant calling analysis. Not to be confused with somatic variant calling which uses matched tumour - normal samples, requiring a different workflow.
Performing a single sample analysis and a joint cohort analysis follow the same steps covered below.
Call germline SNPs and indels via local re-assembly of haplotypes by:
-R
: Reference Genome-I
: Recalibrated BAM-L
: Exome Interval list-D
: Known SNPs (dbSNP)gatk --java-options -Xmx2g \
HaplotypeCaller \
-R ${var_cal}/reference/GRCh37.fasta \
-I subsample.recal.bam \
-L ${var_cal}/assets/exome.bed.interval_list \
-D ${var_cal}/assets/dbsnp_138.b37.vcf.gz \
-O subsample.g.vcf \
-ERC GVCF
-ERC
: Mode for emitting reference confidence scores. The key difference between a regular VCF and a GVCF is that the GVCF has records for all sites, whether there is a variant call there or not. The goal is to have every site represented in the file in order to do joint analysis of a cohort in subsequent steps.-O
: GVCF as described above.GenotypeVCFs
is designed to perform joint genotyping on a single input, which may contain one or many samples. In any case, the input samples must possess genotype likelihoods produced by HaplotypeCaller with -ERC GVCF
or -ERC BP_RESOLUTION
.
-R
: Reference genome file.-L
: Exome interval list file.-D
: dbSNP annotation file.-V
: GVCF file generated from HaplotypeCaller
.gatk --java-options -Xmx2g \
GenotypeGVCFs \
-R ${var_cal}/reference/GRCh37.fasta \
-L ${var_cal}/assets/exome.bed.interval_list \
-D ${var_cal}/assets/dbsnp_138.b37.vcf.gz \
-V subsample.g.vcf \
-O subsample.vcf
-O
: VCF file containing variants.N.B Both Variant Calling steps require the samtools reference index file *.fai
and picards dictionary *.dict
file to be present in the reference genome directory.
Before applying filtering thresholds to the called variants, subset the VCF file for both SNPs and INDELs using SelectVariants
.
-R
: Reference Genome-V
: VCF file produced by GenotypeVCFs
gatk SelectVariants \
-R ${var_cal}/reference/GRCh37.fasta \
-V subsample.vcf \
-O subsample.snps.vcf.gz \
-select-type SNP
-select-type
: Specify variant type to subset (SNP/INDEL).-O
: VCF file containing SNPs-R
: Reference Genome-V
: VCF file produced by GenotypeVCFs
gatk SelectVariants \
-R ${var_cal}/reference/GRCh37.fasta \
-V subsample.vcf \
-O subsample.indels.vcf.gz \
-select-type INDEL
-select-type
: Specify variant type to subset (SNP/INDEL).-O
: VCF file containing INDELsN.B Both subsetting steps require the samtools reference index file *.fai
and picards dictionary *.dict
file to be present in the reference genome directory.
Apply filtering to selected variants generated in the previous step.
Please refer to the following blog post by GATK regarding hard-filtering SNPs.
-R
: Reference Genome-V
: VCF file containing SNPsgatk VariantFiltration \
-R ${var_cal}/reference/GRCh37.fasta \
-V subsample.snps.vcf.gz \
-O subsample.filt_snps.vcf \
--filter-expression "QD < 2.0" \
--filter-name "filterQD_lt2.0" \
--filter-expression "MQ < 25.0" \
--filter-name "filterMQ_lt25.0" \
--filter-expression "SOR > 3.0" \
--filter-name "filterSOR_gt3.0" \
--filter-expression "MQRankSum < -12.5" \
--filter-name "filterMQRankSum_lt-12.5" \
--filter-expression "ReadPosRankSum < -8.0" \
--filter-name "filterReadPosRankSum_lt-8.0"
--filter-expression
: Filtering to be applied to SNPs.--filter-name
: Annotation of SNPs failing filtering threshold written to VCF file.-O
: VCF file with annotated SNPs failing the selected filtering thresholds. SNPs that pass all thresholds will be marked with PASS
.Please refer to the following blog post by GATK regarding hard-filtering INDELs.
-R
: Reference Genome-V
: VCF file containing INDELsgatk VariantFiltration \
-R ${var_cal}/reference/GRCh37.fasta \
-V subsample.indels.vcf.gz \
-O subsample.filt_indels.vcf \
--filter-expression "QD < 2.0" \
--filter-name "filterQD" \
--filter-expression "SOR > 10.0" \
--filter-name "filterSOR_gt10.0" \
--filter-expression "ReadPosRankSum < -20.0" \
--filter-name "filterReadPosRankSum"
--filter-expression
: Filtering to be applied to INDELs.--filter-name
: Annotation of INDELs failing filtering threshold written to VCF file.-O
: VCF file with annotated INDELs failing the selected filtering thresholds. INDELs that pass all thresholds will be marked with PASS
.N.B Both filtering steps require the samtools reference index file *.fai
and picards dictionary *.dict
file to be present in the reference genome directory.
MergeVCFs
combines multiple variant files into a single variant file.
-I
: Filtered SNPs VCF file-I
: Filtered INDELs VCF filegatk MergeVcfs \
-I subsample.filt_indels.vcf \
-I subsample.filt_snps.vcf \
-O filtered_sample.vcf
-O
: The same VCF file output by GenotypeVCFs
, with added annotations denoting filtering thresholds failed or PASS
if the variant passed all thresholds.Please do not run this. You will not be able to complete this step with the compute resources we have requested (java.lang.OutOfMemoryError: GC overhead limit exceeded
).
I have included snpEff
in the full nextflow script, which I will provide at the end of the tutorial.
snpEff GRCh37.75 \
-csvStats subsample.snpEff.csv \
-nodownload \
-dataDir /data/snpEff/ \
-canon \
-v filtered_sample.vcf > subsample.snpEff.ann.vcf
I will show you a pre-prepared snpEff file to conclude the pipeline walkthrough.