Please do not run analyses in the /home/
directory on lugh. All analyses must be run in the /data/
directory.
Navigate to /data/MSc/2022
and make a directory for yourself there using the naming convention [first letter forename][surname]
eg: bdigby
.
If you have already created a directory like this with previous weeks tutorials, move it to this directory. Be sure to include the -R
flag to add all sub-directories recursively.
Please delete all analysis files created in the
/home/
directory
All files required for the tutorial are present in /data/MSc/2022/chip_scratch/files
:
SPT5_T0
samples (2 replicates)SPT5_Input
controls (2 replicates)The
fastq
files hold ChiP-Seq data generated from S.cerevisiae samples. The files have been aggressively sub-sampled (100,000 reads per file) to speed up the analysis.
In addition to the analysis files, the container we are going to use is located in the container cache on lugh:
/data/containers/nfcore-chipseq-1.2.2.img
Before you start, request resources on a MSC
compute node:
salloc -p MSC -n 1 -c 2
Check which node you have been assigned to using squeue
and your username:
squeue -u bdigby
Shell into the compute node using ssh
:
ssh compute[1/2/3]
Be sure to change to your directory on /data/MSc/2022/username
after this step.
Finally, shell into the container to access the suite of tools required for the tutorial:
module load singularity
singularity shell -B /data/ /data/containers/nfcore-chipseq-1.2.2.img
Copy the tutorial files to your directory on /data/MSc/2022/username
:
cp -R /data/MSc/2022/chip_scratch/files/ .
Copy the genome.fa
file to your current working directory and index the reference genome file using bwa index
:
cp files/genome.fa .
bwa index -a bwtsw genome.fa
mkdir -p BWAIndex
mv genome.fa* BWAIndex/
The fastq
files provided for the analysis are already of high quality with no contamination. Regardless, we will generate FastQC
reports for the sequencing reads as this is a step inherent in all genomic assays.
Make a directory for quality control files called fastqc
, and run fastqc
on the reads:
mkdir -p fastqc
fastqc files/*.fastq.gz --outdir fastqc/
We will align the sequencing reads to the reference genome using bwa
. We require the sequencing files in pairs (both R1 and R2 are passed simultaneously), the reference genome FASTA file and the genome index files we previously generated.
Recall there are 2 replicates for SPT5_T0
and the input controls SPT5_Input
. We will merge these replicates further downstream - it is not good practice to merge replicate fastq files prior to alignment (think about this critically).
We will also need to specify the @RG
header for bwa
to specify membership in bam
files i.e which replicates belong to SPT5_T0
or SPT5_Input
. This is crucial when merging bam
files downstream..
Create a directory called bwa
. This is where we will perform alignment.
We will use symbolic links to the reference index files and place them in the bwa
directory:
mkdir -p bwa && cd bwa
ln -s ../BWAIndex/* .
Inspect the symbolic links, ask questions if what we are doing here is unclear. (I’m using symbolic links here as this is the underlying mechanism nextflow uses in its processes).
Singularity germline_vc.img:/data/MSc/2022/bdigby/chip_seq/bwa> ls -la
total 0
drwxr-xr-x 2 bdigby bdigby 155 Feb 15 15:01 .
drwxrwxr-x 10 bdigby bdigby 230 Feb 15 15:00 ..
lrwxrwxrwx 1 bdigby bdigby 21 Feb 15 15:01 genome.fa -> ../BWAIndex/genome.fa
lrwxrwxrwx 1 bdigby bdigby 25 Feb 15 15:01 genome.fa.amb -> ../BWAIndex/genome.fa.amb
lrwxrwxrwx 1 bdigby bdigby 25 Feb 15 15:01 genome.fa.ann -> ../BWAIndex/genome.fa.ann
lrwxrwxrwx 1 bdigby bdigby 25 Feb 15 15:01 genome.fa.bwt -> ../BWAIndex/genome.fa.bwt
lrwxrwxrwx 1 bdigby bdigby 25 Feb 15 15:01 genome.fa.pac -> ../BWAIndex/genome.fa.pac
lrwxrwxrwx 1 bdigby bdigby 24 Feb 15 15:01 genome.fa.sa -> ../BWAIndex/genome.fa.sa
Please take your time with this, downstream processes will fail if you make an error here.
Using SPT5_T0_1_R1.fastq.gz
and SPT5_T0_1_R2.fastq.gz
as an example, the corresponding @RG
header will be:
@RG\tID:SPT5_T0_1\tSM:SPT5_T0\tPL:ILLUMINA\tLB:SPT5_T0_1\tPU:1
Notice the SM:
sample identifier specifies this sample belongs to SPT5_T0
of which there are two replicates.
The
ID
provided for each sample must be unique
Align the files using the following command:
bwa mem -t 2 -M -R '@RG\tID:SPT5_T0_1\tSM:SPT5_T0\tPL:ILLUMINA\tLB:SPT5_T0_1\tPU:1' genome.fa ../files/SPT5_T0_1_R1.fastq.gz ../files/SPT5_T0_1_R2.fastq.gz | samtools view -@ 2 -b -h -F 0x0100 -O BAM -o SPT5_T0_1.bam
genome.fa
is pointing to both the reference FASTA genome and the index filesgenome.fa.*
Repeat the above code for the 3 other samples SPT5_T0_2_*
, SPT5_Input_1_*
and SPT5_Input_2_*
. Update the input @RG
header, the input reads and output bam
name for each iteration.
When you are finished the following bam
files should be in your bwa
directory:
SPT5_Input_1.bam SPT5_Input_2.bam SPT5_T0_1.bam SPT5_T0_2.bam
Automating this process is slightly tricky but solution is broken down below:
for file in ../files/*_R1.fastq.gz; do R1=$file; R2=${file/_R1.fastq.gz/}_R2.fastq.gz; echo $R1 $R2; done
gives:
../files/SPT5_Input_1_R1.fastq.gz ../files/SPT5_Input_1_R2.fastq.gz
../files/SPT5_Input_2_R1.fastq.gz ../files/SPT5_Input_2_R2.fastq.gz
../files/SPT5_T0_1_R1.fastq.gz ../files/SPT5_T0_1_R2.fastq.gz
../files/SPT5_T0_2_R1.fastq.gz ../files/SPT5_T0_2_R2.fastq.gz
which is exactly the input we want to pass to bwa
, read pairs passed at the same time.
The next step is to extract the ID name and Sample name for the @RG
header. Build on your previous code (Indentation optional, helps readability..):
for file in ../files/*_R1.fastq.gz; do \
R1=$file; \
R2=${file/_R1.fastq.gz/}_R2.fastq.gz; \
id=$(basename $file _R1.fastq.gz); \
sample=${id%_*}; \
echo $R1 $R2 $id $sample; \
done
gives:
../files/SPT5_Input_1_R1.fastq.gz ../files/SPT5_Input_1_R2.fastq.gz SPT5_Input_1 SPT5_Input
../files/SPT5_Input_2_R1.fastq.gz ../files/SPT5_Input_2_R2.fastq.gz SPT5_Input_2 SPT5_Input
../files/SPT5_T0_1_R1.fastq.gz ../files/SPT5_T0_1_R2.fastq.gz SPT5_T0_1 SPT5_T0
../files/SPT5_T0_2_R1.fastq.gz ../files/SPT5_T0_2_R2.fastq.gz SPT5_T0_2 SPT5_T0
We have everything we need for bwa
automation, just wrap the for loop in the bwa
command.
Don’t worry about the code used - the important part is that you are able to identify patterns in the file names and think critically about how they map to the
bwa
command. Once you know what you want to achieve, googling the code becomes much easier.
for file in ../files/*_R1.fastq.gz; do \
R1=$file; \
R2=${file/_R1.fastq.gz/}_R2.fastq.gz; \
id=$(basename $file _R1.fastq.gz); \
sample=${id%_*}; \
bwa mem -t 2 \
-M \
-R "@RG\tID:${id}\tSM:${sample}\tPL:ILLUMINA\tLB:${id}\tPU:1" \
genome.fa \
$R1 $R2 | samtools view -@ 2 -b -h -F 0x0100 -O BAM -o ${id}.bam; \
done
You must place the
@RG
string in double quotes for the variable expansion to work.
Make sure that the @RG
headers are correctly formatted. You can use the command samtools view -H <bam>
to inspect the header information of a bam
file which contains chromosome information, @RG
headers, and the history of commands used to generate the bam
file. We will use grep
to isolate the @RG
headers:
for bam in *.bam; do samtools view -H $bam | grep "@RG"; done
Note: If you recieve
bam
files for an analysis and want to know how they were produced, inspect the@PG
lines in the header.
After generating bam
files for all of the samples the next step is to sort and index the files.
run this in the
bwa/
directory containing thebam
files generated in the previous step.
In the interest of time, use the for loop to do this step:
for bam in *.bam; do \
prefix=$(basename $bam .bam);\
samtools sort -@ 2 -o ${prefix}.sorted.bam $bam;\
samtools index ${prefix}.sorted.bam
done
The bam
files are now sorted by coordinates and indexed. This is done for computational efficiency and required for downstream steps.
Now we can merge our replicates into one bam
file before marking duplicates, the final preprocessing step prior to macs2
analysis.
At each step, we must index the output
bam
files
picard MergeSamFiles \
INPUT=SPT5_T0_1.sorted.bam \
INPUT=SPT5_T0_2.sorted.bam \
OUTPUT=SPT5_T0.sorted.bam \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=LENIENT
samtools index SPT5_T0.sorted.bam
picard MarkDuplicates \
INPUT=SPT5_T0.sorted.bam \
OUTPUT=SPT5_T0.markdups.bam \
ASSUME_SORTED=true \
REMOVE_DUPLICATES=false \
VALIDATION_STRINGENCY=LENIENT \
METRICS_FILE=../fastqc/SPT5_T0.markdups.metrics.txt
samtools index SPT5_T0.markdups.bam
Repeat these 4 steps for the SPT5_Input
samples. We will be using the markdups.bam
files for peak calling.
This is seriously repetitive stuff….. . . . . . ….. . . . . and we only have 4 replicates 0_o
At long last….
move to the top of your directory tree for the analysis. make a new directory called macs2
which is where we will store the results.
mkdir -p macs2
run macs on our two bam files:
macs2 callpeak -t bwa/SPT5_T0.markdups.bam -c bwa/SPT5_Input.markdups.bam -f BAMPE --outdir macs2/ --name SPT5
annotatePeaks.pl
is a cool perl script that provides dense annotation of each peak using the input GTF
file:
annotatePeaks.pl macs2/SPT5_summits.bed files/genome.fa -gid -gtf files/genes.gtf > SPT5_annotated_peaks.txt
Check your output file is the same:
wc -l SPT5_annotated_peaks.txt
714
head SPT5_annotated_peaks.txt
PeakID (cmd=annotatePeaks.pl SPT5_summits.bed ../files/genome.fa -gid -gtf ../files/genes.gtf) Chr Start End Strand Peak Score Focus Ratio/Region SizeAnnotation Detailed Annotation Distance to TSS Nearest PromoterID Entrez ID Nearest Unigene Nearest Refseq Nearest Ensembl Gene Name Gene Alias Gene Description Gene Type
SPT5_peak_460 XII 460510 460510 + 838.196 NA TTS (RDN25-2) rRNA-TTS (RDN25-2) 834 RDN5-1 RDN5-1 RDN5-1 RDN5-1 rRNA
SPT5_peak_465 XII 490122 490122 + 218.423 NA promoter-TSS (YLR162W-A) protein_coding-promoter-TSS (YLR162W-A) -284 YLR162W-A YLR162W-A YLR162W-A RRT15 protein_coding
SPT5_peak_510 XII 1071609 1071609 + 139.342 NA promoter-TSS (YLR466C-A) protein_coding-promoter-TSS (YLR466C-A) 101 YLR466C-B YLR466C-B YLR466C-B YLR466C-B protein_coding
SPT5_peak_662 XV 1091237 1091237 + 134.376 NA promoter-TSS (YOR396C-A) protein_coding-promoter-TSS (YOR396C-A) -740 YOR396C-A YOR396C-A YOR396C-A YOR396C-A protein_coding
SPT5_peak_332 VIII 68 68 + 80.9434 NA promoter-TSS (YHL050W-A) protein_coding-promoter-TSS (YHL050W-A) -743 YHL050W-A YHL050W-A YHL050W-A YHL050W-A protein_coding
SPT5_peak_445 XII 5739 5739 + 77.7621 NA promoter-TSS (YLL066W-A) protein_coding-promoter-TSS (YLL066W-A) 134 YLL066W-B YLL066W-B YLL066W-B YLL066W-B protein_coding
SPT5_peak_664 XVI 6457 6457 + 51.9798 NA promoter-TSS (YPL283C) protein_coding-promoter-TSS (YPL283C) -450 YPL283C YPL283C YPL283C YRF1-7 protein_coding
SPT5_peak_446 XII 11181 11181 + 51.3152 NA promoter-TSS (YLL065W) protein_coding-promoter-TSS (YLL065W) -545 YLL065W YLL065W YLL065W YLL065W protein_coding
SPT5_peak_331 VII 1084549 1084549 + 48.2078 NA promoter-TSS (YGR296W) protein_coding-promoter-TSS (YGR296W) -315 YGR296W YGR296W YGR296W YRF1-3 protein_coding