Request compute resources on the MSC queue, ssh to the node and step through the Kallisto Bustools
pipeline in an interactive container shell.
Everything you need to run the pipeline is available at the following directory on lugh:
data_dir="/data/MSc/2020/MA5112/scRNA-Seq"
${data_dir}/reference/Homo_sapiens.cDNA.idx
${data_dir}/reads/*_R{1,2}_001.fastq.gz
${data_dir}/assets/10xv3_whitelist.txt
${data_dir}/container/scRNA.img
${data_dir}/assets/tx2gene.txt
Kallisto requires an indexed genome file for downstream quantification. This has been generated for you already and is available at /data/MSc/2020/MA5112/scRNA-Seq/reference/Homo_sapiens.cDNA.idx
kallisto index -i Homo_sapiens.cDNA.idx Homo_sapiens.GRCh38.cdna.all.fa
Kallisto index produces an indexed genome file with the filename specified by -i
kallisto bus
works with raw FASTQ files for single-cell RNA-Seq datasets. For each read the cell barcode, UMI information and the equivalence class resulting from pseudoalignment are stored in a BUS file output.bus
stored in the output directory specified by -o
, along with matrix.ec
and transcripts.txt
which store information about the equivalence classes and transcript names for downstream processing.
Run the following commands in your directory, and specify the path to the indexed transcriptome and fastq.gz files given at the top of the page.
kallisto bus \
-i Homo_sapiens.cDNA.idx \
-o bus_output/ \
-x 10xv3 \
-t 2 \
pbmc_1k_protein_v3_gex_S1_L001_R1_001.fastq.gz pbmc_1k_protein_v3_gex_S1_L001_R2_001.fastq.gz
-i
Genome Index file generated by kallisto index
-o
Directory to write output to-x
Single-cell technology used (string)-t
n threads to useKallisto bus
generates an output directory with the following files:
matrix.ec
output.bus
run_info.json
transcripts.txt
BUS files can be barcode error corrected with regards to a technology specific whitelist of barcodes. The bustools correct
command will correct all barcodes that are at Hamming distance 1 (i.e. one substitution) away from a single barcode in the whitelist.
To run the bustools correct command, you will need the 10xv3_whitelist.txt
file and the output.bus
files generated by kallisto bus
.
bus
bustools correct \
-w 10xv3_whitelist.txt \
bus_output/output.bus \
-o output_corrected.bus
-o
File for corrected bus output-w
File of whitelisted barcodes to correct againstbustools correct
outputs a corrected BUS file.
Raw BUS output from kallisto bus
pseudoalignment may be unsorted. To accelerate downstream processing BUS files can be sorted using bustools sort
. This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.
Before running the command, make a tmp/
directory in your directory. Run the following commands, referecing the output_corrected.bus
file generated in the previous step.
bustools correct
bustools sort \
-T tmp/ \
-t 2 \
-o output_sort.bus \
output_corrected.bus
-t
Number of threads to use-m
Maximum memory used-T
Location for scratch files-o
Sorted output fileBustools sort
outputs a sorted BUS file.
BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages. We will be using an R script on the output file generated.
Run the following commands, referencing the sorted bus file generated in the previous step.
bustools text \
-o output_sort.txt \
output.sort.bus
-o
File for text outputBustools text
produces a BUS file in text format.
Exit the interactive container for this step. Copy the tx2gene.txt
file to your directory where you ran the kallisto
analysis: cp /data/MSc/2020/MA5112/scRNA-Seq/assets/tx2gene.txt .
Copy and paste this python script into your directory as a new file, and call it format_bustools.py
. Open format_bustools.py
, and edit line 8 to your directory:
#!/usr/bin/env python
gene_min = 200
gene_max = 10000
#setup working directory
import os
os.chdir("ADD_YOUR_PATH_HERE") <- # change to os.chdir('/data/your_username/scRNA')
Before running the script, make a directory called seurat
to store the results and run the script:
mkdir seurat
python format_bustools.py
The output directory should look like this:
--seurat
|--barcodes.tsv
|--gene_hist.png
|--genes.tsv
|--matrix.mtx
There is an ENSEMBL gene ID in the genes.tsv
file that is missing a gene symbol. This will create an error in R
, we will manually add the gene symbol ourselves.
Check which ID has a blank 2nd column i.e no gene symbol:
awk 'NF!=2' genes.tsv
ENSG00000237235.2
Google the ENSEMBL ID, the gene ID is TRDD2
. Now use awk
to insert the gene in place of the empty cell. This is ok for one missing symbol, if you have multiple gene symbols missing you will have to come up with a better strategy for this
awk -F"\t" 'BEGIN {OFS="\t"}; $1 == "ENSG00000237235.2" {$2 = "TRDD2"};1' genes.tsv > tmp.tsv && rm genes.tsv && mv tmp.tsv genes.tsv