Kallisto bustools Pipeline

Request compute resources on the MSC queue, ssh to the node and step through the Kallisto Bustools pipeline in an interactive container shell.

Everything you need to run the pipeline is available at the following directory on lugh:

data_dir="/data/MSc/2020/MA5112/scRNA-Seq"

Reference file (indexed): ${data_dir}/reference/Homo_sapiens.cDNA.idx
FASTQ read pair: ${data_dir}/reads/*_R{1,2}_001.fastq.gz
Whitelist barcode file: ${data_dir}/assets/10xv3_whitelist.txt
Container: ${data_dir}/container/scRNA.img
tx2gene file: ${data_dir}/assets/tx2gene.txt

Indexing
kallisto bus
bustools correct
bustools sort
bustools text
Process outputs

1. Indexing (DO NOT RUN)

Kallisto requires an indexed genome file for downstream quantification. This has been generated for you already and is available at /data/MSc/2020/MA5112/scRNA-Seq/reference/Homo_sapiens.cDNA.idx

Inputs

cDNA Reference file

kallisto index -i Homo_sapiens.cDNA.idx Homo_sapiens.GRCh38.cdna.all.fa

Outputs

Kallisto index produces an indexed genome file with the filename specified by -i

2. Kallisto bus

kallisto bus works with raw FASTQ files for single-cell RNA-Seq datasets. For each read the cell barcode, UMI information and the equivalence class resulting from pseudoalignment are stored in a BUS file output.bus stored in the output directory specified by -o, along with matrix.ec and transcripts.txt which store information about the equivalence classes and transcript names for downstream processing.

Run the following commands in your directory, and specify the path to the indexed transcriptome and fastq.gz files given at the top of the page.

Inputs

Genome Index file
FASTQ reads

kallisto bus \
         -i Homo_sapiens.cDNA.idx \
         -o bus_output/ \
         -x 10xv3 \
         -t 2 \
         pbmc_1k_protein_v3_gex_S1_L001_R1_001.fastq.gz pbmc_1k_protein_v3_gex_S1_L001_R2_001.fastq.gz

Flags

-i Genome Index file generated by kallisto index
-o Directory to write output to
-x Single-cell technology used (string)
-t n threads to use

Outputs

Kallisto bus generates an output directory with the following files:

matrix.ec
output.bus
run_info.json
transcripts.txt

3. Bustools correct

BUS files can be barcode error corrected with regards to a technology specific whitelist of barcodes. The bustools correct command will correct all barcodes that are at Hamming distance 1 (i.e. one substitution) away from a single barcode in the whitelist.

To run the bustools correct command, you will need the 10xv3_whitelist.txt file and the output.bus files generated by kallisto bus.

Inputs

Whitelisted barcodes file
BUS file generated by kallisto bus

bustools correct \
         -w 10xv3_whitelist.txt \
         bus_output/output.bus \
         -o output_corrected.bus

Flags

-o File for corrected bus output
-w File of whitelisted barcodes to correct against

Outputs

bustools correct outputs a corrected BUS file.

4. Bustools sort

Raw BUS output from kallisto bus pseudoalignment may be unsorted. To accelerate downstream processing BUS files can be sorted using bustools sort. This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.

Before running the command, make a tmp/ directory in your directory. Run the following commands, referecing the output_corrected.bus file generated in the previous step.

Inputs

corrected BUS file generated by bustools correct

bustools sort \
         -T tmp/ \
         -t 2 \
         -o output_sort.bus \
         output_corrected.bus

Flags

-t Number of threads to use
-m Maximum memory used
-T Location for scratch files
-o Sorted output file

Outputs

Bustools sort outputs a sorted BUS file.

5. Bustools text

BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages. We will be using an R script on the output file generated.

Run the following commands, referencing the sorted bus file generated in the previous step.

Inputs

Corrected + Sorted BUS file
Transcripts to Gene mapping file

bustools text \
         -o output_sort.txt \
         output.sort.bus

Flags

-o File for text output

Outputs

Bustools text produces a BUS file in text format.

6. Process Outputs

Exit the interactive container for this step. Copy the tx2gene.txt file to your directory where you ran the kallisto analysis: cp /data/MSc/2020/MA5112/scRNA-Seq/assets/tx2gene.txt .

Copy and paste this python script into your directory as a new file, and call it format_bustools.py. Open format_bustools.py, and edit line 8 to your directory:

#!/usr/bin/env python

gene_min = 200
gene_max = 10000

#setup working directory
import os
os.chdir("ADD_YOUR_PATH_HERE")  <- # change to os.chdir('/data/your_username/scRNA')

Before running the script, make a directory called seurat to store the results and run the script:

mkdir seurat
python format_bustools.py

The output directory should look like this:

--seurat
    |--barcodes.tsv  
    |--gene_hist.png  
    |--genes.tsv  
    |--matrix.mtx

There is an ENSEMBL gene ID in the genes.tsv file that is missing a gene symbol. This will create an error in R, we will manually add the gene symbol ourselves.

Check which ID has a blank 2nd column i.e no gene symbol:

awk 'NF!=2' genes.tsv
ENSG00000237235.2

Google the ENSEMBL ID, the gene ID is TRDD2. Now use awk to insert the gene in place of the empty cell. This is ok for one missing symbol, if you have multiple gene symbols missing you will have to come up with a better strategy for this

awk -F"\t" 'BEGIN {OFS="\t"}; $1 == "ENSG00000237235.2" {$2 = "TRDD2"};1' genes.tsv > tmp.tsv && rm genes.tsv && mv tmp.tsv genes.tsv

Genomic Data Analysis II

Kallisto bustools Pipeline

Skip to

1. Indexing (DO NOT RUN)

Inputs

Outputs

2. Kallisto bus

Inputs

Flags

Outputs

3. Bustools correct

Inputs

Flags

Outputs

4. Bustools sort

Inputs

Flags

Outputs

5. Bustools text

Inputs

Flags

Outputs

6. Process Outputs