Kallisto bustools Pipeline

Request compute resources on the MSC queue, ssh to the node and step through the Kallisto Bustools pipeline in an interactive container shell.

Everything you need to run the pipeline is available at the following directory on lugh:

data_dir="/data/MSc/2020/MA5112/scRNA-Seq"
  1. Reference file (indexed): ${data_dir}/reference/Homo_sapiens.cDNA.idx
  2. FASTQ read pair: ${data_dir}/reads/*_R{1,2}_001.fastq.gz
  3. Whitelist barcode file: ${data_dir}/assets/10xv3_whitelist.txt
  4. Container: ${data_dir}/container/scRNA.img
  5. tx2gene file: ${data_dir}/assets/tx2gene.txt

Skip to
  1. Indexing
  2. kallisto bus
  3. bustools correct
  4. bustools sort
  5. bustools text
  6. Process outputs

1. Indexing (DO NOT RUN)

Kallisto requires an indexed genome file for downstream quantification. This has been generated for you already and is available at /data/MSc/2020/MA5112/scRNA-Seq/reference/Homo_sapiens.cDNA.idx

Inputs
  • cDNA Reference file
kallisto index -i Homo_sapiens.cDNA.idx Homo_sapiens.GRCh38.cdna.all.fa
Outputs

Kallisto index produces an indexed genome file with the filename specified by -i


2. Kallisto bus

kallisto bus works with raw FASTQ files for single-cell RNA-Seq datasets. For each read the cell barcode, UMI information and the equivalence class resulting from pseudoalignment are stored in a BUS file output.bus stored in the output directory specified by -o, along with matrix.ec and transcripts.txt which store information about the equivalence classes and transcript names for downstream processing.

Run the following commands in your directory, and specify the path to the indexed transcriptome and fastq.gz files given at the top of the page.

Inputs
  • Genome Index file
  • FASTQ reads
kallisto bus \
         -i Homo_sapiens.cDNA.idx \
         -o bus_output/ \
         -x 10xv3 \
         -t 2 \
         pbmc_1k_protein_v3_gex_S1_L001_R1_001.fastq.gz pbmc_1k_protein_v3_gex_S1_L001_R2_001.fastq.gz

Flags
  • -i Genome Index file generated by kallisto index
  • -o Directory to write output to
  • -x Single-cell technology used (string)
  • -t n threads to use
Outputs

Kallisto bus generates an output directory with the following files:

  • matrix.ec
  • output.bus
  • run_info.json
  • transcripts.txt

3. Bustools correct

BUS files can be barcode error corrected with regards to a technology specific whitelist of barcodes. The bustools correct command will correct all barcodes that are at Hamming distance 1 (i.e. one substitution) away from a single barcode in the whitelist.

To run the bustools correct command, you will need the 10xv3_whitelist.txt file and the output.bus files generated by kallisto bus.

Inputs
  • Whitelisted barcodes file
  • BUS file generated by kallisto bus
bustools correct \
         -w 10xv3_whitelist.txt \
         bus_output/output.bus \
         -o output_corrected.bus
Flags
  • -o File for corrected bus output
  • -w File of whitelisted barcodes to correct against
Outputs

bustools correct outputs a corrected BUS file.


4. Bustools sort

Raw BUS output from kallisto bus pseudoalignment may be unsorted. To accelerate downstream processing BUS files can be sorted using bustools sort. This will create a new BUS file where the BUS records are sorted by barcode first, UMI second, and equivalence class third.

Before running the command, make a tmp/ directory in your directory. Run the following commands, referecing the output_corrected.bus file generated in the previous step.

Inputs
  • corrected BUS file generated by bustools correct
bustools sort \
         -T tmp/ \
         -t 2 \
         -o output_sort.bus \
         output_corrected.bus
Flags
  • -t Number of threads to use
  • -m Maximum memory used
  • -T Location for scratch files
  • -o Sorted output file
Outputs

Bustools sort outputs a sorted BUS file.


5. Bustools text

BUS files can be converted to a tab-separated format for easy inspection and processing using shell scripts or high level languages. We will be using an R script on the output file generated.

Run the following commands, referencing the sorted bus file generated in the previous step.

Inputs
  • Corrected + Sorted BUS file
  • Transcripts to Gene mapping file
bustools text \
         -o output_sort.txt \
         output.sort.bus
Flags
  • -o File for text output
Outputs

Bustools text produces a BUS file in text format.


6. Process Outputs

Exit the interactive container for this step. Copy the tx2gene.txt file to your directory where you ran the kallisto analysis: cp /data/MSc/2020/MA5112/scRNA-Seq/assets/tx2gene.txt .

Copy and paste this python script into your directory as a new file, and call it format_bustools.py. Open format_bustools.py, and edit line 8 to your directory:

#!/usr/bin/env python

gene_min = 200
gene_max = 10000

#setup working directory
import os
os.chdir("ADD_YOUR_PATH_HERE")  <- # change to os.chdir('/data/your_username/scRNA')

Before running the script, make a directory called seurat to store the results and run the script:

mkdir seurat
python format_bustools.py

The output directory should look like this:

--seurat
    |--barcodes.tsv  
    |--gene_hist.png  
    |--genes.tsv  
    |--matrix.mtx

There is an ENSEMBL gene ID in the genes.tsv file that is missing a gene symbol. This will create an error in R, we will manually add the gene symbol ourselves.

Check which ID has a blank 2nd column i.e no gene symbol:

awk 'NF!=2' genes.tsv
ENSG00000237235.2

Google the ENSEMBL ID, the gene ID is TRDD2. Now use awk to insert the gene in place of the empty cell. This is ok for one missing symbol, if you have multiple gene symbols missing you will have to come up with a better strategy for this

awk -F"\t" 'BEGIN {OFS="\t"}; $1 == "ENSG00000237235.2" {$2 = "TRDD2"};1' genes.tsv > tmp.tsv && rm genes.tsv && mv tmp.tsv genes.tsv