NAb-seq bioinformatics tutorial (updated)

Introduction

Welcome to the updated (v0.2) NAb-seq tutorial! This new version is written in nextflow, a workflow language that enables portable and scalable computing. This means that it should be really easy to use! You can find the github page here

A number of changes have been made since the original version. It is recommended that you check out the changelog for detailed information.

Running the pipeline

Clone the repo (git clone https://github.com/kzeglinski/nabseq_nf.git)
Edit the nextflow config file to suit your environment.
Make sure you have nextflow installed on your computer
Run the pipeline using nextflow run nabseq_nf/main.nf

There is a nice guide on running nextflow pipelines for beginners here

Parameters

You can see all of the parameters using nextflow run nabseq_nf/main.nf --help:

Usage: nextflow run ./nabseq_nf/main.nf --fastq_dir [input path] --organism [organism name] --sample_sheet [sample sheet]
--help                : prints this help message

Required arguments:
--out_dir             : where the output files will be written to (default: "$projectDir/results)
--fastq_dir           : where the input fastq files are located
--sample_sheet        : location of the .csv sample sheet (format: barcode01,sample_x,rat,1)

Optional (only needed for advanced users)
--barcode_dirs        : whether the input fastq files are located within folders named barcode01 etc (default: false)
--num_consensus       : maximum number of consensus sequences to generate for each sample (default: 999)
--igblast_databases   : location of the igblast databases (default: "$projectDir/references/igblast/")
--reference_sequences : location of the reference sequences (default: "$projectDir/references/reference_sequences/")
--trim_3p             : pattern for cutadapt to trim off the 3' end (default: "A{20}N{90}")
--trim_5p             : pattern for cutadapt to trim off the 5' end (default: "N{90}T{20}")
--medaka_model        : model to use for medaka (depends on your basecalling model, default:"r941_min_sup_g507")
--report_title        : title to use for naming the report (default: "NAb-seq report")

Most of the above are self-explanatory but:

Previously, the species/organism was specified as a command-line argument, but this has moved to the sample sheet (so you can analyse data from multiple different species at once). If you want to use the built-in mouse or rat references, then you don’t need to edit --igblast_databases or --reference_sequences.
--barcode_dirs can be used when your data is structured like reads/barcode01/{blah}.fastq, as opposed to the default NAb-seq expects, which should be like reads/barcode01{any_random_text}.fastq. See the sample sheet/file structure section for more information.
For the number of consensus sequences, the default is 999, which means that NAb-seq will generate up to 999 consensus sequences for the H chain and up to 999 consensus sequences for the light chain in a single cell. In practice, there will never be this many and so it basically just means generate all possible consensus sequences for each chain in each cell (note that it is only possible to generate consensus sequences for groups that have a count of at least 3). You can set this number lower to save time (since the consensus calling is the slowest part), but for hybridomas we recommend that you generate at least the top 20 or 30 consensus sequences, given their propensity for expressing multiple heavy and/or light chains.
For the adapter trimming trim_3p and trim_5p should be cutadapt patterns. You only need to change the default values if you would like to perform trimming of custom adapter sequences. NAb-seq by default will trim off the polyA tail.
For the medaka model, you can see a list of all models here. Choose one that matches your pore number and basecalling model/guppy version. Note: Don’t choose a model with ‘variant’ in the name

Nextflow config files

The nextflow config file allows you to set NAb-seq’s parameters, as well as control other aspects of how nextflow runs this pipeline (for example, using SLURM or AWS).

The config file supplied with NAb-seq has been written to run on WEHI’s Milton HPC, and may not work on your system. You might want to try looking at the nf-core collection of institutional config files to see if they have one for your institute. Alternatively, you could try asking a nextflow expert at your workplace!

For more information about writing/editing nextflow config files, take a look at the official nextflow documentation.

Sample sheet & file structure

As of v0.2.2, the format of NAb-seq’s sample sheet has changed. Sample sheets should now have the structure barcode,sample_name,species,report_group. For example:

barcode,sample_name,species,report_group
barcode16,28-21-4E10-2-1,rat,1
barcode17,28-21-5H11-5-1,rat,1
barcode18,15-14-7H9-19-1,mouse,2
barcode19,42-11-A4-D1-2-1,mouse,2
barcode20,9-13-IB4-1-1,rat,3
barcode21,25-19-5A6-1-1,rat,4

Where:

barcode is the nanopore barcode used.
sample_name is the name of the sample. I have tested that dashes and underscores work ok, other special characters may cause issues so use at your own risk
species is the name of the species for that sample, either rat or mouse (case sensitive) if you want to use NAb-seq’s built-in references.
report_group is used to create separate reports for different groups of samples. In the example above, there would be four reports created, one with the first two samples, one with the third and fourth samples and then the last two samples each in their own report. If you want all samples in the same report, just set all to 1.

When looking for input, NAb-seq uses the sample sheet, the path provided to --fastq_dir and the switch --barcode_dirs to read in data from the following file structures:

Default behaviour: NAb-seq will look in the --fastq_dir path and identify FASTQ files (ending in .fastq, .fq, .fq.gz or .fastq.gz) that start with the barcodes in the sample sheet (e.g. for the above example barcode16.fastq or barcode20_pass.fq.gz or barcode18_fail.fastq.gz). If there are multiple files for the same barcode, they will be concatenated, so it’s not recommended to leave fail reads in the directory or they will be used in the analysis. However, any FASTQ that don’t start with a barcode in the sample sheet (e.g. unclassified_pass.fastq) won’t be counted, nor will any files that don’t have a fastq file extension.
Using the --barcode_dirs parameter: NAb-seq will look in the --fastq_dir path for folders that have the names of the barcodes in the sample sheet (e.g. barcode16/ or barcode21/). Then, all FASTQ files in each of these folders (ending in .fastq, .fq, .fq.gz or .fastq.gz) will be concatenated. Note that file extensions can’t be mixed, so make sure all the FASTQs are gzipped or are all unzipped otherwise the concatenation won’t work