Welcome to the updated (v0.2) NAb-seq tutorial! This new version is written in nextflow, a workflow language that enables portable and scalable computing. This means that it should be really easy to use! You can find the github page here
A number of changes have been made since the original version. It is recommended that you check out the changelog for detailed information.
git clone https://github.com/kzeglinski/nabseq_nf.git
)nextflow run nabseq_nf/main.nf
There is a nice guide on running nextflow pipelines for beginners here
You can see all of the parameters using
nextflow run nabseq_nf/main.nf --help
:
Usage: nextflow run ./nabseq_nf/main.nf --fastq_dir [input path] --organism [organism name] --sample_sheet [sample sheet]
--help : prints this help message
Required arguments:
--out_dir : where the output files will be written to (default: "$projectDir/results)
--fastq_dir : where the input fastq files are located
--sample_sheet : location of the .csv sample sheet (format: barcode01,sample_x,rat,1)
Optional (only needed for advanced users)
--barcode_dirs : whether the input fastq files are located within folders named barcode01 etc (default: false)
--num_consensus : maximum number of consensus sequences to generate for each sample (default: 999)
--igblast_databases : location of the igblast databases (default: "$projectDir/references/igblast/")
--reference_sequences : location of the reference sequences (default: "$projectDir/references/reference_sequences/")
--trim_3p : pattern for cutadapt to trim off the 3' end (default: "A{20}N{90}")
--trim_5p : pattern for cutadapt to trim off the 5' end (default: "N{90}T{20}")
--medaka_model : model to use for medaka (depends on your basecalling model, default:"r941_min_sup_g507")
--report_title : title to use for naming the report (default: "NAb-seq report")
Most of the above are self-explanatory but:
Previously, the species/organism was specified as a command-line
argument, but this has moved to the sample sheet (so you can analyse
data from multiple different species at once). If you want to use the
built-in mouse
or rat
references, then you
don’t need to edit --igblast_databases
or
--reference_sequences
.
--barcode_dirs
can be used when your data is
structured like reads/barcode01/{blah}.fastq
, as opposed to
the default NAb-seq expects, which should be like
reads/barcode01{any_random_text}.fastq
. See the sample sheet/file structure section for more
information.
For the number of consensus sequences, the default is 999, which means that NAb-seq will generate up to 999 consensus sequences for the H chain and up to 999 consensus sequences for the light chain in a single cell. In practice, there will never be this many and so it basically just means generate all possible consensus sequences for each chain in each cell (note that it is only possible to generate consensus sequences for groups that have a count of at least 3). You can set this number lower to save time (since the consensus calling is the slowest part), but for hybridomas we recommend that you generate at least the top 20 or 30 consensus sequences, given their propensity for expressing multiple heavy and/or light chains.
For the adapter trimming trim_3p
and
trim_5p
should be cutadapt
patterns. You only need to change the default values if you would
like to perform trimming of custom adapter sequences. NAb-seq by default
will trim off the polyA tail.
For the medaka model, you can see a list of all models here. Choose one that matches your pore number and basecalling model/guppy version. Note: Don’t choose a model with ‘variant’ in the name
The nextflow config file allows you to set NAb-seq’s parameters, as well as control other aspects of how nextflow runs this pipeline (for example, using SLURM or AWS).
The config file supplied with NAb-seq has been written to run on WEHI’s Milton HPC, and may not work on your system. You might want to try looking at the nf-core collection of institutional config files to see if they have one for your institute. Alternatively, you could try asking a nextflow expert at your workplace!
For more information about writing/editing nextflow config files, take a look at the official nextflow documentation.
As
of v0.2.2, the format of NAb-seq’s sample sheet has changed. Sample
sheets should now have the structure
barcode,sample_name,species,report_group
. For example:
barcode,sample_name,species,report_group
barcode16,28-21-4E10-2-1,rat,1
barcode17,28-21-5H11-5-1,rat,1
barcode18,15-14-7H9-19-1,mouse,2
barcode19,42-11-A4-D1-2-1,mouse,2
barcode20,9-13-IB4-1-1,rat,3
barcode21,25-19-5A6-1-1,rat,4
Where:
barcode
is the nanopore barcode used.
sample_name
is the name of the sample. I have tested
that dashes and underscores work ok, other special characters may cause
issues so use at your own risk
species
is the name of the species for that sample,
either rat
or mouse
(case
sensitive) if you want to use NAb-seq’s built-in
references.
report_group
is used to create separate reports for
different groups of samples. In the example above, there would be four
reports created, one with the first two samples, one with the third and
fourth samples and then the last two samples each in their own report.
If you want all samples in the same report, just set all to 1.
When looking for input, NAb-seq uses the sample sheet, the path
provided to --fastq_dir
and the switch
--barcode_dirs
to read in data from the following file
structures:
--fastq_dir
path and identify FASTQ files (ending in
.fastq, .fq, .fq.gz or .fastq.gz) that start with the barcodes in the
sample sheet (e.g. for the above example barcode16.fastq or
barcode20_pass.fq.gz or barcode18_fail.fastq.gz). If there are multiple
files for the same barcode, they will be concatenated, so it’s
not recommended to leave fail reads in the directory or they will be
used in the analysis. However, any FASTQ that don’t start with
a barcode in the sample sheet (e.g. unclassified_pass.fastq) won’t be
counted, nor will any files that don’t have a fastq file extension.--barcode_dirs
parameter:
NAb-seq will look in the --fastq_dir
path for
folders that have the names of the barcodes in the
sample sheet (e.g. barcode16/ or barcode21/). Then, all FASTQ files in
each of these folders (ending in .fastq, .fq, .fq.gz or .fastq.gz) will
be concatenated. Note that file extensions can’t be mixed, so
make sure all the FASTQs are gzipped or are all unzipped otherwise the
concatenation won’t work