CoPTR Tutorial¶
This page contains a worked example for using CoPTR. It takes you through the steps in the Quick Start. The data used in this example is available here:
https://dl.dropboxusercontent.com/s/wrsxjimr6l96lcq/example-data.tar.gz
Contents
Building the reference database¶
The first step for running CoPTR is to create a reference database from fasta files. CoPTR is designed to run on both complete reference genomes and draft assemblies. Each genome should be in its own fasta file. A complete genome has a single sequence, while assemblies have multiple sequences—one for each contig.
Precomputed reference databases can be downloaded here: Reference Databases
If you want to construct your own reference database, please use the guidelines below.
Cluster genomes at 95% ANI¶
CoPTR estimates PTRs at the species level: estimates among strains within 95% ANI are consistent. This means you should cluster genomes by 95% ANI, an operational defintion of a species, and select one representative genome per species.
Select only high-quality assemblies or complete genomes¶
Metagenomic assembly is a challenging problem, and assemblies are not guaranteed to be correct. We set some basic criteria for using an assembly:
It is estimated to be >90% complete
It is estimated to have <5% contamination
The first two criteria are similar to the MIMAG criteria for high-quality assemblies.
When choosing between an assembly, and complete reference genome, preference should always be given to the complete genome.
Indexing the reference database¶
On the example data:
$ coptr index example-data/ref-db example-data/ref-db/example-index
[INFO] [Jun 28, 2021 15:46:57] [coptr.read_mapper] Found 2 files totaling 0.00611 GB.
[INFO] [Jun 28, 2021 15:46:57] [coptr.read_mapper] Copying FASTA files to coptr-fna-2021-06-28T19:46:57+00:00.fna with prepended genome ids (filenames).
[INFO] [Jun 28, 2021 15:46:57] [coptr.read_mapper] Writing 2 reference genome ids to example-data/ref-db/example-index.genomes.
[INFO] [Jun 28, 2021 15:46:57] [coptr.read_mapper] bowtie2-build coptr-fna-2021-06-28T19:46:57+00:00.fna example-data/ref-db/example-index --threads 1
...bowtie2 output...
[INFO] [Jun 28, 2021 15:47:00] [coptr.read_mapper] Indexed 2 FASTA files for the reference database.
[INFO] [Jun 28, 2021 15:47:00] [coptr.read_mapper] Cleaning up coptr-fna-2021-06-28T19:46:57+00:00.fna.
CoPTR provides a few utilities to make indexing the database with bowtie2 easier. It assumes the reference database is provided by a folder of fasta files, one per reference genome. Reference genomes are combined in a single file, with the filename prepended to each sequencing id. This is how CoPTR keeps track of contigs from the same assembly.
Indexing a database of fasta files can be accomplished by calling
coptr index
:
usage: coptr index [-h] [--bt2-bmax BT2_BMAX] [--bt2-dcv BT2_DCV] [--bt2-threads BT2_THREADS] [--bt2-packed] ref-fasta index-out
positional arguments:
ref_fasta File or folder containing fasta to index. If a folder,
the extension for each fasta must be one of [.fasta,
.fna, .fa]
index_out Filepath to store index.
optional arguments:
-h, --help show this help message and exit
--bt2-bmax BT2_BMAX Set the --bmax arguement for bowtie2-build. Used to
control memory useage.
--bt2-dcv BT2_DCV Set the --dcv argument for bowtie2-build. Used to
control memory usage.
--bt2-threads BT2_THREADS
Number of threads to pass to bowtie2-build.
--bt2-packed Set the --packed flag for bowtie2-build. Used to
control memory usage.
It takes two arguments. The first ref-fasta
is either a folder containing
fasta for the database. If it is a folder, CoPTR will scan the folder for
all files ending in .fasta
, .fna
, or .fa
. CoPTR will combine these
into a single fasta file, prepending the filename to each sequence id in order
to track contigs from the same reference genome. It then calls the bowtie2-build
to index this file.
coPTR additionally outputs an index_name.genomes
file with a list of ids for each
reference genome.
Notes on memory usage¶
Sometimes the database is too large for a single index. You can split the
reference database into multiple parts, and index each separately. After
read mapping, the resulting BAM files can be merged with `coptr merge`
.
Mapping reads¶
On the example data:
$ coptr map example-data/ref-db/example-index example-data/fastq example-data/bam
[INFO] [Jun 28, 2021 15:48:26] [coptr.read_mapper] Mapping example-data/fastq to example-data/bam/ERR969281.sam
[INFO] [Jun 28, 2021 15:48:26] [coptr.read_mapper] bowtie2 -x example-data/ref-db/example-index example-data/fastq/ERR969281.fastq.gz --no-unal -p 1 -k 10
10818 reads; of these:
10818 (100.00%) were unpaired; of these:
4071 (37.63%) aligned 0 times
6709 (62.02%) aligned exactly 1 time
38 (0.35%) aligned >1 times
62.37% overall alignment rate
[INFO] [Jun 28, 2021 15:48:27] [coptr.read_mapper] Converting example-data/bam/ERR969281.sam to example-data/bam/ERR969281.bam.
[INFO] [Jun 28, 2021 15:48:27] [coptr.read_mapper] Cleaning up example-data/bam/ERR969281.sam.
....
[INFO] [Jun 28, 2021 15:48:40] [coptr.read_mapper] Converting example-data/bam/ERR969430.sam to example-data/bam/ERR969430.bam.
[INFO] [Jun 28, 2021 15:48:40] [coptr.read_mapper] Cleaning up example-data/bam/ERR969430.sam.
Once you have indexed a reference database. You can then map reads against the database. CoPTR provides a wrapper around bowtie2 to make read mapping convenient:
usage: coptr map [-h] [--threads INT] [--paired] index input out-folder
positional arguments:
index Name of database index.
input File or folder containing fastq reads to map. If a
folder, the extension for each fastq must be one of
[.fastq, .fq, .fastq.gz, fq.gz]
out_folder Folder to save mapped reads. BAM files are output here.
optional arguments:
-h, --help show this help message and exit
--paired Set for paired end reads. Assumes fastq files end in _1.*
and _2.*
--threads THREADS Number of threads for bowtie2 mapping.
--bt2-k BT2_K (Default 10). Number of alignments to report. Passed to -k flag of
bowtie2.
The name of the database index corresponds to the name used from coptr index
.
The input can either be a single fastq file, or a folder of fastq files to map.
It also takes an optional --threads
argument that allows bowtie2 to use
multiple threads. Reads are output as bam
files to save space.
For paired-end sequencing, we recommand mapping reads from end only (e.g. the files ending in either _1.* or _2.*).
Merging mapped reads from multiple indexes¶
For large reference databases, it is sometimes necessary to create several
indexes for subsets of the data and map reads against each index. Results
from each index need to be merged to select reads with the best MAPQ across
indexes. You can use `coptr merge`
to merge multiple bam files.
usage: coptr merge [-h] in-bam1 in-bam2 ... in-bamN out-bam
positional arguments:
in-bams A space separateed list of BAM files to merge. Assumes same
reads were mapped against different indexes.
out-bam Path to merged BAM.
optional arguments:
-h, --help show this help message and exit
Computing coverage maps¶
On the example data:
$ coptr extract example-data/bam example-data/coverage-maps
[INFO] [Jun 28, 2021 15:49:48] [coptr.bam_processor] Processing example-data/bam/ERR969281.bam.
[INFO] [Jun 28, 2021 15:49:48] [coptr.bam_processor] Determining reference genomes.
[INFO] [Jun 28, 2021 15:49:48] [coptr.bam_processor] Collecting multi-mapped reads.
[INFO] [Jun 28, 2021 15:49:48] [coptr.bam_processor] Grouping reads by reference genome.
...
[INFO] [Jun 28, 2021 15:50:00] [coptr.cli] Found 190 reference sequences corresponding to 2 genomes.
Once reads have been mapped, the next step is to compute the coverage along each reference genome. In this step, starting positions of each read are extracted from each bam file, and reads from different contigs of the same assembly are collected.
usage: usage: coptr extract [-h] [--ref-genome-regex REF_GENOME_REGEX] [--check-regex]
in-folder out-folder
positional arguments:
in_folder Folder with BAM files.
out_folder Folder to store coverage maps.
optional arguments:
-h, --help show this help message and exit
--ref-genome-regex REF_GENOME_REGEX
Regular expression extracting a reference genome id
from the sequence id in a bam file.
--check-regex Check the regular expression by counting reference
genomes without processing
The important argument here is the --ref-genome-regex
. This is a regular
expression that extracts the reference genome id from a sequence id. The default
argument will work with the index created by `coptr index`
, and works by
prepending the name of the fasta file, and special character `|`
to each
sequence id.
Using other BAM files¶
If you already have BAM files that were not computed with CoPTR, you will need
to set the --ref-genome-regex
flag. This flag is a regular expression that
extracts a genome id from a sequence id in a fasta file. It is used to group
contigs together. The default argument [^\|]+
matches all characters up
to the first |
, and uses them as a genome id.
You can check your regular expression using the --check-regex
flag, which
skips the extract step and instead outputs a list of all genome ids.
Estimating PTRs¶
On the example data:
$ coptr estimate example-data/coverage-maps out --min-reads 2500
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] Grouping reads by reference genome.
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] Saving to example-data/coverage-maps/coverage-maps-genome:
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969281.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969282.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969283.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969285.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969286.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969428.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969429.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] ERR969430.cm.pkl
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] Grouping by reference genome: Complete.
[INFO] [Jun 28, 2021 15:50:15] [coptr.cli] The --restart flag can be used to start from here.
[INFO] [Jun 28, 2021 15:50:15] [coptr.coptr_ref] Checking reference genomes.
[INFO] [Jun 28, 2021 15:50:24] [coptr.coptr_ref] Running l-gasseri-ref.
[INFO] [Jun 28, 2021 15:50:25] [coptr.coptr_ref] Finished l-gasseri-ref.
[INFO] [Jun 28, 2021 15:50:25] [coptr.coptr_contig] Checking reference genomes.
[INFO] [Jun 28, 2021 15:50:25] [coptr.coptr_contig] Running e-coli-mag.
[INFO] [Jun 28, 2021 15:50:25] [coptr.coptr_contig] Finished e-coli-mag.
[INFO] [Jun 28, 2021 15:50:25] [coptr.cli] Writing out.csv.
[INFO] [Jun 28, 2021 15:50:25] [coptr.cli] Done.
[INFO] [Jun 28, 2021 15:50:25] [coptr.cli] You may now remove the folder example-data/coverage-maps/coverage-maps-genome.
The final stage is to estimate PTR ratios from coverage maps. This is accomplished
with the estimate
command. It is strongly recommended that you perform this step
on all samples at once.
usage: usage: coptr estimate [-h] [--min-reads MIN_READS] [--min-cov MIN_COV] [--min-samples MIN_SAMPLES] [--threads THREADS] [--plot OUTFOLDER] [--restart] coverage-map-folder out-file
positional arguments:
coverage_map_folder Folder with coverage maps computed from 'extract'.
out_file Filename to store PTR table.
optional arguments:
-h, --help show this help message and exit
--min-reads MIN_READS
Minimum number of reads required to compute a PTR (default 5000).
--min-cov MIN_COV Fraction of nonzero bins required to compute a PTR (default 0.75).
--min-samples MIN_SAMPLES
CoPTRContig only. Minimum number of samples required to reorder bins (default 5).
--plot OUTFOLDER Plot model fit for each PTR.
--restart Restarts the estimation step using the genomes in the coverage-maps-genome folder.
This combines all coverage maps by species, then estimates PTRs for each species. We have tried to set sensible default parameters for PTR estimatation. We set the minimum number of reads for the example data to 2500 in order to keep the size of the example data small, but the default of 5000 reads is recommended.
The output is a CSV file where, the rows are reference genomes, and the columns are samples. Each entry is the estimated log2 PTR.