Modules¶
bam_processor.py¶
Extract coordinates along each reference sequence and group reference sequences by genome.
-
class
src.coptr.bam_processor.
BamProcessor
(regex='[^\\|]+', is_bam=True)¶ Extract coordinates from a bam file.
- Parameters
regex (str) – A regular expression that matches a reference genome id from a reference sequence name.
Notes
For assemblies, the reference database will contain several contigs per assembly, each with a reference sequence name. We need to map a reference sequence name to the assembly is came from. This is accomplished with a regular expression. The expression matches a reference genome id.
For example, consider an assembly with contigs named
ERS235517|65|k99_317
ERS235517|65|k99_655
ERS235517|65|k99_1708
ERS235517|65 gives the reference genome, while the k99_* identify each contig. We want a regular expression that matches ERS235517|65. The regular expression
\w+\|\d+
matches 1 or more letter, number, or underscore (
\w+
), followed by the|
character, followed by 1 or more number (\d+
). This extracts the right reference genome id.-
assign_multimapped_reads
(read_container, lengths, max_alignments)¶ Assign multiple mapped reads to a single genome.
- Parameters
reads (dict[str] -> Read) – A dictionary whose key is the read query_id and value is a Read object
- Returns
read_positions – A dictionary whose key is a sequence id, and value are the read positions along that sequence.
- Return type
dict[str] -> int
-
compute_bin_coverage
(genome_ids, read_container, lengths, ref_seq_genome_id)¶ Compute fraction of bins with reads for each reference genome.
- Parameters
genome_ids (set[str]) – A set of genome ids in the sample
read_container (ReadContainer) – Reads in the sample
lengths (dict[str -> int]) – Length of each reference sequence
- Returns
coverage_frac – Fraction of nonzero read count bins for each genome_id.
- Return type
dict[str -> float]
-
extract_reads
(bam_file)¶ Get the read positions from sam_file for each reference genome.
- Parameters
bam_file (str) – Path to a bam file.
- Returns
read_positions (dict[str -> list]) – A dictionary whose key is a reference sequence id, and the value is a list of read coordinates along that reference sequence
lengths (dict[str -> int]) – A dictionary whose key is a reference sequence id, and the value is the length of that reference sequence
-
get_ref_names
(bam_file)¶ Get the names of reference sequences and reference genomes.
- Parameters
bam_file (str) – Path to bam file.
- Returns
ref_seq (set[str]) – A set of reference sequences ids
ref_genomes (set[str]) – A set of reference genome ids
-
get_ref_seq_lengths
(bam_file)¶ Extract the lengths of each reference sequence from a bam file.
- Parameters
bam_file (str) – Path to a bam file.
- Returns
ref_seq_lengths – A dictionary whose key is a reference sequence id, and value is the length of that reference sequence.
- Return type
dict[str -> int]
-
merge
(bam_files, out_bam)¶ Merge many bam files from different indexes into one, taking the reads with the highest mapping quality from each bam.
- Parameters
bam_files (list[str]) – A list of bam files to merge.
out_bam (str) – Location to store the merged bam file.
-
process_bam
(bam_file, max_alignments=10)¶ Extract read coordinates along each reference sequence.
- Parameters
bam_file (str) – Path to a bam file.
max_alignments (int) – Bounds the number alignments to consider for multi-mapped reads. Reads greater than or equal to this threshold are discarded.
- Returns
coverage_maps – A dictionary whose key is a reference genome id, and value is a CoverageMap.
- Return type
dict[str -> CoverageMap]
-
class
src.coptr.bam_processor.
CoverageMap
(bam_file, genome_id, is_assembly)¶ Data structure to store read positions along reference genomes.
- Parameters
bam_file (str) – Bam file used to construct coverage map
genome_id (str) – Unique ID of the reference genome
is_assembly (bool) – If true, the reference genome is an assembly (and has multiple contigs). Otherwise is a complete reference genome with a single contig.
-
class
src.coptr.bam_processor.
CoverageMapContig
(bam_file, genome_id, contig_read_positions, contig_lengths)¶ Data structure to store read positions from assemblies.
- Parameters
bam_file (str) – Bam file used to construct coverage map
genome_id (str) – Unique ID of the reference genome
contig_read_positions (dict[str -> list]) – A dictionary whose key is a contig id (sequence id), and value is a list of coordinates from the reads along that contig.
contig_lengths (dict[str -> int]) – A dictionary whose key is a contig id (sequence id), and value is the length of that contig.
-
bin_reads
(contig_id)¶ Count reads in bins.
- Parameters
contig_id (str) – Reference sequence id of the contig
- Returns
binned_reads – Total number of reads in each 10Kb window
- Return type
numpy.array
-
compute_bin_size
()¶ Compute bin size for read counts.
- Returns
bin_size – Bin size for read counts
- Return type
int
-
count_reads
()¶ Return number of mapped reads.
- Returns
nreads – Number of reads
- Return type
int
-
get_length
(contig_id)¶ Get the length of a contig.
- Parameters
contig_id (str) – Reference sequence id of the contig
- Returns
length – The length of the contig
- Return type
int
-
get_reads
(contig_id)¶ Get the coordinates of all reads for a contig.
- Parameters
contig_id (str) – Reference sequence id of the contig
- Returns
reads – The coordinates of each read along the contig
- Return type
numpy.array
-
passed_qc
()¶ Run basic quality control checks. Sets the passed_gc_flag attribute.
-
class
src.coptr.bam_processor.
CoverageMapRef
(bam_file, genome_id, read_positions, length)¶ Data structure to store read positions from complete reference genomes.
- Parameters
bam_file (str) – Bam file used to construct coverage map
genome_id (str) – Unique ID of the reference genome
read_positions (list[int]) – A list of coordinates, one per read.
-
count_reads
()¶ Return number of mapped reads.
- Returns
nreads – Number of reads
- Return type
int
-
get_length
()¶ Get the lenth of the genome.
- Returns
length – The length of the reference genome
- Return type
int
-
get_reads
()¶ Get the read coordinates along the reference genome.
- Returns
reads – A numpy array containing the coordinates of each read
- Return type
numpy.array
-
class
src.coptr.bam_processor.
ReadContainer
¶ Container to store read positions and metadata.
-
check_add_mapping
(query_id, ref_name, ref_position, score)¶ Stores a mapping if it has an alignment score greater than or equal to the mappings seen so far.
- Parameters
query_id (str) – The name (identifier) of the read.
ref_name (str) – The name of the reference sequence the read maps to.
ref_position (int) – The location in the reference sequence the read maps to.
score (float) – The alignment score
-
get_mappings
(query_id)¶ Return the highest scoring mappings for a read.
- Parameters
query_id (str) – The name of the read.
- Returns
ref_names (list[str]) – A list of the reference sequences the read maps to.
ref_positions (list[int]) – A list of positions in each reference sequences.
-
coptr_contig.py¶
Estimate peak-to-trough ratios from assemblies.
-
class
src.coptr.coptr_contig.
CoPTRContig
(min_reads, min_samples)¶ Estimate PTRs from draft assemblies.
- Parameters
min_reads (float) – Minumum read count after filtering to estimate PTR
min_samples (float) – Minimum number of passing samples required to estimate a PTR
-
compute_genomewide_bounds
(binned_counts, crit_region=0.05)¶ Compute bounds on read counts in bins.
- Parameters
binned_counts (np.array) – Binned read counts
- Returns
lower_bound (float) – A lower bound on read counts in bins
upper_bound (float) – An upper bound on read counts in bins
-
compute_log_likelihood
(log2_ptr, binned_counts)¶ Compute the log likelihood of reordered bin probabilities given a log2(PTR).
- Parameters
log2_ptr (float) – The Log2(PTR)
binned_counts (np.array) – Binned read counts along the genome
- Returns
log_lk – The log likelihood
- Return type
float
-
construct_coverage_matrix
(coverage_maps)¶ Aggregate coverage maps into a bin by sample matrix.
- Parameters
coverage_maps (list[CoverageMapContig]) – A list of coverage maps per sample.
- Returns
A – The resulting coverage matrix.
- Return type
2D np.array of float
-
estimate_ptr_maximum_likelihood
(binned_counts)¶ Compute maximum likelihood estimates of log2(PTR) gived observed reads.
- Parameters
binned_counts (np.array) – Binned read counts for a genome
- Returns
log2_ptr (float) – The maximum likelihood log2(PTR) estimate
intercept (float) – Probability a reads is in the leftmost bin
log_lk (float) – The log likelihood
-
estimate_ptrs
(coverage_maps, return_bins=False)¶ Estimate PTRs across multiple samples of the same reference genome.
- Parameters
coverage_maps (list[CoverageMapContig]) – The coverage maps for the assembly
min_reads (int) – Minimum number of reads after filtering to estimate a PTR
min_samples (int) – Minimum number of samples required to reorder bins using PCA. No estimate will be produced if the number of passing samples is less than min_samples.
return_bins (bool) – If true, returns estimated slopes, intercepts, and binned read counts in addition to estimates
- Returns
estimates (list[CoPTRContigEstimate]) – A list of estimated PTRs
parameters (list[tuple(2)]) – If return_bins is true, this is a list of tuples with first coordinate the stimated slope, and second coordinate the intercept
binned_counts list[np.array] – If return_bins is true, this is a list of the binned 10Kb read counts after filtering steps.
-
estimate_slope_intercept
(bin_log_probs)¶ Estimate the slope along reordering bins using least squares.
- Parameters
bin_log_probs (np.array) – The observed probability that a read lands in a bin.
- Returns
m (float) – The slope
b (float) – The intercept
-
class
src.coptr.coptr_contig.
CoPTRContigEstimate
(bam_file, genome_id, sample_id, estimate, nreads, cov_frac)¶ Data structure to store CoPTRContig estimates.
- Parameters
bam_file (str) – The bam file from which the estimate was obtained
genome_id (str) – The reference genome id for the estimate
sample_id (str) – The sample id for the estimate
estimate (float or np.nan) – The estimated PTR. This will be set to np.nan if the sample did not pass filtering steps
nreads (int) – If the sample passing filtering steps (i.e. has a PTR estimate) this is the number of filtered reads used for the estimate. Otherwise it is the number of reads before filtering.
cov_frac (float) – The fraction of nonzero bins in 10Kb windows.
-
src.coptr.coptr_contig.
estimate_ptrs_coptr_contig
(assembly_genome_ids, grouped_coverage_map_folder, min_reads, min_samples, plot_folder=None)¶ Estimate Peak-to-Trough ratios across samples.
- Parameters
assembly_genome_ids (set[str]) – A list of genome ids to estimate PTRs.
grouped_coverage_map_folder (str) – Path to folder containing coverage maps grouped by genome.
min_reads (float) – Minumum read count after filtering to estimate PTR
min_samples (float) – Minimum number of passing samples required to estimate a PTR
threads (int) – Number of threads for parallel computation
plot_folder (str) – If specified, plots of fitted model will be saved here
- Returns
coptr_contig_estimates – A dictionary with key reference genome id and value a list of CoPTRContigEstimate for that reference genome.
- Return type
dict[str -> list[CoPTContigEstimate]]
coptr_ref.py¶
Estimate peak-to-trough ratios using complete reference genomes.
-
class
src.coptr.coptr_ref.
CoPTRRef
(min_reads, min_cov)¶ CoPTR estimator for complete reference genomes.
- Parameters
min_reads (float) – Minumum read count after filtering to estimate PTR
frac_nonzero (float) – Fraction of nonzero 10Kb bins required to estimate PTR
-
compute_log_likelihood
(ori_loc, ter_loc, log2_ptr, read_locs)¶ Model log-likelihood for one sample.
- Parameters
ori_loc (float) – Position of replication origin in the interval [0, 1]
ter_loc (float) – Position of terminus location in the interval [0, 1]
log2_ptr (float) – The log base 2 peak-to-trough ratio
read_locs (np.array of float) – A 1D numpy array giving read positions in [0, 1]
-
estimate_ptr
(read_positions, ref_genome_len, filter_reads=True, estimate_terminus=False)¶ Estimate the PTR for a single sample.
- Parameters
read_positions (np.array) – A 1D numpy array giving the starting coordinates of each read
ref_genome_length (int) – The length of the refernce genome
filter_reads (bool) – If true, reads a filtered before estimates
estimate_terminus – If true, the replication terminus is estimated in addition to the replication origin
- Returns
log2_ptr (float) – The estimated log base 2 PTR
ori (float) – The estimated origin in position in [0, 1]
ter (float) – The estimated replication terminus position in [0, 1]
log_lk (float) – The model log likelihood
-
estimate_ptrs
(coverage_maps)¶ Compute maximum likelihood PTR estimates across samples.
- Parameters
coverage_maps (list[CoverageMapRef]) – A list of coverage maps per sample
- Returns
estimates – A list of estimates per sample
- Return type
list[CoPTRefEstimate]
-
update_ori_ter
(log2_ptrs, read_locs_list)¶ Compute maximum likelihood ori estimates across samples given estimated log2(PTR)s across samples.
- Parameters
log2_ptrs (list[float]) – List of estimated log2(PTR)s across samples
list[np.array] (read_locs_list) – List of read positions in [0, 1] across samples
- Returns
ori (float) – The estimated replication origin in [0, 1]
ter (float) – The estimated replication terminus in [0, 1]
-
update_ptrs
(ori, ter, prv_log2_ptrs, read_locs_list)¶ Compute maximum likelihood PTR estimates across samples given an estimated replication origin and terminus.
- Parameters
ori (float) – The estimated origin in position in [0, 1]
ter (float) – The estimated replication terminus position in [0, 1]
prv_log2_ptrs (list[float]) – Estimated log2(PTR) across samples
list[np.array] (read_locs_list) – Read positions in [0, 1] across samples
- Returns
log2_ptrs – The updated log2(PTR) estimates
- Return type
list[float]
-
class
src.coptr.coptr_ref.
CoPTRRefEstimate
(bam_file, genome_id, sample_id, estimate, ori_estimate, ter_estimate, nreads, cov_frac)¶ Data structure to store CoPTRRef estimates.
- Parameters
bam_file (str) – The bam file from which the estimate was obtained
genome_id (str) – The reference genome id for the estimate
sample_id (str) – The sample id for the estimate
estimate (float or np.nan) – The estimated log2(PTR). This will be set to np.nan if the sample did not pass filtering steps
ori_estimate (float) – Estimated replication origin position in interval [0, 1]
ter_estimate (float) – Estimated replication terminus position in interval [0, 1]
nreads (int) – The number of reads remaining after filtering
cov_frac (float) – The fraction of nonzero bins in 10Kb windows.
-
class
src.coptr.coptr_ref.
QCResult
(frac_nonzero, nreads, frac_removed, passed_qc)¶ Data structure to store results of quality checking.
- Parameters
frac_nonzero (float) – Fraction of nonzero bins in 10Kb windows
nreads (int) – Number of reads after filtering
frac_removed (float) – Proportion of genome removed during filtering
passed_qc (bool) – Flag indicating if the sample passed quality thresholds
-
class
src.coptr.coptr_ref.
ReadFilterRef
(min_reads, min_cov)¶ Read filtering steps for CoPTR Ref.
- Parameters
min_reads (float) – Minumum read count after filtering
frac_nonzero (float) – Fraction of nonzero 10Kb bins required
-
bin_reads
(read_positions, genome_length, bin_size=10000)¶ Aggregate reads into bin_size windows and compute the read count in each window.
- Parameters
read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) – The length of the reference genome
- Returns
bin_counts – An array of the read count in each bin
- Return type
np.array
-
compute_bin_bounds
(binned_counts)¶ Compute bounds on read counts in a smaller window along the genome.
- Parameters
binned_counts (np.array) – Binned read counts
- Returns
lower_bound (float) – A lower bound on read counts in bins
upper_bound (float) – An upper bound on read counts in bins
-
compute_bin_size
(genome_length)¶ Compute bin size for read counts.
- Parameters
genome_length (int) – Length of the reference genome
- Returns
bin_size – Bin size for read countns
- Return type
int
-
compute_genomewide_bounds
(binned_counts)¶ Compute bounds on read counts in bins.
- Parameters
binned_counts (np.array) – Binned read counts
- Returns
lower_bound (float) – A lower bound on read counts in bins
upper_bound (float) – An upper bound on read counts in bins
-
compute_rolling_sum
(read_positions, genome_length, bin_size)¶ Compute a rolling sum of read counts in 10Kb bins, sliding 1Kb at a time.
- Parameters
read_positions (np.array or list of int) – The coordinate of the start positions of each read along the genome.
genome_length (float) – The length of the reference genome
- Returns
rolling_counts (np.array of float) – The read count in each rolling bin
endpoints (np.array of tuple) – Each tuple gives the left (inclusive) and right (exclusive) end point of each bin.
-
filter_reads
(read_positions, genome_length)¶ Filter out reads in regions of the genome where the coverage is either too high or too low.
- Parameters
read_positions (np.array or list of int) – The starting coordinate of each read along the genome
genome_length (float) – The length of the reference genome
- Returns
filtered_read_positions (np.array of int) – The starting position of each read passing filtering criteria
filtered_genome_length (float) – The length of the genome with filtered regions removed
passed_qc (bool) – A flag indicating if the sample passed quality control metrics
-
filter_reads_phase1
(read_positions, genome_length, bin_size)¶ A coarse-grained genomewide filter that removes reads in ultra-high or ultra-low coverage regions.
- Parameters
read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) –
- Returns
read_positions (np.array of int) – The starting position of each read in an unfiltered region, adjusted for the positions removed.
new_genome_length (float) – The length of the genome with filtered regions removed
-
filter_reads_phase2
(read_positions, genome_length, bin_size)¶ A fine-grained filter that removes reads in localized regions with too-high or too-low coverage. For each 10Kb, looks 6.25% of the genome length ahead and 6.25% of the genome length behind.
- Parameters
read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) – The length of the reference genome
- Returns
read_positions (np.array of int) – The starting position of each read in an unfiltered region, adjusted for the positions removed.
new_genome_length (float) – The length of the genome with filtered regions removed
-
quality_check
(filtered_read_positions, filtered_genome_length, original_genome_length, bin_size, min_reads, frac_nonzero)¶ A basic quality check for required coverage of a genome in a sample.
- Parameters
filtered_read_positions (np.array or list of int) – The starting coordinate of each read along the genome after filtering
filtered_genome_length (float) – The length of the reference genome after filtering
original_genome_length (float) – The length of the reference genome before filtering
min_reads (float) – Minumum read count after filtering
frac_nonzero (float) – Fraction of nonzero 10Kb bins required
- Returns
qc_result – Object that stores the result of the quality check
- Return type
-
remove_reads_by_region
(read_positions, genome_length, regions)¶ Remove reads that overlap a region in regions.
- Parameters
read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) – The length of the reference genome
regions (list of tuple of int) – Each tuple gives an interval in the genome to remove
- Returns
read_positions (np.array of int) – Update coordiantes for each read after removing the regions in region
new_genome_length (float) – The length of the genome once each region is regions is removed.
-
src.coptr.coptr_ref.
estimate_ptrs_coptr_ref
(ref_genome_ids, grouped_coverage_map_folder, min_reads, min_cov, plot_folder=None, mem_limit=None)¶ Estimate Peak-to-Trough ratios across samples.
- Parameters
ref_genome_ids (set[str]) – A list of genome ids to estimate PTRs.
grouped_coverage_map_folder (str) – Path to folder containing coverage maps grouped by genome.
min_reads (float) – Minumum read count after filtering to estimate PTR
frac_nonzero (float) – Fraction of nonzero 10Kb bins required to estimate PTR
threads (int) – Number of threads for parallel computation
plot_folder (str) – If not None, plots of fitted model are generated and saved here
mem_limit (int) – Limit amount of coverage maps loaded into memory at once. Breaks samples per genome into batches. Limit is specified in GB. Works only in single-threaded mode.
- Returns
coptr_ref_estimates – A dictionary with key reference genome id and value a list of CoPTRRefEstimate for that reference genome.
- Return type
dict[str -> list[CoPTRefEstimate]]
poisson_pca.py¶
Dimensionality reduction with Poisson loss function.
-
class
src.coptr.poisson_pca.
PoissonPCA
¶ Principal Component Analysis with Poisson observations.
-
grad_W
(X, W, V)¶ Compute the gradient of the log likelihood wrt W.
- Parameters
X (np.array) – An n x m data matrix
W (np.array) – An n x k matrix
V (np.array) – A k x m matrix
-
log_likelihood
(X, W, V)¶ Compute the log likelihood (minus a constant).
- Parameters
X (np.array) – An n x m data matrix
W (np.array) – An n x k matrix
V (np.array) – A k x m matrix
-
pca
(X, k, tol=0.001)¶ Run the PCA. For a data matrix X (sample by observations), finds two matrices W and V such that:
X_{ij} ~ Poisson(exp( [WV]_{ij} ))
Note that missing entries are possible. Missing entries should be denoted by np.nan.
- Parameters
X (np.array) – A sample by observation data matrix with observed counts
k (int) – Number of latent dimensions. Determines the rank of W and V.
-
update_V
(X, W, V)¶ Optimize the log-likelihood with respect to V.
- Parameters
X (np.array) – The data matrix
W (np.array) – The current estimate of W
V (np.array) – The current estimate of V
-
update_W
(X, W, V)¶ Optimize the log-likelihood with respect to W.
- Parameters
X (np.array) – The data matrix
W (np.array) – The current estimate of W
V (np.array) – The current estimate of V
-
read_mapper.py¶
Module to map reads to a reference database of high-quality assemblies and complete reference genomes.
-
class
src.coptr.read_mapper.
ReadMapper
¶ Wrapper around bowtie2.
-
index
(ref_fasta, index_out, bt2_bmax, bt2_dcv, bt2_threads, bt2_packed)¶ Build a bowtie2 index from ref_fasta.
- Parameters
ref_fasta (str) – Fasta file or folder containing fasta files to index. Valid extensions include ‘.fasta’, ‘.fna’, ‘.fa’
index_out (str) – Path to output the index.
bt2_bmax (str) – Parameter to pass to bowtie2-build –bmax argument.
bt2_dcv (str) – Parameter to pass to bowtie2-build –dcv argument.
bt2_threads (str) – Parameter to pass to bowtie2-build –threads argument.
bt2_packed (str) – Parameter to pass to bowtie2-build –packed argument.
-
map
(index, inputf, outfolder, paired, threads, bt2_k)¶ Map reads from infile against reference database using bowtie2, then convert to a bam file.
- Parameters
index (str) – Path of the database index to map against
inputf (str) – File or folder with the reads to map
outfolder (str) – Folder to save bam files.
paired (bool) – True for paired end sequencing.
threads (int) – Number of threads to use. Passed to the -p argument for bowtie2.
bt2_k (int) – Number of alignments to report
-
read_assigner.py¶
Assign multi-mapped reads to a single genome.
-
class
src.coptr.read_assigner.
ReadAssigner
(X, prior_counts)¶ Assign multi-mapped reads using a mixture model.
- Parameters
X (np.array) – A read by genome data matrix (0-1). X[i,j] = 1 if read i maps to genome j
-
assign_reads
()¶ Compute read assignments.
- Returns
assignments – A 1D array. Each entry gives a genome id that is the genome assignment for each read.
- Return type
np.array
-
compute_elbo
()¶ Compute the variational objective function.
-
run_vi
(tol=0.001)¶ Optimize variational parameters with respect to elbo.
-
update_eta
()¶ Optimize eta with fixed phi.
-
update_phi
()¶ Optimize phi with fixed eta.