Modules¶

bam_processor.py¶

Extract coordinates along each reference sequence and group reference sequences by genome.

class src.coptr.bam_processor.BamProcessor(regex='[^\\|]+', is_bam=True)¶

Extract coordinates from a bam file.

Parameters: regex (str) – A regular expression that matches a reference genome id from a reference sequence name.

Notes

For assemblies, the reference database will contain several contigs per assembly, each with a reference sequence name. We need to map a reference sequence name to the assembly is came from. This is accomplished with a regular expression. The expression matches a reference genome id.

For example, consider an assembly with contigs named

ERS235517|65|k99_317
ERS235517|65|k99_655
ERS235517|65|k99_1708

ERS235517|65 gives the reference genome, while the k99_* identify each contig. We want a regular expression that matches ERS235517|65. The regular expression

\w+\|\d+

matches 1 or more letter, number, or underscore (\w+), followed by the | character, followed by 1 or more number (\d+). This extracts the right reference genome id.

assign_multimapped_reads(read_container, lengths, max_alignments)¶

Assign multiple mapped reads to a single genome.

Parameters: reads (dict[str] -> Read) – A dictionary whose key is the read query_id and value is a Read object
Returns: read_positions – A dictionary whose key is a sequence id, and value are the read positions along that sequence.
Return type: dict[str] -> int

compute_bin_coverage(genome_ids, read_container, lengths, ref_seq_genome_id)¶

Compute fraction of bins with reads for each reference genome.

Parameters

genome_ids (set[str]) – A set of genome ids in the sample
read_container (ReadContainer) – Reads in the sample
lengths (dict[str -> int]) – Length of each reference sequence

Returns

coverage_frac – Fraction of nonzero read count bins for each genome_id.

Return type

dict[str -> float]

extract_reads(bam_file)¶

Get the read positions from sam_file for each reference genome.

Parameters

bam_file (str) – Path to a bam file.

Returns

read_positions (dict[str -> list]) – A dictionary whose key is a reference sequence id, and the value is a list of read coordinates along that reference sequence
lengths (dict[str -> int]) – A dictionary whose key is a reference sequence id, and the value is the length of that reference sequence

get_ref_names(bam_file)¶

Get the names of reference sequences and reference genomes.

Parameters

bam_file (str) – Path to bam file.

Returns

ref_seq (set[str]) – A set of reference sequences ids
ref_genomes (set[str]) – A set of reference genome ids

get_ref_seq_lengths(bam_file)¶

Extract the lengths of each reference sequence from a bam file.

Parameters: bam_file (str) – Path to a bam file.
Returns: ref_seq_lengths – A dictionary whose key is a reference sequence id, and value is the length of that reference sequence.
Return type: dict[str -> int]

merge(bam_files, out_bam)¶

Merge many bam files from different indexes into one, taking the reads with the highest mapping quality from each bam.

Parameters

bam_files (list[str]) – A list of bam files to merge.
out_bam (str) – Location to store the merged bam file.

process_bam(bam_file, max_alignments=10)¶

Extract read coordinates along each reference sequence.

Parameters

bam_file (str) – Path to a bam file.
max_alignments (int) – Bounds the number alignments to consider for multi-mapped reads. Reads greater than or equal to this threshold are discarded.

Returns

coverage_maps – A dictionary whose key is a reference genome id, and value is a CoverageMap.

Return type

dict[str -> CoverageMap]

class src.coptr.bam_processor.CoverageMap(bam_file, genome_id, is_assembly)¶

Data structure to store read positions along reference genomes.

Parameters

bam_file (str) – Bam file used to construct coverage map
genome_id (str) – Unique ID of the reference genome
is_assembly (bool) – If true, the reference genome is an assembly (and has multiple contigs). Otherwise is a complete reference genome with a single contig.

class src.coptr.bam_processor.CoverageMapContig(bam_file, genome_id, contig_read_positions, contig_lengths)¶

Data structure to store read positions from assemblies.

Parameters

bam_file (str) – Bam file used to construct coverage map
genome_id (str) – Unique ID of the reference genome
contig_read_positions (dict[str -> list]) – A dictionary whose key is a contig id (sequence id), and value is a list of coordinates from the reads along that contig.
contig_lengths (dict[str -> int]) – A dictionary whose key is a contig id (sequence id), and value is the length of that contig.

bin_reads(contig_id)¶

Count reads in bins.

Parameters: contig_id (str) – Reference sequence id of the contig
Returns: binned_reads – Total number of reads in each 10Kb window
Return type: numpy.array

compute_bin_size()¶

Compute bin size for read counts.

Returns: bin_size – Bin size for read counts
Return type: int

count_reads()¶

Return number of mapped reads.

Returns: nreads – Number of reads
Return type: int

get_length(contig_id)¶

Get the length of a contig.

Parameters: contig_id (str) – Reference sequence id of the contig
Returns: length – The length of the contig
Return type: int

get_reads(contig_id)¶

Get the coordinates of all reads for a contig.

Parameters: contig_id (str) – Reference sequence id of the contig
Returns: reads – The coordinates of each read along the contig
Return type: numpy.array

passed_qc()¶: Run basic quality control checks. Sets the passed_gc_flag attribute.

class src.coptr.bam_processor.CoverageMapRef(bam_file, genome_id, read_positions, length)¶

Data structure to store read positions from complete reference genomes.

Parameters

bam_file (str) – Bam file used to construct coverage map
genome_id (str) – Unique ID of the reference genome
read_positions (list[int]) – A list of coordinates, one per read.

count_reads()¶

Return number of mapped reads.

Returns: nreads – Number of reads
Return type: int

get_length()¶

Get the lenth of the genome.

Returns: length – The length of the reference genome
Return type: int

get_reads()¶

Get the read coordinates along the reference genome.

Returns: reads – A numpy array containing the coordinates of each read
Return type: numpy.array

class src.coptr.bam_processor.ReadContainer¶

Container to store read positions and metadata.

check_add_mapping(query_id, ref_name, ref_position, score)¶

Stores a mapping if it has an alignment score greater than or equal to the mappings seen so far.

Parameters

query_id (str) – The name (identifier) of the read.
ref_name (str) – The name of the reference sequence the read maps to.
ref_position (int) – The location in the reference sequence the read maps to.
score (float) – The alignment score

get_mappings(query_id)¶

Return the highest scoring mappings for a read.

Parameters

query_id (str) – The name of the read.

Returns

ref_names (list[str]) – A list of the reference sequences the read maps to.
ref_positions (list[int]) – A list of positions in each reference sequences.

coptr_contig.py¶

Estimate peak-to-trough ratios from assemblies.

class src.coptr.coptr_contig.CoPTRContig(min_reads, min_samples)¶

Estimate PTRs from draft assemblies.

Parameters

min_reads (float) – Minumum read count after filtering to estimate PTR
min_samples (float) – Minimum number of passing samples required to estimate a PTR

compute_genomewide_bounds(binned_counts, crit_region=0.05)¶

Compute bounds on read counts in bins.

Parameters

binned_counts (np.array) – Binned read counts

Returns

lower_bound (float) – A lower bound on read counts in bins
upper_bound (float) – An upper bound on read counts in bins

compute_log_likelihood(log2_ptr, binned_counts)¶

Compute the log likelihood of reordered bin probabilities given a log2(PTR).

Parameters

log2_ptr (float) – The Log2(PTR)
binned_counts (np.array) – Binned read counts along the genome

Returns

log_lk – The log likelihood

Return type

float

construct_coverage_matrix(coverage_maps)¶

Aggregate coverage maps into a bin by sample matrix.

Parameters: coverage_maps (list[CoverageMapContig]) – A list of coverage maps per sample.
Returns: A – The resulting coverage matrix.
Return type: 2D np.array of float

estimate_ptr_maximum_likelihood(binned_counts)¶

Compute maximum likelihood estimates of log2(PTR) gived observed reads.

Parameters

binned_counts (np.array) – Binned read counts for a genome

Returns

log2_ptr (float) – The maximum likelihood log2(PTR) estimate
intercept (float) – Probability a reads is in the leftmost bin
log_lk (float) – The log likelihood

estimate_ptrs(coverage_maps, return_bins=False)¶

Estimate PTRs across multiple samples of the same reference genome.

Parameters

coverage_maps (list[CoverageMapContig]) – The coverage maps for the assembly
min_reads (int) – Minimum number of reads after filtering to estimate a PTR
min_samples (int) – Minimum number of samples required to reorder bins using PCA. No estimate will be produced if the number of passing samples is less than min_samples.
return_bins (bool) – If true, returns estimated slopes, intercepts, and binned read counts in addition to estimates

Returns

estimates (list[CoPTRContigEstimate]) – A list of estimated PTRs
parameters (list[tuple(2)]) – If return_bins is true, this is a list of tuples with first coordinate the stimated slope, and second coordinate the intercept
binned_counts list[np.array] – If return_bins is true, this is a list of the binned 10Kb read counts after filtering steps.

estimate_slope_intercept(bin_log_probs)¶

Estimate the slope along reordering bins using least squares.

Parameters

bin_log_probs (np.array) – The observed probability that a read lands in a bin.

Returns

m (float) – The slope
b (float) – The intercept

class src.coptr.coptr_contig.CoPTRContigEstimate(bam_file, genome_id, sample_id, estimate, nreads, cov_frac)¶

Data structure to store CoPTRContig estimates.

Parameters

bam_file (str) – The bam file from which the estimate was obtained
genome_id (str) – The reference genome id for the estimate
sample_id (str) – The sample id for the estimate
estimate (float or np.nan) – The estimated PTR. This will be set to np.nan if the sample did not pass filtering steps
nreads (int) – If the sample passing filtering steps (i.e. has a PTR estimate) this is the number of filtered reads used for the estimate. Otherwise it is the number of reads before filtering.
cov_frac (float) – The fraction of nonzero bins in 10Kb windows.

src.coptr.coptr_contig.estimate_ptrs_coptr_contig(assembly_genome_ids, grouped_coverage_map_folder, min_reads, min_samples, plot_folder=None)¶

Estimate Peak-to-Trough ratios across samples.

Parameters

assembly_genome_ids (set[str]) – A list of genome ids to estimate PTRs.
grouped_coverage_map_folder (str) – Path to folder containing coverage maps grouped by genome.
min_reads (float) – Minumum read count after filtering to estimate PTR
min_samples (float) – Minimum number of passing samples required to estimate a PTR
threads (int) – Number of threads for parallel computation
plot_folder (str) – If specified, plots of fitted model will be saved here

Returns

coptr_contig_estimates – A dictionary with key reference genome id and value a list of CoPTRContigEstimate for that reference genome.

Return type

dict[str -> list[CoPTContigEstimate]]

coptr_ref.py¶

Estimate peak-to-trough ratios using complete reference genomes.

class src.coptr.coptr_ref.CoPTRRef(min_reads, min_cov)¶

CoPTR estimator for complete reference genomes.

Parameters

min_reads (float) – Minumum read count after filtering to estimate PTR
frac_nonzero (float) – Fraction of nonzero 10Kb bins required to estimate PTR

compute_log_likelihood(ori_loc, ter_loc, log2_ptr, read_locs)¶

Model log-likelihood for one sample.

Parameters

ori_loc (float) – Position of replication origin in the interval [0, 1]
ter_loc (float) – Position of terminus location in the interval [0, 1]
log2_ptr (float) – The log base 2 peak-to-trough ratio
read_locs (np.array of float) – A 1D numpy array giving read positions in [0, 1]

estimate_ptr(read_positions, ref_genome_len, filter_reads=True, estimate_terminus=False)¶

Estimate the PTR for a single sample.

Parameters

read_positions (np.array) – A 1D numpy array giving the starting coordinates of each read
ref_genome_length (int) – The length of the refernce genome
filter_reads (bool) – If true, reads a filtered before estimates
estimate_terminus – If true, the replication terminus is estimated in addition to the replication origin

Returns

log2_ptr (float) – The estimated log base 2 PTR
ori (float) – The estimated origin in position in [0, 1]
ter (float) – The estimated replication terminus position in [0, 1]
log_lk (float) – The model log likelihood

estimate_ptrs(coverage_maps)¶

Compute maximum likelihood PTR estimates across samples.

Parameters: coverage_maps (list[CoverageMapRef]) – A list of coverage maps per sample
Returns: estimates – A list of estimates per sample
Return type: list[CoPTRefEstimate]

update_ori_ter(log2_ptrs, read_locs_list)¶

Compute maximum likelihood ori estimates across samples given estimated log2(PTR)s across samples.

Parameters

log2_ptrs (list[float]) – List of estimated log2(PTR)s across samples
list[np.array] (read_locs_list) – List of read positions in [0, 1] across samples

Returns

ori (float) – The estimated replication origin in [0, 1]
ter (float) – The estimated replication terminus in [0, 1]

update_ptrs(ori, ter, prv_log2_ptrs, read_locs_list)¶

Compute maximum likelihood PTR estimates across samples given an estimated replication origin and terminus.

Parameters

ori (float) – The estimated origin in position in [0, 1]
ter (float) – The estimated replication terminus position in [0, 1]
prv_log2_ptrs (list[float]) – Estimated log2(PTR) across samples
list[np.array] (read_locs_list) – Read positions in [0, 1] across samples

Returns

log2_ptrs – The updated log2(PTR) estimates

Return type

list[float]

class src.coptr.coptr_ref.CoPTRRefEstimate(bam_file, genome_id, sample_id, estimate, ori_estimate, ter_estimate, nreads, cov_frac)¶

Data structure to store CoPTRRef estimates.

Parameters

bam_file (str) – The bam file from which the estimate was obtained
genome_id (str) – The reference genome id for the estimate
sample_id (str) – The sample id for the estimate
estimate (float or np.nan) – The estimated log2(PTR). This will be set to np.nan if the sample did not pass filtering steps
ori_estimate (float) – Estimated replication origin position in interval [0, 1]
ter_estimate (float) – Estimated replication terminus position in interval [0, 1]
nreads (int) – The number of reads remaining after filtering
cov_frac (float) – The fraction of nonzero bins in 10Kb windows.

class src.coptr.coptr_ref.QCResult(frac_nonzero, nreads, frac_removed, passed_qc)¶

Data structure to store results of quality checking.

Parameters

frac_nonzero (float) – Fraction of nonzero bins in 10Kb windows
nreads (int) – Number of reads after filtering
frac_removed (float) – Proportion of genome removed during filtering
passed_qc (bool) – Flag indicating if the sample passed quality thresholds

class src.coptr.coptr_ref.ReadFilterRef(min_reads, min_cov)¶

Read filtering steps for CoPTR Ref.

Parameters

min_reads (float) – Minumum read count after filtering
frac_nonzero (float) – Fraction of nonzero 10Kb bins required

bin_reads(read_positions, genome_length, bin_size=10000)¶

Aggregate reads into bin_size windows and compute the read count in each window.

Parameters

read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) – The length of the reference genome

Returns

bin_counts – An array of the read count in each bin

Return type

np.array

compute_bin_bounds(binned_counts)¶

Compute bounds on read counts in a smaller window along the genome.

Parameters

binned_counts (np.array) – Binned read counts

Returns

lower_bound (float) – A lower bound on read counts in bins
upper_bound (float) – An upper bound on read counts in bins

compute_bin_size(genome_length)¶

Compute bin size for read counts.

Parameters: genome_length (int) – Length of the reference genome
Returns: bin_size – Bin size for read countns
Return type: int

compute_genomewide_bounds(binned_counts)¶

Compute bounds on read counts in bins.

Parameters

binned_counts (np.array) – Binned read counts

Returns

lower_bound (float) – A lower bound on read counts in bins
upper_bound (float) – An upper bound on read counts in bins

compute_rolling_sum(read_positions, genome_length, bin_size)¶

Compute a rolling sum of read counts in 10Kb bins, sliding 1Kb at a time.

Parameters

read_positions (np.array or list of int) – The coordinate of the start positions of each read along the genome.
genome_length (float) – The length of the reference genome

Returns

rolling_counts (np.array of float) – The read count in each rolling bin
endpoints (np.array of tuple) – Each tuple gives the left (inclusive) and right (exclusive) end point of each bin.

filter_reads(read_positions, genome_length)¶

Filter out reads in regions of the genome where the coverage is either too high or too low.

Parameters

read_positions (np.array or list of int) – The starting coordinate of each read along the genome
genome_length (float) – The length of the reference genome

Returns

filtered_read_positions (np.array of int) – The starting position of each read passing filtering criteria
filtered_genome_length (float) – The length of the genome with filtered regions removed
passed_qc (bool) – A flag indicating if the sample passed quality control metrics

filter_reads_phase1(read_positions, genome_length, bin_size)¶

A coarse-grained genomewide filter that removes reads in ultra-high or ultra-low coverage regions.

Parameters

read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) –

Returns

read_positions (np.array of int) – The starting position of each read in an unfiltered region, adjusted for the positions removed.
new_genome_length (float) – The length of the genome with filtered regions removed

filter_reads_phase2(read_positions, genome_length, bin_size)¶

A fine-grained filter that removes reads in localized regions with too-high or too-low coverage. For each 10Kb, looks 6.25% of the genome length ahead and 6.25% of the genome length behind.

Parameters

read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) – The length of the reference genome

Returns

read_positions (np.array of int) – The starting position of each read in an unfiltered region, adjusted for the positions removed.
new_genome_length (float) – The length of the genome with filtered regions removed

quality_check(filtered_read_positions, filtered_genome_length, original_genome_length, bin_size, min_reads, frac_nonzero)¶

A basic quality check for required coverage of a genome in a sample.

Parameters

filtered_read_positions (np.array or list of int) – The starting coordinate of each read along the genome after filtering
filtered_genome_length (float) – The length of the reference genome after filtering
original_genome_length (float) – The length of the reference genome before filtering
min_reads (float) – Minumum read count after filtering
frac_nonzero (float) – Fraction of nonzero 10Kb bins required

Returns

qc_result – Object that stores the result of the quality check

Return type

QCResult

remove_reads_by_region(read_positions, genome_length, regions)¶

Remove reads that overlap a region in regions.

Parameters

read_positions (np.array of int) – The coordinate of the start position of each read
genome_length (int) – The length of the reference genome
regions (list of tuple of int) – Each tuple gives an interval in the genome to remove

Returns

read_positions (np.array of int) – Update coordiantes for each read after removing the regions in region
new_genome_length (float) – The length of the genome once each region is regions is removed.

src.coptr.coptr_ref.estimate_ptrs_coptr_ref(ref_genome_ids, grouped_coverage_map_folder, min_reads, min_cov, plot_folder=None, mem_limit=None)¶

Estimate Peak-to-Trough ratios across samples.

Parameters

ref_genome_ids (set[str]) – A list of genome ids to estimate PTRs.
grouped_coverage_map_folder (str) – Path to folder containing coverage maps grouped by genome.
min_reads (float) – Minumum read count after filtering to estimate PTR
frac_nonzero (float) – Fraction of nonzero 10Kb bins required to estimate PTR
threads (int) – Number of threads for parallel computation
plot_folder (str) – If not None, plots of fitted model are generated and saved here
mem_limit (int) – Limit amount of coverage maps loaded into memory at once. Breaks samples per genome into batches. Limit is specified in GB. Works only in single-threaded mode.

Returns

coptr_ref_estimates – A dictionary with key reference genome id and value a list of CoPTRRefEstimate for that reference genome.

Return type

dict[str -> list[CoPTRefEstimate]]

poisson_pca.py¶

Dimensionality reduction with Poisson loss function.

class src.coptr.poisson_pca.PoissonPCA¶

Principal Component Analysis with Poisson observations.

grad_W(X, W, V)¶

Compute the gradient of the log likelihood wrt W.

Parameters

X (np.array) – An n x m data matrix
W (np.array) – An n x k matrix
V (np.array) – A k x m matrix

log_likelihood(X, W, V)¶

Compute the log likelihood (minus a constant).

Parameters

X (np.array) – An n x m data matrix
W (np.array) – An n x k matrix
V (np.array) – A k x m matrix

pca(X, k, tol=0.001)¶

Run the PCA. For a data matrix X (sample by observations), finds two matrices W and V such that:

X_{ij} ~ Poisson(exp( [WV]_{ij} ))

Note that missing entries are possible. Missing entries should be denoted by np.nan.

Parameters

X (np.array) – A sample by observation data matrix with observed counts
k (int) – Number of latent dimensions. Determines the rank of W and V.

update_V(X, W, V)¶

Optimize the log-likelihood with respect to V.

Parameters

X (np.array) – The data matrix
W (np.array) – The current estimate of W
V (np.array) – The current estimate of V

update_W(X, W, V)¶

Optimize the log-likelihood with respect to W.

Parameters

X (np.array) – The data matrix
W (np.array) – The current estimate of W
V (np.array) – The current estimate of V

read_mapper.py¶

Module to map reads to a reference database of high-quality assemblies and complete reference genomes.

class src.coptr.read_mapper.ReadMapper¶

Wrapper around bowtie2.

index(ref_fasta, index_out, bt2_bmax, bt2_dcv, bt2_threads, bt2_packed)¶

Build a bowtie2 index from ref_fasta.

Parameters

ref_fasta (str) – Fasta file or folder containing fasta files to index. Valid extensions include ‘.fasta’, ‘.fna’, ‘.fa’
index_out (str) – Path to output the index.
bt2_bmax (str) – Parameter to pass to bowtie2-build –bmax argument.
bt2_dcv (str) – Parameter to pass to bowtie2-build –dcv argument.
bt2_threads (str) – Parameter to pass to bowtie2-build –threads argument.
bt2_packed (str) – Parameter to pass to bowtie2-build –packed argument.

map(index, inputf, outfolder, paired, threads, bt2_k)¶

Map reads from infile against reference database using bowtie2, then convert to a bam file.

Parameters

index (str) – Path of the database index to map against
inputf (str) – File or folder with the reads to map
outfolder (str) – Folder to save bam files.
paired (bool) – True for paired end sequencing.
threads (int) – Number of threads to use. Passed to the -p argument for bowtie2.
bt2_k (int) – Number of alignments to report

read_assigner.py¶

Assign multi-mapped reads to a single genome.

class src.coptr.read_assigner.ReadAssigner(X, prior_counts)¶

Assign multi-mapped reads using a mixture model.

Parameters: X (np.array) – A read by genome data matrix (0-1). X[i,j] = 1 if read i maps to genome j

assign_reads()¶

Compute read assignments.

Returns: assignments – A 1D array. Each entry gives a genome id that is the genome assignment for each read.
Return type: np.array

compute_elbo()¶: Compute the variational objective function.

run_vi(tol=0.001)¶: Optimize variational parameters with respect to elbo.

update_eta()¶: Optimize eta with fixed phi.

update_phi()¶: Optimize phi with fixed eta.

util.py¶

Miscellaneous utility functions.

src.coptr.util.get_fastq_name(fpath)¶

Remove the path and extension from a fastq file.

Parameters: fpath (str) – path to .fastq, fastq.gz, .fq, or fq.gz file
Returns: fname – the name of the file with its path and extension removeed
Return type: str