Modules

bam_processor.py

Extract coordinates along each reference sequence and group reference sequences by genome.

class src.coptr.bam_processor.BamProcessor(regex='[^\\|]+', is_bam=True)

Extract coordinates from a bam file.

Parameters

regex (str) – A regular expression that matches a reference genome id from a reference sequence name.

Notes

For assemblies, the reference database will contain several contigs per assembly, each with a reference sequence name. We need to map a reference sequence name to the assembly is came from. This is accomplished with a regular expression. The expression matches a reference genome id.

For example, consider an assembly with contigs named

ERS235517|65|k99_317
ERS235517|65|k99_655
ERS235517|65|k99_1708

ERS235517|65 gives the reference genome, while the k99_* identify each contig. We want a regular expression that matches ERS235517|65. The regular expression

\w+\|\d+

matches 1 or more letter, number, or underscore (\w+), followed by the | character, followed by 1 or more number (\d+). This extracts the right reference genome id.

assign_multimapped_reads(read_container, lengths, max_alignments)

Assign multiple mapped reads to a single genome.

Parameters

reads (dict[str] -> Read) – A dictionary whose key is the read query_id and value is a Read object

Returns

read_positions – A dictionary whose key is a sequence id, and value are the read positions along that sequence.

Return type

dict[str] -> int

compute_bin_coverage(genome_ids, read_container, lengths, ref_seq_genome_id)

Compute fraction of bins with reads for each reference genome.

Parameters
  • genome_ids (set[str]) – A set of genome ids in the sample

  • read_container (ReadContainer) – Reads in the sample

  • lengths (dict[str -> int]) – Length of each reference sequence

Returns

coverage_frac – Fraction of nonzero read count bins for each genome_id.

Return type

dict[str -> float]

extract_reads(bam_file)

Get the read positions from sam_file for each reference genome.

Parameters

bam_file (str) – Path to a bam file.

Returns

  • read_positions (dict[str -> list]) – A dictionary whose key is a reference sequence id, and the value is a list of read coordinates along that reference sequence

  • lengths (dict[str -> int]) – A dictionary whose key is a reference sequence id, and the value is the length of that reference sequence

get_ref_names(bam_file)

Get the names of reference sequences and reference genomes.

Parameters

bam_file (str) – Path to bam file.

Returns

  • ref_seq (set[str]) – A set of reference sequences ids

  • ref_genomes (set[str]) – A set of reference genome ids

get_ref_seq_lengths(bam_file)

Extract the lengths of each reference sequence from a bam file.

Parameters

bam_file (str) – Path to a bam file.

Returns

ref_seq_lengths – A dictionary whose key is a reference sequence id, and value is the length of that reference sequence.

Return type

dict[str -> int]

merge(bam_files, out_bam)

Merge many bam files from different indexes into one, taking the reads with the highest mapping quality from each bam.

Parameters
  • bam_files (list[str]) – A list of bam files to merge.

  • out_bam (str) – Location to store the merged bam file.

process_bam(bam_file, max_alignments=10)

Extract read coordinates along each reference sequence.

Parameters
  • bam_file (str) – Path to a bam file.

  • max_alignments (int) – Bounds the number alignments to consider for multi-mapped reads. Reads greater than or equal to this threshold are discarded.

Returns

coverage_maps – A dictionary whose key is a reference genome id, and value is a CoverageMap.

Return type

dict[str -> CoverageMap]

class src.coptr.bam_processor.CoverageMap(bam_file, genome_id, is_assembly)

Data structure to store read positions along reference genomes.

Parameters
  • bam_file (str) – Bam file used to construct coverage map

  • genome_id (str) – Unique ID of the reference genome

  • is_assembly (bool) – If true, the reference genome is an assembly (and has multiple contigs). Otherwise is a complete reference genome with a single contig.

class src.coptr.bam_processor.CoverageMapContig(bam_file, genome_id, contig_read_positions, contig_lengths)

Data structure to store read positions from assemblies.

Parameters
  • bam_file (str) – Bam file used to construct coverage map

  • genome_id (str) – Unique ID of the reference genome

  • contig_read_positions (dict[str -> list]) – A dictionary whose key is a contig id (sequence id), and value is a list of coordinates from the reads along that contig.

  • contig_lengths (dict[str -> int]) – A dictionary whose key is a contig id (sequence id), and value is the length of that contig.

bin_reads(contig_id)

Count reads in bins.

Parameters

contig_id (str) – Reference sequence id of the contig

Returns

binned_reads – Total number of reads in each 10Kb window

Return type

numpy.array

compute_bin_size()

Compute bin size for read counts.

Returns

bin_size – Bin size for read counts

Return type

int

count_reads()

Return number of mapped reads.

Returns

nreads – Number of reads

Return type

int

get_length(contig_id)

Get the length of a contig.

Parameters

contig_id (str) – Reference sequence id of the contig

Returns

length – The length of the contig

Return type

int

get_reads(contig_id)

Get the coordinates of all reads for a contig.

Parameters

contig_id (str) – Reference sequence id of the contig

Returns

reads – The coordinates of each read along the contig

Return type

numpy.array

passed_qc()

Run basic quality control checks. Sets the passed_gc_flag attribute.

class src.coptr.bam_processor.CoverageMapRef(bam_file, genome_id, read_positions, length)

Data structure to store read positions from complete reference genomes.

Parameters
  • bam_file (str) – Bam file used to construct coverage map

  • genome_id (str) – Unique ID of the reference genome

  • read_positions (list[int]) – A list of coordinates, one per read.

count_reads()

Return number of mapped reads.

Returns

nreads – Number of reads

Return type

int

get_length()

Get the lenth of the genome.

Returns

length – The length of the reference genome

Return type

int

get_reads()

Get the read coordinates along the reference genome.

Returns

reads – A numpy array containing the coordinates of each read

Return type

numpy.array

class src.coptr.bam_processor.ReadContainer

Container to store read positions and metadata.

check_add_mapping(query_id, ref_name, ref_position, score)

Stores a mapping if it has an alignment score greater than or equal to the mappings seen so far.

Parameters
  • query_id (str) – The name (identifier) of the read.

  • ref_name (str) – The name of the reference sequence the read maps to.

  • ref_position (int) – The location in the reference sequence the read maps to.

  • score (float) – The alignment score

get_mappings(query_id)

Return the highest scoring mappings for a read.

Parameters

query_id (str) – The name of the read.

Returns

  • ref_names (list[str]) – A list of the reference sequences the read maps to.

  • ref_positions (list[int]) – A list of positions in each reference sequences.

coptr_contig.py

Estimate peak-to-trough ratios from assemblies.

class src.coptr.coptr_contig.CoPTRContig(min_reads, min_samples)

Estimate PTRs from draft assemblies.

Parameters
  • min_reads (float) – Minumum read count after filtering to estimate PTR

  • min_samples (float) – Minimum number of passing samples required to estimate a PTR

compute_genomewide_bounds(binned_counts, crit_region=0.05)

Compute bounds on read counts in bins.

Parameters

binned_counts (np.array) – Binned read counts

Returns

  • lower_bound (float) – A lower bound on read counts in bins

  • upper_bound (float) – An upper bound on read counts in bins

compute_log_likelihood(log2_ptr, binned_counts)

Compute the log likelihood of reordered bin probabilities given a log2(PTR).

Parameters
  • log2_ptr (float) – The Log2(PTR)

  • binned_counts (np.array) – Binned read counts along the genome

Returns

log_lk – The log likelihood

Return type

float

construct_coverage_matrix(coverage_maps)

Aggregate coverage maps into a bin by sample matrix.

Parameters

coverage_maps (list[CoverageMapContig]) – A list of coverage maps per sample.

Returns

A – The resulting coverage matrix.

Return type

2D np.array of float

estimate_ptr_maximum_likelihood(binned_counts)

Compute maximum likelihood estimates of log2(PTR) gived observed reads.

Parameters

binned_counts (np.array) – Binned read counts for a genome

Returns

  • log2_ptr (float) – The maximum likelihood log2(PTR) estimate

  • intercept (float) – Probability a reads is in the leftmost bin

  • log_lk (float) – The log likelihood

estimate_ptrs(coverage_maps, return_bins=False)

Estimate PTRs across multiple samples of the same reference genome.

Parameters
  • coverage_maps (list[CoverageMapContig]) – The coverage maps for the assembly

  • min_reads (int) – Minimum number of reads after filtering to estimate a PTR

  • min_samples (int) – Minimum number of samples required to reorder bins using PCA. No estimate will be produced if the number of passing samples is less than min_samples.

  • return_bins (bool) – If true, returns estimated slopes, intercepts, and binned read counts in addition to estimates

Returns

  • estimates (list[CoPTRContigEstimate]) – A list of estimated PTRs

  • parameters (list[tuple(2)]) – If return_bins is true, this is a list of tuples with first coordinate the stimated slope, and second coordinate the intercept

  • binned_counts list[np.array] – If return_bins is true, this is a list of the binned 10Kb read counts after filtering steps.

estimate_slope_intercept(bin_log_probs)

Estimate the slope along reordering bins using least squares.

Parameters

bin_log_probs (np.array) – The observed probability that a read lands in a bin.

Returns

  • m (float) – The slope

  • b (float) – The intercept

class src.coptr.coptr_contig.CoPTRContigEstimate(bam_file, genome_id, sample_id, estimate, nreads, cov_frac)

Data structure to store CoPTRContig estimates.

Parameters
  • bam_file (str) – The bam file from which the estimate was obtained

  • genome_id (str) – The reference genome id for the estimate

  • sample_id (str) – The sample id for the estimate

  • estimate (float or np.nan) – The estimated PTR. This will be set to np.nan if the sample did not pass filtering steps

  • nreads (int) – If the sample passing filtering steps (i.e. has a PTR estimate) this is the number of filtered reads used for the estimate. Otherwise it is the number of reads before filtering.

  • cov_frac (float) – The fraction of nonzero bins in 10Kb windows.

src.coptr.coptr_contig.estimate_ptrs_coptr_contig(assembly_genome_ids, grouped_coverage_map_folder, min_reads, min_samples, plot_folder=None)

Estimate Peak-to-Trough ratios across samples.

Parameters
  • assembly_genome_ids (set[str]) – A list of genome ids to estimate PTRs.

  • grouped_coverage_map_folder (str) – Path to folder containing coverage maps grouped by genome.

  • min_reads (float) – Minumum read count after filtering to estimate PTR

  • min_samples (float) – Minimum number of passing samples required to estimate a PTR

  • threads (int) – Number of threads for parallel computation

  • plot_folder (str) – If specified, plots of fitted model will be saved here

Returns

coptr_contig_estimates – A dictionary with key reference genome id and value a list of CoPTRContigEstimate for that reference genome.

Return type

dict[str -> list[CoPTContigEstimate]]

coptr_ref.py

Estimate peak-to-trough ratios using complete reference genomes.

class src.coptr.coptr_ref.CoPTRRef(min_reads, min_cov)

CoPTR estimator for complete reference genomes.

Parameters
  • min_reads (float) – Minumum read count after filtering to estimate PTR

  • frac_nonzero (float) – Fraction of nonzero 10Kb bins required to estimate PTR

compute_log_likelihood(ori_loc, ter_loc, log2_ptr, read_locs)

Model log-likelihood for one sample.

Parameters
  • ori_loc (float) – Position of replication origin in the interval [0, 1]

  • ter_loc (float) – Position of terminus location in the interval [0, 1]

  • log2_ptr (float) – The log base 2 peak-to-trough ratio

  • read_locs (np.array of float) – A 1D numpy array giving read positions in [0, 1]

estimate_ptr(read_positions, ref_genome_len, filter_reads=True, estimate_terminus=False)

Estimate the PTR for a single sample.

Parameters
  • read_positions (np.array) – A 1D numpy array giving the starting coordinates of each read

  • ref_genome_length (int) – The length of the refernce genome

  • filter_reads (bool) – If true, reads a filtered before estimates

  • estimate_terminus – If true, the replication terminus is estimated in addition to the replication origin

Returns

  • log2_ptr (float) – The estimated log base 2 PTR

  • ori (float) – The estimated origin in position in [0, 1]

  • ter (float) – The estimated replication terminus position in [0, 1]

  • log_lk (float) – The model log likelihood

estimate_ptrs(coverage_maps)

Compute maximum likelihood PTR estimates across samples.

Parameters

coverage_maps (list[CoverageMapRef]) – A list of coverage maps per sample

Returns

estimates – A list of estimates per sample

Return type

list[CoPTRefEstimate]

update_ori_ter(log2_ptrs, read_locs_list)

Compute maximum likelihood ori estimates across samples given estimated log2(PTR)s across samples.

Parameters
  • log2_ptrs (list[float]) – List of estimated log2(PTR)s across samples

  • list[np.array] (read_locs_list) – List of read positions in [0, 1] across samples

Returns

  • ori (float) – The estimated replication origin in [0, 1]

  • ter (float) – The estimated replication terminus in [0, 1]

update_ptrs(ori, ter, prv_log2_ptrs, read_locs_list)

Compute maximum likelihood PTR estimates across samples given an estimated replication origin and terminus.

Parameters
  • ori (float) – The estimated origin in position in [0, 1]

  • ter (float) – The estimated replication terminus position in [0, 1]

  • prv_log2_ptrs (list[float]) – Estimated log2(PTR) across samples

  • list[np.array] (read_locs_list) – Read positions in [0, 1] across samples

Returns

log2_ptrs – The updated log2(PTR) estimates

Return type

list[float]

class src.coptr.coptr_ref.CoPTRRefEstimate(bam_file, genome_id, sample_id, estimate, ori_estimate, ter_estimate, nreads, cov_frac)

Data structure to store CoPTRRef estimates.

Parameters
  • bam_file (str) – The bam file from which the estimate was obtained

  • genome_id (str) – The reference genome id for the estimate

  • sample_id (str) – The sample id for the estimate

  • estimate (float or np.nan) – The estimated log2(PTR). This will be set to np.nan if the sample did not pass filtering steps

  • ori_estimate (float) – Estimated replication origin position in interval [0, 1]

  • ter_estimate (float) – Estimated replication terminus position in interval [0, 1]

  • nreads (int) – The number of reads remaining after filtering

  • cov_frac (float) – The fraction of nonzero bins in 10Kb windows.

class src.coptr.coptr_ref.QCResult(frac_nonzero, nreads, frac_removed, passed_qc)

Data structure to store results of quality checking.

Parameters
  • frac_nonzero (float) – Fraction of nonzero bins in 10Kb windows

  • nreads (int) – Number of reads after filtering

  • frac_removed (float) – Proportion of genome removed during filtering

  • passed_qc (bool) – Flag indicating if the sample passed quality thresholds

class src.coptr.coptr_ref.ReadFilterRef(min_reads, min_cov)

Read filtering steps for CoPTR Ref.

Parameters
  • min_reads (float) – Minumum read count after filtering

  • frac_nonzero (float) – Fraction of nonzero 10Kb bins required

bin_reads(read_positions, genome_length, bin_size=10000)

Aggregate reads into bin_size windows and compute the read count in each window.

Parameters
  • read_positions (np.array of int) – The coordinate of the start position of each read

  • genome_length (int) – The length of the reference genome

Returns

bin_counts – An array of the read count in each bin

Return type

np.array

compute_bin_bounds(binned_counts)

Compute bounds on read counts in a smaller window along the genome.

Parameters

binned_counts (np.array) – Binned read counts

Returns

  • lower_bound (float) – A lower bound on read counts in bins

  • upper_bound (float) – An upper bound on read counts in bins

compute_bin_size(genome_length)

Compute bin size for read counts.

Parameters

genome_length (int) – Length of the reference genome

Returns

bin_size – Bin size for read countns

Return type

int

compute_genomewide_bounds(binned_counts)

Compute bounds on read counts in bins.

Parameters

binned_counts (np.array) – Binned read counts

Returns

  • lower_bound (float) – A lower bound on read counts in bins

  • upper_bound (float) – An upper bound on read counts in bins

compute_rolling_sum(read_positions, genome_length, bin_size)

Compute a rolling sum of read counts in 10Kb bins, sliding 1Kb at a time.

Parameters
  • read_positions (np.array or list of int) – The coordinate of the start positions of each read along the genome.

  • genome_length (float) – The length of the reference genome

Returns

  • rolling_counts (np.array of float) – The read count in each rolling bin

  • endpoints (np.array of tuple) – Each tuple gives the left (inclusive) and right (exclusive) end point of each bin.

filter_reads(read_positions, genome_length)

Filter out reads in regions of the genome where the coverage is either too high or too low.

Parameters
  • read_positions (np.array or list of int) – The starting coordinate of each read along the genome

  • genome_length (float) – The length of the reference genome

Returns

  • filtered_read_positions (np.array of int) – The starting position of each read passing filtering criteria

  • filtered_genome_length (float) – The length of the genome with filtered regions removed

  • passed_qc (bool) – A flag indicating if the sample passed quality control metrics

filter_reads_phase1(read_positions, genome_length, bin_size)

A coarse-grained genomewide filter that removes reads in ultra-high or ultra-low coverage regions.

Parameters
  • read_positions (np.array of int) – The coordinate of the start position of each read

  • genome_length (int) –

Returns

  • read_positions (np.array of int) – The starting position of each read in an unfiltered region, adjusted for the positions removed.

  • new_genome_length (float) – The length of the genome with filtered regions removed

filter_reads_phase2(read_positions, genome_length, bin_size)

A fine-grained filter that removes reads in localized regions with too-high or too-low coverage. For each 10Kb, looks 6.25% of the genome length ahead and 6.25% of the genome length behind.

Parameters
  • read_positions (np.array of int) – The coordinate of the start position of each read

  • genome_length (int) – The length of the reference genome

Returns

  • read_positions (np.array of int) – The starting position of each read in an unfiltered region, adjusted for the positions removed.

  • new_genome_length (float) – The length of the genome with filtered regions removed

quality_check(filtered_read_positions, filtered_genome_length, original_genome_length, bin_size, min_reads, frac_nonzero)

A basic quality check for required coverage of a genome in a sample.

Parameters
  • filtered_read_positions (np.array or list of int) – The starting coordinate of each read along the genome after filtering

  • filtered_genome_length (float) – The length of the reference genome after filtering

  • original_genome_length (float) – The length of the reference genome before filtering

  • min_reads (float) – Minumum read count after filtering

  • frac_nonzero (float) – Fraction of nonzero 10Kb bins required

Returns

qc_result – Object that stores the result of the quality check

Return type

QCResult

remove_reads_by_region(read_positions, genome_length, regions)

Remove reads that overlap a region in regions.

Parameters
  • read_positions (np.array of int) – The coordinate of the start position of each read

  • genome_length (int) – The length of the reference genome

  • regions (list of tuple of int) – Each tuple gives an interval in the genome to remove

Returns

  • read_positions (np.array of int) – Update coordiantes for each read after removing the regions in region

  • new_genome_length (float) – The length of the genome once each region is regions is removed.

src.coptr.coptr_ref.estimate_ptrs_coptr_ref(ref_genome_ids, grouped_coverage_map_folder, min_reads, min_cov, plot_folder=None, mem_limit=None)

Estimate Peak-to-Trough ratios across samples.

Parameters
  • ref_genome_ids (set[str]) – A list of genome ids to estimate PTRs.

  • grouped_coverage_map_folder (str) – Path to folder containing coverage maps grouped by genome.

  • min_reads (float) – Minumum read count after filtering to estimate PTR

  • frac_nonzero (float) – Fraction of nonzero 10Kb bins required to estimate PTR

  • threads (int) – Number of threads for parallel computation

  • plot_folder (str) – If not None, plots of fitted model are generated and saved here

  • mem_limit (int) – Limit amount of coverage maps loaded into memory at once. Breaks samples per genome into batches. Limit is specified in GB. Works only in single-threaded mode.

Returns

coptr_ref_estimates – A dictionary with key reference genome id and value a list of CoPTRRefEstimate for that reference genome.

Return type

dict[str -> list[CoPTRefEstimate]]

poisson_pca.py

Dimensionality reduction with Poisson loss function.

class src.coptr.poisson_pca.PoissonPCA

Principal Component Analysis with Poisson observations.

grad_W(X, W, V)

Compute the gradient of the log likelihood wrt W.

Parameters
  • X (np.array) – An n x m data matrix

  • W (np.array) – An n x k matrix

  • V (np.array) – A k x m matrix

log_likelihood(X, W, V)

Compute the log likelihood (minus a constant).

Parameters
  • X (np.array) – An n x m data matrix

  • W (np.array) – An n x k matrix

  • V (np.array) – A k x m matrix

pca(X, k, tol=0.001)

Run the PCA. For a data matrix X (sample by observations), finds two matrices W and V such that:

X_{ij} ~ Poisson(exp( [WV]_{ij} ))

Note that missing entries are possible. Missing entries should be denoted by np.nan.

Parameters
  • X (np.array) – A sample by observation data matrix with observed counts

  • k (int) – Number of latent dimensions. Determines the rank of W and V.

update_V(X, W, V)

Optimize the log-likelihood with respect to V.

Parameters
  • X (np.array) – The data matrix

  • W (np.array) – The current estimate of W

  • V (np.array) – The current estimate of V

update_W(X, W, V)

Optimize the log-likelihood with respect to W.

Parameters
  • X (np.array) – The data matrix

  • W (np.array) – The current estimate of W

  • V (np.array) – The current estimate of V

read_mapper.py

Module to map reads to a reference database of high-quality assemblies and complete reference genomes.

class src.coptr.read_mapper.ReadMapper

Wrapper around bowtie2.

index(ref_fasta, index_out, bt2_bmax, bt2_dcv, bt2_threads, bt2_packed)

Build a bowtie2 index from ref_fasta.

Parameters
  • ref_fasta (str) – Fasta file or folder containing fasta files to index. Valid extensions include ‘.fasta’, ‘.fna’, ‘.fa’

  • index_out (str) – Path to output the index.

  • bt2_bmax (str) – Parameter to pass to bowtie2-build –bmax argument.

  • bt2_dcv (str) – Parameter to pass to bowtie2-build –dcv argument.

  • bt2_threads (str) – Parameter to pass to bowtie2-build –threads argument.

  • bt2_packed (str) – Parameter to pass to bowtie2-build –packed argument.

map(index, inputf, outfolder, paired, threads, bt2_k)

Map reads from infile against reference database using bowtie2, then convert to a bam file.

Parameters
  • index (str) – Path of the database index to map against

  • inputf (str) – File or folder with the reads to map

  • outfolder (str) – Folder to save bam files.

  • paired (bool) – True for paired end sequencing.

  • threads (int) – Number of threads to use. Passed to the -p argument for bowtie2.

  • bt2_k (int) – Number of alignments to report

read_assigner.py

Assign multi-mapped reads to a single genome.

class src.coptr.read_assigner.ReadAssigner(X, prior_counts)

Assign multi-mapped reads using a mixture model.

Parameters

X (np.array) – A read by genome data matrix (0-1). X[i,j] = 1 if read i maps to genome j

assign_reads()

Compute read assignments.

Returns

assignments – A 1D array. Each entry gives a genome id that is the genome assignment for each read.

Return type

np.array

compute_elbo()

Compute the variational objective function.

run_vi(tol=0.001)

Optimize variational parameters with respect to elbo.

update_eta()

Optimize eta with fixed phi.

update_phi()

Optimize phi with fixed eta.

util.py

Miscellaneous utility functions.

src.coptr.util.get_fastq_name(fpath)

Remove the path and extension from a fastq file.

Parameters

fpath (str) – path to .fastq, fastq.gz, .fq, or fq.gz file

Returns

fname – the name of the file with its path and extension removeed

Return type

str