snpio.filtering.nremover2.NRemover2
- class snpio.filtering.nremover2.NRemover2(genotype_data)[source]
A class for filtering alignments based on various criteria.
The class provides various methods for filtering genetic data alignments based on user-defined criteria. These include filtering sequences (samples) and loci (columns) by missing data thresholds, minor allele frequency, minor allele count, and other criteria. It can also remove monomorphic sites, singletons, loci with more than two alleles, and linked loci. Additional functionality includes thinning loci within a specified distance, randomly subsetting loci, and plotting filtering results.
- Key features:
Search for optimal filtering thresholds and plot results using search_thresholds.
Thin loci within a specified distance using thin_loci.
Randomly subset loci using random_subset_loci.
Filter linked loci using the VCF CHROM field with filter_linked.
Plot a Sankey diagram of loci removed at each step using plot_sankey_filtering_report.
Print a summary of filtering results with print_filtering_report.
Remove monomorphic sites using filter_monomorphic.
Remove loci where the only variant is a singleton using filter_singletons.
Remove loci with more than two alleles using filter_biallelic.
Filter loci by minor allele frequency (filter_maf) or count (filter_mac).
Filter loci with excessive missing data using filter_missing.
Filter samples with excessive missing data using filter_missing_sample.
Filter loci with excessive missing data in specific populations using filter_missing_pop.
Example
>>> from snpio import VCFReader >>> >>> # Specify the genetic data from a VCF file >>> vcf_file = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz" >>> >>> # Specify the population map file >>> popmap_file = "snpio/example_data/popmaps/phylogen_nomx.popmap" >>> >>> # Read the genetic data from the VCF file >>> >>> gd = VCFReader(filename=vcf_file, popmapfile=popmap_file) >>> >>> # Initialize the NRemover2 class with the GenotypeData instance >>> nrm = NRemover2(gd) >>> >>> # Filter samples and loci. >>> nrm.filter_missing_sample(0.75) .filter_missing(0.75) .filter_missing_pop(0.75) .filter_mac(2) .filter_monomorphic(exclude_heterozygous=False) .filter_singletons(exclude_heterozygous=False) .filter_biallelic(exclude_heterozygous=False) .resolve() >>> # Plot the Sankey diagram showing the number of loci removed at each filtering step. >>> nrm.plot_sankey_filtering_report() >>> >>> # Run a threshold search and plot the results. >>> nrm.search_thresholds( thresholds=[0.1, 0.2, 0.3, 0.4, 0.5], maf_thresholds=[0.01, 0.05, 0.1], mac_thresholds=[2, 3, 4, 5], filter_order=[ "filter_missing_sample", "filter_missing", "filter_missing_pop", "filter_maf", "filter_mac", "filter_monomorphic", "filter_singletons", "filter_biallelic", ])
- genotype_data
An instance of the GenotypeData class.
- Type:
GenotypeData
- _filtering_helper
An instance of the FilteringHelper class.
- Type:
FilteringHelper
- _filtering_methods
An instance of the FilteringMethods class.
- Type:
FilteringMethods
- df_sample_list
A list of DataFrames containing filtering results for samples.
- Type:
List[pd.DataFrame]
- df_global_list
A list of DataFrames containing global filtering results.
- Type:
List[pd.DataFrame]
- _chain_active
A boolean flag indicating whether a filtering chain is active.
- Type:
bool
- _chain_resolved
A boolean flag indicating whether a filtering chain has been resolved.
- Type:
bool
- _search_mode
A boolean flag indicating whether the search mode is active.
- Type:
bool
- debug
A boolean flag indicating whether to enable debug mode.
- Type:
bool
- verbose
A boolean flag indicating whether to enable verbose mode.
- Type:
bool
- logger
An instance of the Logger class for logging messages.
- Type:
Logger
- alignment
The input alignment to filter.
- Type:
np.ndarray
- populations
The population for each sequence in the alignment.
- Type:
List[str]
- samples
The sample IDs for each sequence in the alignment.
- Type:
List[str]
- prefix
The prefix for the output files.
- Type:
str
- popmap
A dictionary mapping sample IDs to population names.
- Type:
Dict[str, str | int]
- popmap_inverse
A dictionary mapping population names to lists of sample IDs.
- Type:
Dict[str | int, List[str]]
- sample_indices
A boolean array indicating which samples to keep.
- Type:
np.ndarray
- loci_indices
A boolean array indicating which loci to keep.
- Type:
np.ndarray
- loci_removed_per_step
A dictionary tracking the number of loci removed at each filtering step.
- Type:
Dict[str, Tuple[int, int]]
- samples_removed_per_step
A dictionary tracking the number of samples removed at each filtering step.
- Type:
Dict[str, Tuple[int, int]]
- kept_per_step
A dictionary tracking the number of loci or samples kept at each filtering step.
- Type:
Dict[str, Tuple[int, float]]
- step_index
The current step index in the filtering process.
- Type:
int
- current_threshold
The current threshold for missing data.
- Type:
float
- original_loci_count
The original number of loci in the alignment.
- Type:
int
- original_sample_count
The original number of samples in the alignment.
- Type:
int
- original_loci_indices
A boolean array indicating the original loci indices.
- Type:
np.ndarray
- original_sample_indices
A boolean array indicating the original sample indices.
- Type:
np.ndarray
- filter_missing()
Filters out sequences from the alignment that have more than a given proportion of missing data.
- filter_missing_pop()
Filters out sequences from the alignment that have more than a given proportion of missing data in a specific population.
- filter_missing_sample()
Filters out samples from the alignment that have more than a given proportion of missing data.
- filter_maf()
Filters out loci (columns) where the minor allele frequency is below the threshold.
- filter_monomorphic()
Filters out monomorphic sites.
- filter_singletons()
Filters out loci (columns) where the only variant is a singleton.
- filter_biallelic()
Filter out loci (columns) that have more than 2 alleles.
- filter_linked()
Filter out linked loci using VCF file CHROM field.
- thin_loci()
Thin out loci within a specified distance of each other.
- random_subset_loci()
Randomly subset the loci (columns) in the SNP dataset.
- filter_allele_depth()
Filters out loci based on the total allele depth.
- search_thresholds()
Plots the proportion of missing data against the filtering thresholds.
- plot_sankey_filtering_report()[source]
Makes a Sankey plot showing the number of loci removed at each filtering step.
- print_filtering_report()
Prints a summary of the filtering results.
- __init__(genotype_data)[source]
Initializes the NRemover2 class.
This method initializes the NRemover2 class with the provided GenotypeData instance. It sets up the filtering state, including the alignment, sample indices, loci indices, and other relevant attributes. It also initializes the filtering helper and methods.
- Parameters:
genotype_data (GenotypeData) – An instance of the GenotypeData class containing the genetic data alignment, population map, populations, and other relevant data.
Methods
__init__(genotype_data)Initializes the NRemover2 class.
filter_allele_depth([min_total_depth])Filter loci where total allele depth (AD) across all retained samples is below threshold.
filter_biallelic([exclude_heterozygous])Retain only biallelic loci and remove those with more than two alleles.
filter_het(threshold)Filters out loci (columns) with missing data proportion greater than the specified threshold.
filter_het_pop(threshold)Filters loci (columns) based on missing data per population.
filter_linked([seed])Filter to retain only one locus per chromosome/scaffold from a VCF file.
filter_mac(min_count[, exclude_heterozygous])Filters loci where the minor allele count is below the given minimum count.
filter_maf(threshold[, exclude_heterozygous])Filters loci where the minor allele frequency is below the threshold.
filter_missing(threshold)Filters out loci (columns) with missing data proportion greater than the specified threshold.
filter_missing_pop(threshold)Filters loci (columns) based on missing data per population.
filter_missing_sample(threshold)Remove samples with missing data proportion > threshold.
filter_monomorphic([exclude_heterozygous])Filter out monomorphic loci with only one valid allele.
filter_singletons([exclude_heterozygous])Filter out singleton loci (minor allele count == 1, with ≥2 alleles present).
plot_sankey_filtering_report([filename])Plots a Sankey diagram showing the number of loci removed at each filtering step.
random_subset_loci(size[, seed])Randomly subset loci from the current alignment.
resolve([benchmark_mode])Resolve the method chain and finalize the filtering process.
search_thresholds([thresholds, ...])Search across filtering thresholds and plot the proportions.
thin_loci(size[, remove_all, seed])Thin loci that are within size base pairs of each other.
Attributes
Gets the current loci_indices.
Gets the current sample_indices.
search_modeGets the current search mode status.