snpio.NRemover2

class snpio.NRemover2(genotype_data)[source]

A class for filtering alignments based on various criteria.

The class provides various methods for filtering genetic data alignments based on user-defined criteria. These include filtering sequences (samples) and loci (columns) by missing data thresholds, minor allele frequency, minor allele count, and other criteria. It can also remove monomorphic sites, singletons, loci with more than two alleles, and linked loci. Additional functionality includes thinning loci within a specified distance, randomly subsetting loci, and plotting filtering results.

Key features:
  • Search for optimal filtering thresholds and plot results using search_thresholds.

  • Thin loci within a specified distance using thin_loci.

  • Randomly subset loci using random_subset_loci.

  • Filter linked loci using the VCF CHROM field with filter_linked.

  • Plot a Sankey diagram of loci removed at each step using plot_sankey_filtering_report.

  • Print a summary of filtering results with print_filtering_report.

  • Remove monomorphic sites using filter_monomorphic.

  • Remove loci where the only variant is a singleton using filter_singletons.

  • Remove loci with more than two alleles using filter_biallelic.

  • Filter loci by minor allele frequency (filter_maf) or count (filter_mac).

  • Filter loci with excessive missing data using filter_missing.

  • Filter samples with excessive missing data using filter_missing_sample.

  • Filter loci with excessive missing data in specific populations using filter_missing_pop.

Example

>>> from snpio import VCFReader
>>>
>>> # Specify the genetic data from a VCF file
>>> vcf_file = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
>>>
>>> # Specify the population map file
>>> popmap_file = "snpio/example_data/popmaps/phylogen_nomx.popmap"
>>>
>>> # Read the genetic data from the VCF file
>>>
>>> gd = VCFReader(filename=vcf_file, popmapfile=popmap_file)
>>>
>>> # Initialize the NRemover2 class with the GenotypeData instance
>>> nrm = NRemover2(gd)
>>>
>>> # Filter samples and loci.
>>> nrm.filter_missing_sample(0.75)
        .filter_missing(0.75)
        .filter_missing_pop(0.75)
        .filter_mac(2)
        .filter_monomorphic(exclude_heterozygous=False)
        .filter_singletons(exclude_heterozygous=False)
        .filter_biallelic(exclude_heterozygous=False)
        .resolve()
>>> # Plot the Sankey diagram showing the number of loci removed at each filtering step.
>>> nrm.plot_sankey_filtering_report()
>>>
>>> # Run a threshold search and plot the results.
>>> nrm.search_thresholds(
    thresholds=[0.1, 0.2, 0.3, 0.4, 0.5],
    maf_thresholds=[0.01, 0.05, 0.1],
    mac_thresholds=[2, 3, 4, 5],
    filter_order=[
        "filter_missing_sample",
        "filter_missing",
        "filter_missing_pop",
        "filter_maf",
        "filter_mac",
        "filter_monomorphic",
        "filter_singletons",
        "filter_biallelic",
    ])
genotype_data

An instance of the GenotypeData class.

Type:

GenotypeData

_filtering_helper

An instance of the FilteringHelper class.

Type:

FilteringHelper

_filtering_methods

An instance of the FilteringMethods class.

Type:

FilteringMethods

df_sample_list

A list of DataFrames containing filtering results for samples.

Type:

List[pd.DataFrame]

df_global_list

A list of DataFrames containing global filtering results.

Type:

List[pd.DataFrame]

_chain_active

A boolean flag indicating whether a filtering chain is active.

Type:

bool

_chain_resolved

A boolean flag indicating whether a filtering chain has been resolved.

Type:

bool

_search_mode

A boolean flag indicating whether the search mode is active.

Type:

bool

debug

A boolean flag indicating whether to enable debug mode.

Type:

bool

verbose

A boolean flag indicating whether to enable verbose mode.

Type:

bool

logger

An instance of the Logger class for logging messages.

Type:

Logger

alignment

The input alignment to filter.

Type:

np.ndarray

populations

The population for each sequence in the alignment.

Type:

List[str]

samples

The sample IDs for each sequence in the alignment.

Type:

List[str]

prefix

The prefix for the output files.

Type:

str

popmap

A dictionary mapping sample IDs to population names.

Type:

Dict[str, str | int]

popmap_inverse

A dictionary mapping population names to lists of sample IDs.

Type:

Dict[str | int, List[str]]

sample_indices

A boolean array indicating which samples to keep.

Type:

np.ndarray

loci_indices

A boolean array indicating which loci to keep.

Type:

np.ndarray

loci_removed_per_step

A dictionary tracking the number of loci removed at each filtering step.

Type:

Dict[str, Tuple[int, int]]

samples_removed_per_step

A dictionary tracking the number of samples removed at each filtering step.

Type:

Dict[str, Tuple[int, int]]

kept_per_step

A dictionary tracking the number of loci or samples kept at each filtering step.

Type:

Dict[str, Tuple[int, float]]

step_index

The current step index in the filtering process.

Type:

int

current_threshold

The current threshold for missing data.

Type:

float

original_loci_count

The original number of loci in the alignment.

Type:

int

original_sample_count

The original number of samples in the alignment.

Type:

int

original_loci_indices

A boolean array indicating the original loci indices.

Type:

np.ndarray

original_sample_indices

A boolean array indicating the original sample indices.

Type:

np.ndarray

filter_missing()

Filters out sequences from the alignment that have more than a given proportion of missing data.

filter_missing_pop()

Filters out sequences from the alignment that have more than a given proportion of missing data in a specific population.

filter_missing_sample()

Filters out samples from the alignment that have more than a given proportion of missing data.

filter_maf()

Filters out loci (columns) where the minor allele frequency is below the threshold.

filter_monomorphic()

Filters out monomorphic sites.

filter_singletons()

Filters out loci (columns) where the only variant is a singleton.

filter_biallelic()

Filter out loci (columns) that have more than 2 alleles.

filter_linked()

Filter out linked loci using VCF file CHROM field.

thin_loci()

Thin out loci within a specified distance of each other.

random_subset_loci()

Randomly subset the loci (columns) in the SNP dataset.

filter_allele_depth()

Filters out loci based on the total allele depth.

search_thresholds()

Plots the proportion of missing data against the filtering thresholds.

plot_sankey_filtering_report()[source]

Makes a Sankey plot showing the number of loci removed at each filtering step.

print_filtering_report()

Prints a summary of the filtering results.

resolve()[source]

Finalizes the method chain and returns the updated GenotypeData instance.

__init__(genotype_data)[source]

Initializes the NRemover2 class.

This method initializes the NRemover2 class with the provided GenotypeData instance. It sets up the filtering state, including the alignment, sample indices, loci indices, and other relevant attributes. It also initializes the filtering helper and methods.

Parameters:

genotype_data (GenotypeData) – An instance of the GenotypeData class containing the genetic data alignment, population map, populations, and other relevant data.

Methods

__init__(genotype_data)

Initializes the NRemover2 class.

filter_allele_depth([min_total_depth])

Filter loci where total allele depth (AD) across all retained samples is below threshold.

filter_biallelic([exclude_heterozygous])

Retain only biallelic loci and remove those with more than two alleles.

filter_het(threshold)

Filters out loci (columns) with missing data proportion greater than the specified threshold.

filter_het_pop(threshold)

Filters loci (columns) based on missing data per population.

filter_linked([seed])

Filter to retain only one locus per chromosome/scaffold from a VCF file.

filter_mac(min_count[, exclude_heterozygous])

Filters loci where the minor allele count is below the given minimum count.

filter_maf(threshold[, exclude_heterozygous])

Filters loci where the minor allele frequency is below the threshold.

filter_missing(threshold)

Filters out loci (columns) with missing data proportion greater than the specified threshold.

filter_missing_pop(threshold)

Filters loci (columns) based on missing data per population.

filter_missing_sample(threshold)

Remove samples with missing data proportion > threshold.

filter_monomorphic([exclude_heterozygous])

Filter out monomorphic loci with only one valid allele.

filter_singletons([exclude_heterozygous])

Filter out singleton loci (minor allele count == 1, with ≥2 alleles present).

plot_sankey_filtering_report([filename])

Plots a Sankey diagram showing the number of loci removed at each filtering step.

random_subset_loci(size[, seed])

Randomly subset loci from the current alignment.

resolve([benchmark_mode])

Resolve the method chain and finalize the filtering process.

search_thresholds([thresholds, ...])

Search across filtering thresholds and plot the proportions.

thin_loci(size[, remove_all, seed])

Thin loci that are within size base pairs of each other.

Attributes

loci_indices

Gets the current loci_indices.

sample_indices

Gets the current sample_indices.

search_mode

Gets the current search mode status.