Getting Started

Welcome to SNPio! This guide will help you get started with reading, filtering, encoding, analyzing, and visualizing genotype data for population genomic and phylogenetic analysis.

SNPio supports multiple standard formats — VCF, PHYLIP, STRUCTURE, and GENEPOP — and provides a unified GenotypeData object for downstream analysis using a consistent interface.

This guide covers:

Installation options (pip, conda, Docker)
File readers and formats
Core usage examples
Experimental modules like AMOVA and TreeParser

Installation

There are several ways to install SNPio:

Using pip (recommended with a virtual environment)
Using conda
Using Docker
(Advanced) Installing globally with pip (not recommended)

Installing with pip

python3 -m venv snpio-env
source snpio-env/bin/activate
pip install snpio

Tip

For reproducibility, consider saving your environment to a requirements file:

pip freeze > requirements.txt

Installing with conda

conda create -n snpio-env python=3.12
conda activate snpio-env
conda install -c btmartin721 snpio

Installing with Docker

docker pull btmartin721/snpio:latest
docker run -it btmartin721/snpio:latest

Tip

SNPio is designed for Unix-based systems (Linux, macOS). On Windows, use WSL (Windows Subsystem for Linux) for full compatibility.

Importing SNPio

Once installed, you can import SNPio modules as needed:

from snpio import (
    VCFReader, PhylipReader, StructureReader, GenePopReader,
    NRemover2, PopGenStatistics, GenotypeEncoder
)

Tip

Only import the readers or tools you plan to use. This helps avoid clutter, reduces memory usage, and avoids overwhelming your namespace with unused classes.

Reading Genotype Data

SNPio provides dedicated readers for each supported format. Each reader returns a GenotypeData object, which enables downstream filtering, encoding, and statistical analyses.

Here’s a minimal VCF loading example:

vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"

gd = VCFReader(
    filename=vcf,
    popmapfile=popmap,
    force_popmap=True,
    prefix="snpio_example",
    plot_format="png",
    verbose=True
)

The GenotypeData Object

All SNPio readers return a GenotypeData object, which serves as a central container for your genomic data and metadata. Once you load your data, you can access several useful attributes directly from this object.

Key Public Attributes

Attribute	Description
`snp_data`	A NumPy array `(n_samples, n_loci)` of IUPAC-encoded genotypes.
`samples`	A list of sample (individual) IDs.
`num_inds`	The total number of individuals.
`num_snps`	The total number of SNPs (loci).
`shape`	A `(n_samples, n_loci)` tuple representing the alignment dimensions.
`populations`	A list of population IDs corresponding to each sample.
`popmap`	A dictionary mapping each sample ID to its population ID.
`popmap_inverse`	A dictionary mapping each population ID to a list of its member sample IDs.
`num_pops`	The total number of unique populations.
`pop_sizes`	A dictionary mapping each population ID to its sample count.
`pop_to_indices`	A dictionary mapping each population ID to a list of its sample row indices.
`has_popmap`	A boolean indicating if population data is present.
`ref`	A list of reference alleles for each locus (VCF-derived).
`alt`	A list of alternate alleles for each locus (VCF-derived).
`locus_names`	A list of concrete names for each locus (e.g., “chr1:1001”).
`snpsdict`	A dictionary mapping sample IDs to their genotype sequences.
`inputs`	A dictionary of the keyword arguments used to initialize the object.
`is_empty`	A boolean that is `True` if the dataset has zero samples or loci.
`output_dir`	The root output directory path for generated files.
`plots_dir`	The dedicated directory path for plots.
`reports_dir`	The dedicated directory path for reports.
`missing_mask`	A boolean NumPy array where `True` marks a missing genotype.
`valid_mask`	A boolean NumPy array where `True` marks a non-missing genotype.
`het_mask`	A boolean NumPy array where `True` marks a heterozygous genotype.
`missing_rate`	The overall proportion of missing data in the alignment.
`per_locus_missing`	A pandas Series with the missing data proportion for each locus.
`per_individual_missing`	A pandas Series with the missing data proportion for each individual.
`per_locus_het_rate`	A pandas Series with the heterozygosity rate for each locus.
`per_individual_het_rate`	A pandas Series with the heterozygosity rate for each individual.
`is_missing_locus`	A boolean NumPy array that is `True` for loci missing in all samples.
`nbytes`	The approximate memory footprint of the `snp_data` array in bytes.
`sample_indices`	A boolean array indicating which samples are retained after filtering.
`loci_indices`	A boolean array indicating which loci are retained after filtering.

You can use these attributes to inspect your data at any stage of your analysis pipeline. For example:

# Load data
gd = VCFReader(filename="my_data.vcf.gz", popmapfile="popmap.txt")

# Inspect the data
print(f"Loaded {gd.num_inds} individuals and {gd.num_snps} SNPs.")
print(f"Populations found: {list(gd.popmap_inverse.keys())}")
print(f"First 5 samples: {gd.samples[:5]}")
print(f"Genotype of first sample at first SNP: {gd.snp_data[0, 0]}")

Specifying Populations

You can restrict analysis to specific populations using include_pops and/or exclude_pops:

gd = VCFReader(
    filename=vcf,
    popmapfile=popmap,
    prefix="snpio_example",
    include_pops=["EA", "GU", "ON"],
    exclude_pops=["MX", "YU"]
)

Tip

You can combine include_pops and exclude_pops to enforce fine-grained control. Just be sure they don’t overlap.

Note

The population IDs in include_pops and exclude_pops must not overlap. If they do, an error will be raised.

force_popmap Behavior

The force_popmap=True option attempts to align the popmap and alignment automatically by:

Removing unmatched samples
Sorting both to match order
Avoiding errors on slight mismatches

If set to False, exact correspondence is required between the popmap and alignment sample IDs.

Supported Missing Data Characters

All SNPio readers treat the following as missing genotypes:

“N”
“.”
“?”
“-”

Population Map File Format

Popmap files map sample IDs to population IDs. SNPio supports:

Whitespace- or comma-delimited files
Optional header row: SampleID, PopulationID

Example (no header):

Sample1,Population1
Sample2,Population1
Sample3,Population2

Example (with header):

SampleID,PopulationID
Sample1,Population1
Sample2,Population1
Sample3,Population2

Writing Genotype Data to File

All GenotypeData objects created by SNPio readers (e.g., VCFReader, PhylipReader, StructureReader, GenePopReader) support interoperable export methods. This allows you to convert between formats seamlessly, regardless of the original input type.

Available write methods:

write_vcf(filepath) - Export to bgzipped VCF format
write_phylip(filepath) - Export to sequential PHYLIP format
write_structure(filepath) - Export to STRUCTURE format
write_genepop(filepath) - Export to GENEPOP format
write_popmap(filepath) - Export population map file. Useful if samples were filtered or modified.

These methods are available directly from any GenotypeData object.

Examples

Export filtered data to a new VCF:

gd.write_vcf("output_data.vcf.gz")

Convert a VCF file to STRUCTURE format:

from snpio import VCFReader

gd = VCFReader(
    filename="input.vcf.gz",
    popmapfile="popmap.txt",
    force_popmap=True,
    prefix="converted_data"
)

gd.write_structure("converted_data.str")

Convert STRUCTURE → PHYLIP:

from snpio import StructureReader

gd = StructureReader(
    filename="input_structure.str",
    popmapfile="popmap.txt"
)

gd.write_phylip("converted.phy")

Convert GENEPOP → VCF:

from snpio import GenePopReader

gd = GenePopReader(
    filename="example.gen",
    popmapfile="popmap.txt"
)

gd.write_vcf("converted_from_genepop.vcf")

Notes

Output files can be compressed (e.g., .vcf.gz) or plain text.
The same GenotypeData object can be written to any supported format, regardless of input format.
SNPio handles internal data transformation automatically (e.g., allele formats, headers, sample ordering).

Tip

This interoperability is useful for converting legacy formats or preparing input for downstream tools that require a specific genotype format.

Next Steps

Now that you’ve successfully loaded data, you’re ready to filter, encode, and analyze genotype datasets using NRemover2, GenotypeEncoder, and PopGenStatistics.

Continue to the next section: Filtering with NRemover2.

Filtering Genotype Data with NRemover2

The NRemover2 class provides a flexible and efficient way to clean and preprocess your genotype dataset before downstream analysis. It supports filters for:

Missing data thresholds (per sample, locus, population)
Minor allele count (MAC) and frequency (MAF)
Singleton and monomorphic loci
Biallelic filtering
Locus thinning and linked loci pruning
Random subsetting

Filters can be chained and must be finalized with .resolve().

Basic Example

from snpio import NRemover2

nrm = NRemover2(genotype_data)

gd_filt = (
    nrm.filter_missing_sample(0.75)
      .filter_missing(0.75)
      .filter_missing_pop(0.75)
      .filter_mac(2)
      .filter_monomorphic(exclude_heterozygous=True)
      .filter_singletons(exclude_heterozygous=True)
      .filter_biallelic(exclude_heterozygous=True)
      .resolve()
)

# Save filtered VCF
gd_filt.write_vcf("filtered_output.vcf.gz")

Note

You must call .resolve() at the end of the filter chain to apply the filtering logic and return a new GenotypeData object.

Filtering with exclude_heterozygous

Many filters accept the exclude_heterozygous=True flag, which ensures that only homozygous genotypes are considered during filtering. For example:

filter_singletons(exclude_heterozygous=True) removes singletons where the minor allele is present in only one homozygous genotype. If the minor allele is present in a heterozygous genotype, it will not be counted as a singleton, but rather as a monomorphic site.
filter_biallelic(exclude_heterozygous=True) retains only sites with at least one homozygous reference and one homozygous alternate.
filter_monomorphic(exclude_heterozygous=True) removes sites with only one allele. A site with a heterozygous genotype and all homozygous reference genotypes will still be considered monomorphic if exclude_heterozygous=True is set, as it will not count the heterozygous genotype as a valid allele.

Tip

Use exclude_heterozygous=True when you want to apply conservative filters that rely only on clear homozygous calls.

Key Filtering Methods

Method	Description
`filter_missing_sample(thresh)`	Removes individuals with missing data above threshold.
`filter_missing(thresh)`	Removes loci with missing data above threshold.
`filter_missing_pop(thresh)`	Removes loci where any population exceeds threshold missingness.
`filter_mac(thresh)`	Removes loci with MAC below threshold.
`filter_maf(thresh)`	Removes loci with MAF below threshold.
`filter_monomorphic(...)`	Removes sites with only one allele.
`filter_singletons(...)`	Removes loci where the minor allele is present in only one genotype.
`filter_biallelic(...)`	Retains loci with exactly two alleles.
`thin_loci(size)`	Retains one SNP within every size bp window.
`filter_linked(size)`	Keeps only one SNP per scaffold/chromosome within size bp.
`random_subset_loci(size)`	Selects a random subset of size loci.
`resolve()`	Finalizes and applies all filters.

Note

thin_loci() and filter_linked() are only available for VCFReader input. They rely on chromosome and position metadata.

Threshold Grid Search

Use search_thresholds() to explore the effects of various filtering thresholds:

nrm = NRemover2(genotype_data)

nrm.search_thresholds(
    thresholds=[0.25, 0.5, 0.75, 1.0],
    maf_thresholds=[0.01, 0.05],
    mac_thresholds=[2, 5],
    filter_order=[
        "filter_missing_sample",
        "filter_missing",
        "filter_missing_pop",
        "filter_mac",
        "filter_monomorphic",
        "filter_singletons",
        "filter_biallelic"
    ]
)

Tip

Grid search helps identify thresholds that retain maximum informative loci while reducing noise from missing or rare variants.

Note

search_thresholds() is incompatible with thin_loci(), filter_linked(), and random_subset_loci().

Filtering Visualizations

Missingness Reports

After filtering, you can inspect missing data levels:

gd_filt.missingness_reports(prefix="filtered")

Missing Data Report — Figure: Per-sample, Per-locus, and Population-level missing data summaries.

Sankey Diagram

Visualize how many loci are retained vs. removed at each filter step:

nrm.plot_sankey_filtering_report()

Sankey plot of filtering steps — Figure: Sankey diagram showing filter effects at each stage. Bands are proportional to the number of loci retained (green) or removed (red) at each step.

Note

The Sankey diagram only visualizes loci, not samples. It also must be run after .resolve().

Note

plot_sankey_filtering_report() does not render thin_loci(), filter_linked(), random_subset_loci(), or filter_missing_samples().

Best Practices

Always inspect missingness and allele distributions before analysis.
Use search_thresholds() to identify optimal filter thresholds for your dataset.
Call .resolve() before using filtered data downstream.
Combine visual tools (Sankey, missingness plots) to guide decisions.

Next Steps

Continue to Section 3: Encoding Genotype Data for Machine Learning and AI using the GenotypeEncoder class.

Encoding Genotype Data

SNPio provides the GenotypeEncoder class for converting genotype data into formats suitable for AI, machine learning, and numerical analyses. It supports:

One-hot encoding (multi-label binary)
Integer encoding (categorical)
0-1-2 encoding (for additive genetic models)

These formats can be used directly in machine learning pipelines or population genomic models.

Importing GenotypeEncoder

from snpio import VCFReader, GenotypeEncoder

gd = VCFReader(
    filename="snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz",
    popmapfile="snpio/example_data/popmaps/phylogen_nomx.popmap",
    force_popmap=True,
    prefix="snpio_example",
    verbose=True
)

encoder = GenotypeEncoder(gd)

Encoding Formats

One-Hot Encoding

gt_ohe = encoder.genotypes_onehot

Each base or IUPAC code is represented as a vector:

{
    "A": [1.0, 0.0, 0.0, 0.0],
    "C": [0.0, 1.0, 0.0, 0.0],
    "G": [0.0, 0.0, 1.0, 0.0],
    "T": [0.0, 0.0, 0.0, 1.0],
    "N": [0.0, 0.0, 0.0, 0.0],
    "W": [1.0, 0.0, 0.0, 1.0],
    "R": [1.0, 0.0, 1.0, 0.0],
    "M": [1.0, 1.0, 0.0, 0.0],
    "K": [0.0, 0.0, 1.0, 1.0],
    "Y": [0.0, 1.0, 0.0, 1.0],
    "S": [0.0, 1.0, 1.0, 0.0],
}

Integer Encoding

gt_int = encoder.genotypes_int

Each genotype is mapped to a unique integer:

“A” = 0, “C” = 1, “G” = 2, “T” = 3
IUPAC heterozygotes (e.g., “W”, “R”) = 4-9
Missing values = -9

0-1-2 Encoding

gt_012 = encoder.genotypes_012

Encodes phased diploid data:

0 = Homozygous reference
1 = Heterozygous
2 = Homozygous alternate
-9 = Missing

Two-channel Allele Encoding

The two-channel alleles encoding provides separate matrices for each allele in diploid genotypes.

# Tuple of numpy arrays of shape (n_samples, n_variants)
gt_allele1, gt_allele2 = encoder.two_channel_alleles

Reversing the Encoding

All encodings are bi-directional. You can convert encoded data back to genotypes by setting the property:

encoder.genotypes_onehot = gt_ohe
encoder.genotypes_int = gt_int
encoder.genotypes_012 = gt_012
encoder.two_channel_alleles = (gt_allele1, gt_allele2)

This updates the original GenotypeData.snp_data matrix with the decoded genotype values.

Use Cases

AI and ML training data: Convert to numerical format for model input
Dimensionality reduction (e.g., PCA, t-SNE): Use 0-1-2 or one-hot formats
Visualization: Heatmaps, clustering, D-statistics, etc.

Tip

The encoder handles IUPAC ambiguity codes seamlessly and supports bidirectional conversion between genotypes and encoded formats.

Next Steps

Proceed to Section 4: Population Genetic Analyses with `PopGenStatistics` for computing Fst, heterozygosity, nucleotide diversity, D-statistics, PCA, and more.

Population Genetic Analysis

The PopGenStatistics class provides a unified interface to perform key population genetic analyses using your filtered GenotypeData object.

It supports:

Summary statistics: He, Ho, π, and pairwise Weir & Cockerham Fst (Weir and Cockerham, 1984)
Nei’s genetic distance (Nei, 1972) (observed, permutation p-values, bootstrap CIs)
D-statistics: Patterson’s (Green et al., 2010, Patterson et al., 2006), Partitioned (Eaton and Ree, 2013), and D-FOIL (Pease and Hahn, 2015)
Fst outlier detection (Beaumont and Nichols, 1996, Foll and Gaggiotti, 2008)
Principal Component Analysis (PCA)
(Experimental) AMOVA (Excoffier et al., 1992)

Basic Usage

from snpio import VCFReader, PopGenStatistics

gd = VCFReader(
    filename="example.vcf.gz",
    popmapfile="popmap.txt",
    force_popmap=True,
    prefix="analysis_output",
    verbose=True
)

pgs = PopGenStatistics(gd)

# New API (returns a tuple)
summary_stats, allele_stats_df = pgs.summary_statistics(
    fst_method="permutation",  # "observed", "permutation", or "bootstrap"
    n_reps=100,
    n_jobs=-1,
    save_plots=True
)

Calculates:

Observed heterozygosity (Ho)
Expected heterozygosity (He)
Nucleotide diversity (π)
Pairwise Fst (Weir & Cockerham, 1984)
Allele-frequency based summary statistics

Includes per-population tables and visualizations.

Allele Summary Statistics

SNPio provides a comprehensive allele-level diagnostic summary via the PopGenStatistics class. This module quantifies missingness, heterozygosity, allelic richness, minor allele frequencies (MAFs), and more — providing an overview of SNP dataset quality and diversity.

This is automatically called within the PopGenStatistics(…).summary_statistics() method.

Example

from snpio import VCFReader, PopGenStatistics

gd = VCFReader(
    filename="example.vcf.gz",
    popmapfile="popmap.txt",
    force_popmap=True,
    prefix="allele_stats"
)

pgs = PopGenStatistics(gd)

# New return order: (summary_stats, allele_stats_df)
summary_stats, allele_stats_df = pgs.summary_statistics(
    fst_method="permutation",
    n_reps=100,
    n_jobs=-1,
    save_plots=True
)

Output

Returns:

summary_stats: a dictionary of result DataFrames (e.g., Ho/He/π summaries, pairwise Fst matrices and, if applicable, p-values or bootstrap intervals).
allele_stats_df: a pd.DataFrame with >30 allele-level metrics.

Note

All proportions range from 0.0 to 1.0. The tuple return is ordered as (summary_stats, allele_stats_df).

Nei’s (1972) Genetic Distance

Computes pairwise Nei’s genetic distance between populations. Supports: observed distances, permutation-based p-values, and bootstrap confidence intervals.

nei_results = pgs.neis_genetic_distance(
    method="permutation",   # "observed", "permutation", or "bootstrap"
    n_reps=1000,
    n_jobs=-1
)

df_nei = nei_results["observed"]
df_pvals = nei_results.get("pvalues")  # present if method == "permutation"
df_lower = nei_results.get("lower_ci") # present if method == "bootstrap"
df_upper = nei_results.get("upper_ci") # present if method == "bootstrap"

Nei's Genetic Distance Heatmap between Pairwise Populations — Figure: Heatmap showing Nei’s (1972) genetic distance between pairwise populations.

Tip

When plotted via SNPio’s built-ins and in the MultiQC report, diagonals are shown as 0.0 for distances and 1.0 for permutation p-values for readability.

Weir & Cockerham Fst (Distance Matrix)

Compute pairwise Weir & Cockerham Fst between populations. Supports: observed Fst, permutation-based p-values, and bootstrap confidence intervals.

fst_results = pgs.fst_distance(
    method="permutation",   # "observed", "permutation", or "bootstrap"
    n_reps=1000,
    n_jobs=-1,
    palette="viridis",
    suppress_plot=False
)

df_fst = fst_results["observed"]
df_fst_p = fst_results.get("pvalues")      # if method == "permutation"
df_fst_l = fst_results.get("lower_ci")     # if method == "bootstrap"
df_fst_u = fst_results.get("upper_ci")     # if method == "bootstrap"

Pairwise Fst heatmap — Figure: Pairwise Fst matrix. Lighter colors indicate higher differentiation.

D-Statistics (Patterson, Partitioned, DFOIL)

Calculate D-statistics for testing introgression among four (Patterson’s D), or five (Partitioned D, D-FOIL) populations. Supports bootstrap resampling, jackknife resampling, overall and per-combination statistics, and plotting.

Also supported are randomized or deterministic individual selection per population, and limiting the maximum number of individuals per population for balanced sampling and/or speed.

Basic Example

from snpio import VCFReader, PopGenStatistics

gd = VCFReader(
    filename="example.vcf.gz",
    popmapfile="popmap.txt",
    prefix="dstat_analysis",
    force_popmap=True,
)

pgs = PopGenStatistics(gd)

df, summary = pgs.calculate_d_statistics(
    method="partitioned",
    population1="P1",
    population2="P2",
    population3="P3a",
    population4="P3b",
    outgroup="Out",
    num_bootstraps=1000,
    max_individuals_per_pop=5,
    individual_selection="random",
    per_combination=True,
    calc_overall=True,
    save_plot=True,
    use_jackknife=False,
    block_size=500,
    seed=42,
)

Fst Outlier Detection

Supports two methods:

Traditional permutations
DBSCAN clustering

# Permutation-based outliers (single DataFrame with unadjusted/adjusted
# p-values)
df_outliers = pgs.detect_fst_outliers(
    n_permutations=1000,
    correction_method="fdr_bh",  # or "bonferroni", "holm", etc.
    n_jobs=4,
    seed=123
)

# DBSCAN-based outliers
df_dbscan = pgs.detect_fst_outliers(
    use_dbscan=True,
    correction_method="fdr_bh",
    n_jobs=4,
    min_samples=5
)

Fst Outliers heatmap — Figure: Outlier SNPs based on Fst between population pairs.

Note

detect_fst_outliers now returns one DataFrame. If a correction method is provided, adjusted p-values are included as columns.

AMOVA (Experimental)

amova_result = pgs.amova(
    regionmap={
        "EA": "East",
        "GU": "East",
        "TT": "East",
        "DS": "West"
    },
    n_permutations=100,
    n_jobs=1
)

Performs Analysis of Molecular Variance based on hierarchical population structure.

PopGenStatistics Method Summary

Method	Description	Algorithm(s)
`summary_statistics(fst_method="observed\|permutation\|bootstrap", n_reps, n_jobs, save_plots)`	Calculates He, Ho, π, and pairwise Fst; returns `(summary_stats, allele_stats_df)`.	Standard formulas; permutation/bootstrapping for Fst if requested
`neis_genetic_distance(method, n_reps, n_jobs, palette, suppress_plot)`	Computes Nei’s pairwise distances. Returns dict with keys: `observed`, and (optionally) `pvalues`, `lower_ci`, `upper_ci`.	Nei (1972)
`fst_distance(method, n_reps, n_jobs, palette, suppress_plot)`	Computes Weir & Cockerham pairwise Fst. Returns dict like Nei’s: `observed` plus (optionally) `pvalues` or CIs.	Weir & Cockerham (1984)
`calculate_d_statistics(method, population1, population2, population3, population4=None, outgroup=None, num_bootstraps, max_individuals_per_pop, individual_selection, save_plot, seed, per_combination, calc_overall, use_jackknife, block_size)`	Calculates Patterson / Partitioned / D-FOIL with bootstrap or jackknife support.	ABBA–BABA framework with resampling
`detect_fst_outliers(correction_method=None, alpha=0.05, use_dbscan=False, n_permutations=1000, n_jobs=1, seed=None, min_samples=5, max_outliers_to_plot=None)`	Detects outlier loci via permutations or DBSCAN. Returns a single DataFrame (includes unadjusted/adjusted p-values when requested).	Permutations or DBSCAN
`pca(n_components=None, center=True, scale=False, n_axes=2\|3, seed=None, ..., plot_format=None)`	Runs PCA (KNN-imputes missing data, optional scaling, robust guards) and creates visualizations.	scikit-learn PCA
`amova(regionmap=None, n_permutations=0, n_jobs=1, random_seed=None)` (experimental)	Performs AMOVA with optional SNP-wise permutations for p-values.	Hierarchical variance partitioning

Next Steps

Continue to Section 5: MultiQC Report Generation if you want to explore SNPio’s support for generating interactive MultiQC reports.

MultiQC Report Generation

SNPio includes built-in support for generating interactive MultiQC reports that summarize results from all major analysis modules. These reports are ideal for data exploration, publication-ready figures, and batch comparisons.

The SNPioMultiQCReport class automatically collects results from:

Filtering (NRemover2)
Summary statistics and Fst estimates (PopGenStatistics)
Genetic distance matrices (Nei’s, Weir & Cockerham’s)
D-statistics (Patterson, Partitioned, D-FOIL)
Fst Outlier detection (DBSCAN, permutations)
Genotype encodings (GenotypeEncoder)
PCA
Input/output summary tables (e.g., number of samples/loci retained)

Basic Example

After completing your filtering and statistical analyses:

from snpio import SNPioMultiQC

report = SNPioMultiQC.build(
    prefix="my_report",
    output_dir="results/multiqc",
    overwrite=True,
)

Note

SNPioMultiQC should not be initialized directly. Instead, use it without parentheses followed by calling the build() method to generate the report from the current session’s results.

Tip

SNPioMultiQC should be called after all filtering and analysis steps are complete. It collects results from the current session and generates a comprehensive report of all filtering and analysis results.

This creates a self-contained .html report with interactive visualizations, significance summaries, and analysis metadata.

Tip

The MultiQC HTML report can be viewed in Firefox, Chrome, and possibly other web browsers. It is fully self-contained, meaning all data and visualizations are embedded in the HTML file.

Note

MultiQC support is available for all formats supported by SNPio (VCF, PHYLIP, STRUCTURE, GENEPOP).

Report Contents

The SNPio MultiQC report includes:

Filtering Summary (Sankey plot, missingness, sample/locus counts)
Summary Statistics (He, Ho, π)
Pairwise Fst Matrix (Weir & Cockerham)
Nei’s Distance Heatmap
D-statistics (histograms, significance bar plots)
Fst Outlier Heatmaps
PCA (colored by population)
Input Metadata Table (samples, loci, populations)

Custom Report Location

You can control the report path:

SNPioMultiQC.build(
      prefix="Example Report",
      output_dir="results/multiqc",
      overwrite=True,
  )

Tip

All generated reports are self-contained HTML files and require no internet connection to view.

Exporting Report Data

All visualizations in the report are backed by exportable CSV and JSON data, saved alongside the HTML file. This allows integration into downstream tools or publications.

Tip

The MultiQC HTML report is fully portable—just open the file in any browser or send it to collaborators.

Integration with Analysis Pipeline

You can integrate report generation as a final step in your SNPio pipeline:

# After all file I/O, filtering, encoding, and running PopGenStatistics
SNPioMultiQC.build(
      prefix="Example Report",
      output_dir="results/multiqc",
      overwrite=True,
  )

This ensures all results are captured in a single, interactive report for easy sharing and review.

Click Here to view an example MultiQC report generated by SNPio, showcasing the various analyses and visualizations available.

Next Steps

Continue to Section 6: Experimental Tree Parsing and Phylogenetic Analysis if you want to explore SNPio’s capabilities for working with phylogenetic trees and evolutionary models.

Tree Parsing and Phylogenetic Analysis (Experimental)

SNPio includes an experimental module, TreeParser, for working with phylogenetic trees in Newick or NEXUS formats. This class enables:

Reading and writing trees
Drawing and visualizing trees
Rerooting and pruning trees
Extracting subtrees by sample name
Getting distance matrices and rate models (Q-matrix, site rates)

Basic Example

from snpio import VCFReader, TreeParser

gd = VCFReader(
    filename="snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz",
    popmapfile="snpio/example_data/popmaps/phylogen_nomx.popmap",
    force_popmap=True,
    prefix="snpio_example"
)

tp = TreeParser(
    genotype_data=gd,
    treefile="snpio/example_data/trees/test.tre",
    siterates="snpio/example_data/trees/test14K.rates",
    qmatrix="snpio/example_data/trees/test.iqtree",
    verbose=True
)

Reading and Drawing Trees

tree = tp.read_tree()
tree.draw()  # Visualize the tree

Saving Trees to File

Save in Newick format:

tp.write_tree(tree, save_path="output_tree.tre")

Save in NEXUS format:

tp.write_tree(tree, save_path="output_tree.nex", nexus=True)

Return tree as a string:

newick_string = tp.write_tree(tree, save_path=None)

Tree Editing and Analysis

Reroot tree using samples matching a pattern:

tp.reroot_tree("~EA")  # Uses regex to match sample names containing 'EA'

Prune samples:

pruned = tp.prune_tree("~ON")

Extract subtree from a sample pattern:

subtree = tp.get_subtree("~EA")

Write pruned tree to file:

tp.write_tree(pruned, save_path="pruned_tree.tre")

Inspect tree statistics:

stats = tp.tree_stats()
print(stats)

Distance Matrix and Evolutionary Rates

Get full distance matrix between all nodes:

dist_matrix = tp.get_distance_matrix()

Access the rate matrix Q (from .iqtree file or Q-matrix):

print(tp.qmat)

Access site-specific substitution rates:

print(tp.site_rates)

Input File Notes

treefile: Required. Accepts Newick or NEXUS.
siterates: Optional. One-column file with rate per alignment site.
qmatrix: Optional. Can be parsed from .iqtree or provided as a CSV or TSV file with a 4×4 substitution rate matrix for A/C/G/T.

Experimental Status

Warning

TreeParser is still under active development. Interface and behavior may change in future versions.

Tip

Regular expressions can be used with ~ prefix in get_subtree() and prune_tree().

Example Use Case: Extracting a Subtree and Saving to Newick

subtree = tp.get_subtree("~EA")
newick = tp.write_tree(subtree, save_path=None)
print(newick)

Conclusion

TreeParser allows for interactive tree visualization and manipulation, fully integrated with SNPio’s genotype data. This feature is especially useful for combining genetic variation data with phylogenetic inference pipelines.

Thank you for using SNPio! We hope this guide helps you get started with reading, filtering, encoding, analyzing, and visualizing your genotype data effectively. For more advanced usage and examples, please refer to the documentation or explore the source code on GitHub.

For any questions or issues, feel free to open an issue on the SNPio GitHub repository.

References

[BN96]

Mark A. Beaumont and Richard A. Nichols. Evaluating loci for use in the genetic analysis of population structure. Proceedings of the Royal Society B: Biological Sciences, 263(1377):1619–1626, 1996. doi:10.1098/rspb.1996.0237.

[ER13]

David A. R. Eaton and Richard H. Ree. Inferring phylogeny and introgression using radseq data: an example from flowering plants (\textit Pedicularis: orobanchaceae). Systematic Biology, 62(5):689–706, 2013. doi:10.1093/sysbio/syt032.

[ESQ92]

Laurent Excoffier, Peter E. Smouse, and Joseph M. Quattro. Analysis of molecular variance inferred from metric distances among dna haplotypes: application to human mitochondrial dna restriction data. Genetics, 131(2):479–491, 1992. doi:10.1093/genetics/131.2.479.

[FG08]

Matthieu Foll and Oscar Gaggiotti. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a bayesian perspective. Genetics, 180(2):977–993, 2008. doi:10.1534/genetics.108.092221.

[GKB10]

Richard E. Green, Johannes Krause, and \textit et al. Briggs, Adrian W. A draft sequence of the neandertal genome. Science, 328(5979):710–722, 2010. doi:10.1126/science.1188021.

[Nei72]

Masatoshi Nei. Genetic distance between populations. American Naturalist, 106(949):283–292, 1972. doi:10.1086/282771.

[PPR06]

Nick Patterson, Alkes L. Price, and David Reich. Population structure and eigenanalysis. PLoS Genetics, 2(12):e190, 2006. doi:10.1371/journal.pgen.0020190.

[PH15]

James B. Pease and Matthew W. Hahn. Detection and polarization of introgression in a five-taxon phylogeny. Systematic Biology, 64(4):651–662, 2015. doi:10.1093/sysbio/syv023.

[WC84]

Bruce S. Weir and C. Clark Cockerham. Estimating f-statistics for the analysis of population structure. Evolution, 38(6):1358–1370, 1984. doi:10.2307/2408641.