Getting Started

SNPio Logo

Welcome to SNPio! This guide will help you get started with reading, filtering, encoding, analyzing, and visualizing genotype data for population genomic and phylogenetic analysis.

SNPio supports multiple standard formats — VCF, PHYLIP, STRUCTURE, and GENEPOP — and provides a unified GenotypeData object for downstream analysis using a consistent interface.

This guide covers:

  • Installation options (pip, conda, Docker)

  • File readers and formats

  • Core usage examples

  • Experimental modules like AMOVA and TreeParser

Installation

There are several ways to install SNPio:

  • Using pip (recommended with a virtual environment)

  • Using conda

  • Using Docker

  • (Advanced) Installing globally with pip (not recommended)

Installing with pip

python3 -m venv snpio-env
source snpio-env/bin/activate
pip install snpio

Tip

For reproducibility, consider saving your environment to a requirements file:

pip freeze > requirements.txt

Installing with conda

conda create -n snpio-env python=3.12
conda activate snpio-env
conda install -c btmartin721 snpio

Installing with Docker

docker pull btmartin721/snpio:latest
docker run -it btmartin721/snpio:latest

Tip

SNPio is designed for Unix-based systems (Linux, macOS). On Windows, use WSL (Windows Subsystem for Linux) for full compatibility.

Importing SNPio

Once installed, you can import SNPio modules as needed:

from snpio import (
    VCFReader, PhylipReader, StructureReader, GenePopReader,
    NRemover2, PopGenStatistics, GenotypeEncoder
)

Tip

Only import the readers or tools you plan to use. This helps avoid clutter, reduces memory usage, and avoids overwhelming your namespace with unused classes.

Reading Genotype Data

SNPio provides dedicated readers for each supported format. Each reader returns a GenotypeData object, which enables downstream filtering, encoding, and statistical analyses.

Here’s a minimal VCF loading example:

vcf = "snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz"
popmap = "snpio/example_data/popmaps/phylogen_nomx.popmap"

gd = VCFReader(
    filename=vcf,
    popmapfile=popmap,
    force_popmap=True,
    prefix="snpio_example",
    plot_format="png",
    verbose=True
)

The GenotypeData Object

All SNPio readers return a GenotypeData object, which serves as a central container for your genomic data and metadata. Once you load your data, you can access several useful attributes directly from this object.

Key Public Attributes

Attribute

Description

snp_data

A NumPy array (n_samples, n_loci) of IUPAC-encoded genotypes.

samples

A list of sample (individual) IDs.

num_inds

The total number of individuals.

num_snps

The total number of SNPs (loci).

shape

A (n_samples, n_loci) tuple representing the alignment dimensions.

populations

A list of population IDs corresponding to each sample.

popmap

A dictionary mapping each sample ID to its population ID.

popmap_inverse

A dictionary mapping each population ID to a list of its member sample IDs.

num_pops

The total number of unique populations.

pop_sizes

A dictionary mapping each population ID to its sample count.

pop_to_indices

A dictionary mapping each population ID to a list of its sample row indices.

has_popmap

A boolean indicating if population data is present.

ref

A list of reference alleles for each locus (VCF-derived).

alt

A list of alternate alleles for each locus (VCF-derived).

locus_names

A list of concrete names for each locus (e.g., “chr1:1001”).

snpsdict

A dictionary mapping sample IDs to their genotype sequences.

inputs

A dictionary of the keyword arguments used to initialize the object.

is_empty

A boolean that is True if the dataset has zero samples or loci.

output_dir

The root output directory path for generated files.

plots_dir

The dedicated directory path for plots.

reports_dir

The dedicated directory path for reports.

missing_mask

A boolean NumPy array where True marks a missing genotype.

valid_mask

A boolean NumPy array where True marks a non-missing genotype.

het_mask

A boolean NumPy array where True marks a heterozygous genotype.

missing_rate

The overall proportion of missing data in the alignment.

per_locus_missing

A pandas Series with the missing data proportion for each locus.

per_individual_missing

A pandas Series with the missing data proportion for each individual.

per_locus_het_rate

A pandas Series with the heterozygosity rate for each locus.

per_individual_het_rate

A pandas Series with the heterozygosity rate for each individual.

is_missing_locus

A boolean NumPy array that is True for loci missing in all samples.

nbytes

The approximate memory footprint of the snp_data array in bytes.

sample_indices

A boolean array indicating which samples are retained after filtering.

loci_indices

A boolean array indicating which loci are retained after filtering.

You can use these attributes to inspect your data at any stage of your analysis pipeline. For example:

# Load data
gd = VCFReader(filename="my_data.vcf.gz", popmapfile="popmap.txt")

# Inspect the data
print(f"Loaded {gd.num_inds} individuals and {gd.num_snps} SNPs.")
print(f"Populations found: {list(gd.popmap_inverse.keys())}")
print(f"First 5 samples: {gd.samples[:5]}")
print(f"Genotype of first sample at first SNP: {gd.snp_data[0, 0]}")

Specifying Populations

You can restrict analysis to specific populations using include_pops and/or exclude_pops:

gd = VCFReader(
    filename=vcf,
    popmapfile=popmap,
    prefix="snpio_example",
    include_pops=["EA", "GU", "ON"],
    exclude_pops=["MX", "YU"]
)

Tip

You can combine include_pops and exclude_pops to enforce fine-grained control. Just be sure they don’t overlap.

Note

The population IDs in include_pops and exclude_pops must not overlap. If they do, an error will be raised.

force_popmap Behavior

The force_popmap=True option attempts to align the popmap and alignment automatically by:

  • Removing unmatched samples

  • Sorting both to match order

  • Avoiding errors on slight mismatches

If set to False, exact correspondence is required between the popmap and alignment sample IDs.

Supported Missing Data Characters

All SNPio readers treat the following as missing genotypes:

  • “N”

  • “.”

  • “?”

  • “-”

Population Map File Format

Popmap files map sample IDs to population IDs. SNPio supports:

  • Whitespace- or comma-delimited files

  • Optional header row: SampleID, PopulationID

Example (no header):

Sample1,Population1
Sample2,Population1
Sample3,Population2

Example (with header):

SampleID,PopulationID
Sample1,Population1
Sample2,Population1
Sample3,Population2

Writing Genotype Data to File

All GenotypeData objects created by SNPio readers (e.g., VCFReader, PhylipReader, StructureReader, GenePopReader) support interoperable export methods. This allows you to convert between formats seamlessly, regardless of the original input type.

Available write methods:

  • write_vcf(filepath) - Export to bgzipped VCF format

  • write_phylip(filepath) - Export to sequential PHYLIP format

  • write_structure(filepath) - Export to STRUCTURE format

  • write_genepop(filepath) - Export to GENEPOP format

  • write_popmap(filepath) - Export population map file. Useful if samples were filtered or modified.

These methods are available directly from any GenotypeData object.

Examples

Export filtered data to a new VCF:

gd.write_vcf("output_data.vcf.gz")

Convert a VCF file to STRUCTURE format:

from snpio import VCFReader

gd = VCFReader(
    filename="input.vcf.gz",
    popmapfile="popmap.txt",
    force_popmap=True,
    prefix="converted_data"
)

gd.write_structure("converted_data.str")

Convert STRUCTURE → PHYLIP:

from snpio import StructureReader

gd = StructureReader(
    filename="input_structure.str",
    popmapfile="popmap.txt"
)

gd.write_phylip("converted.phy")

Convert GENEPOP → VCF:

from snpio import GenePopReader

gd = GenePopReader(
    filename="example.gen",
    popmapfile="popmap.txt"
)

gd.write_vcf("converted_from_genepop.vcf")

Notes

  • Output files can be compressed (e.g., .vcf.gz) or plain text.

  • The same GenotypeData object can be written to any supported format, regardless of input format.

  • SNPio handles internal data transformation automatically (e.g., allele formats, headers, sample ordering).

Tip

This interoperability is useful for converting legacy formats or preparing input for downstream tools that require a specific genotype format.

Next Steps

Now that you’ve successfully loaded data, you’re ready to filter, encode, and analyze genotype datasets using NRemover2, GenotypeEncoder, and PopGenStatistics.

Continue to the next section: Filtering with NRemover2.

Filtering Genotype Data with NRemover2

The NRemover2 class provides a flexible and efficient way to clean and preprocess your genotype dataset before downstream analysis. It supports filters for:

  • Missing data thresholds (per sample, locus, population)

  • Minor allele count (MAC) and frequency (MAF)

  • Singleton and monomorphic loci

  • Biallelic filtering

  • Locus thinning and linked loci pruning

  • Random subsetting

Filters can be chained and must be finalized with .resolve().

Basic Example

from snpio import NRemover2

nrm = NRemover2(genotype_data)

gd_filt = (
    nrm.filter_missing_sample(0.75)
      .filter_missing(0.75)
      .filter_missing_pop(0.75)
      .filter_mac(2)
      .filter_monomorphic(exclude_heterozygous=True)
      .filter_singletons(exclude_heterozygous=True)
      .filter_biallelic(exclude_heterozygous=True)
      .resolve()
)

# Save filtered VCF
gd_filt.write_vcf("filtered_output.vcf.gz")

Note

You must call .resolve() at the end of the filter chain to apply the filtering logic and return a new GenotypeData object.

Filtering with exclude_heterozygous

Many filters accept the exclude_heterozygous=True flag, which ensures that only homozygous genotypes are considered during filtering. For example:

  • filter_singletons(exclude_heterozygous=True) removes singletons where the minor allele is present in only one homozygous genotype. If the minor allele is present in a heterozygous genotype, it will not be counted as a singleton, but rather as a monomorphic site.

  • filter_biallelic(exclude_heterozygous=True) retains only sites with at least one homozygous reference and one homozygous alternate.

  • filter_monomorphic(exclude_heterozygous=True) removes sites with only one allele. A site with a heterozygous genotype and all homozygous reference genotypes will still be considered monomorphic if exclude_heterozygous=True is set, as it will not count the heterozygous genotype as a valid allele.

Tip

Use exclude_heterozygous=True when you want to apply conservative filters that rely only on clear homozygous calls.

Key Filtering Methods

Method

Description

filter_missing_sample(thresh)

Removes individuals with missing data above threshold.

filter_missing(thresh)

Removes loci with missing data above threshold.

filter_missing_pop(thresh)

Removes loci where any population exceeds threshold missingness.

filter_mac(thresh)

Removes loci with MAC below threshold.

filter_maf(thresh)

Removes loci with MAF below threshold.

filter_monomorphic(...)

Removes sites with only one allele.

filter_singletons(...)

Removes loci where the minor allele is present in only one genotype.

filter_biallelic(...)

Retains loci with exactly two alleles.

thin_loci(size)

Retains one SNP within every size bp window.

filter_linked(size)

Keeps only one SNP per scaffold/chromosome within size bp.

random_subset_loci(size)

Selects a random subset of size loci.

resolve()

Finalizes and applies all filters.

Note

thin_loci() and filter_linked() are only available for VCFReader input. They rely on chromosome and position metadata.

Filtering Visualizations

Missingness Reports

After filtering, you can inspect missing data levels:

gd_filt.missingness_reports(prefix="filtered")
Missing Data Report

Figure: Per-sample, Per-locus, and Population-level missing data summaries.

Sankey Diagram

Visualize how many loci are retained vs. removed at each filter step:

nrm.plot_sankey_filtering_report()
Sankey plot of filtering steps

Figure: Sankey diagram showing filter effects at each stage. Bands are proportional to the number of loci retained (green) or removed (red) at each step.

Note

The Sankey diagram only visualizes loci, not samples. It also must be run after .resolve().

Note

plot_sankey_filtering_report() does not render thin_loci(), filter_linked(), random_subset_loci(), or filter_missing_samples().

Best Practices

  • Always inspect missingness and allele distributions before analysis.

  • Use search_thresholds() to identify optimal filter thresholds for your dataset.

  • Call .resolve() before using filtered data downstream.

  • Combine visual tools (Sankey, missingness plots) to guide decisions.

Next Steps

Continue to Section 3: Encoding Genotype Data for Machine Learning and AI using the GenotypeEncoder class.

Encoding Genotype Data

SNPio provides the GenotypeEncoder class for converting genotype data into formats suitable for AI, machine learning, and numerical analyses. It supports:

  • One-hot encoding (multi-label binary)

  • Integer encoding (categorical)

  • 0-1-2 encoding (for additive genetic models)

These formats can be used directly in machine learning pipelines or population genomic models.

Importing GenotypeEncoder

from snpio import VCFReader, GenotypeEncoder

gd = VCFReader(
    filename="snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz",
    popmapfile="snpio/example_data/popmaps/phylogen_nomx.popmap",
    force_popmap=True,
    prefix="snpio_example",
    verbose=True
)

encoder = GenotypeEncoder(gd)

Encoding Formats

One-Hot Encoding

gt_ohe = encoder.genotypes_onehot

Each base or IUPAC code is represented as a vector:

{
    "A": [1.0, 0.0, 0.0, 0.0],
    "C": [0.0, 1.0, 0.0, 0.0],
    "G": [0.0, 0.0, 1.0, 0.0],
    "T": [0.0, 0.0, 0.0, 1.0],
    "N": [0.0, 0.0, 0.0, 0.0],
    "W": [1.0, 0.0, 0.0, 1.0],
    "R": [1.0, 0.0, 1.0, 0.0],
    "M": [1.0, 1.0, 0.0, 0.0],
    "K": [0.0, 0.0, 1.0, 1.0],
    "Y": [0.0, 1.0, 0.0, 1.0],
    "S": [0.0, 1.0, 1.0, 0.0],
}

Integer Encoding

gt_int = encoder.genotypes_int

Each genotype is mapped to a unique integer:

  • “A” = 0, “C” = 1, “G” = 2, “T” = 3

  • IUPAC heterozygotes (e.g., “W”, “R”) = 4-9

  • Missing values = -9

0-1-2 Encoding

gt_012 = encoder.genotypes_012

Encodes phased diploid data:

  • 0 = Homozygous reference

  • 1 = Heterozygous

  • 2 = Homozygous alternate

  • -9 = Missing

Two-channel Allele Encoding

The two-channel alleles encoding provides separate matrices for each allele in diploid genotypes.

# Tuple of numpy arrays of shape (n_samples, n_variants)
gt_allele1, gt_allele2 = encoder.two_channel_alleles

Reversing the Encoding

All encodings are bi-directional. You can convert encoded data back to genotypes by setting the property:

encoder.genotypes_onehot = gt_ohe
encoder.genotypes_int = gt_int
encoder.genotypes_012 = gt_012
encoder.two_channel_alleles = (gt_allele1, gt_allele2)

This updates the original GenotypeData.snp_data matrix with the decoded genotype values.

Use Cases

  • AI and ML training data: Convert to numerical format for model input

  • Dimensionality reduction (e.g., PCA, t-SNE): Use 0-1-2 or one-hot formats

  • Visualization: Heatmaps, clustering, D-statistics, etc.

Tip

The encoder handles IUPAC ambiguity codes seamlessly and supports bidirectional conversion between genotypes and encoded formats.

Next Steps

Proceed to Section 4: Population Genetic Analyses with `PopGenStatistics` for computing Fst, heterozygosity, nucleotide diversity, D-statistics, PCA, and more.

Population Genetic Analysis

The PopGenStatistics class provides a unified interface to perform key population genetic analyses using your filtered GenotypeData object.

It supports:

Basic Usage

from snpio import VCFReader, PopGenStatistics

gd = VCFReader(
    filename="example.vcf.gz",
    popmapfile="popmap.txt",
    force_popmap=True,
    prefix="analysis_output",
    verbose=True
)

pgs = PopGenStatistics(gd)

# New API (returns a tuple)
summary_stats, allele_stats_df = pgs.summary_statistics(
    fst_method="permutation",  # "observed", "permutation", or "bootstrap"
    n_reps=100,
    n_jobs=-1,
    save_plots=True
)

Calculates:

  • Observed heterozygosity (Ho)

  • Expected heterozygosity (He)

  • Nucleotide diversity (π)

  • Pairwise Fst (Weir & Cockerham, 1984)

  • Allele-frequency based summary statistics

Includes per-population tables and visualizations.

Allele Summary Statistics

SNPio provides a comprehensive allele-level diagnostic summary via the PopGenStatistics class. This module quantifies missingness, heterozygosity, allelic richness, minor allele frequencies (MAFs), and more — providing an overview of SNP dataset quality and diversity.

This is automatically called within the PopGenStatistics(…).summary_statistics() method.

Example

from snpio import VCFReader, PopGenStatistics

gd = VCFReader(
    filename="example.vcf.gz",
    popmapfile="popmap.txt",
    force_popmap=True,
    prefix="allele_stats"
)

pgs = PopGenStatistics(gd)

# New return order: (summary_stats, allele_stats_df)
summary_stats, allele_stats_df = pgs.summary_statistics(
    fst_method="permutation",
    n_reps=100,
    n_jobs=-1,
    save_plots=True
)

Output

Returns:

  • summary_stats: a dictionary of result DataFrames (e.g., Ho/He/π summaries, pairwise Fst matrices and, if applicable, p-values or bootstrap intervals).

  • allele_stats_df: a pd.DataFrame with >30 allele-level metrics.

Note

All proportions range from 0.0 to 1.0. The tuple return is ordered as (summary_stats, allele_stats_df).

Nei’s (1972) Genetic Distance

Computes pairwise Nei’s genetic distance between populations. Supports: observed distances, permutation-based p-values, and bootstrap confidence intervals.

nei_results = pgs.neis_genetic_distance(
    method="permutation",   # "observed", "permutation", or "bootstrap"
    n_reps=1000,
    n_jobs=-1
)

df_nei = nei_results["observed"]
df_pvals = nei_results.get("pvalues")  # present if method == "permutation"
df_lower = nei_results.get("lower_ci") # present if method == "bootstrap"
df_upper = nei_results.get("upper_ci") # present if method == "bootstrap"
Nei's Genetic Distance Heatmap between Pairwise Populations

Figure: Heatmap showing Nei’s (1972) genetic distance between pairwise populations.

Tip

When plotted via SNPio’s built-ins and in the MultiQC report, diagonals are shown as 0.0 for distances and 1.0 for permutation p-values for readability.

Weir & Cockerham Fst (Distance Matrix)

Compute pairwise Weir & Cockerham Fst between populations. Supports: observed Fst, permutation-based p-values, and bootstrap confidence intervals.

fst_results = pgs.fst_distance(
    method="permutation",   # "observed", "permutation", or "bootstrap"
    n_reps=1000,
    n_jobs=-1,
    palette="viridis",
    suppress_plot=False
)

df_fst = fst_results["observed"]
df_fst_p = fst_results.get("pvalues")      # if method == "permutation"
df_fst_l = fst_results.get("lower_ci")     # if method == "bootstrap"
df_fst_u = fst_results.get("upper_ci")     # if method == "bootstrap"
Pairwise Fst heatmap

Figure: Pairwise Fst matrix. Lighter colors indicate higher differentiation.

D-Statistics (Patterson, Partitioned, DFOIL)

Calculate D-statistics for testing introgression among four (Patterson’s D), or five (Partitioned D, D-FOIL) populations. Supports bootstrap resampling, jackknife resampling, overall and per-combination statistics, and plotting.

Also supported are randomized or deterministic individual selection per population, and limiting the maximum number of individuals per population for balanced sampling and/or speed.

Basic Example

from snpio import VCFReader, PopGenStatistics

gd = VCFReader(
    filename="example.vcf.gz",
    popmapfile="popmap.txt",
    prefix="dstat_analysis",
    force_popmap=True,
)

pgs = PopGenStatistics(gd)

df, summary = pgs.calculate_d_statistics(
    method="partitioned",
    population1="P1",
    population2="P2",
    population3="P3a",
    population4="P3b",
    outgroup="Out",
    num_bootstraps=1000,
    max_individuals_per_pop=5,
    individual_selection="random",
    per_combination=True,
    calc_overall=True,
    save_plot=True,
    use_jackknife=False,
    block_size=500,
    seed=42,
)

Fst Outlier Detection

Supports two methods:

  • Traditional permutations

  • DBSCAN clustering

# Permutation-based outliers (single DataFrame with unadjusted/adjusted
# p-values)
df_outliers = pgs.detect_fst_outliers(
    n_permutations=1000,
    correction_method="fdr_bh",  # or "bonferroni", "holm", etc.
    n_jobs=4,
    seed=123
)

# DBSCAN-based outliers
df_dbscan = pgs.detect_fst_outliers(
    use_dbscan=True,
    correction_method="fdr_bh",
    n_jobs=4,
    min_samples=5
)
Fst Outliers heatmap

Figure: Outlier SNPs based on Fst between population pairs.

Note

detect_fst_outliers now returns one DataFrame. If a correction method is provided, adjusted p-values are included as columns.

AMOVA (Experimental)

amova_result = pgs.amova(
    regionmap={
        "EA": "East",
        "GU": "East",
        "TT": "East",
        "DS": "West"
    },
    n_permutations=100,
    n_jobs=1
)

Performs Analysis of Molecular Variance based on hierarchical population structure.

PopGenStatistics Method Summary

Method

Description

Algorithm(s)

summary_statistics(fst_method="observed|permutation|bootstrap", n_reps, n_jobs, save_plots)

Calculates He, Ho, π, and pairwise Fst; returns (summary_stats, allele_stats_df).

Standard formulas; permutation/bootstrapping for Fst if requested

neis_genetic_distance(method, n_reps, n_jobs, palette, suppress_plot)

Computes Nei’s pairwise distances. Returns dict with keys: observed, and (optionally) pvalues, lower_ci, upper_ci.

Nei (1972)

fst_distance(method, n_reps, n_jobs, palette, suppress_plot)

Computes Weir & Cockerham pairwise Fst. Returns dict like Nei’s: observed plus (optionally) pvalues or CIs.

Weir & Cockerham (1984)

calculate_d_statistics(method, population1, population2, population3, population4=None, outgroup=None, num_bootstraps, max_individuals_per_pop, individual_selection, save_plot, seed, per_combination, calc_overall, use_jackknife, block_size)

Calculates Patterson / Partitioned / D-FOIL with bootstrap or jackknife support.

ABBA–BABA framework with resampling

detect_fst_outliers(correction_method=None, alpha=0.05, use_dbscan=False, n_permutations=1000, n_jobs=1, seed=None, min_samples=5, max_outliers_to_plot=None)

Detects outlier loci via permutations or DBSCAN. Returns a single DataFrame (includes unadjusted/adjusted p-values when requested).

Permutations or DBSCAN

pca(n_components=None, center=True, scale=False, n_axes=2|3, seed=None, ..., plot_format=None)

Runs PCA (KNN-imputes missing data, optional scaling, robust guards) and creates visualizations.

scikit-learn PCA

amova(regionmap=None, n_permutations=0, n_jobs=1, random_seed=None) (experimental)

Performs AMOVA with optional SNP-wise permutations for p-values.

Hierarchical variance partitioning

Next Steps

Continue to Section 5: MultiQC Report Generation if you want to explore SNPio’s support for generating interactive MultiQC reports.

MultiQC Report Generation

SNPio includes built-in support for generating interactive MultiQC reports that summarize results from all major analysis modules. These reports are ideal for data exploration, publication-ready figures, and batch comparisons.

The SNPioMultiQCReport class automatically collects results from:

  • Filtering (NRemover2)

  • Summary statistics and Fst estimates (PopGenStatistics)

  • Genetic distance matrices (Nei’s, Weir & Cockerham’s)

  • D-statistics (Patterson, Partitioned, D-FOIL)

  • Fst Outlier detection (DBSCAN, permutations)

  • Genotype encodings (GenotypeEncoder)

  • PCA

  • Input/output summary tables (e.g., number of samples/loci retained)

Basic Example

After completing your filtering and statistical analyses:

from snpio import SNPioMultiQC

report = SNPioMultiQC.build(
    prefix="my_report",
    output_dir="results/multiqc",
    overwrite=True,
)

Note

SNPioMultiQC should not be initialized directly. Instead, use it without parentheses followed by calling the build() method to generate the report from the current session’s results.

Tip

SNPioMultiQC should be called after all filtering and analysis steps are complete. It collects results from the current session and generates a comprehensive report of all filtering and analysis results.

This creates a self-contained .html report with interactive visualizations, significance summaries, and analysis metadata.

Tip

The MultiQC HTML report can be viewed in Firefox, Chrome, and possibly other web browsers. It is fully self-contained, meaning all data and visualizations are embedded in the HTML file.

Note

MultiQC support is available for all formats supported by SNPio (VCF, PHYLIP, STRUCTURE, GENEPOP).

Report Contents

The SNPio MultiQC report includes:

  • Filtering Summary (Sankey plot, missingness, sample/locus counts)

  • Summary Statistics (He, Ho, π)

  • Pairwise Fst Matrix (Weir & Cockerham)

  • Nei’s Distance Heatmap

  • D-statistics (histograms, significance bar plots)

  • Fst Outlier Heatmaps

  • PCA (colored by population)

  • Input Metadata Table (samples, loci, populations)

Custom Report Location

You can control the report path:

SNPioMultiQC.build(
      prefix="Example Report",
      output_dir="results/multiqc",
      overwrite=True,
  )

Tip

All generated reports are self-contained HTML files and require no internet connection to view.

Exporting Report Data

All visualizations in the report are backed by exportable CSV and JSON data, saved alongside the HTML file. This allows integration into downstream tools or publications.

Tip

The MultiQC HTML report is fully portable—just open the file in any browser or send it to collaborators.

Integration with Analysis Pipeline

You can integrate report generation as a final step in your SNPio pipeline:

# After all file I/O, filtering, encoding, and running PopGenStatistics
SNPioMultiQC.build(
      prefix="Example Report",
      output_dir="results/multiqc",
      overwrite=True,
  )

This ensures all results are captured in a single, interactive report for easy sharing and review.

Click Here to view an example MultiQC report generated by SNPio, showcasing the various analyses and visualizations available.

Next Steps

Continue to Section 6: Experimental Tree Parsing and Phylogenetic Analysis if you want to explore SNPio’s capabilities for working with phylogenetic trees and evolutionary models.

Tree Parsing and Phylogenetic Analysis (Experimental)

SNPio includes an experimental module, TreeParser, for working with phylogenetic trees in Newick or NEXUS formats. This class enables:

  • Reading and writing trees

  • Drawing and visualizing trees

  • Rerooting and pruning trees

  • Extracting subtrees by sample name

  • Getting distance matrices and rate models (Q-matrix, site rates)

Basic Example

from snpio import VCFReader, TreeParser

gd = VCFReader(
    filename="snpio/example_data/vcf_files/phylogen_subset14K_sorted.vcf.gz",
    popmapfile="snpio/example_data/popmaps/phylogen_nomx.popmap",
    force_popmap=True,
    prefix="snpio_example"
)

tp = TreeParser(
    genotype_data=gd,
    treefile="snpio/example_data/trees/test.tre",
    siterates="snpio/example_data/trees/test14K.rates",
    qmatrix="snpio/example_data/trees/test.iqtree",
    verbose=True
)

Reading and Drawing Trees

tree = tp.read_tree()
tree.draw()  # Visualize the tree

Saving Trees to File

Save in Newick format:

tp.write_tree(tree, save_path="output_tree.tre")

Save in NEXUS format:

tp.write_tree(tree, save_path="output_tree.nex", nexus=True)

Return tree as a string:

newick_string = tp.write_tree(tree, save_path=None)

Tree Editing and Analysis

Reroot tree using samples matching a pattern:

tp.reroot_tree("~EA")  # Uses regex to match sample names containing 'EA'

Prune samples:

pruned = tp.prune_tree("~ON")

Extract subtree from a sample pattern:

subtree = tp.get_subtree("~EA")

Write pruned tree to file:

tp.write_tree(pruned, save_path="pruned_tree.tre")

Inspect tree statistics:

stats = tp.tree_stats()
print(stats)

Distance Matrix and Evolutionary Rates

Get full distance matrix between all nodes:

dist_matrix = tp.get_distance_matrix()

Access the rate matrix Q (from .iqtree file or Q-matrix):

print(tp.qmat)

Access site-specific substitution rates:

print(tp.site_rates)

Input File Notes

  • treefile: Required. Accepts Newick or NEXUS.

  • siterates: Optional. One-column file with rate per alignment site.

  • qmatrix: Optional. Can be parsed from .iqtree or provided as a CSV or TSV file with a 4×4 substitution rate matrix for A/C/G/T.

Experimental Status

Warning

TreeParser is still under active development. Interface and behavior may change in future versions.

Tip

Regular expressions can be used with ~ prefix in get_subtree() and prune_tree().

Example Use Case: Extracting a Subtree and Saving to Newick

subtree = tp.get_subtree("~EA")
newick = tp.write_tree(subtree, save_path=None)
print(newick)

Conclusion

TreeParser allows for interactive tree visualization and manipulation, fully integrated with SNPio’s genotype data. This feature is especially useful for combining genetic variation data with phylogenetic inference pipelines.

Thank you for using SNPio! We hope this guide helps you get started with reading, filtering, encoding, analyzing, and visualizing your genotype data effectively. For more advanced usage and examples, please refer to the documentation or explore the source code on GitHub.

For any questions or issues, feel free to open an issue on the SNPio GitHub repository.

References

[BN96]

Mark A. Beaumont and Richard A. Nichols. Evaluating loci for use in the genetic analysis of population structure. Proceedings of the Royal Society B: Biological Sciences, 263(1377):1619–1626, 1996. doi:10.1098/rspb.1996.0237.

[ER13]

David A. R. Eaton and Richard H. Ree. Inferring phylogeny and introgression using radseq data: an example from flowering plants (\textit Pedicularis: orobanchaceae). Systematic Biology, 62(5):689–706, 2013. doi:10.1093/sysbio/syt032.

[ESQ92]

Laurent Excoffier, Peter E. Smouse, and Joseph M. Quattro. Analysis of molecular variance inferred from metric distances among dna haplotypes: application to human mitochondrial dna restriction data. Genetics, 131(2):479–491, 1992. doi:10.1093/genetics/131.2.479.

[FG08]

Matthieu Foll and Oscar Gaggiotti. A genome-scan method to identify selected loci appropriate for both dominant and codominant markers: a bayesian perspective. Genetics, 180(2):977–993, 2008. doi:10.1534/genetics.108.092221.

[GKB10]

Richard E. Green, Johannes Krause, and \textit et al. Briggs, Adrian W. A draft sequence of the neandertal genome. Science, 328(5979):710–722, 2010. doi:10.1126/science.1188021.

[Nei72]

Masatoshi Nei. Genetic distance between populations. American Naturalist, 106(949):283–292, 1972. doi:10.1086/282771.

[PPR06]

Nick Patterson, Alkes L. Price, and David Reich. Population structure and eigenanalysis. PLoS Genetics, 2(12):e190, 2006. doi:10.1371/journal.pgen.0020190.

[PH15]

James B. Pease and Matthew W. Hahn. Detection and polarization of introgression in a five-taxon phylogeny. Systematic Biology, 64(4):651–662, 2015. doi:10.1093/sysbio/syv023.

[WC84]

Bruce S. Weir and C. Clark Cockerham. Estimating f-statistics for the analysis of population structure. Evolution, 38(6):1358–1370, 1984. doi:10.2307/2408641.