snpio.io.genepop_reader.GenePopReader

class snpio.io.genepop_reader.GenePopReader(filename, popmapfile=None, allele_encoding=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Reads GenePop-formatted files into a GenotypeData object.

Supports: - 2-digit and 3-digit allele codings - Mixed ploidy: haploid and diploid entries - Missing data encoded as 0000, 000000, or partial (e.g., 1000, 0010) - Flexible locus headers (comma-separated or newline-separated)

Example

>>> gp = GenePopReader("example.gen", popmapfile="pops.txt")
>>> gp.snp_data
array([...])
__init__(filename, popmapfile=None, allele_encoding=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Initialize the GenePopReader.

This class reads a GenePop file and extracts genotype data, populations, and sample information.

Parameters:
  • filename (str) – Path to the GenePop file.

  • popmapfile (Optional[str]) – Path to the population map file.

  • allele_encoding (Optional[dict[str, str]]) – Mapping of allele codes to IUPAC symbols.

  • force_popmap (bool) – Whether to enforce population mapping.

  • exclude_pops (Optional[List[str]]) – List of populations to exclude.

  • include_pops (Optional[List[str]]) – List of populations to include.

  • plot_format (str) – Format for output plots (default: “png”).

  • prefix (str) – Prefix for output files (default: “snpio”).

  • verbose (bool) – Whether to enable verbose logging (default: False).

  • debug (bool) – Whether to enable debug mode (default: False).

Methods

__init__(filename[, popmapfile, ...])

Initialize the GenePopReader.

bgzip_file(filepath)

BGZips a VCF file using pysam's BGZFile, preserving parent directories.

build_vcf_header()

Dynamically builds the VCF header using the stored header object.

calc_missing(df, *[, use_pops])

Compute missing-value statistics with proper locus + sample names.

copy()

Create a deep copy of the GenotypeData or subclass object.

encode_to_vcf_format(snp_data, ref_alleles, ...)

Vectorized encoding of IUPAC codes into VCF GT strings.

get_population_indices()

Create a mapping from population IDs to sample indices.

get_ref_alt_alleles(data)

Determine ref/alt alleles for each locus from IUPAC-encoded genotypes.

get_reverse_iupac_mapping()

Creates a reverse mapping from IUPAC codes -> allele tuples.

load_aln()

Load the GenePop file and parse genotype data.

missingness_reports([prefix, zoom, ...])

Generate missingness reports and plots.

read_popmap()

Read population map from file to map samples to populations.

refs_alts_from_snp_data(snp_matrix)

Determine REF/ALT per locus from a (samples x loci) IUPAC matrix.

replace_alleles(row, ref, alts)

Replace the alleles in the VCF row with the corresponding VCF genotype codes.

set_alignment(snp_data, samples, ...[, ...])

Set the alignment data and sample IDs after filtering.

subset_with_popmap(my_popmap, samples, force)

Subset popmap and samples based on population criteria.

tabix_index(filepath)

Creates a Tabix index for a bgzipped VCF file.

update_vcf_attributes(snp_data, ...)

Update VCF attributes after genotype data changes.

write_genepop(output_file[, genotype_data, ...])

Write the SNP data in GenePop format.

write_phylip(output_file[, genotype_data, ...])

Write the stored alignment as a PHYLIP file.

write_popmap(filename)

Write the population map to a file.

write_structure(output_file[, onerow, ...])

Write the stored alignment as a STRUCTURE file.

write_vcf(output_filename[, hdf5_file_path, ...])

Writes the GenotypeData object data to a VCF file in chunks.

Attributes

alt

Get list of alternate alleles of length num_snps.

biallelic_mask

[n_loci] True where locus appears biallelic (A/C/G/T plus heterozygotes of those two).

has_multiallelic

True if any locus shows >2 unambiguous nucleotides.

has_popmap

True if population information is present.

het_mask

[n_samples, n_loci] True if genotype is heterozygous (IUPAC ambiguity codes).

inputs

Get GenotypeData keyword arguments as a dictionary.

is_empty

True if there are zero samples or loci.

is_missing_locus

[n_loci] True if an entire locus is missing across all samples.

loci_indices

Boolean array for retained loci in filtered alignment.

locus_names

Concrete locus names, generating defaults if absent.

missing_mask

Boolean mask [n_samples, n_loci] where True indicates a missing genotype.

missing_rate

Overall missing proportion in the alignment.

nbytes

Approximate RAM footprint of snp_data (bytes).

num_inds

Number of individuals (samples) in dataset.

num_pops

Number of populations in the dataset.

num_snps

Number of snps (loci) in the dataset.

observed_iupac_per_locus

Observed IUPAC codes per locus (excluding missing).

output_dir

Root output directory for this dataset.

per_individual_het_rate

Heterozygote proportion per individual (ignores missing).

per_individual_missing

Missing proportion per sample as a pandas Series indexed by sample name.

per_locus_het_rate

Heterozygote proportion per locus (ignores missing).

per_locus_missing

Missing proportion per locus as a pandas Series; uses marker names if present.

plot_kwargs

Backwards compatibility; convert PlotConfig to the old dict shape.

plots_dir

Standardized location for plots (pre/post-filtering aware).

pop_sizes

Population -> sample count.

pop_to_indices

Population -> list of sample indices (built from current popmap_inverse).

popmap

Dictionary mapping sample IDs to population IDs.

popmap_inverse

Dictionary mapping population IDs to lists of sample IDs.

populations

List of populations in the dataset.

ref

Get list of reference alleles of length num_snps.

reports_dir

Standardized location for reports (pre/post-filtering aware).

sample_index_map

Map sample ID -> row index (useful for subsetting).

sample_indices

Boolean array for retained samples in filtered alignment.

samples

List of sample IDs in the dataset.

shape

Tuple of (n_samples, n_loci) for the SNP data.

snp_data

Get the genotypes as a 2D list of shape (n_samples, n_loci).

snpsdict

Dictionary with Sample IDs as keys and lists of genotypes as values.

valid_mask

Boolean mask [n_samples, n_loci] where True = non-missing genotype.