snpio.io.genepop_reader.GenePopReader

class snpio.io.genepop_reader.GenePopReader(filename, popmapfile=None, allele_encoding=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Reads GenePop-formatted files into a GenotypeData object.

Supports: - 2-digit and 3-digit allele codings - Mixed ploidy: haploid and diploid entries - Missing data encoded as 0000, 000000, or partial (e.g., 1000, 0010) - Flexible locus headers (comma-separated or newline-separated)

Example

>>> gp = GenePopReader("example.gen", popmapfile="pops.txt")
>>> gp.snp_data
array([...])

__init__(filename, popmapfile=None, allele_encoding=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Initialize the GenePopReader.

This class reads a GenePop file and extracts genotype data, populations, and sample information.

Parameters:

filename (str) – Path to the GenePop file.
popmapfile (Optional[str]) – Path to the population map file.
allele_encoding (Optional[dict[str, str]]) – Mapping of allele codes to IUPAC symbols.
force_popmap (bool) – Whether to enforce population mapping.
exclude_pops (Optional[List[str]]) – List of populations to exclude.
include_pops (Optional[List[str]]) – List of populations to include.
plot_format (str) – Format for output plots (default: “png”).
prefix (str) – Prefix for output files (default: “snpio”).
verbose (bool) – Whether to enable verbose logging (default: False).
debug (bool) – Whether to enable debug mode (default: False).

Methods

`__init__`(filename[, popmapfile, ...])	Initialize the GenePopReader.
`bgzip_file`(filepath)	BGZips a VCF file using pysam's BGZFile, preserving parent directories.
`build_vcf_header`()	Dynamically builds the VCF header using the stored header object.
`calc_missing`(df, *[, use_pops])	Compute missing-value statistics with proper locus + sample names.
`copy`()	Create a deep copy of the GenotypeData or subclass object.
`encode_to_vcf_format`(snp_data, ref_alleles, ...)	Vectorized encoding of IUPAC codes into VCF GT strings.
`get_population_indices`()	Create a mapping from population IDs to sample indices.
`get_ref_alt_alleles`(data)	Determine ref/alt alleles for each locus from IUPAC-encoded genotypes.
`get_reverse_iupac_mapping`()	Creates a reverse mapping from IUPAC codes -> allele tuples.
`load_aln`()	Load the GenePop file and parse genotype data.
`missingness_reports`([prefix, zoom, ...])	Generate missingness reports and plots.
`read_popmap`()	Read population map from file to map samples to populations.
`refs_alts_from_snp_data`(snp_matrix)	Determine REF/ALT per locus from a (samples x loci) IUPAC matrix.
`replace_alleles`(row, ref, alts)	Replace the alleles in the VCF row with the corresponding VCF genotype codes.
`set_alignment`(snp_data, samples, ...[, ...])	Set the alignment data and sample IDs after filtering.
`subset_with_popmap`(my_popmap, samples, force)	Subset popmap and samples based on population criteria.
`tabix_index`(filepath)	Creates a Tabix index for a bgzipped VCF file.
`update_vcf_attributes`(snp_data, ...)	Update VCF attributes after genotype data changes.
`write_genepop`(output_file[, genotype_data, ...])	Write the SNP data in GenePop format.
`write_phylip`(output_file[, genotype_data, ...])	Write the stored alignment as a PHYLIP file.
`write_popmap`(filename)	Write the population map to a file.
`write_structure`(output_file[, onerow, ...])	Write the stored alignment as a STRUCTURE file.
`write_vcf`(output_filename[, hdf5_file_path, ...])	Writes the GenotypeData object data to a VCF file in chunks.

Attributes

`alt`	Get list of alternate alleles of length num_snps.
`biallelic_mask`	[n_loci] True where locus appears biallelic (A/C/G/T plus heterozygotes of those two).
`has_multiallelic`	True if any locus shows >2 unambiguous nucleotides.
`has_popmap`	True if population information is present.
`het_mask`	[n_samples, n_loci] True if genotype is heterozygous (IUPAC ambiguity codes).
`inputs`	Get GenotypeData keyword arguments as a dictionary.
`is_empty`	True if there are zero samples or loci.
`is_missing_locus`	[n_loci] True if an entire locus is missing across all samples.
`loci_indices`	Boolean array for retained loci in filtered alignment.
`locus_names`	Concrete locus names, generating defaults if absent.
`missing_mask`	Boolean mask [n_samples, n_loci] where True indicates a missing genotype.
`missing_rate`	Overall missing proportion in the alignment.
`nbytes`	Approximate RAM footprint of snp_data (bytes).
`num_inds`	Number of individuals (samples) in dataset.
`num_pops`	Number of populations in the dataset.
`num_snps`	Number of snps (loci) in the dataset.
`observed_iupac_per_locus`	Observed IUPAC codes per locus (excluding missing).
`output_dir`	Root output directory for this dataset.
`per_individual_het_rate`	Heterozygote proportion per individual (ignores missing).
`per_individual_missing`	Missing proportion per sample as a pandas Series indexed by sample name.
`per_locus_het_rate`	Heterozygote proportion per locus (ignores missing).
`per_locus_missing`	Missing proportion per locus as a pandas Series; uses marker names if present.
`plot_kwargs`	Backwards compatibility; convert PlotConfig to the old dict shape.
`plots_dir`	Standardized location for plots (pre/post-filtering aware).
`pop_sizes`	Population -> sample count.
`pop_to_indices`	Population -> list of sample indices (built from current popmap_inverse).
`popmap`	Dictionary mapping sample IDs to population IDs.
`popmap_inverse`	Dictionary mapping population IDs to lists of sample IDs.
`populations`	List of populations in the dataset.
`ref`	Get list of reference alleles of length num_snps.
`reports_dir`	Standardized location for reports (pre/post-filtering aware).
`sample_index_map`	Map sample ID -> row index (useful for subsetting).
`sample_indices`	Boolean array for retained samples in filtered alignment.
`samples`	List of sample IDs in the dataset.
`shape`	Tuple of (n_samples, n_loci) for the SNP data.
`snp_data`	Get the genotypes as a 2D list of shape (n_samples, n_loci).
`snpsdict`	Dictionary with Sample IDs as keys and lists of genotypes as values.
`valid_mask`	Boolean mask [n_samples, n_loci] where True = non-missing genotype.