snpio.io.structure_reader.StructureReader

class snpio.io.structure_reader.StructureReader(filename=None, popmapfile=None, has_popids=False, has_marker_names=False, allele_start_col=None, allele_encoding=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Read STRUCTURE file into a GenotypeData object.

This class reads STRUCTURE files, which can be in one-row or two-row format. In one-row format, each genotype is represented by pairs of consecutive alleles on the same line. In two-row format, each genotype is represented by two lines, with the first line containing the first allele and the second line containing the second allele (e.g., “1” and “1” on separate lines). Each sample ID and population ID (if has_popids=True) should be repeated for each row of alleles if the file is in two-row format.

The first column is always the sample name, and the second column is the population ID if has_popids=True. If has_marker_names=True, the first line of the file contains the marker names, which are stored in self.marker_names. The allele_start_col parameter specifies the zero-based index where the alleles begin. The rest of the columns are genotypes, which are converted to IUPAC codes.

The allele_start_col parameter specifies the zero-based index where the alleles begin. If has_popids=True, the second column must be the population IDs. If has_marker_names=True, the first line must be the marker names.

If no popmap filename is provided and has_popids=True, the class will create a default population map based on the population IDs in the STRUCTURE file, saved to {prefix}_output/alignments/popmap.txt or {prefix}_output/nremover/alignments/popmap.txt.

__init__(filename=None, popmapfile=None, has_popids=False, has_marker_names=False, allele_start_col=None, allele_encoding=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Read STRUCTURE file into a GenotypeData object.

This class reads STRUCTURE files, which can be in one-row or two-row format. In one-row format, each genotype is represented by pairs of consecutive alleles on the same line. In two-row format, each genotype is represented by two lines, with the first line containing the first allele and the second line containing the second allele (e.g., “1” and “1” on separate lines). Each sample ID and population ID (if has_popids=True) should be repeated for each row of alleles if the file is in two-row format.

The first column is always the sample name, and the second column is the population ID if has_popids=True. If has_marker_names=True, the first line of the file contains the marker names, which are stored in self.marker_names. The allele_start_col parameter specifies the zero-based index where the alleles begin. The rest of the columns are genotypes, which are converted to IUPAC codes.

The allele_start_col parameter specifies the zero-based index where the alleles begin. If has_popids=True, the second column must be the population IDs. If has_marker_names=True, the first line must be the marker names.

If no popmap filename is provided and has_popids=True, the class will create a default population map based on the population IDs in the STRUCTURE file, saved to {prefix}_output/alignments/popmap.txt or {prefix}_output/nremover/alignments/popmap.txt.

Parameters:

filename (str) – path to STRUCTURE file.
popmapfile (str) – path to popmap file.
has_popids (bool) – if True, file’s second column is a popID (skipped automatically).
has_marker_names (bool) – if True, first line is marker names.
allele_start_col (int) – zero-based index where alleles begin; if None, defaults to 1 + (1 if has_popids else 0).
allele_encoding (dict) – dictionary for allele encoding. e.g., {1: “A”, 2: “C”, 3: “G”, 4: “T”}. If None, defaults to {1: “A”, 2: “C”, 3: “G”, 4: “T”}. 1/1 → A, 2/2 → C, 3/3 → G, 4/4 → T. 1/2 → M, 1/3 → R, 1/4 → W, 2/3 → S, 2/4 → Y, 3/4 → K.
force_popmap (bool) – if True, force popmap even if not needed
exclude_pops (list[str]) – list of populations to exclude.
include_pops (list[str]) – list of populations to include.
plot_format (str) – format for plots (png, pdf, jpg, svg).
prefix (str) – prefix for log files.
verbose (bool) – if True, print verbose messages.
debug (bool) – if True, print debug messages.

Raises:

AlignmentFileNotFoundError – if the file does not exist.
AlignmentFormatError – if the file format is incorrect.
StructureAlignmentSampleMismatch – if the number of samples does not match the number of genotypes.

Example

>>> gd = StructureReader(
...    filename="path/to/structure_file.txt",
...    popmapfile="path/to/popmap_file.txt",
...    has_popids=True,
...    has_marker_names=False,
...    allele_start_col=2,
...    force_popmap=False,
...    exclude_pops=["pop1"],
...    include_pops=["pop2"],
...    plot_format="png",
...    prefix="snpio",
...    verbose=True,
...    debug=False,
)

>>> print(gd.snp_data)
[['A', 'C', 'G', 'T'],
 ['T', 'G', 'C', 'A'],
 ['C', 'A', 'T', 'G']]

>>> print(gd.marker_names)
['M1', 'M2', 'M3', 'M4']

>>> print(gd.samples)
['Sample1', 'Sample2', 'Sample3']

>>> print(gd.populations)
['pop1', 'pop2', 'pop3']

>>> print(gd.num_snps)
4

>>> print(gd.num_inds)
3

>>> print(gd.popmap)
{'Sample1': 'pop1', 'Sample2': 'pop2', 'Sample3': 'pop3'}

>>> print(gd.popmap_inverse)
{'pop1': ['Sample1', 'Sample2'], 'pop2': ['Sample2']}

Methods

`__init__`([filename, popmapfile, has_popids, ...])	Read STRUCTURE file into a GenotypeData object.
`bgzip_file`(filepath)	BGZips a VCF file using pysam's BGZFile, preserving parent directories.
`build_vcf_header`()	Dynamically builds the VCF header using the stored header object.
`calc_missing`(df, *[, use_pops])	Compute missing-value statistics with proper locus + sample names.
`copy`()	Create a deep copy of the GenotypeData or subclass object.
`encode_to_vcf_format`(snp_data, ref_alleles, ...)	Vectorized encoding of IUPAC codes into VCF GT strings.
`get_population_indices`()	Create a mapping from population IDs to sample indices.
`get_ref_alt_alleles`(data)	Determine ref/alt alleles for each locus from IUPAC-encoded genotypes.
`get_reverse_iupac_mapping`()	Creates a reverse mapping from IUPAC codes -> allele tuples.
`load_aln`()	Efficiently load a STRUCTURE file with optional header and population IDs.
`missingness_reports`([prefix, zoom, ...])	Generate missingness reports and plots.
`read_popmap`()	Read population map from file to map samples to populations.
`refs_alts_from_snp_data`(snp_matrix)	Determine REF/ALT per locus from a (samples x loci) IUPAC matrix.
`replace_alleles`(row, ref, alts)	Replace the alleles in the VCF row with the corresponding VCF genotype codes.
`set_alignment`(snp_data, samples, ...[, ...])	Set the alignment data and sample IDs after filtering.
`subset_with_popmap`(my_popmap, samples, force)	Subset popmap and samples based on population criteria.
`tabix_index`(filepath)	Creates a Tabix index for a bgzipped VCF file.
`update_vcf_attributes`(snp_data, ...)	Update VCF attributes after genotype data changes.
`write_genepop`(output_file[, genotype_data, ...])	Write the SNP data in GenePop format.
`write_phylip`(output_file[, genotype_data, ...])	Write the stored alignment as a PHYLIP file.
`write_popmap`(filename)	Write the population map to a file.
`write_structure`(output_file[, onerow, ...])	Write the stored alignment as a STRUCTURE file.
`write_vcf`(output_filename[, hdf5_file_path, ...])	Writes the GenotypeData object data to a VCF file in chunks.

Attributes

`alt`	Get list of alternate alleles of length num_snps.
`biallelic_mask`	[n_loci] True where locus appears biallelic (A/C/G/T plus heterozygotes of those two).
`has_multiallelic`	True if any locus shows >2 unambiguous nucleotides.
`has_popmap`	True if population information is present.
`het_mask`	[n_samples, n_loci] True if genotype is heterozygous (IUPAC ambiguity codes).
`inputs`	Get GenotypeData keyword arguments as a dictionary.
`is_empty`	True if there are zero samples or loci.
`is_missing_locus`	[n_loci] True if an entire locus is missing across all samples.
`loci_indices`	Boolean array for retained loci in filtered alignment.
`locus_names`	Concrete locus names, generating defaults if absent.
`missing_mask`	Boolean mask [n_samples, n_loci] where True indicates a missing genotype.
`missing_rate`	Overall missing proportion in the alignment.
`nbytes`	Approximate RAM footprint of snp_data (bytes).
`num_inds`	Number of individuals (samples) in dataset.
`num_pops`	Number of populations in the dataset.
`num_snps`	Number of snps (loci) in the dataset.
`observed_iupac_per_locus`	Observed IUPAC codes per locus (excluding missing).
`output_dir`	Root output directory for this dataset.
`per_individual_het_rate`	Heterozygote proportion per individual (ignores missing).
`per_individual_missing`	Missing proportion per sample as a pandas Series indexed by sample name.
`per_locus_het_rate`	Heterozygote proportion per locus (ignores missing).
`per_locus_missing`	Missing proportion per locus as a pandas Series; uses marker names if present.
`plot_kwargs`	Backwards compatibility; convert PlotConfig to the old dict shape.
`plots_dir`	Standardized location for plots (pre/post-filtering aware).
`pop_sizes`	Population -> sample count.
`pop_to_indices`	Population -> list of sample indices (built from current popmap_inverse).
`popmap`	Dictionary mapping sample IDs to population IDs.
`popmap_inverse`	Dictionary mapping population IDs to lists of sample IDs.
`populations`	List of populations in the dataset.
`ref`	Get list of reference alleles of length num_snps.
`reports_dir`	Standardized location for reports (pre/post-filtering aware).
`sample_index_map`	Map sample ID -> row index (useful for subsetting).
`sample_indices`	Boolean array for retained samples in filtered alignment.
`samples`	List of sample IDs in the dataset.
`shape`	Tuple of (n_samples, n_loci) for the SNP data.
`snp_data`	Get the genotypes as a 2D list of shape (n_samples, n_loci).
`snpsdict`	Dictionary with Sample IDs as keys and lists of genotypes as values.
`valid_mask`	Boolean mask [n_samples, n_loci] where True = non-missing genotype.