snpio.VCFReader

class snpio.VCFReader(filename=None, popmapfile=None, chunk_size=1000, store_format_fields=False, disable_progress_bar=False, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', plot_fontsize=18, plot_dpi=300, plot_despine=True, show_plots=False, prefix='snpio', verbose=False, sample_indices=None, loci_indices=None, debug=False)[source]

A class to read VCF files into GenotypeData objects and write GenotypeData objects to VCF files.

VCFReader serves as the entry point to read and write VCF file formats. It is a subclass of GenotypeData that provides methods to read VCF files and extract the necessary attributes. It also provides methods to write GenotypeData objects to VCF files. The class uses the pysam library to read VCF files and the h5py library to write HDF5 files.

Example

>>> from snpio import VCFReader
>>>
>>> genotype_data = VCFReader(filename="example.vcf", popmapfile="popmap.txt", verbose=True)
>>> genotype_data.snp_data
array([["A", "T", "T", "A"], ["A", "T", "T", "A"], ["A", "T", "T", "A"]], dtype="<U1")
>>>
>>> genotype_data.samples
["sample1", "sample2", "sample3", "sample4"]
>>>
>>> genotype_data.num_inds
4
>>>
>>> genotype_data.num_snps
3
>>>
>>> genotype_data.populations
["pop1", "pop1", "pop2", "pop2"]
>>>
>>> genotype_data.popmap
{"sample1": "pop1", "sample2": "pop1", "sample3": "pop2", "sample4":
"pop2"}
>>>
>>> genotype_data.popmap_inverse
{"pop1": ["sample1", "sample2"], "pop2": ["sample3", "sample4"]}
>>>
>>> genotype_data.loci_indices
array([True, True, True], dtype=bool)
>>>
>>> genotype_data.sample_indices
array([True, True, True, True], dtype=bool)
>>>
>>> genotype_data.ref
["A", "A", "A"]
>>>
>>> genotype_data.alt
["T", "T", "T"]
>>>
>>> genotype_data.missingness_reports()
>>>
>>> genotype_data.run_pca()
>>>
>>> genotype_data.write_vcf("output.vcf")

filename

The name of the VCF file to read.

Type:: str | None

popmapfile

The name of the population map file to read.

Type:: str | None

chunk_size

The size of the chunks to read from the VCF file.

Type:: int

store_format_fields

Whether to store FORMAT fields. Setting to True may result in an increase in runtime and memory usage.

Type:: bool

force_popmap

Whether to force the use of the population map file.

Type:: bool

exclude_pops

The populations to exclude.

Type:: List[str] | None

include_pops

The populations to include.

Type:: List[str] | None

plot_format

The format to save the plots in.

Type:: Literal[“png”, “pdf”, “jpg”, “svg”] | None

plot_fontsize

The font size for the plots.

Type:: int

plot_dpi

The DPI for the plots.

Type:: int

plot_despine

Whether to remove the spines from the plots.

Type:: bool

show_plots

Whether to show the plots.

Type:: bool

prefix

The prefix to use for the output files.

Type:: str

verbose

Whether to print verbose output.

Type:: bool

sample_indices

The indices of the samples to read.

Type:: np.ndarray

loci_indices

The indices of the loci to read.

Type:: np.ndarray

debug

Whether to enable debug mode.

Type:: bool

num_records

The number of records in the VCF file.

Type:: int

filetype

The type of the file.

Type:: str

vcf_header

The VCF header.

Type:: pysam.libcbcf.VariantHeader | None

info_fields

The VCF info fields.

Type:: List[str] | None

resource_data

A dictionary to store resource data.

Type:: dict

logger

The logger object.

Type:: logging.Logger

snp_data

The SNP data.

Type:: np.ndarray

samples

The sample names.

Type:: np.ndarray

vcf_attributes_fn

The path to the HDF5 file containing the VCF attributes.

Type:: Path

Note

The VCF file is bgzipped, sorted, and indexed using Tabix to ensure efficient reading, if necessary.

__init__(filename=None, popmapfile=None, chunk_size=1000, store_format_fields=False, disable_progress_bar=False, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', plot_fontsize=18, plot_dpi=300, plot_despine=True, show_plots=False, prefix='snpio', verbose=False, sample_indices=None, loci_indices=None, debug=False)[source]

Initializes the VCFReader object.

This method sets up the VCFReader object with the provided parameters. It initializes the logger, sets the file paths, and prepares the output directory. The VCF file is bgzipped, sorted, and indexed using Tabix to ensure efficient reading.

Parameters:

filename (str | None) – The name of the VCF file to read. Defaults to None.
popmapfile (str | None) – The name of the population map file to read. Defaults to None.
chunk_size (int) – The size of the chunks to read from the VCF file. Defaults to 1000.
store_format_fields (bool) – Whether to store per-locus-per-sample FORMAT fields. Note that setting this parameter to True may result in an increase in runtime and memory usage. Defaults to False.
disable_progress_bar (bool) – Whether to disable the progress bar. If True, disables the progress bar. Defaults to False.
force_popmap (bool) – Whether to force the use of the population map file. Defaults to False.
exclude_pops (List[str] | None) – The populations to exclude. Defaults to None.
include_pops (List[str] | None) – The populations to include. Defaults to None.
plot_format (Literal["png", "pdf", "jpg", "svg"] | None) – The format to save the plots in. Defaults to “png”.
plot_fontsize (int) – The font size for the plots. Defaults to 18.
plot_dpi (int) – The DPI for the plots. Defaults to 300.
plot_despine (bool) – Whether to remove the spines from the plots. Defaults to True.
show_plots (bool) – Whether to show the plots. Defaults to False.
prefix (str) – The prefix to use for the output files. Defaults to “snpio”.
verbose (bool) – Whether to print verbose output. Defaults to False.
sample_indices (np.ndarray | None) – The indices of the samples to read. Defaults to None.
loci_indices (np.ndarray | None) – The indices of the loci to read. Defaults to None.
debug (bool) – Whether to enable debug mode. Defaults to False.

Notes

Setting store_format_fields to True slows down VCFReader. If you don’t need the per-sample-per-locus metadata, leave this at False.

Methods

`__init__`([filename, popmapfile, chunk_size, ...])	Initializes the VCFReader object.
`bgzip_file`(filepath)	BGZips a VCF file using pysam's BGZFile, preserving parent directories.
`build_vcf_header`()	Dynamically builds the VCF header using the stored header object.
`calc_missing`(df, *[, use_pops])	Compute missing-value statistics with proper locus + sample names.
`copy`()	Create a deep copy of the GenotypeData or subclass object.
`encode_to_vcf_format`(snp_data, ref_alleles, ...)	Vectorized encoding of IUPAC codes into VCF GT strings.
`get_population_indices`()	Create a mapping from population IDs to sample indices.
`get_ref_alt_alleles`(data)	Determine ref/alt alleles for each locus from IUPAC-encoded genotypes.
`get_reverse_iupac_mapping`()	Creates a reverse mapping from IUPAC codes -> allele tuples.
`get_vcf_attributes`(vcf[, chunk_size])	Extract VCF attributes with pysam, chunked HDF5 writes, and vectorized GT→IUPAC.
`load_aln`()	Loads the alignment from the VCF file into the VCFReader object.
`missingness_reports`([prefix, zoom, ...])	Generate missingness reports and plots.
`read_popmap`()	Read population map from file to map samples to populations.
`refs_alts_from_snp_data`(snp_matrix)	Determine REF/ALT per locus from a (samples x loci) IUPAC matrix.
`replace_alleles`(row, ref, alts)	Replace the alleles in the VCF row with the corresponding VCF genotype codes.
`set_alignment`(snp_data, samples, ...[, ...])	Set the alignment data and sample IDs after filtering.
`subset_with_popmap`(my_popmap, samples, force)	Subset popmap and samples based on population criteria.
`tabix_index`(filepath)	Creates a Tabix index for a bgzipped VCF file.
`update_vcf_attributes`(snp_data, ...)	Updates the VCF attributes with new data in chunks.
`write_genepop`(output_file[, genotype_data, ...])	Write the SNP data in GenePop format.
`write_phylip`(output_file[, genotype_data, ...])	Write the stored alignment as a PHYLIP file.
`write_popmap`(filename)	Write the population map to a file.
`write_structure`(output_file[, onerow, ...])	Write the stored alignment as a STRUCTURE file.
`write_vcf`(output_filename[, hdf5_file_path, ...])	Writes the GenotypeData object data to a VCF file in chunks.

Attributes

`alt`	Get list of alternate alleles of length num_snps.
`biallelic_mask`	[n_loci] True where locus appears biallelic (A/C/G/T plus heterozygotes of those two).
`has_multiallelic`	True if any locus shows >2 unambiguous nucleotides.
`has_popmap`	True if population information is present.
`het_mask`	[n_samples, n_loci] True if genotype is heterozygous (IUPAC ambiguity codes).
`inputs`	Get GenotypeData keyword arguments as a dictionary.
`is_empty`	True if there are zero samples or loci.
`is_missing_locus`	[n_loci] True if an entire locus is missing across all samples.
`loci_indices`	Boolean array for retained loci in filtered alignment.
`locus_names`	Concrete locus names, generating defaults if absent.
`missing_mask`	Boolean mask [n_samples, n_loci] where True indicates a missing genotype.
`missing_rate`	Overall missing proportion in the alignment.
`nbytes`	Approximate RAM footprint of snp_data (bytes).
`num_inds`	Number of individuals (samples) in dataset.
`num_pops`	Number of populations in the dataset.
`num_snps`	Number of snps (loci) in the dataset.
`observed_iupac_per_locus`	Observed IUPAC codes per locus (excluding missing).
`output_dir`	Root output directory for this dataset.
`per_individual_het_rate`	Heterozygote proportion per individual (ignores missing).
`per_individual_missing`	Missing proportion per sample as a pandas Series indexed by sample name.
`per_locus_het_rate`	Heterozygote proportion per locus (ignores missing).
`per_locus_missing`	Missing proportion per locus as a pandas Series; uses marker names if present.
`plot_kwargs`	Backwards compatibility; convert PlotConfig to the old dict shape.
`plots_dir`	Standardized location for plots (pre/post-filtering aware).
`pop_sizes`	Population -> sample count.
`pop_to_indices`	Population -> list of sample indices (built from current popmap_inverse).
`popmap`	Dictionary mapping sample IDs to population IDs.
`popmap_inverse`	Dictionary mapping population IDs to lists of sample IDs.
`populations`	List of populations in the dataset.
`ref`	Get list of reference alleles of length num_snps.
`reports_dir`	Standardized location for reports (pre/post-filtering aware).
`sample_index_map`	Map sample ID -> row index (useful for subsetting).
`sample_indices`	Boolean array for retained samples in filtered alignment.
`samples`	List of sample IDs in the dataset.
`shape`	Tuple of (n_samples, n_loci) for the SNP data.
`snp_data`	Get the genotypes as a 2D list of shape (n_samples, n_loci).
`snpsdict`	Dictionary with Sample IDs as keys and lists of genotypes as values.
`valid_mask`	Boolean mask [n_samples, n_loci] where True = non-missing genotype.
`vcf_attributes_fn`	The path to the HDF5 file containing the VCF attributes.