snpio.PhylipReader

class snpio.PhylipReader(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Class to read and write PHYLIP files.

This class inherits from the GenotypeData class and provides methods to read and write PHYLIP files. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.

Example

>>> from snpio import PhylipReader
>>>
>>> phylip = PhylipReader(filename="example.phy", popmapfile="example.popmap", verbose=True)
>>>
>>> genotype_data.snp_data
array([["A", "T", "T", "A"], ["C", "G", "G", "C"], ["A", "T", "T", "A"]], dtype="<U1")
>>>
>>> genotype_data.samples
["Sample1", "Sample2", "Sample3", "Sample4"]
>>>
>>> genotype_data.populations
["Pop1", "Pop1", "Pop2", "Pop2"]
>>>
>>> genotype_data.num_snps
3
>>>
>>> genotype_data.num_inds
4
>>>
>>> genotype_data.popmap
>>> {"Sample1": "Pop1", "Sample2": "Pop1", "Sample3": "Pop2", "Sample4": "Pop2"}
>>>
>>> genotype_data.popmap_inverse
{"Pop1": ["Sample1", "Sample2"], "Pop2": ["Sample3", "Sample4"]}
>>>
>>> genotype_data.ref
["A", "C", "A"]
>>>
>>> genotype_data.alt
["T", "G", "T"]
>>>
>>> genotype_data.missingness_reports()
>>>
>>> genotype_data.run_pca()
>>>
>>> genotype_data.write_phylip("output.str")

filename

Name of the PHYLIP file.

Type:: str

popmapfile

Name of the population map file.

Type:: str

force_popmap

If True, the population map file is required.

Type:: bool

exclude_pops

List of populations to exclude.

Type:: List[str]

include_pops

List of populations to include.

Type:: List[str]

plot_format

Format for saving plots. Default is ‘png’.

Type:: str

prefix

Prefix for output files.

Type:: str

verbose

If True, status updates are printed.

Type:: bool

samples

List of sample IDs.

Type:: List[str]

snp_data

List of SNP data.

Type:: List[List[str]]

num_inds

Number of individuals.

Type:: int

num_snps

Number of SNPs.

Type:: int

logger

Logger instance.

Type:: Logger

debug

If True, debug messages are printed.

Type:: bool

__init__(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Initialize the PhylipReader class.

This method sets up the logger and initializes the list of missing values. It also takes a filename and a population map file to read the data. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.

For example:

` 4 4 Sample1 ATTA Sample2 CGGC Sample3 ATTA Sample4 CGGC `

Parameters:

filename (str | None) – Name of the PHYLIP file. Defaults to None.
popmapfile (str | None) – Name of the population map file. Defaults to None.
force_popmap (bool) – If True, the population map file is required. Defaults to False.
exclude_pops (List[str] | None) – List of populations to exclude. Defaults to None.
include_pops (List[str] | None) – List of populations to include. Defaults to None.
plot_format (str | None) – Format for saving plots. Default is ‘png’. Defaults to ‘png’.
prefix (str) – Prefix for output files. Defaults to ‘snpio’.
verbose (bool) – If True, status updates are printed. Defaults to False.
debug (bool) – If True, debug messages are printed. Defaults to False.

Note

The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.

Methods

`__init__`([filename, popmapfile, ...])	Initialize the PhylipReader class.
`bgzip_file`(filepath)	BGZips a VCF file using pysam's BGZFile, preserving parent directories.
`build_vcf_header`()	Dynamically builds the VCF header using the stored header object.
`calc_missing`(df, *[, use_pops])	Compute missing-value statistics with proper locus + sample names.
`copy`()	Create a deep copy of the GenotypeData or subclass object.
`encode_to_vcf_format`(snp_data, ref_alleles, ...)	Vectorized encoding of IUPAC codes into VCF GT strings.
`get_population_indices`()	Create a mapping from population IDs to sample indices.
`get_ref_alt_alleles`(data)	Determine ref/alt alleles for each locus from IUPAC-encoded genotypes.
`get_reverse_iupac_mapping`()	Creates a reverse mapping from IUPAC codes -> allele tuples.
`load_aln`()	Load the PHYLIP file and populate SNP data, samples, and alleles.
`missingness_reports`([prefix, zoom, ...])	Generate missingness reports and plots.
`read_popmap`()	Read population map from file to map samples to populations.
`refs_alts_from_snp_data`(snp_matrix)	Determine REF/ALT per locus from a (samples x loci) IUPAC matrix.
`replace_alleles`(row, ref, alts)	Replace the alleles in the VCF row with the corresponding VCF genotype codes.
`set_alignment`(snp_data, samples, ...[, ...])	Set the alignment data and sample IDs after filtering.
`subset_with_popmap`(my_popmap, samples, force)	Subset popmap and samples based on population criteria.
`tabix_index`(filepath)	Creates a Tabix index for a bgzipped VCF file.
`update_vcf_attributes`(snp_data, ...)	Update VCF attributes after genotype data changes.
`write_genepop`(output_file[, genotype_data, ...])	Write the SNP data in GenePop format.
`write_phylip`(output_file[, genotype_data, ...])	Write the stored alignment as a PHYLIP file.
`write_popmap`(filename)	Write the population map to a file.
`write_structure`(output_file[, onerow, ...])	Write the stored alignment as a STRUCTURE file.
`write_vcf`(output_filename[, hdf5_file_path, ...])	Writes the GenotypeData object data to a VCF file in chunks.

Attributes

`alt`	Get list of alternate alleles of length num_snps.
`biallelic_mask`	[n_loci] True where locus appears biallelic (A/C/G/T plus heterozygotes of those two).
`has_multiallelic`	True if any locus shows >2 unambiguous nucleotides.
`has_popmap`	True if population information is present.
`het_mask`	[n_samples, n_loci] True if genotype is heterozygous (IUPAC ambiguity codes).
`inputs`	Get GenotypeData keyword arguments as a dictionary.
`is_empty`	True if there are zero samples or loci.
`is_missing_locus`	[n_loci] True if an entire locus is missing across all samples.
`loci_indices`	Boolean array for retained loci in filtered alignment.
`locus_names`	Concrete locus names, generating defaults if absent.
`missing_mask`	Boolean mask [n_samples, n_loci] where True indicates a missing genotype.
`missing_rate`	Overall missing proportion in the alignment.
`nbytes`	Approximate RAM footprint of snp_data (bytes).
`num_inds`	Number of individuals (samples) in dataset.
`num_pops`	Number of populations in the dataset.
`num_snps`	Number of snps (loci) in the dataset.
`observed_iupac_per_locus`	Observed IUPAC codes per locus (excluding missing).
`output_dir`	Root output directory for this dataset.
`per_individual_het_rate`	Heterozygote proportion per individual (ignores missing).
`per_individual_missing`	Missing proportion per sample as a pandas Series indexed by sample name.
`per_locus_het_rate`	Heterozygote proportion per locus (ignores missing).
`per_locus_missing`	Missing proportion per locus as a pandas Series; uses marker names if present.
`plot_kwargs`	Backwards compatibility; convert PlotConfig to the old dict shape.
`plots_dir`	Standardized location for plots (pre/post-filtering aware).
`pop_sizes`	Population -> sample count.
`pop_to_indices`	Population -> list of sample indices (built from current popmap_inverse).
`popmap`	Dictionary mapping sample IDs to population IDs.
`popmap_inverse`	Dictionary mapping population IDs to lists of sample IDs.
`populations`	List of populations in the dataset.
`ref`	Get list of reference alleles of length num_snps.
`reports_dir`	Standardized location for reports (pre/post-filtering aware).
`sample_index_map`	Map sample ID -> row index (useful for subsetting).
`sample_indices`	Boolean array for retained samples in filtered alignment.
`samples`	List of sample IDs in the dataset.
`shape`	Tuple of (n_samples, n_loci) for the SNP data.
`snp_data`	Get the genotypes as a 2D list of shape (n_samples, n_loci).
`snpsdict`	Dictionary with Sample IDs as keys and lists of genotypes as values.
`valid_mask`	Boolean mask [n_samples, n_loci] where True = non-missing genotype.