snpio.PhylipReader

class snpio.PhylipReader(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Class to read and write PHYLIP files.

This class inherits from the GenotypeData class and provides methods to read and write PHYLIP files. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.

Example

>>> from snpio import PhylipReader
>>>
>>> phylip = PhylipReader(filename="example.phy", popmapfile="example.popmap", verbose=True)
>>>
>>> genotype_data.snp_data
array([["A", "T", "T", "A"], ["C", "G", "G", "C"], ["A", "T", "T", "A"]], dtype="<U1")
>>>
>>> genotype_data.samples
["Sample1", "Sample2", "Sample3", "Sample4"]
>>>
>>> genotype_data.populations
["Pop1", "Pop1", "Pop2", "Pop2"]
>>>
>>> genotype_data.num_snps
3
>>>
>>> genotype_data.num_inds
4
>>>
>>> genotype_data.popmap
>>> {"Sample1": "Pop1", "Sample2": "Pop1", "Sample3": "Pop2", "Sample4": "Pop2"}
>>>
>>> genotype_data.popmap_inverse
{"Pop1": ["Sample1", "Sample2"], "Pop2": ["Sample3", "Sample4"]}
>>>
>>> genotype_data.ref
["A", "C", "A"]
>>>
>>> genotype_data.alt
["T", "G", "T"]
>>>
>>> genotype_data.missingness_reports()
>>>
>>> genotype_data.run_pca()
>>>
>>> genotype_data.write_phylip("output.str")
filename

Name of the PHYLIP file.

Type:

str

popmapfile

Name of the population map file.

Type:

str

force_popmap

If True, the population map file is required.

Type:

bool

exclude_pops

List of populations to exclude.

Type:

List[str]

include_pops

List of populations to include.

Type:

List[str]

plot_format

Format for saving plots. Default is ‘png’.

Type:

str

prefix

Prefix for output files.

Type:

str

verbose

If True, status updates are printed.

Type:

bool

samples

List of sample IDs.

Type:

List[str]

snp_data

List of SNP data.

Type:

List[List[str]]

num_inds

Number of individuals.

Type:

int

num_snps

Number of SNPs.

Type:

int

logger

Logger instance.

Type:

Logger

debug

If True, debug messages are printed.

Type:

bool

__init__(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]

Initialize the PhylipReader class.

This method sets up the logger and initializes the list of missing values. It also takes a filename and a population map file to read the data. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.

For example:

` 4 4 Sample1 ATTA Sample2 CGGC Sample3 ATTA Sample4 CGGC `

Parameters:
  • filename (str | None) – Name of the PHYLIP file. Defaults to None.

  • popmapfile (str | None) – Name of the population map file. Defaults to None.

  • force_popmap (bool) – If True, the population map file is required. Defaults to False.

  • exclude_pops (List[str] | None) – List of populations to exclude. Defaults to None.

  • include_pops (List[str] | None) – List of populations to include. Defaults to None.

  • plot_format (str | None) – Format for saving plots. Default is ‘png’. Defaults to ‘png’.

  • prefix (str) – Prefix for output files. Defaults to ‘snpio’.

  • verbose (bool) – If True, status updates are printed. Defaults to False.

  • debug (bool) – If True, debug messages are printed. Defaults to False.

Note

The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.

Methods

__init__([filename, popmapfile, ...])

Initialize the PhylipReader class.

bgzip_file(filepath)

BGZips a VCF file using pysam's BGZFile, preserving parent directories.

build_vcf_header()

Dynamically builds the VCF header using the stored header object.

calc_missing(df, *[, use_pops])

Compute missing-value statistics with proper locus + sample names.

copy()

Create a deep copy of the GenotypeData or subclass object.

encode_to_vcf_format(snp_data, ref_alleles, ...)

Vectorized encoding of IUPAC codes into VCF GT strings.

get_population_indices()

Create a mapping from population IDs to sample indices.

get_ref_alt_alleles(data)

Determine ref/alt alleles for each locus from IUPAC-encoded genotypes.

get_reverse_iupac_mapping()

Creates a reverse mapping from IUPAC codes -> allele tuples.

load_aln()

Load the PHYLIP file and populate SNP data, samples, and alleles.

missingness_reports([prefix, zoom, ...])

Generate missingness reports and plots.

read_popmap()

Read population map from file to map samples to populations.

refs_alts_from_snp_data(snp_matrix)

Determine REF/ALT per locus from a (samples x loci) IUPAC matrix.

replace_alleles(row, ref, alts)

Replace the alleles in the VCF row with the corresponding VCF genotype codes.

set_alignment(snp_data, samples, ...[, ...])

Set the alignment data and sample IDs after filtering.

subset_with_popmap(my_popmap, samples, force)

Subset popmap and samples based on population criteria.

tabix_index(filepath)

Creates a Tabix index for a bgzipped VCF file.

update_vcf_attributes(snp_data, ...)

Update VCF attributes after genotype data changes.

write_genepop(output_file[, genotype_data, ...])

Write the SNP data in GenePop format.

write_phylip(output_file[, genotype_data, ...])

Write the stored alignment as a PHYLIP file.

write_popmap(filename)

Write the population map to a file.

write_structure(output_file[, onerow, ...])

Write the stored alignment as a STRUCTURE file.

write_vcf(output_filename[, hdf5_file_path, ...])

Writes the GenotypeData object data to a VCF file in chunks.

Attributes

alt

Get list of alternate alleles of length num_snps.

biallelic_mask

[n_loci] True where locus appears biallelic (A/C/G/T plus heterozygotes of those two).

has_multiallelic

True if any locus shows >2 unambiguous nucleotides.

has_popmap

True if population information is present.

het_mask

[n_samples, n_loci] True if genotype is heterozygous (IUPAC ambiguity codes).

inputs

Get GenotypeData keyword arguments as a dictionary.

is_empty

True if there are zero samples or loci.

is_missing_locus

[n_loci] True if an entire locus is missing across all samples.

loci_indices

Boolean array for retained loci in filtered alignment.

locus_names

Concrete locus names, generating defaults if absent.

missing_mask

Boolean mask [n_samples, n_loci] where True indicates a missing genotype.

missing_rate

Overall missing proportion in the alignment.

nbytes

Approximate RAM footprint of snp_data (bytes).

num_inds

Number of individuals (samples) in dataset.

num_pops

Number of populations in the dataset.

num_snps

Number of snps (loci) in the dataset.

observed_iupac_per_locus

Observed IUPAC codes per locus (excluding missing).

output_dir

Root output directory for this dataset.

per_individual_het_rate

Heterozygote proportion per individual (ignores missing).

per_individual_missing

Missing proportion per sample as a pandas Series indexed by sample name.

per_locus_het_rate

Heterozygote proportion per locus (ignores missing).

per_locus_missing

Missing proportion per locus as a pandas Series; uses marker names if present.

plot_kwargs

Backwards compatibility; convert PlotConfig to the old dict shape.

plots_dir

Standardized location for plots (pre/post-filtering aware).

pop_sizes

Population -> sample count.

pop_to_indices

Population -> list of sample indices (built from current popmap_inverse).

popmap

Dictionary mapping sample IDs to population IDs.

popmap_inverse

Dictionary mapping population IDs to lists of sample IDs.

populations

List of populations in the dataset.

ref

Get list of reference alleles of length num_snps.

reports_dir

Standardized location for reports (pre/post-filtering aware).

sample_index_map

Map sample ID -> row index (useful for subsetting).

sample_indices

Boolean array for retained samples in filtered alignment.

samples

List of sample IDs in the dataset.

shape

Tuple of (n_samples, n_loci) for the SNP data.

snp_data

Get the genotypes as a 2D list of shape (n_samples, n_loci).

snpsdict

Dictionary with Sample IDs as keys and lists of genotypes as values.

valid_mask

Boolean mask [n_samples, n_loci] where True = non-missing genotype.