snpio.PhylipReader
- class snpio.PhylipReader(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]
Class to read and write PHYLIP files.
This class inherits from the GenotypeData class and provides methods to read and write PHYLIP files. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.
Example
>>> from snpio import PhylipReader >>> >>> phylip = PhylipReader(filename="example.phy", popmapfile="example.popmap", verbose=True) >>> >>> genotype_data.snp_data array([["A", "T", "T", "A"], ["C", "G", "G", "C"], ["A", "T", "T", "A"]], dtype="<U1") >>> >>> genotype_data.samples ["Sample1", "Sample2", "Sample3", "Sample4"] >>> >>> genotype_data.populations ["Pop1", "Pop1", "Pop2", "Pop2"] >>> >>> genotype_data.num_snps 3 >>> >>> genotype_data.num_inds 4 >>> >>> genotype_data.popmap >>> {"Sample1": "Pop1", "Sample2": "Pop1", "Sample3": "Pop2", "Sample4": "Pop2"} >>> >>> genotype_data.popmap_inverse {"Pop1": ["Sample1", "Sample2"], "Pop2": ["Sample3", "Sample4"]} >>> >>> genotype_data.ref ["A", "C", "A"] >>> >>> genotype_data.alt ["T", "G", "T"] >>> >>> genotype_data.missingness_reports() >>> >>> genotype_data.run_pca() >>> >>> genotype_data.write_phylip("output.str")
- filename
Name of the PHYLIP file.
- Type:
str
- popmapfile
Name of the population map file.
- Type:
str
- force_popmap
If True, the population map file is required.
- Type:
bool
- exclude_pops
List of populations to exclude.
- Type:
List[str]
- include_pops
List of populations to include.
- Type:
List[str]
- plot_format
Format for saving plots. Default is ‘png’.
- Type:
str
- prefix
Prefix for output files.
- Type:
str
- verbose
If True, status updates are printed.
- Type:
bool
- samples
List of sample IDs.
- Type:
List[str]
- snp_data
List of SNP data.
- Type:
List[List[str]]
- num_inds
Number of individuals.
- Type:
int
- num_snps
Number of SNPs.
- Type:
int
- logger
Logger instance.
- Type:
Logger
- debug
If True, debug messages are printed.
- Type:
bool
- __init__(filename=None, popmapfile=None, force_popmap=False, exclude_pops=None, include_pops=None, plot_format='png', prefix='snpio', verbose=False, debug=False)[source]
Initialize the PhylipReader class.
This method sets up the logger and initializes the list of missing values. It also takes a filename and a population map file to read the data. The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.
For example:
` 4 4 Sample1 ATTA Sample2 CGGC Sample3 ATTA Sample4 CGGC `- Parameters:
filename (str | None) – Name of the PHYLIP file. Defaults to None.
popmapfile (str | None) – Name of the population map file. Defaults to None.
force_popmap (bool) – If True, the population map file is required. Defaults to False.
exclude_pops (List[str] | None) – List of populations to exclude. Defaults to None.
include_pops (List[str] | None) – List of populations to include. Defaults to None.
plot_format (str | None) – Format for saving plots. Default is ‘png’. Defaults to ‘png’.
prefix (str) – Prefix for output files. Defaults to ‘snpio’.
verbose (bool) – If True, status updates are printed. Defaults to False.
debug (bool) – If True, debug messages are printed. Defaults to False.
Note
The PHYLIP format is a simple text format for representing multiple sequence alignments. The first line of a PHYLIP file contains the number of samples and the number of loci. Each subsequent line contains the sample ID followed by the sequence data.
Methods
__init__([filename, popmapfile, ...])Initialize the PhylipReader class.
bgzip_file(filepath)BGZips a VCF file using pysam's BGZFile, preserving parent directories.
build_vcf_header()Dynamically builds the VCF header using the stored header object.
calc_missing(df, *[, use_pops])Compute missing-value statistics with proper locus + sample names.
copy()Create a deep copy of the GenotypeData or subclass object.
encode_to_vcf_format(snp_data, ref_alleles, ...)Vectorized encoding of IUPAC codes into VCF GT strings.
get_population_indices()Create a mapping from population IDs to sample indices.
get_ref_alt_alleles(data)Determine ref/alt alleles for each locus from IUPAC-encoded genotypes.
get_reverse_iupac_mapping()Creates a reverse mapping from IUPAC codes -> allele tuples.
load_aln()Load the PHYLIP file and populate SNP data, samples, and alleles.
missingness_reports([prefix, zoom, ...])Generate missingness reports and plots.
read_popmap()Read population map from file to map samples to populations.
refs_alts_from_snp_data(snp_matrix)Determine REF/ALT per locus from a (samples x loci) IUPAC matrix.
replace_alleles(row, ref, alts)Replace the alleles in the VCF row with the corresponding VCF genotype codes.
set_alignment(snp_data, samples, ...[, ...])Set the alignment data and sample IDs after filtering.
subset_with_popmap(my_popmap, samples, force)Subset popmap and samples based on population criteria.
tabix_index(filepath)Creates a Tabix index for a bgzipped VCF file.
update_vcf_attributes(snp_data, ...)Update VCF attributes after genotype data changes.
write_genepop(output_file[, genotype_data, ...])Write the SNP data in GenePop format.
write_phylip(output_file[, genotype_data, ...])Write the stored alignment as a PHYLIP file.
write_popmap(filename)Write the population map to a file.
write_structure(output_file[, onerow, ...])Write the stored alignment as a STRUCTURE file.
write_vcf(output_filename[, hdf5_file_path, ...])Writes the GenotypeData object data to a VCF file in chunks.
Attributes
altGet list of alternate alleles of length num_snps.
biallelic_mask[n_loci] True where locus appears biallelic (A/C/G/T plus heterozygotes of those two).
has_multiallelicTrue if any locus shows >2 unambiguous nucleotides.
has_popmapTrue if population information is present.
het_mask[n_samples, n_loci] True if genotype is heterozygous (IUPAC ambiguity codes).
inputsGet GenotypeData keyword arguments as a dictionary.
is_emptyTrue if there are zero samples or loci.
is_missing_locus[n_loci] True if an entire locus is missing across all samples.
loci_indicesBoolean array for retained loci in filtered alignment.
locus_namesConcrete locus names, generating defaults if absent.
missing_maskBoolean mask [n_samples, n_loci] where True indicates a missing genotype.
missing_rateOverall missing proportion in the alignment.
nbytesApproximate RAM footprint of snp_data (bytes).
Number of individuals (samples) in dataset.
num_popsNumber of populations in the dataset.
Number of snps (loci) in the dataset.
observed_iupac_per_locusObserved IUPAC codes per locus (excluding missing).
output_dirRoot output directory for this dataset.
per_individual_het_rateHeterozygote proportion per individual (ignores missing).
per_individual_missingMissing proportion per sample as a pandas Series indexed by sample name.
per_locus_het_rateHeterozygote proportion per locus (ignores missing).
per_locus_missingMissing proportion per locus as a pandas Series; uses marker names if present.
plot_kwargsBackwards compatibility; convert PlotConfig to the old dict shape.
plots_dirStandardized location for plots (pre/post-filtering aware).
pop_sizesPopulation -> sample count.
pop_to_indicesPopulation -> list of sample indices (built from current popmap_inverse).
popmapDictionary mapping sample IDs to population IDs.
popmap_inverseDictionary mapping population IDs to lists of sample IDs.
populationsList of populations in the dataset.
refGet list of reference alleles of length num_snps.
reports_dirStandardized location for reports (pre/post-filtering aware).
sample_index_mapMap sample ID -> row index (useful for subsetting).
sample_indicesBoolean array for retained samples in filtered alignment.
List of sample IDs in the dataset.
shapeTuple of (n_samples, n_loci) for the SNP data.
Get the genotypes as a 2D list of shape (n_samples, n_loci).
snpsdictDictionary with Sample IDs as keys and lists of genotypes as values.
valid_maskBoolean mask [n_samples, n_loci] where True = non-missing genotype.