Frequently Asked Questions#

Installation#

Which install method should I use?

For most users: mamba install -c conda-forge -c bioconda wasp2 (Bioconda). This installs WASP2 and all dependencies (samtools, bcftools, bedtools) in one step.

Use pip install wasp2 if you are on a system without conda or want a specific Python environment. You will need to install samtools, bcftools, and bedtools separately.

Use Docker or Singularity on HPC clusters or when you need full reproducibility.

What Python versions are supported?

Python 3.10, 3.11, 3.12, and 3.13. Pre-built wheels are available for all four on Linux (x86_64, aarch64) and macOS (Intel, Apple Silicon).

Why do I get an error about missing samtools/bcftools/bedtools?

The PyPI wheel bundles the Rust extension and htslib but not the system binaries. Install them via conda (mamba install -c bioconda samtools bcftools bedtools) or your system package manager.

Input Data#

Do I need phased genotypes?

Yes. WASP2 assigns reads to haplotypes using phased heterozygous variants. Without phase information, WASP2 cannot distinguish which allele a read came from. Use WhatsHap, SHAPEIT4, or Eagle2 to phase your VCF before running WASP2.

What VCF formats does WASP2 support?

  • VCF or BCF (bgzip-compressed + tabix-indexed: .vcf.gz + .tbi)

  • PLINK2 PGEN format (.pgen + .pvar + .psam)

Multi-sample VCFs are supported; use -s SAMPLE_ID to specify the target sample.

Can I use an unphased VCF?

The counting step (wasp2-count) will still run but the allele assignments will be arbitrary. The statistical results will have reduced power and increased false positives. Always use phased genotypes when possible.

My BAM doesn’t have read groups. Will WASP2 work?

Yes, for counting. Read groups are not required for allele counting. For the remapping step (wasp2-map), the sample ID is needed to look up variants in a multi-sample VCF — pass it explicitly with -s SAMPLE_ID.

Running WASP2#

How long does each step take?

Typical runtimes on a single core for a 30× whole-genome BAM (~100M reads):

  • wasp2-map make-reads: 2–4 hours

  • Re-alignment (external): depends on aligner

  • wasp2-map filter-remapped: 1–2 hours

  • wasp2-count count-variants: 30–60 minutes

  • wasp2-analyze find-imbalance: < 5 minutes

Use the Nextflow pipelines for automatic parallelization across chromosomes/samples.

Can I run WASP2 on multiple samples at once?

Yes. WASP2 CLI processes one sample at a time; run multiple samples in parallel with a job scheduler (SLURM, PBS) or use the Nextflow pipelines which handle parallelization automatically.

What is the ``–region`` flag for?

Restrict counting to a specific genomic region (e.g., chr1:1000000-2000000). Useful for testing on a subset of data or for chromosome-level parallelization.

Single-Cell#

What single-cell chemistries are supported?

All 10x Genomics Chromium chemistries (scRNA v1/v2/v3, scATAC v1/v2) and any other protocol with a cell barcode tag in the BAM (CB tag by default). See Single-Cell Analysis for barcode format details.

Do I need Cell Ranger output?

No, but it is the most common input. WASP2 needs:

  • A BAM with cell barcodes in a BAM tag (default: CB)

  • A whitelist of valid barcodes (optional but recommended)

  • A phased VCF

Any aligner that produces CB-tagged BAMs will work (STARsolo, Alevin-fry, etc.).

How do I get per-cell-type results?

Run WASP2 on the full BAM to get per-cell allele counts, then use the output with your cell type annotations in Python (AnnData/Scanpy) to aggregate by cell type. See Single-Cell Workflow (scRNA-seq / scATAC-seq) for an example.

Output and Results#

What does the p-value in the output represent?

The p-value comes from a likelihood ratio test comparing the beta-binomial model under allelic imbalance vs. the null model of balanced expression. The test is calibrated for the overdispersion typical of RNA-seq count data.

What FDR threshold should I use?

The standard threshold is FDR < 0.05. For discovery analyses you may want FDR < 0.1. For validation or follow-up experiments, consider FDR < 0.01. See Analysis Module for the BH procedure and the NaN-propagation warning.

My output has very few significant sites. What’s wrong?

Common causes:

  • Low coverage at heterozygous sites (increase --min_count)

  • Too few heterozygous variants in the VCF

  • VCF and BAM use different chromosome naming conventions (chr1 vs 1)

  • VCF is not phased

My output has too many significant sites (inflated FDR).

This typically means mapping bias is driving the signal. Run the WASP remapping step (wasp2-map) before counting. See Mapping Module.

For ATAC-seq, do I need to use WASP-remapped BAMs?

Yes. WASP2 counting applies only the unmapped filter (see WASP Mapping Bias Correction “Canonical Filter Contract”); it does not correct reference mapping bias on its own. You must run wasp2-map make-reads + re-alignment + wasp2-map filter-remapped first, then pass the resulting *_wasp_filt_rmdup.bam to wasp2-count. Counting on raw BWA output leaves reference bias uncorrected — reads carrying the alt allele are systematically under-represented.

The same requirement applies to RNA-seq and scATAC-seq. The only difference is the aligner used in the re-alignment step (STAR for RNA, BWA for ATAC).

Troubleshooting#

I get “chromosome not found” errors.

VCF and BAM must use the same chromosome naming convention. If your VCF uses chr1 and your BAM uses 1 (or vice versa), use bcftools annotate --rename-chrs to harmonize the VCF.

The Rust extension fails to load.

This happens if the wheel was built for a different platform or Python version. Try reinstalling: pip install --force-reinstall wasp2. If building from source, run pixi run verify to rebuild.

WASP2 runs but produces an empty counts file.

Check that:

  • The BAM is coordinate-sorted and indexed (.bai file present)

  • The VCF overlaps the regions in your BAM

  • The sample name passed with -s matches a sample in the VCF

Use bcftools query -l variants.vcf.gz to list VCF sample names.