Bulk Workflow (RNA-seq / ATAC-seq)#
End-to-end walkthrough of WASP2 on bulk sequencing data: WASP mapping filter, allele counting at regions, and the beta-binomial LRT for allelic imbalance. The same pipeline works for RNA-seq at genes (GTF) and ATAC-seq at peaks (BED); data-type differences are called out where they matter.
Inputs:
Aligned BAM (coordinate-sorted, indexed)
VCF with the donor’s heterozygous variants (phased for RNA-seq haplotypes)
Region file: GTF/GFF for RNA-seq genes, BED for ATAC-seq peaks
Step 1 — WASP mapping filter#
Reference alleles map more easily than alternate alleles, biasing ref counts upward. The WASP remap-and-filter step removes reads whose mapping position depends on which allele they carry. The test is “would this read map to the same location if we swapped ref↔alt at the SNP?” — only reads that do are kept. After filtering,
under the usual assumption that the aligner’s scoring is deterministic and position-dependent (see WASP Mapping Bias Correction and [vandeGeijn2015]).
# Step 1a: produce allele-swapped FASTQ + BAMs
wasp2-map make-reads sample.bam variants.vcf.gz \
--samples SAMPLE1 \
--out_dir wasp_output
# Step 1b: remap with the SAME aligner and parameters used originally
bwa mem -M -t 8 genome.fa \
wasp_output/sample_swapped_alleles_r1.fq \
wasp_output/sample_swapped_alleles_r2.fq \
| samtools sort -o wasp_output/sample_remapped.bam -
samtools index wasp_output/sample_remapped.bam
# Step 1c: drop reads whose swapped version mapped elsewhere
wasp2-map filter-remapped \
wasp_output/sample_remapped.bam \
--wasp_data_json wasp_output/sample_wasp_data_files.json \
--out_bam wasp_output/sample_wasp_filtered.bam
# Step 1d: merge with reads that never overlapped a SNP
samtools merge -f wasp_output/sample_final.bam \
wasp_output/sample_wasp_filtered.bam \
wasp_output/sample_keep.bam
samtools index wasp_output/sample_final.bam
Important
The remap step must use the same aligner, version, and parameters as the original alignment. A different scoring function invalidates the filter contract.
Filter rates are approximately 1–10% of SNP-overlapping reads dropped (developer experience; varies by sequencing platform, aligner, and variant density). Outlier rates should be investigated — they typically signal CNVs, aligner-version mismatch, or variant-call errors rather than a WASP failure.
Step 2 — Count alleles#
RNA-seq at genes (GTF):
wasp2-count count-variants \
wasp_output/sample_final.bam \
variants.vcf.gz \
--samples SAMPLE1 \
--region genes.gtf \
--out_file allele_counts.tsv
ATAC-seq at peaks (BED):
wasp2-count count-variants \
wasp_output/sample_final.bam \
variants.vcf.gz \
--samples SAMPLE1 \
--region peaks.bed \
--out_file allele_counts.tsv
Output columns: chrom, pos, ref, alt, ref_count,
alt_count, other_count, plus region (peak name) or
gene_id/gene_name (GTF). See Counting Module for region
filtering and multi-sample options.
Step 3 — Test for allelic imbalance#
wasp2-analyze find-imbalance \
allele_counts.tsv \
--min 10 \
--pseudocount 1 \
--phased \
--out_file ai_results.tsv
The beta-binomial LRT tests \(\mu = 0.5\) against \(\mu \neq 0.5\) per region, correcting for overdispersion and multiple testing via Benjamini–Hochberg (Analysis Module).
--phased applies only when the VCF GT field uses 0|1 / 1|0; pass
it off for unphased data.
Output columns: region, num_snps, aggregated ref_count /
alt_count, pval, fdr_pval, mu, effect_size
(\(\log_2(\text{ref/alt})\)).
Filtering and interpretation#
# FDR < 0.05 and |log2FC| > 1
awk -F'\t' 'NR==1 || ($5 < 0.05 && ($6 > 1 || $6 < -1))' \
ai_results.tsv > significant_ase.tsv
Monoallelic expression (ref fraction > 0.9 or < 0.1) is a common hallmark of genomic imprinting on RNA-seq and of strong cis-regulatory variants on ATAC.
eQTL integration (RNA-seq only)#
import pandas as pd
ase = pd.read_csv('ai_results.tsv', sep='\t')
eqtl = pd.read_csv('gtex_eqtl.tsv', sep='\t')
merged = ase.merge(eqtl, left_on='region', right_on='gene_id')
# ASE effect_size > 0 => REF allele more expressed.
# eQTL slope > 0 => ALT allele increases expression.
# Concordance = opposite signs.
merged['concordant'] = (merged['effect_size'] > 0) != (merged['slope'] > 0)
Troubleshooting#
Few SNPs found. Confirm the VCF sample column matches --samples, the
VCF has heterozygous genotypes for this donor, and BAM+VCF use the same
reference build.
Low power / few significant results. Increase depth, lower --min
(with caution), or aggregate at a coarser region level.
Too many significant results. Confirm the WASP filter was applied, check for batch effects, CNVs, or aligner-version drift, and tighten FDR.
See Also#
Mapping Module — full mapping CLI reference
Counting Module — counting CLI reference
Analysis Module — analysis CLI reference
WASP Mapping Bias Correction — canonical WASP filter contract
Analysis Module — analysis CLI + defaults + contracts
Single-Cell Workflow (scRNA-seq / scATAC-seq) — single-cell RNA-seq / ATAC-seq