Bulk Workflow (RNA-seq / ATAC-seq) =================================== End-to-end walkthrough of WASP2 on bulk sequencing data: WASP mapping filter, allele counting at regions, and the beta-binomial LRT for allelic imbalance. The same pipeline works for RNA-seq at genes (GTF) and ATAC-seq at peaks (BED); data-type differences are called out where they matter. Inputs: - Aligned BAM (coordinate-sorted, indexed) - VCF with the donor's heterozygous variants (phased for RNA-seq haplotypes) - Region file: GTF/GFF for RNA-seq genes, BED for ATAC-seq peaks Step 1 — WASP mapping filter ---------------------------- Reference alleles map more easily than alternate alleles, biasing ref counts upward. The WASP remap-and-filter step removes reads whose mapping position depends on which allele they carry. The test is "would this read map to the same location if we swapped ref↔alt at the SNP?" — only reads that do are kept. After filtering, .. math:: P(\text{map} \mid \text{ref}) \approx P(\text{map} \mid \text{alt}) under the usual assumption that the aligner's scoring is deterministic and position-dependent (see :doc:`/methods/mapping_filter` and [vandeGeijn2015]_). .. code-block:: bash # Step 1a: produce allele-swapped FASTQ + BAMs wasp2-map make-reads sample.bam variants.vcf.gz \ --samples SAMPLE1 \ --out_dir wasp_output # Step 1b: remap with the SAME aligner and parameters used originally bwa mem -M -t 8 genome.fa \ wasp_output/sample_swapped_alleles_r1.fq \ wasp_output/sample_swapped_alleles_r2.fq \ | samtools sort -o wasp_output/sample_remapped.bam - samtools index wasp_output/sample_remapped.bam # Step 1c: drop reads whose swapped version mapped elsewhere wasp2-map filter-remapped \ wasp_output/sample_remapped.bam \ --wasp_data_json wasp_output/sample_wasp_data_files.json \ --out_bam wasp_output/sample_wasp_filtered.bam # Step 1d: merge with reads that never overlapped a SNP samtools merge -f wasp_output/sample_final.bam \ wasp_output/sample_wasp_filtered.bam \ wasp_output/sample_keep.bam samtools index wasp_output/sample_final.bam .. important:: The remap step **must** use the same aligner, version, and parameters as the original alignment. A different scoring function invalidates the filter contract. Filter rates are **approximately** 1–10% of SNP-overlapping reads dropped (developer experience; varies by sequencing platform, aligner, and variant density). Outlier rates should be investigated — they typically signal CNVs, aligner-version mismatch, or variant-call errors rather than a WASP failure. Step 2 — Count alleles ---------------------- **RNA-seq at genes** (GTF): .. code-block:: bash wasp2-count count-variants \ wasp_output/sample_final.bam \ variants.vcf.gz \ --samples SAMPLE1 \ --region genes.gtf \ --out_file allele_counts.tsv **ATAC-seq at peaks** (BED): .. code-block:: bash wasp2-count count-variants \ wasp_output/sample_final.bam \ variants.vcf.gz \ --samples SAMPLE1 \ --region peaks.bed \ --out_file allele_counts.tsv Output columns: ``chrom``, ``pos``, ``ref``, ``alt``, ``ref_count``, ``alt_count``, ``other_count``, plus ``region`` (peak name) or ``gene_id``/``gene_name`` (GTF). See :doc:`/user_guide/counting` for region filtering and multi-sample options. Step 3 — Test for allelic imbalance ----------------------------------- .. code-block:: bash wasp2-analyze find-imbalance \ allele_counts.tsv \ --min 10 \ --pseudocount 1 \ --phased \ --out_file ai_results.tsv The beta-binomial LRT tests :math:`\mu = 0.5` against :math:`\mu \neq 0.5` per region, correcting for overdispersion and multiple testing via Benjamini–Hochberg (:doc:`/user_guide/analysis`). ``--phased`` applies only when the VCF GT field uses ``0|1`` / ``1|0``; pass it off for unphased data. Output columns: ``region``, ``num_snps``, aggregated ``ref_count`` / ``alt_count``, ``pval``, ``fdr_pval``, ``mu``, ``effect_size`` (:math:`\log_2(\text{ref/alt})`). Filtering and interpretation ---------------------------- .. code-block:: bash # FDR < 0.05 and |log2FC| > 1 awk -F'\t' 'NR==1 || ($5 < 0.05 && ($6 > 1 || $6 < -1))' \ ai_results.tsv > significant_ase.tsv Monoallelic expression (ref fraction > 0.9 or < 0.1) is a common hallmark of genomic imprinting on RNA-seq and of strong cis-regulatory variants on ATAC. eQTL integration (RNA-seq only) ------------------------------- .. code-block:: python import pandas as pd ase = pd.read_csv('ai_results.tsv', sep='\t') eqtl = pd.read_csv('gtex_eqtl.tsv', sep='\t') merged = ase.merge(eqtl, left_on='region', right_on='gene_id') # ASE effect_size > 0 => REF allele more expressed. # eQTL slope > 0 => ALT allele increases expression. # Concordance = opposite signs. merged['concordant'] = (merged['effect_size'] > 0) != (merged['slope'] > 0) Troubleshooting --------------- **Few SNPs found.** Confirm the VCF sample column matches ``--samples``, the VCF has heterozygous genotypes for this donor, and BAM+VCF use the same reference build. **Low power / few significant results.** Increase depth, lower ``--min`` (with caution), or aggregate at a coarser region level. **Too many significant results.** Confirm the WASP filter was applied, check for batch effects, CNVs, or aligner-version drift, and tighten FDR. See Also -------- - :doc:`/user_guide/mapping` — full mapping CLI reference - :doc:`/user_guide/counting` — counting CLI reference - :doc:`/user_guide/analysis` — analysis CLI reference - :doc:`/methods/mapping_filter` — canonical WASP filter contract - :doc:`/user_guide/analysis` — analysis CLI + defaults + contracts - :doc:`/tutorials/single_cell_workflow` — single-cell RNA-seq / ATAC-seq