Single-Cell Workflow (scRNA-seq / scATAC-seq)#

End-to-end allele-specific workflow for single-cell data — 10X Chromium scRNA-seq and 10X scATAC-seq. Pipeline is the same in both cases; the data-type difference shows up as GTF (for scRNA-seq genes) vs. BED (for scATAC-seq peaks) in the --feature argument.

Inputs#

  • Cell Ranger BAM with cell barcodes in the CB:Z:... tag + index

  • Phased VCF/BCF/PGEN for the donor

  • Barcode-to-group TSV (cell type or other assignment — see Single-Cell Analysis for Seurat/Scanpy export code and format)

  • scRNA-seq: GTF gene annotation

  • scATAC-seq: BED peak file (usually from Cell Ranger filtered_peak_bc_matrix or a consensus peak set)

Step 1 — Count alleles per cell#

scRNA-seq (genes):

wasp2-count count-variants-sc \
  cellranger_output/outs/possorted_genome_bam.bam \
  phased_variants.vcf.gz \
  barcodes_celltype.tsv \
  --feature genes.gtf \
  --samples SAMPLE_ID \
  --out_file allele_counts.h5ad

scATAC-seq (peaks):

wasp2-count count-variants-sc \
  cellranger_output/outs/possorted_bam.bam \
  phased_variants.vcf.gz \
  barcodes_celltype.tsv \
  --feature peaks.bed \
  --samples SAMPLE_ID \
  --out_file allele_counts.h5ad

Output: an AnnData .h5ad with ref / alt / other layers, genotype columns in .obs, and cell-type assignments in .var. See Single-Cell Analysis for the full schema.

Step 2 — Per-group imbalance#

wasp2-analyze find-imbalance-sc \
  allele_counts.h5ad \
  barcodes_celltype.tsv \
  --sample SAMPLE_ID \
  --phased --min 10 -z 3 \
  --out_file imbalance_by_celltype.tsv

Output columns: region, cell_type, aggregated ref_count / alt_count, pval, fdr_pval, effect_size (log₂ ref/alt).

Step 3 — Compare groups (optional)#

wasp2-analyze compare-imbalance \
  allele_counts.h5ad \
  barcodes_celltype.tsv \
  --groups "CD4_T_cell,CD8_T_cell" \
  --phased \
  --out_file differential_imbalance.tsv

Omit --groups to compare all available groups pairwise. See Analysis Module for the full CLI reference and output columns.

Per-cell vs. pseudo-bulk#

Single-cell ATAC data is especially sparse — most cells contribute zero reads to most peaks. Two analysis modes are common:

Aspect

Per-cell

Pseudo-bulk (per-cell-type)

Resolution

Single cell

Cell population

Power

Low (sparse)

High (aggregated)

Use case

Outlier cells

Population-level imbalance

Pseudo-bulk (the default, via the barcode-to-group TSV) is the right starting point for most scATAC experiments. Per-cell analysis is useful when investigating rare subpopulations or outlier effects.

Interpreting results#

import pandas as pd

results = pd.read_csv('imbalance_by_celltype.tsv', sep='\t')
sig = results[results['fdr_pval'] < 0.05]

top = (sig.groupby('cell_type')
          .apply(lambda x: x.nsmallest(10, 'fdr_pval'))
          .reset_index(drop=True))

print(top[['region', 'cell_type', 'effect_size', 'fdr_pval']])

Troubleshooting#

Zero barcodes matched. Confirm barcode format in the BAM vs. the TSV — the CB:Z:... tag often has a -1 suffix that your export must match:

samtools view your.bam | head -10000 | grep -o 'CB:Z:[^[:space:]]*' \
  | cut -d: -f3 | sort -u > bam_bc.txt
cut -f1 barcodes.tsv | sort -u > file_bc.txt
comm -12 bam_bc.txt file_bc.txt | wc -l   # should be > 0

Fix a missing suffix:

awk -F'\t' '{print $1"-1\t"$2}' barcodes_no_suffix.tsv > barcodes.tsv

Sparse counts / low power. Aggregate to pseudo-bulk by cell type, lower --min / --min_count, or focus on highly expressed genes (scRNA-seq) / high-coverage peaks (scATAC-seq).

Memory. For large cohorts, split the feature file by chromosome and process chunks:

for chr in chr{1..22}; do
  grep "^${chr}\s" peaks.bed > peaks_${chr}.bed
  wasp2-count count-variants-sc sample.bam variants.vcf.gz barcodes.tsv \
    --feature peaks_${chr}.bed --out_file counts_${chr}.h5ad
done

See Also#