{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ATAC-seq Allelic Imbalance Analysis Tutorial\n", "\n", "**Estimated time:** ~30 minutes\n", "\n", "This tutorial demonstrates a complete WASP2 workflow for detecting allelic imbalance in chromatin accessibility from ATAC-seq data. You will learn to:\n", "\n", "1. Prepare ATAC-seq peak files for analysis\n", "2. Count alleles at heterozygous SNPs within accessibility peaks\n", "3. Detect significant allelic imbalance using beta-binomial statistical testing\n", "4. Visualize results with volcano plots and effect size distributions\n", "5. Integrate with caQTL/eQTL data for biological interpretation\n", "\n", "## Background\n", "\n", "**Allelic Imbalance in Chromatin Accessibility**\n", "\n", "ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) measures open chromatin regions genome-wide. When a heterozygous individual shows unequal accessibility between maternal and paternal alleles at a regulatory region, this indicates **allelic imbalance in chromatin accessibility**.\n", "\n", "Such imbalance often reflects:\n", "- *cis*-regulatory variants affecting transcription factor binding\n", "- Chromatin accessibility QTLs (caQTLs)\n", "- Haplotype-specific enhancer activity\n", "\n", "WASP2 uses a **beta-binomial model** to detect significant departures from the expected 50:50 allelic ratio while accounting for overdispersion in sequencing data." ] }, { "cell_type": "markdown", "metadata": {}, "source": "## Prerequisites\n\n### Software\n\n- **WASP2** (`pip install wasp2`)\n- **Python 3.10+** with pandas, matplotlib, numpy\n- **samtools** (for BAM operations)\n- **tabix** (for VCF indexing)\n\n### Input Data\n\n| File | Description | Format |\n|------|-------------|--------|\n| `sample.bam` | ATAC-seq aligned reads | BAM (indexed) |\n| `variants.vcf.gz` | Phased heterozygous variants | VCF (indexed) |\n| `peaks.bed` | ATAC-seq peaks from MACS2/SEACR | BED or narrowPeak |\n\n**Note:** For best results, use WASP-filtered BAM files to correct reference mapping bias. See the [mapping documentation](../user_guide/mapping.rst) for details." }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# Configure plotting\n", "plt.style.use(\"seaborn-v0_8-whitegrid\")\n", "plt.rcParams[\"figure.figsize\"] = (10, 6)\n", "plt.rcParams[\"font.size\"] = 11\n", "\n", "# Define paths (modify these for your data)\n", "DATA_DIR = Path(\"data\")\n", "RESULTS_DIR = Path(\"results\")\n", "RESULTS_DIR.mkdir(exist_ok=True)\n", "\n", "# Input files\n", "BAM_FILE = DATA_DIR / \"atac_sample.bam\"\n", "VCF_FILE = DATA_DIR / \"phased_variants.vcf.gz\"\n", "PEAKS_FILE = DATA_DIR / \"atac_peaks.narrowPeak\"\n", "SAMPLE_ID = \"SAMPLE1\" # Sample name in VCF\n", "\n", "print(\"WASP2 ATAC-seq Tutorial\")\n", "print(\"=\" * 40)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 1: Data Loading and Preparation\n", "\n", "### 1.1 Inspect Peak File Format\n", "\n", "ATAC-seq peaks from MACS2 are typically in **narrowPeak** format (BED6+4). WASP2 accepts both BED and narrowPeak formats." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load and inspect peaks\n", "peak_columns = [\n", " \"chr\",\n", " \"start\",\n", " \"end\",\n", " \"name\",\n", " \"score\",\n", " \"strand\",\n", " \"signalValue\",\n", " \"pValue\",\n", " \"qValue\",\n", " \"peak\",\n", "]\n", "peaks_df = pd.read_csv(PEAKS_FILE, sep=\"\\t\", header=None, names=peak_columns)\n", "\n", "print(f\"Total peaks: {len(peaks_df):,}\")\n", "print(\"\\nPeak size distribution:\")\n", "peaks_df[\"size\"] = peaks_df[\"end\"] - peaks_df[\"start\"]\n", "print(peaks_df[\"size\"].describe())\n", "\n", "print(\"\\nFirst 5 peaks:\")\n", "peaks_df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Verify BAM and VCF Files\n", "\n", "Check that your input files are properly formatted and indexed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$SAMPLE_ID\"\n", "BAM_FILE=$1\n", "VCF_FILE=$2\n", "SAMPLE_ID=$3\n", "\n", "echo \"=== BAM File Check ===\"\n", "echo \"File: $BAM_FILE\"\n", "samtools view -H \"$BAM_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n", "\n", "echo \"\"\n", "echo \"=== VCF File Check ===\"\n", "echo \"File: $VCF_FILE\"\n", "echo \"Checking for sample: $SAMPLE_ID\"\n", "bcftools query -l \"$VCF_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n", "\n", "echo \"\"\n", "echo \"=== Index Check ===\"\n", "ls -la \"${BAM_FILE}.bai\" 2>/dev/null || echo \"BAM index (.bai): Using example paths\"\n", "ls -la \"${VCF_FILE}.tbi\" 2>/dev/null || echo \"VCF index (.tbi): Using example paths\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 2: Allele Counting at Peaks\n", "\n", "WASP2 counts reads supporting reference and alternate alleles at each heterozygous SNP within ATAC-seq peaks. The `--region` parameter restricts counting to SNPs overlapping your peaks.\n", "\n", "### 2.1 Run Allele Counting\n", "\n", "**Key Parameters:**\n", "- `--region`: Peak file to restrict SNPs to accessible regions\n", "- `--samples`: Sample ID for genotype filtering (heterozygous sites only)\n", "- `--out_file`: Output path for count results" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define output file\n", "COUNTS_FILE = RESULTS_DIR / \"atac_allele_counts.tsv\"\n", "\n", "# Build the command\n", "count_cmd = f\"\"\"\n", "wasp2-count count-variants \\\\\n", " {BAM_FILE} \\\\\n", " {VCF_FILE} \\\\\n", " --region {PEAKS_FILE} \\\\\n", " --samples {SAMPLE_ID} \\\\\n", " --out_file {COUNTS_FILE}\n", "\"\"\"\n", "\n", "print(\"Command to run:\")\n", "print(count_cmd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$PEAKS_FILE\" \"$SAMPLE_ID\" \"$COUNTS_FILE\"\n", "# Uncomment to run (requires actual data files)\n", "# wasp2-count count-variants \\\n", "# \"$1\" \\\n", "# \"$2\" \\\n", "# --region \"$3\" \\\n", "# --samples \"$4\" \\\n", "# --out_file \"$5\"\n", "\n", "echo \"Note: Uncomment the command above to run with your data\"\n", "echo \"For this tutorial, we'll use simulated example output.\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 Inspect Count Results\n", "\n", "The output contains per-SNP allele counts with peak annotations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# For demonstration, create example data\n", "# (Replace with: counts_df = pd.read_csv(COUNTS_FILE, sep='\\t') for real data)\n", "\n", "np.random.seed(42)\n", "n_snps = 5000\n", "\n", "# Simulate realistic ATAC-seq allele counts\n", "counts_df = pd.DataFrame(\n", " {\n", " \"chr\": np.random.choice([\"chr1\", \"chr2\", \"chr3\", \"chr4\", \"chr5\"], n_snps),\n", " \"pos\": np.random.randint(1e6, 2e8, n_snps),\n", " \"ref\": np.random.choice([\"A\", \"C\", \"G\", \"T\"], n_snps),\n", " \"alt\": np.random.choice([\"A\", \"C\", \"G\", \"T\"], n_snps),\n", " \"region_id\": [f\"peak_{i}\" for i in np.random.randint(0, 1500, n_snps)],\n", " }\n", ")\n", "\n", "# Generate counts with some true imbalanced regions\n", "total_depth = np.random.negative_binomial(5, 0.3, n_snps) + 5\n", "imbalance_prob = np.where(\n", " np.random.random(n_snps) < 0.1, # 10% truly imbalanced\n", " np.random.choice([0.3, 0.7], n_snps), # Imbalanced allele freq\n", " 0.5, # Balanced\n", ")\n", "counts_df[\"ref_count\"] = np.random.binomial(total_depth, imbalance_prob)\n", "counts_df[\"alt_count\"] = total_depth - counts_df[\"ref_count\"]\n", "counts_df[\"other_count\"] = 0\n", "\n", "print(f\"Total SNPs counted: {len(counts_df):,}\")\n", "print(f\"Unique peaks with SNPs: {counts_df['region_id'].nunique():,}\")\n", "print(\"\\nCount statistics:\")\n", "print(counts_df[[\"ref_count\", \"alt_count\"]].describe())\n", "counts_df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Quality Control: Count Distribution\n", "\n", "ATAC-seq typically has **lower coverage per peak** than RNA-seq genes. Check the distribution to set appropriate filtering thresholds." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate total counts per SNP\n", "counts_df[\"total\"] = counts_df[\"ref_count\"] + counts_df[\"alt_count\"]\n", "\n", "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", "# Total count distribution\n", "ax = axes[0]\n", "ax.hist(counts_df[\"total\"], bins=50, edgecolor=\"black\", alpha=0.7)\n", "ax.axvline(10, color=\"red\", linestyle=\"--\", label=\"min_count=10\")\n", "ax.axvline(5, color=\"orange\", linestyle=\"--\", label=\"min_count=5\")\n", "ax.set_xlabel(\"Total Read Count per SNP\")\n", "ax.set_ylabel(\"Number of SNPs\")\n", "ax.set_title(\"Read Depth Distribution\")\n", "ax.legend()\n", "ax.set_xlim(0, 100)\n", "\n", "# Allele ratio distribution\n", "ax = axes[1]\n", "ratio = counts_df[\"ref_count\"] / counts_df[\"total\"]\n", "ax.hist(ratio[counts_df[\"total\"] >= 10], bins=50, edgecolor=\"black\", alpha=0.7)\n", "ax.axvline(0.5, color=\"red\", linestyle=\"--\", label=\"Expected (0.5)\")\n", "ax.set_xlabel(\"Reference Allele Frequency\")\n", "ax.set_ylabel(\"Number of SNPs\")\n", "ax.set_title(\"Allele Ratio Distribution (depth ≥10)\")\n", "ax.legend()\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / \"count_qc.png\", dpi=150)\n", "plt.show()\n", "\n", "# Summary statistics\n", "print(\n", " f\"\\nSNPs with depth ≥5: {(counts_df['total'] >= 5).sum():,} ({100 * (counts_df['total'] >= 5).mean():.1f}%)\"\n", ")\n", "print(\n", " f\"SNPs with depth ≥10: {(counts_df['total'] >= 10).sum():,} ({100 * (counts_df['total'] >= 10).mean():.1f}%)\"\n", ")\n", "print(\n", " f\"SNPs with depth ≥20: {(counts_df['total'] >= 20).sum():,} ({100 * (counts_df['total'] >= 20).mean():.1f}%)\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 3: Statistical Testing (Beta-Binomial)\n", "\n", "WASP2's `find-imbalance` command uses a **beta-binomial model** to test for allelic imbalance:\n", "\n", "- **Null hypothesis (H₀):** Reference allele frequency = 0.5 (balanced)\n", "- **Alternative (H₁):** Reference allele frequency ≠ 0.5 (imbalanced)\n", "\n", "The beta-binomial distribution accounts for **overdispersion** - the extra variability beyond binomial sampling that's common in sequencing data.\n", "\n", "### 3.1 Run Imbalance Detection" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define output file\n", "IMBALANCE_FILE = RESULTS_DIR / \"atac_imbalance_results.tsv\"\n", "\n", "# Build the command\n", "# Note: Using --min 5 for ATAC-seq (lower coverage than RNA-seq)\n", "# The --model single option uses a single dispersion parameter for all regions\n", "analysis_cmd = f\"\"\"\n", "wasp2-analyze find-imbalance \\\\\n", " {COUNTS_FILE} \\\\\n", " --min 5 \\\\\n", " --model single \\\\\n", " --output {IMBALANCE_FILE}\n", "\"\"\"\n", "\n", "print(\"Command to run:\")\n", "print(analysis_cmd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "%%bash -s \"$COUNTS_FILE\" \"$IMBALANCE_FILE\"\n# Uncomment to run (requires actual count file)\n# wasp2-analyze find-imbalance \\\n# \"$1\" \\\n# --min 5 \\\n# --model single \\\n# --output \"$2\"\n\necho \"Note: Uncomment the command above to run with your data\"" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3.2 Simulate Results for Demonstration\n", "\n", "For this tutorial, we simulate realistic analysis results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Aggregate counts by peak (region)\n", "peak_counts = (\n", " counts_df.groupby(\"region_id\")\n", " .agg(\n", " {\n", " \"chr\": \"first\",\n", " \"pos\": [\"min\", \"max\"],\n", " \"ref_count\": \"sum\",\n", " \"alt_count\": \"sum\",\n", " }\n", " )\n", " .reset_index()\n", ")\n", "peak_counts.columns = [\"region\", \"chr\", \"start\", \"end\", \"ref_count\", \"alt_count\"]\n", "\n", "# Calculate statistics\n", "peak_counts[\"total\"] = peak_counts[\"ref_count\"] + peak_counts[\"alt_count\"]\n", "peak_counts[\"mu\"] = peak_counts[\"ref_count\"] / peak_counts[\"total\"]\n", "# Add pseudocount (+1) to avoid log(0) and stabilize ratios for low-count regions\n", "peak_counts[\"effect_size\"] = np.log2(\n", " (peak_counts[\"ref_count\"] + 1) / (peak_counts[\"alt_count\"] + 1)\n", ")\n", "\n", "# Simulate p-values (truly imbalanced peaks get low p-values)\n", "# Note: This simulation uses binomial for simplicity. Real ATAC-seq data exhibits\n", "# overdispersion, which is why WASP2 uses the beta-binomial model.\n", "np.random.seed(42)\n", "is_imbalanced = np.abs(peak_counts[\"mu\"] - 0.5) > 0.15\n", "peak_counts[\"p_value\"] = np.where(\n", " is_imbalanced,\n", " 10 ** (-np.random.uniform(2, 10, len(peak_counts))), # Significant\n", " np.random.uniform(0.05, 1, len(peak_counts)), # Not significant\n", ")\n", "\n", "# FDR correction (Benjamini-Hochberg)\n", "# Note: This manual BH implementation is for demonstration.\n", "# WASP2 internally uses scipy.stats.false_discovery_control()\n", "peak_counts = peak_counts.sort_values(\"p_value\")\n", "n_tests = len(peak_counts)\n", "peak_counts[\"rank\"] = range(1, n_tests + 1)\n", "peak_counts[\"fdr_pval\"] = np.minimum(peak_counts[\"p_value\"] * n_tests / peak_counts[\"rank\"], 1.0)\n", "peak_counts[\"fdr_pval\"] = peak_counts[\"fdr_pval\"][::-1].cummin()[::-1]\n", "\n", "# Filter to testable peaks\n", "results_df = peak_counts[peak_counts[\"total\"] >= 5].copy()\n", "results_df = results_df.drop(\"rank\", axis=1)\n", "\n", "print(f\"Peaks tested: {len(results_df):,}\")\n", "print(f\"Significant (FDR < 0.05): {(results_df['fdr_pval'] < 0.05).sum():,}\")\n", "print(f\"Significant (FDR < 0.01): {(results_df['fdr_pval'] < 0.01).sum():,}\")\n", "results_df.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 4: Result Interpretation and Visualization\n", "\n", "### 4.1 Volcano Plot\n", "\n", "The volcano plot shows effect size (x-axis) vs. statistical significance (y-axis), helping identify peaks with both strong and significant allelic imbalance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(10, 8))\n", "\n", "# Calculate -log10(p-value) for plotting\n", "results_df[\"neg_log10_pval\"] = -np.log10(results_df[\"p_value\"].clip(lower=1e-50))\n", "\n", "# Define significance thresholds\n", "sig_mask = results_df[\"fdr_pval\"] < 0.05\n", "effect_mask = np.abs(results_df[\"effect_size\"]) > 0.5 # |log2FC| > 0.5\n", "\n", "# Plot non-significant points\n", "ns = ~sig_mask\n", "ax.scatter(\n", " results_df.loc[ns, \"effect_size\"],\n", " results_df.loc[ns, \"neg_log10_pval\"],\n", " c=\"lightgray\",\n", " s=15,\n", " alpha=0.6,\n", " label=f\"Not significant (n={ns.sum():,})\",\n", ")\n", "\n", "# Plot significant but small effect\n", "sig_small = sig_mask & ~effect_mask\n", "ax.scatter(\n", " results_df.loc[sig_small, \"effect_size\"],\n", " results_df.loc[sig_small, \"neg_log10_pval\"],\n", " c=\"steelblue\",\n", " s=25,\n", " alpha=0.7,\n", " label=f\"FDR<0.05, small effect (n={sig_small.sum():,})\",\n", ")\n", "\n", "# Plot significant and large effect\n", "sig_large = sig_mask & effect_mask\n", "ax.scatter(\n", " results_df.loc[sig_large, \"effect_size\"],\n", " results_df.loc[sig_large, \"neg_log10_pval\"],\n", " c=\"firebrick\",\n", " s=40,\n", " alpha=0.8,\n", " label=f\"FDR<0.05, |log2FC|>0.5 (n={sig_large.sum():,})\",\n", ")\n", "\n", "# Add threshold lines\n", "ax.axhline(-np.log10(0.05), color=\"black\", linestyle=\"--\", alpha=0.3, linewidth=1)\n", "ax.axvline(0.5, color=\"gray\", linestyle=\":\", alpha=0.5)\n", "ax.axvline(-0.5, color=\"gray\", linestyle=\":\", alpha=0.5)\n", "ax.axvline(0, color=\"black\", linestyle=\"-\", alpha=0.2)\n", "\n", "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\", fontsize=12)\n", "ax.set_ylabel(\"-log₁₀(p-value)\", fontsize=12)\n", "ax.set_title(\"ATAC-seq Allelic Imbalance\\nVolcano Plot\", fontsize=14)\n", "ax.legend(loc=\"upper right\", fontsize=9)\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / \"volcano_plot.png\", dpi=150)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.2 Effect Size Distribution" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", "# All peaks\n", "ax = axes[0]\n", "ax.hist(results_df[\"effect_size\"], bins=50, edgecolor=\"black\", alpha=0.7, color=\"steelblue\")\n", "ax.axvline(0, color=\"red\", linestyle=\"--\", linewidth=2)\n", "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\")\n", "ax.set_ylabel(\"Number of Peaks\")\n", "ax.set_title(\"All Tested Peaks\")\n", "\n", "# Significant peaks only\n", "ax = axes[1]\n", "sig_effects = results_df.loc[sig_mask, \"effect_size\"]\n", "ax.hist(sig_effects, bins=30, edgecolor=\"black\", alpha=0.7, color=\"firebrick\")\n", "ax.axvline(0, color=\"black\", linestyle=\"--\", linewidth=2)\n", "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\")\n", "ax.set_ylabel(\"Number of Peaks\")\n", "ax.set_title(f\"Significant Peaks (FDR < 0.05, n={sig_mask.sum():,})\")\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / \"effect_size_distribution.png\", dpi=150)\n", "plt.show()\n", "\n", "# Summary statistics\n", "print(\"Effect size statistics (significant peaks):\")\n", "print(f\" Mean: {sig_effects.mean():.3f}\")\n", "print(f\" Median: {sig_effects.median():.3f}\")\n", "print(f\" Std: {sig_effects.std():.3f}\")\n", "print(f\" Ref-biased (log2FC > 0): {(sig_effects > 0).sum()}\")\n", "print(f\" Alt-biased (log2FC < 0): {(sig_effects < 0).sum()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4.3 Top Imbalanced Peaks" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get top hits by significance\n", "top_hits = results_df[results_df[\"fdr_pval\"] < 0.05].nsmallest(20, \"fdr_pval\")\n", "\n", "print(\"Top 20 Peaks with Allelic Imbalance\")\n", "print(\"=\" * 80)\n", "display_cols = [\n", " \"region\",\n", " \"chr\",\n", " \"ref_count\",\n", " \"alt_count\",\n", " \"mu\",\n", " \"effect_size\",\n", " \"p_value\",\n", " \"fdr_pval\",\n", "]\n", "top_hits[display_cols].round(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 5: QTL Overlap Analysis\n", "\n", "Peaks with allelic imbalance often harbor **chromatin accessibility QTLs (caQTLs)** or overlap with **expression QTLs (eQTLs)**. Integrating your results with published QTL databases helps validate findings and identify regulatory mechanisms.\n", "\n", "### 5.1 Prepare BED File for Overlap Analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Export significant peaks as BED for overlap analysis\n", "sig_peaks = results_df[results_df[\"fdr_pval\"] < 0.05][\n", " [\"chr\", \"start\", \"end\", \"region\", \"effect_size\", \"fdr_pval\"]\n", "].copy()\n", "sig_peaks.to_csv(RESULTS_DIR / \"significant_peaks.bed\", sep=\"\\t\", header=False, index=False)\n", "\n", "print(f\"Exported {len(sig_peaks)} significant peaks to: {RESULTS_DIR / 'significant_peaks.bed'}\")\n", "print(\"\\nUse this file for overlap analysis with:\")\n", "print(\" - GTEx eQTLs (https://gtexportal.org)\")\n", "print(\" - ENCODE cCREs (https://www.encodeproject.org)\")\n", "print(\" - Published caQTL datasets\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.2 Example: GTEx eQTL Overlap\n", "\n", "This example shows how to intersect your imbalanced peaks with GTEx eQTL SNPs to identify potential regulatory relationships." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Example overlap analysis with bedtools (requires eQTL BED file)\n", "overlap_cmd = \"\"\"\n", "# Download GTEx eQTLs for your tissue of interest\n", "# Example: Brain cortex significant eQTLs\n", "\n", "# Intersect imbalanced peaks with eQTL positions\n", "bedtools intersect \\\\\n", " -a results/significant_peaks.bed \\\\\n", " -b gtex_brain_cortex_eqtls.bed \\\\\n", " -wa -wb \\\\\n", " > results/peak_eqtl_overlap.bed\n", "\n", "# Count overlaps\n", "wc -l results/peak_eqtl_overlap.bed\n", "\"\"\"\n", "\n", "print(\"Example bedtools command for eQTL overlap:\")\n", "print(overlap_cmd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Simulate overlap results for demonstration\n", "np.random.seed(123)\n", "\n", "# Create simulated eQTL overlap data\n", "n_overlap = 150\n", "overlap_df = pd.DataFrame(\n", " {\n", " \"peak\": [f\"peak_{i}\" for i in np.random.choice(range(1500), n_overlap, replace=False)],\n", " \"eqtl_gene\": [f\"GENE{i}\" for i in np.random.randint(1, 500, n_overlap)],\n", " \"eqtl_pval\": 10 ** (-np.random.uniform(3, 15, n_overlap)),\n", " \"tissue\": np.random.choice(\n", " [\"Brain_Cortex\", \"Brain_Hippocampus\", \"Liver\", \"Heart\"], n_overlap\n", " ),\n", " }\n", ")\n", "\n", "print(f\"Peaks overlapping eQTLs: {len(overlap_df)}\")\n", "print(\"\\nOverlap by tissue:\")\n", "print(overlap_df[\"tissue\"].value_counts())\n", "\n", "# Enrichment analysis\n", "n_sig_peaks = sig_mask.sum()\n", "n_total_peaks = len(results_df)\n", "n_eqtl_overlap = len(overlap_df)\n", "\n", "# Fisher's exact test for enrichment\n", "from scipy.stats import fisher_exact\n", "\n", "# Assume 10% of all peaks overlap eQTLs by chance\n", "expected_overlap = int(n_sig_peaks * 0.10)\n", "contingency = [\n", " [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap],\n", " [expected_overlap, n_sig_peaks - expected_overlap],\n", "]\n", "odds_ratio, p_value = fisher_exact(contingency)\n", "\n", "print(\"\\nEnrichment Analysis:\")\n", "print(f\" Imbalanced peaks: {n_sig_peaks}\")\n", "print(f\" Overlapping eQTLs: {n_eqtl_overlap}\")\n", "print(f\" Expected by chance: ~{expected_overlap}\")\n", "print(f\" Fold enrichment: {n_eqtl_overlap / max(expected_overlap, 1):.2f}x\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5.3 Visualization: eQTL Overlap" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n", "\n", "# Pie chart: Overlap proportion\n", "ax = axes[0]\n", "overlap_counts = [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap]\n", "labels = [\"Overlap with eQTL\", \"No eQTL overlap\"]\n", "colors = [\"#e74c3c\", \"#95a5a6\"]\n", "ax.pie(overlap_counts, labels=labels, colors=colors, autopct=\"%1.1f%%\", startangle=90)\n", "ax.set_title(\"Imbalanced Peaks Overlapping eQTLs\")\n", "\n", "# Bar chart: Overlap by tissue\n", "ax = axes[1]\n", "tissue_counts = overlap_df[\"tissue\"].value_counts()\n", "colors = plt.cm.Set2(range(len(tissue_counts)))\n", "bars = ax.bar(tissue_counts.index, tissue_counts.values, color=colors, edgecolor=\"black\")\n", "ax.set_xlabel(\"Tissue\")\n", "ax.set_ylabel(\"Number of Overlapping eQTLs\")\n", "ax.set_title(\"eQTL Overlap by Tissue\")\n", "ax.tick_params(axis=\"x\", rotation=45)\n", "\n", "plt.tight_layout()\n", "plt.savefig(RESULTS_DIR / \"eqtl_overlap.png\", dpi=150)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Section 6: Downstream Analysis Hints\n", "\n", "### 6.1 Motif Enrichment Analysis\n", "\n", "Imbalanced peaks may disrupt transcription factor binding sites. Use tools like HOMER or MEME-ChIP for motif analysis." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "motif_cmd = \"\"\"\n", "# Extract sequences around imbalanced SNPs\n", "bedtools slop -i significant_peaks.bed -g genome.chrom.sizes -b 50 | \\\\\n", "bedtools getfasta -fi genome.fa -bed - -fo imbalanced_seqs.fa\n", "\n", "# Run HOMER motif analysis\n", "findMotifsGenome.pl significant_peaks.bed hg38 motif_results/ -size 200\n", "\n", "# Alternative: MEME-ChIP\n", "meme-chip -oc meme_results imbalanced_seqs.fa\n", "\"\"\"\n", "\n", "print(\"Example commands for motif enrichment analysis:\")\n", "print(motif_cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.2 Gene Ontology Enrichment\n", "\n", "Identify biological processes associated with genes near imbalanced peaks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "go_cmd = \"\"\"\n", "# Annotate peaks with nearest genes using GREAT or bedtools\n", "bedtools closest -a significant_peaks.bed -b genes.bed -d > peak_gene_assignments.bed\n", "\n", "# Extract gene list\n", "cut -f8 peak_gene_assignments.bed | sort -u > imbalanced_genes.txt\n", "\n", "# Use DAVID, Enrichr, or clusterProfiler for GO enrichment\n", "# Web interface: https://david.ncifcrf.gov/\n", "# Web interface: https://maayanlab.cloud/Enrichr/\n", "\"\"\"\n", "\n", "print(\"Example commands for GO enrichment analysis:\")\n", "print(go_cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6.3 Single-Cell ATAC-seq Extension\n", "\n", "For single-cell ATAC-seq (scATAC-seq), use WASP2's single-cell workflow to detect cell-type-specific allelic imbalance." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sc_cmd = \"\"\"\n", "# Count alleles in single-cell ATAC-seq\n", "wasp2-count count-variants-sc \\\\\n", " scatac_possorted.bam \\\\\n", " variants.vcf.gz \\\\\n", " barcodes.tsv \\\\\n", " --samples SAMPLE_ID \\\\\n", " --feature peaks.bed \\\\\n", " --out_file scatac_counts.h5ad\n", "\n", "# Detect imbalance per cell type\n", "wasp2-analyze find-imbalance-sc \\\\\n", " scatac_counts.h5ad \\\\\n", " barcode_celltype_map.tsv \\\\\n", " --sample SAMPLE_ID \\\\\n", " --min 5 \\\\\n", " --phased\n", "\n", "# Compare imbalance between cell types\n", "wasp2-analyze compare-imbalance \\\\\n", " scatac_counts.h5ad \\\\\n", " barcode_celltype_map.tsv \\\\\n", " --groups \"excitatory,inhibitory\" \\\\\n", " --sample SAMPLE_ID\n", "\"\"\"\n", "\n", "print(\"Commands for single-cell ATAC-seq analysis:\")\n", "print(sc_cmd)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summary\n", "\n", "In this tutorial, you learned to:\n", "\n", "1. **Prepare ATAC-seq data** - Load peaks and verify input file formats\n", "2. **Count alleles** - Use `wasp2-count count-variants` with peak regions\n", "3. **Detect imbalance** - Apply beta-binomial testing with `wasp2-analyze find-imbalance`\n", "4. **Visualize results** - Create volcano plots and effect size distributions\n", "5. **Integrate with QTLs** - Overlap with eQTL databases for biological validation\n", "\n", "### Key Takeaways\n", "\n", "- ATAC-seq has **lower coverage per peak** than RNA-seq; use `--min-count 5` instead of 10\n", "- **FDR correction** is essential for multiple testing across thousands of peaks\n", "- Consider **effect size** alongside significance for biological relevance\n", "- **QTL overlap** helps validate findings and identify causal variants\n", "\n", "### Next Steps\n", "\n", "- [Comparative Imbalance Tutorial](./comparative_imbalance.rst) - Compare imbalance between conditions\n", "- [Single-Cell Tutorial](./scrna_seq.rst) - Cell-type-specific analysis\n", "- [Statistical Methods](../methods/statistical_models.rst) - Deep dive into the beta-binomial model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Troubleshooting\n", "\n", "### Common Issues\n", "\n", "**Low SNP counts in peaks:**\n", "- Ensure VCF contains heterozygous variants for your sample\n", "- Check that peak coordinates use the same reference genome as VCF\n", "- Verify `--samples` matches the sample name in VCF header\n", "\n", "**Memory errors with large datasets:**\n", "- Process chromosomes separately with `--region chr1_peaks.bed`, etc.\n", "- Use `WASP2_RUST_THREADS=4` to limit parallel processing\n", "\n", "**No significant results:**\n", "- Check read depth (may need deeper sequencing)\n", "- Verify WASP filtering was applied to remove mapping bias\n", "- Consider lowering `--min-count` threshold (with caution)\n", "\n", "### Diagnostic Commands\n", "\n", "```bash\n", "# Check VCF sample names\n", "bcftools query -l variants.vcf.gz\n", "\n", "# Count heterozygous SNPs in your sample\n", "bcftools view -s SAMPLE_ID variants.vcf.gz | bcftools view -g het | wc -l\n", "\n", "# Check BAM read depth at a peak\n", "samtools depth -r chr1:1000000-1001000 sample.bam | awk '{sum+=$3} END {print sum/NR}'\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save final results\n", "results_df.to_csv(RESULTS_DIR / \"final_imbalance_results.tsv\", sep=\"\\t\", index=False)\n", "print(f\"Results saved to: {RESULTS_DIR / 'final_imbalance_results.tsv'}\")\n", "print(\"\\nAnalysis complete!\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 4 }