{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ATAC-seq Allelic Imbalance Analysis Tutorial\n",
    "\n",
    "**Estimated time:** ~30 minutes\n",
    "\n",
    "This tutorial demonstrates a complete WASP2 workflow for detecting allelic imbalance in chromatin accessibility from ATAC-seq data. You will learn to:\n",
    "\n",
    "1. Prepare ATAC-seq peak files for analysis\n",
    "2. Count alleles at heterozygous SNPs within accessibility peaks\n",
    "3. Detect significant allelic imbalance using beta-binomial statistical testing\n",
    "4. Visualize results with volcano plots and effect size distributions\n",
    "5. Integrate with caQTL/eQTL data for biological interpretation\n",
    "\n",
    "## Background\n",
    "\n",
    "**Allelic Imbalance in Chromatin Accessibility**\n",
    "\n",
    "ATAC-seq (Assay for Transposase-Accessible Chromatin with sequencing) measures open chromatin regions genome-wide. When a heterozygous individual shows unequal accessibility between maternal and paternal alleles at a regulatory region, this indicates **allelic imbalance in chromatin accessibility**.\n",
    "\n",
    "Such imbalance often reflects:\n",
    "- *cis*-regulatory variants affecting transcription factor binding\n",
    "- Chromatin accessibility QTLs (caQTLs)\n",
    "- Haplotype-specific enhancer activity\n",
    "\n",
    "WASP2 uses a **beta-binomial model** to detect significant departures from the expected 50:50 allelic ratio while accounting for overdispersion in sequencing data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Prerequisites\n\n### Software\n\n- **WASP2** (`pip install wasp2`)\n- **Python 3.10+** with pandas, matplotlib, numpy\n- **samtools** (for BAM operations)\n- **tabix** (for VCF indexing)\n\n### Input Data\n\n| File | Description | Format |\n|------|-------------|--------|\n| `sample.bam` | ATAC-seq aligned reads | BAM (indexed) |\n| `variants.vcf.gz` | Phased heterozygous variants | VCF (indexed) |\n| `peaks.bed` | ATAC-seq peaks from MACS2/SEACR | BED or narrowPeak |\n\n**Note:** For best results, use WASP-filtered BAM files to correct reference mapping bias. See the [mapping documentation](../user_guide/mapping.rst) for details."
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# Configure plotting\n",
    "plt.style.use(\"seaborn-v0_8-whitegrid\")\n",
    "plt.rcParams[\"figure.figsize\"] = (10, 6)\n",
    "plt.rcParams[\"font.size\"] = 11\n",
    "\n",
    "# Define paths (modify these for your data)\n",
    "DATA_DIR = Path(\"data\")\n",
    "RESULTS_DIR = Path(\"results\")\n",
    "RESULTS_DIR.mkdir(exist_ok=True)\n",
    "\n",
    "# Input files\n",
    "BAM_FILE = DATA_DIR / \"atac_sample.bam\"\n",
    "VCF_FILE = DATA_DIR / \"phased_variants.vcf.gz\"\n",
    "PEAKS_FILE = DATA_DIR / \"atac_peaks.narrowPeak\"\n",
    "SAMPLE_ID = \"SAMPLE1\"  # Sample name in VCF\n",
    "\n",
    "print(\"WASP2 ATAC-seq Tutorial\")\n",
    "print(\"=\" * 40)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 1: Data Loading and Preparation\n",
    "\n",
    "### 1.1 Inspect Peak File Format\n",
    "\n",
    "ATAC-seq peaks from MACS2 are typically in **narrowPeak** format (BED6+4). WASP2 accepts both BED and narrowPeak formats."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load and inspect peaks\n",
    "peak_columns = [\n",
    "    \"chr\",\n",
    "    \"start\",\n",
    "    \"end\",\n",
    "    \"name\",\n",
    "    \"score\",\n",
    "    \"strand\",\n",
    "    \"signalValue\",\n",
    "    \"pValue\",\n",
    "    \"qValue\",\n",
    "    \"peak\",\n",
    "]\n",
    "peaks_df = pd.read_csv(PEAKS_FILE, sep=\"\\t\", header=None, names=peak_columns)\n",
    "\n",
    "print(f\"Total peaks: {len(peaks_df):,}\")\n",
    "print(\"\\nPeak size distribution:\")\n",
    "peaks_df[\"size\"] = peaks_df[\"end\"] - peaks_df[\"start\"]\n",
    "print(peaks_df[\"size\"].describe())\n",
    "\n",
    "print(\"\\nFirst 5 peaks:\")\n",
    "peaks_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Verify BAM and VCF Files\n",
    "\n",
    "Check that your input files are properly formatted and indexed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$SAMPLE_ID\"\n",
    "BAM_FILE=$1\n",
    "VCF_FILE=$2\n",
    "SAMPLE_ID=$3\n",
    "\n",
    "echo \"=== BAM File Check ===\"\n",
    "echo \"File: $BAM_FILE\"\n",
    "samtools view -H \"$BAM_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"=== VCF File Check ===\"\n",
    "echo \"File: $VCF_FILE\"\n",
    "echo \"Checking for sample: $SAMPLE_ID\"\n",
    "bcftools query -l \"$VCF_FILE\" 2>/dev/null | head -5 || echo \"Note: Using example paths\"\n",
    "\n",
    "echo \"\"\n",
    "echo \"=== Index Check ===\"\n",
    "ls -la \"${BAM_FILE}.bai\" 2>/dev/null || echo \"BAM index (.bai): Using example paths\"\n",
    "ls -la \"${VCF_FILE}.tbi\" 2>/dev/null || echo \"VCF index (.tbi): Using example paths\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 2: Allele Counting at Peaks\n",
    "\n",
    "WASP2 counts reads supporting reference and alternate alleles at each heterozygous SNP within ATAC-seq peaks. The `--region` parameter restricts counting to SNPs overlapping your peaks.\n",
    "\n",
    "### 2.1 Run Allele Counting\n",
    "\n",
    "**Key Parameters:**\n",
    "- `--region`: Peak file to restrict SNPs to accessible regions\n",
    "- `--samples`: Sample ID for genotype filtering (heterozygous sites only)\n",
    "- `--out_file`: Output path for count results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define output file\n",
    "COUNTS_FILE = RESULTS_DIR / \"atac_allele_counts.tsv\"\n",
    "\n",
    "# Build the command\n",
    "count_cmd = f\"\"\"\n",
    "wasp2-count count-variants \\\\\n",
    "    {BAM_FILE} \\\\\n",
    "    {VCF_FILE} \\\\\n",
    "    --region {PEAKS_FILE} \\\\\n",
    "    --samples {SAMPLE_ID} \\\\\n",
    "    --out_file {COUNTS_FILE}\n",
    "\"\"\"\n",
    "\n",
    "print(\"Command to run:\")\n",
    "print(count_cmd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash -s \"$BAM_FILE\" \"$VCF_FILE\" \"$PEAKS_FILE\" \"$SAMPLE_ID\" \"$COUNTS_FILE\"\n",
    "# Uncomment to run (requires actual data files)\n",
    "# wasp2-count count-variants \\\n",
    "#     \"$1\" \\\n",
    "#     \"$2\" \\\n",
    "#     --region \"$3\" \\\n",
    "#     --samples \"$4\" \\\n",
    "#     --out_file \"$5\"\n",
    "\n",
    "echo \"Note: Uncomment the command above to run with your data\"\n",
    "echo \"For this tutorial, we'll use simulated example output.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 Inspect Count Results\n",
    "\n",
    "The output contains per-SNP allele counts with peak annotations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For demonstration, create example data\n",
    "# (Replace with: counts_df = pd.read_csv(COUNTS_FILE, sep='\\t') for real data)\n",
    "\n",
    "np.random.seed(42)\n",
    "n_snps = 5000\n",
    "\n",
    "# Simulate realistic ATAC-seq allele counts\n",
    "counts_df = pd.DataFrame(\n",
    "    {\n",
    "        \"chr\": np.random.choice([\"chr1\", \"chr2\", \"chr3\", \"chr4\", \"chr5\"], n_snps),\n",
    "        \"pos\": np.random.randint(1e6, 2e8, n_snps),\n",
    "        \"ref\": np.random.choice([\"A\", \"C\", \"G\", \"T\"], n_snps),\n",
    "        \"alt\": np.random.choice([\"A\", \"C\", \"G\", \"T\"], n_snps),\n",
    "        \"region_id\": [f\"peak_{i}\" for i in np.random.randint(0, 1500, n_snps)],\n",
    "    }\n",
    ")\n",
    "\n",
    "# Generate counts with some true imbalanced regions\n",
    "total_depth = np.random.negative_binomial(5, 0.3, n_snps) + 5\n",
    "imbalance_prob = np.where(\n",
    "    np.random.random(n_snps) < 0.1,  # 10% truly imbalanced\n",
    "    np.random.choice([0.3, 0.7], n_snps),  # Imbalanced allele freq\n",
    "    0.5,  # Balanced\n",
    ")\n",
    "counts_df[\"ref_count\"] = np.random.binomial(total_depth, imbalance_prob)\n",
    "counts_df[\"alt_count\"] = total_depth - counts_df[\"ref_count\"]\n",
    "counts_df[\"other_count\"] = 0\n",
    "\n",
    "print(f\"Total SNPs counted: {len(counts_df):,}\")\n",
    "print(f\"Unique peaks with SNPs: {counts_df['region_id'].nunique():,}\")\n",
    "print(\"\\nCount statistics:\")\n",
    "print(counts_df[[\"ref_count\", \"alt_count\"]].describe())\n",
    "counts_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 Quality Control: Count Distribution\n",
    "\n",
    "ATAC-seq typically has **lower coverage per peak** than RNA-seq genes. Check the distribution to set appropriate filtering thresholds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate total counts per SNP\n",
    "counts_df[\"total\"] = counts_df[\"ref_count\"] + counts_df[\"alt_count\"]\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "# Total count distribution\n",
    "ax = axes[0]\n",
    "ax.hist(counts_df[\"total\"], bins=50, edgecolor=\"black\", alpha=0.7)\n",
    "ax.axvline(10, color=\"red\", linestyle=\"--\", label=\"min_count=10\")\n",
    "ax.axvline(5, color=\"orange\", linestyle=\"--\", label=\"min_count=5\")\n",
    "ax.set_xlabel(\"Total Read Count per SNP\")\n",
    "ax.set_ylabel(\"Number of SNPs\")\n",
    "ax.set_title(\"Read Depth Distribution\")\n",
    "ax.legend()\n",
    "ax.set_xlim(0, 100)\n",
    "\n",
    "# Allele ratio distribution\n",
    "ax = axes[1]\n",
    "ratio = counts_df[\"ref_count\"] / counts_df[\"total\"]\n",
    "ax.hist(ratio[counts_df[\"total\"] >= 10], bins=50, edgecolor=\"black\", alpha=0.7)\n",
    "ax.axvline(0.5, color=\"red\", linestyle=\"--\", label=\"Expected (0.5)\")\n",
    "ax.set_xlabel(\"Reference Allele Frequency\")\n",
    "ax.set_ylabel(\"Number of SNPs\")\n",
    "ax.set_title(\"Allele Ratio Distribution (depth ≥10)\")\n",
    "ax.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / \"count_qc.png\", dpi=150)\n",
    "plt.show()\n",
    "\n",
    "# Summary statistics\n",
    "print(\n",
    "    f\"\\nSNPs with depth ≥5: {(counts_df['total'] >= 5).sum():,} ({100 * (counts_df['total'] >= 5).mean():.1f}%)\"\n",
    ")\n",
    "print(\n",
    "    f\"SNPs with depth ≥10: {(counts_df['total'] >= 10).sum():,} ({100 * (counts_df['total'] >= 10).mean():.1f}%)\"\n",
    ")\n",
    "print(\n",
    "    f\"SNPs with depth ≥20: {(counts_df['total'] >= 20).sum():,} ({100 * (counts_df['total'] >= 20).mean():.1f}%)\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 3: Statistical Testing (Beta-Binomial)\n",
    "\n",
    "WASP2's `find-imbalance` command uses a **beta-binomial model** to test for allelic imbalance:\n",
    "\n",
    "- **Null hypothesis (H₀):** Reference allele frequency = 0.5 (balanced)\n",
    "- **Alternative (H₁):** Reference allele frequency ≠ 0.5 (imbalanced)\n",
    "\n",
    "The beta-binomial distribution accounts for **overdispersion** - the extra variability beyond binomial sampling that's common in sequencing data.\n",
    "\n",
    "### 3.1 Run Imbalance Detection"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define output file\n",
    "IMBALANCE_FILE = RESULTS_DIR / \"atac_imbalance_results.tsv\"\n",
    "\n",
    "# Build the command\n",
    "# Note: Using --min 5 for ATAC-seq (lower coverage than RNA-seq)\n",
    "# The --model single option uses a single dispersion parameter for all regions\n",
    "analysis_cmd = f\"\"\"\n",
    "wasp2-analyze find-imbalance \\\\\n",
    "    {COUNTS_FILE} \\\\\n",
    "    --min 5 \\\\\n",
    "    --model single \\\\\n",
    "    --output {IMBALANCE_FILE}\n",
    "\"\"\"\n",
    "\n",
    "print(\"Command to run:\")\n",
    "print(analysis_cmd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": "%%bash -s \"$COUNTS_FILE\" \"$IMBALANCE_FILE\"\n# Uncomment to run (requires actual count file)\n# wasp2-analyze find-imbalance \\\n#     \"$1\" \\\n#     --min 5 \\\n#     --model single \\\n#     --output \"$2\"\n\necho \"Note: Uncomment the command above to run with your data\""
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Simulate Results for Demonstration\n",
    "\n",
    "For this tutorial, we simulate realistic analysis results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Aggregate counts by peak (region)\n",
    "peak_counts = (\n",
    "    counts_df.groupby(\"region_id\")\n",
    "    .agg(\n",
    "        {\n",
    "            \"chr\": \"first\",\n",
    "            \"pos\": [\"min\", \"max\"],\n",
    "            \"ref_count\": \"sum\",\n",
    "            \"alt_count\": \"sum\",\n",
    "        }\n",
    "    )\n",
    "    .reset_index()\n",
    ")\n",
    "peak_counts.columns = [\"region\", \"chr\", \"start\", \"end\", \"ref_count\", \"alt_count\"]\n",
    "\n",
    "# Calculate statistics\n",
    "peak_counts[\"total\"] = peak_counts[\"ref_count\"] + peak_counts[\"alt_count\"]\n",
    "peak_counts[\"mu\"] = peak_counts[\"ref_count\"] / peak_counts[\"total\"]\n",
    "# Add pseudocount (+1) to avoid log(0) and stabilize ratios for low-count regions\n",
    "peak_counts[\"effect_size\"] = np.log2(\n",
    "    (peak_counts[\"ref_count\"] + 1) / (peak_counts[\"alt_count\"] + 1)\n",
    ")\n",
    "\n",
    "# Simulate p-values (truly imbalanced peaks get low p-values)\n",
    "# Note: This simulation uses binomial for simplicity. Real ATAC-seq data exhibits\n",
    "# overdispersion, which is why WASP2 uses the beta-binomial model.\n",
    "np.random.seed(42)\n",
    "is_imbalanced = np.abs(peak_counts[\"mu\"] - 0.5) > 0.15\n",
    "peak_counts[\"p_value\"] = np.where(\n",
    "    is_imbalanced,\n",
    "    10 ** (-np.random.uniform(2, 10, len(peak_counts))),  # Significant\n",
    "    np.random.uniform(0.05, 1, len(peak_counts)),  # Not significant\n",
    ")\n",
    "\n",
    "# FDR correction (Benjamini-Hochberg)\n",
    "# Note: This manual BH implementation is for demonstration.\n",
    "# WASP2 internally uses scipy.stats.false_discovery_control()\n",
    "peak_counts = peak_counts.sort_values(\"p_value\")\n",
    "n_tests = len(peak_counts)\n",
    "peak_counts[\"rank\"] = range(1, n_tests + 1)\n",
    "peak_counts[\"fdr_pval\"] = np.minimum(peak_counts[\"p_value\"] * n_tests / peak_counts[\"rank\"], 1.0)\n",
    "peak_counts[\"fdr_pval\"] = peak_counts[\"fdr_pval\"][::-1].cummin()[::-1]\n",
    "\n",
    "# Filter to testable peaks\n",
    "results_df = peak_counts[peak_counts[\"total\"] >= 5].copy()\n",
    "results_df = results_df.drop(\"rank\", axis=1)\n",
    "\n",
    "print(f\"Peaks tested: {len(results_df):,}\")\n",
    "print(f\"Significant (FDR < 0.05): {(results_df['fdr_pval'] < 0.05).sum():,}\")\n",
    "print(f\"Significant (FDR < 0.01): {(results_df['fdr_pval'] < 0.01).sum():,}\")\n",
    "results_df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 4: Result Interpretation and Visualization\n",
    "\n",
    "### 4.1 Volcano Plot\n",
    "\n",
    "The volcano plot shows effect size (x-axis) vs. statistical significance (y-axis), helping identify peaks with both strong and significant allelic imbalance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(figsize=(10, 8))\n",
    "\n",
    "# Calculate -log10(p-value) for plotting\n",
    "results_df[\"neg_log10_pval\"] = -np.log10(results_df[\"p_value\"].clip(lower=1e-50))\n",
    "\n",
    "# Define significance thresholds\n",
    "sig_mask = results_df[\"fdr_pval\"] < 0.05\n",
    "effect_mask = np.abs(results_df[\"effect_size\"]) > 0.5  # |log2FC| > 0.5\n",
    "\n",
    "# Plot non-significant points\n",
    "ns = ~sig_mask\n",
    "ax.scatter(\n",
    "    results_df.loc[ns, \"effect_size\"],\n",
    "    results_df.loc[ns, \"neg_log10_pval\"],\n",
    "    c=\"lightgray\",\n",
    "    s=15,\n",
    "    alpha=0.6,\n",
    "    label=f\"Not significant (n={ns.sum():,})\",\n",
    ")\n",
    "\n",
    "# Plot significant but small effect\n",
    "sig_small = sig_mask & ~effect_mask\n",
    "ax.scatter(\n",
    "    results_df.loc[sig_small, \"effect_size\"],\n",
    "    results_df.loc[sig_small, \"neg_log10_pval\"],\n",
    "    c=\"steelblue\",\n",
    "    s=25,\n",
    "    alpha=0.7,\n",
    "    label=f\"FDR<0.05, small effect (n={sig_small.sum():,})\",\n",
    ")\n",
    "\n",
    "# Plot significant and large effect\n",
    "sig_large = sig_mask & effect_mask\n",
    "ax.scatter(\n",
    "    results_df.loc[sig_large, \"effect_size\"],\n",
    "    results_df.loc[sig_large, \"neg_log10_pval\"],\n",
    "    c=\"firebrick\",\n",
    "    s=40,\n",
    "    alpha=0.8,\n",
    "    label=f\"FDR<0.05, |log2FC|>0.5 (n={sig_large.sum():,})\",\n",
    ")\n",
    "\n",
    "# Add threshold lines\n",
    "ax.axhline(-np.log10(0.05), color=\"black\", linestyle=\"--\", alpha=0.3, linewidth=1)\n",
    "ax.axvline(0.5, color=\"gray\", linestyle=\":\", alpha=0.5)\n",
    "ax.axvline(-0.5, color=\"gray\", linestyle=\":\", alpha=0.5)\n",
    "ax.axvline(0, color=\"black\", linestyle=\"-\", alpha=0.2)\n",
    "\n",
    "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\", fontsize=12)\n",
    "ax.set_ylabel(\"-log₁₀(p-value)\", fontsize=12)\n",
    "ax.set_title(\"ATAC-seq Allelic Imbalance\\nVolcano Plot\", fontsize=14)\n",
    "ax.legend(loc=\"upper right\", fontsize=9)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / \"volcano_plot.png\", dpi=150)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Effect Size Distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "# All peaks\n",
    "ax = axes[0]\n",
    "ax.hist(results_df[\"effect_size\"], bins=50, edgecolor=\"black\", alpha=0.7, color=\"steelblue\")\n",
    "ax.axvline(0, color=\"red\", linestyle=\"--\", linewidth=2)\n",
    "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\")\n",
    "ax.set_ylabel(\"Number of Peaks\")\n",
    "ax.set_title(\"All Tested Peaks\")\n",
    "\n",
    "# Significant peaks only\n",
    "ax = axes[1]\n",
    "sig_effects = results_df.loc[sig_mask, \"effect_size\"]\n",
    "ax.hist(sig_effects, bins=30, edgecolor=\"black\", alpha=0.7, color=\"firebrick\")\n",
    "ax.axvline(0, color=\"black\", linestyle=\"--\", linewidth=2)\n",
    "ax.set_xlabel(\"Effect Size (log₂ Ref/Alt)\")\n",
    "ax.set_ylabel(\"Number of Peaks\")\n",
    "ax.set_title(f\"Significant Peaks (FDR < 0.05, n={sig_mask.sum():,})\")\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / \"effect_size_distribution.png\", dpi=150)\n",
    "plt.show()\n",
    "\n",
    "# Summary statistics\n",
    "print(\"Effect size statistics (significant peaks):\")\n",
    "print(f\"  Mean: {sig_effects.mean():.3f}\")\n",
    "print(f\"  Median: {sig_effects.median():.3f}\")\n",
    "print(f\"  Std: {sig_effects.std():.3f}\")\n",
    "print(f\"  Ref-biased (log2FC > 0): {(sig_effects > 0).sum()}\")\n",
    "print(f\"  Alt-biased (log2FC < 0): {(sig_effects < 0).sum()}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Top Imbalanced Peaks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get top hits by significance\n",
    "top_hits = results_df[results_df[\"fdr_pval\"] < 0.05].nsmallest(20, \"fdr_pval\")\n",
    "\n",
    "print(\"Top 20 Peaks with Allelic Imbalance\")\n",
    "print(\"=\" * 80)\n",
    "display_cols = [\n",
    "    \"region\",\n",
    "    \"chr\",\n",
    "    \"ref_count\",\n",
    "    \"alt_count\",\n",
    "    \"mu\",\n",
    "    \"effect_size\",\n",
    "    \"p_value\",\n",
    "    \"fdr_pval\",\n",
    "]\n",
    "top_hits[display_cols].round(4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 5: QTL Overlap Analysis\n",
    "\n",
    "Peaks with allelic imbalance often harbor **chromatin accessibility QTLs (caQTLs)** or overlap with **expression QTLs (eQTLs)**. Integrating your results with published QTL databases helps validate findings and identify regulatory mechanisms.\n",
    "\n",
    "### 5.1 Prepare BED File for Overlap Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Export significant peaks as BED for overlap analysis\n",
    "sig_peaks = results_df[results_df[\"fdr_pval\"] < 0.05][\n",
    "    [\"chr\", \"start\", \"end\", \"region\", \"effect_size\", \"fdr_pval\"]\n",
    "].copy()\n",
    "sig_peaks.to_csv(RESULTS_DIR / \"significant_peaks.bed\", sep=\"\\t\", header=False, index=False)\n",
    "\n",
    "print(f\"Exported {len(sig_peaks)} significant peaks to: {RESULTS_DIR / 'significant_peaks.bed'}\")\n",
    "print(\"\\nUse this file for overlap analysis with:\")\n",
    "print(\"  - GTEx eQTLs (https://gtexportal.org)\")\n",
    "print(\"  - ENCODE cCREs (https://www.encodeproject.org)\")\n",
    "print(\"  - Published caQTL datasets\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 Example: GTEx eQTL Overlap\n",
    "\n",
    "This example shows how to intersect your imbalanced peaks with GTEx eQTL SNPs to identify potential regulatory relationships."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Example overlap analysis with bedtools (requires eQTL BED file)\n",
    "overlap_cmd = \"\"\"\n",
    "# Download GTEx eQTLs for your tissue of interest\n",
    "# Example: Brain cortex significant eQTLs\n",
    "\n",
    "# Intersect imbalanced peaks with eQTL positions\n",
    "bedtools intersect \\\\\n",
    "    -a results/significant_peaks.bed \\\\\n",
    "    -b gtex_brain_cortex_eqtls.bed \\\\\n",
    "    -wa -wb \\\\\n",
    "    > results/peak_eqtl_overlap.bed\n",
    "\n",
    "# Count overlaps\n",
    "wc -l results/peak_eqtl_overlap.bed\n",
    "\"\"\"\n",
    "\n",
    "print(\"Example bedtools command for eQTL overlap:\")\n",
    "print(overlap_cmd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Simulate overlap results for demonstration\n",
    "np.random.seed(123)\n",
    "\n",
    "# Create simulated eQTL overlap data\n",
    "n_overlap = 150\n",
    "overlap_df = pd.DataFrame(\n",
    "    {\n",
    "        \"peak\": [f\"peak_{i}\" for i in np.random.choice(range(1500), n_overlap, replace=False)],\n",
    "        \"eqtl_gene\": [f\"GENE{i}\" for i in np.random.randint(1, 500, n_overlap)],\n",
    "        \"eqtl_pval\": 10 ** (-np.random.uniform(3, 15, n_overlap)),\n",
    "        \"tissue\": np.random.choice(\n",
    "            [\"Brain_Cortex\", \"Brain_Hippocampus\", \"Liver\", \"Heart\"], n_overlap\n",
    "        ),\n",
    "    }\n",
    ")\n",
    "\n",
    "print(f\"Peaks overlapping eQTLs: {len(overlap_df)}\")\n",
    "print(\"\\nOverlap by tissue:\")\n",
    "print(overlap_df[\"tissue\"].value_counts())\n",
    "\n",
    "# Enrichment analysis\n",
    "n_sig_peaks = sig_mask.sum()\n",
    "n_total_peaks = len(results_df)\n",
    "n_eqtl_overlap = len(overlap_df)\n",
    "\n",
    "# Fisher's exact test for enrichment\n",
    "from scipy.stats import fisher_exact\n",
    "\n",
    "# Assume 10% of all peaks overlap eQTLs by chance\n",
    "expected_overlap = int(n_sig_peaks * 0.10)\n",
    "contingency = [\n",
    "    [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap],\n",
    "    [expected_overlap, n_sig_peaks - expected_overlap],\n",
    "]\n",
    "odds_ratio, p_value = fisher_exact(contingency)\n",
    "\n",
    "print(\"\\nEnrichment Analysis:\")\n",
    "print(f\"  Imbalanced peaks: {n_sig_peaks}\")\n",
    "print(f\"  Overlapping eQTLs: {n_eqtl_overlap}\")\n",
    "print(f\"  Expected by chance: ~{expected_overlap}\")\n",
    "print(f\"  Fold enrichment: {n_eqtl_overlap / max(expected_overlap, 1):.2f}x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 Visualization: eQTL Overlap"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(1, 2, figsize=(12, 4))\n",
    "\n",
    "# Pie chart: Overlap proportion\n",
    "ax = axes[0]\n",
    "overlap_counts = [n_eqtl_overlap, n_sig_peaks - n_eqtl_overlap]\n",
    "labels = [\"Overlap with eQTL\", \"No eQTL overlap\"]\n",
    "colors = [\"#e74c3c\", \"#95a5a6\"]\n",
    "ax.pie(overlap_counts, labels=labels, colors=colors, autopct=\"%1.1f%%\", startangle=90)\n",
    "ax.set_title(\"Imbalanced Peaks Overlapping eQTLs\")\n",
    "\n",
    "# Bar chart: Overlap by tissue\n",
    "ax = axes[1]\n",
    "tissue_counts = overlap_df[\"tissue\"].value_counts()\n",
    "colors = plt.cm.Set2(range(len(tissue_counts)))\n",
    "bars = ax.bar(tissue_counts.index, tissue_counts.values, color=colors, edgecolor=\"black\")\n",
    "ax.set_xlabel(\"Tissue\")\n",
    "ax.set_ylabel(\"Number of Overlapping eQTLs\")\n",
    "ax.set_title(\"eQTL Overlap by Tissue\")\n",
    "ax.tick_params(axis=\"x\", rotation=45)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig(RESULTS_DIR / \"eqtl_overlap.png\", dpi=150)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Section 6: Downstream Analysis Hints\n",
    "\n",
    "### 6.1 Motif Enrichment Analysis\n",
    "\n",
    "Imbalanced peaks may disrupt transcription factor binding sites. Use tools like HOMER or MEME-ChIP for motif analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "motif_cmd = \"\"\"\n",
    "# Extract sequences around imbalanced SNPs\n",
    "bedtools slop -i significant_peaks.bed -g genome.chrom.sizes -b 50 | \\\\\n",
    "bedtools getfasta -fi genome.fa -bed - -fo imbalanced_seqs.fa\n",
    "\n",
    "# Run HOMER motif analysis\n",
    "findMotifsGenome.pl significant_peaks.bed hg38 motif_results/ -size 200\n",
    "\n",
    "# Alternative: MEME-ChIP\n",
    "meme-chip -oc meme_results imbalanced_seqs.fa\n",
    "\"\"\"\n",
    "\n",
    "print(\"Example commands for motif enrichment analysis:\")\n",
    "print(motif_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 Gene Ontology Enrichment\n",
    "\n",
    "Identify biological processes associated with genes near imbalanced peaks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "go_cmd = \"\"\"\n",
    "# Annotate peaks with nearest genes using GREAT or bedtools\n",
    "bedtools closest -a significant_peaks.bed -b genes.bed -d > peak_gene_assignments.bed\n",
    "\n",
    "# Extract gene list\n",
    "cut -f8 peak_gene_assignments.bed | sort -u > imbalanced_genes.txt\n",
    "\n",
    "# Use DAVID, Enrichr, or clusterProfiler for GO enrichment\n",
    "# Web interface: https://david.ncifcrf.gov/\n",
    "# Web interface: https://maayanlab.cloud/Enrichr/\n",
    "\"\"\"\n",
    "\n",
    "print(\"Example commands for GO enrichment analysis:\")\n",
    "print(go_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.3 Single-Cell ATAC-seq Extension\n",
    "\n",
    "For single-cell ATAC-seq (scATAC-seq), use WASP2's single-cell workflow to detect cell-type-specific allelic imbalance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc_cmd = \"\"\"\n",
    "# Count alleles in single-cell ATAC-seq\n",
    "wasp2-count count-variants-sc \\\\\n",
    "    scatac_possorted.bam \\\\\n",
    "    variants.vcf.gz \\\\\n",
    "    barcodes.tsv \\\\\n",
    "    --samples SAMPLE_ID \\\\\n",
    "    --feature peaks.bed \\\\\n",
    "    --out_file scatac_counts.h5ad\n",
    "\n",
    "# Detect imbalance per cell type\n",
    "wasp2-analyze find-imbalance-sc \\\\\n",
    "    scatac_counts.h5ad \\\\\n",
    "    barcode_celltype_map.tsv \\\\\n",
    "    --sample SAMPLE_ID \\\\\n",
    "    --min 5 \\\\\n",
    "    --phased\n",
    "\n",
    "# Compare imbalance between cell types\n",
    "wasp2-analyze compare-imbalance \\\\\n",
    "    scatac_counts.h5ad \\\\\n",
    "    barcode_celltype_map.tsv \\\\\n",
    "    --groups \"excitatory,inhibitory\" \\\\\n",
    "    --sample SAMPLE_ID\n",
    "\"\"\"\n",
    "\n",
    "print(\"Commands for single-cell ATAC-seq analysis:\")\n",
    "print(sc_cmd)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, you learned to:\n",
    "\n",
    "1. **Prepare ATAC-seq data** - Load peaks and verify input file formats\n",
    "2. **Count alleles** - Use `wasp2-count count-variants` with peak regions\n",
    "3. **Detect imbalance** - Apply beta-binomial testing with `wasp2-analyze find-imbalance`\n",
    "4. **Visualize results** - Create volcano plots and effect size distributions\n",
    "5. **Integrate with QTLs** - Overlap with eQTL databases for biological validation\n",
    "\n",
    "### Key Takeaways\n",
    "\n",
    "- ATAC-seq has **lower coverage per peak** than RNA-seq; use `--min-count 5` instead of 10\n",
    "- **FDR correction** is essential for multiple testing across thousands of peaks\n",
    "- Consider **effect size** alongside significance for biological relevance\n",
    "- **QTL overlap** helps validate findings and identify causal variants\n",
    "\n",
    "### Next Steps\n",
    "\n",
    "- [Comparative Imbalance Tutorial](./comparative_imbalance.rst) - Compare imbalance between conditions\n",
    "- [Single-Cell Tutorial](./scrna_seq.rst) - Cell-type-specific analysis\n",
    "- [Statistical Methods](../methods/statistical_models.rst) - Deep dive into the beta-binomial model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Troubleshooting\n",
    "\n",
    "### Common Issues\n",
    "\n",
    "**Low SNP counts in peaks:**\n",
    "- Ensure VCF contains heterozygous variants for your sample\n",
    "- Check that peak coordinates use the same reference genome as VCF\n",
    "- Verify `--samples` matches the sample name in VCF header\n",
    "\n",
    "**Memory errors with large datasets:**\n",
    "- Process chromosomes separately with `--region chr1_peaks.bed`, etc.\n",
    "- Use `WASP2_RUST_THREADS=4` to limit parallel processing\n",
    "\n",
    "**No significant results:**\n",
    "- Check read depth (may need deeper sequencing)\n",
    "- Verify WASP filtering was applied to remove mapping bias\n",
    "- Consider lowering `--min-count` threshold (with caution)\n",
    "\n",
    "### Diagnostic Commands\n",
    "\n",
    "```bash\n",
    "# Check VCF sample names\n",
    "bcftools query -l variants.vcf.gz\n",
    "\n",
    "# Count heterozygous SNPs in your sample\n",
    "bcftools view -s SAMPLE_ID variants.vcf.gz | bcftools view -g het | wc -l\n",
    "\n",
    "# Check BAM read depth at a peak\n",
    "samtools depth -r chr1:1000000-1001000 sample.bam | awk '{sum+=$3} END {print sum/NR}'\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Save final results\n",
    "results_df.to_csv(RESULTS_DIR / \"final_imbalance_results.tsv\", sep=\"\\t\", index=False)\n",
    "print(f\"Results saved to: {RESULTS_DIR / 'final_imbalance_results.tsv'}\")\n",
    "print(\"\\nAnalysis complete!\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}