cat README.md
# PoPE-ViT
A Vision Transformer pipeline for **brain-tumor MRI classification** (glioma,
meningioma, pituitary, no-tumor), built to compare **Polar Coordinate Positional
Embeddings (PoPE)** against the **Rotary baseline (RoPE)** — and against strong
pretrained and convolutional references.
## Why
Positional encoding decides how a transformer reasons about *where* a patch is.
PoPE explicitly decouples content (magnitude) from position (angle); RoPE
entangles them. We ask whether that clean separation actually helps a clinically
meaningful task — distinguishing four tumor classes from 2D MRI slices.
## How
- **Patch embedding** — 224×224 scans split into 16×16 patches, linearly projected
(768 → 512) with a prepended CLS token
- **Positional encoding** — PoPE (softplus-zeroed initial angles) vs. RoPE
(position-scaled rotation), swappable within the same backbone
- **Transformer** — 6 blocks (LayerNorm → attention → MLP, residual), mean-pooled
to a 4-class head
- **Training** — class-weighted cross-entropy (handles the no-tumor minority),
AdamW, cosine schedule with linear warm-up, early stopping
- **Evaluation** — per-class one-vs-rest AUROC, accuracy, confusion matrices, plus
a hyperparameter grid over learning rate, dropout, and patch size
- **Baselines** — DeiT-Small (pretrained & + PoPE) and a from-scratch ResNet18
## Results
ImageNet pretraining dominates at this data scale: ResNet18 (0.999 AUROC) and
pretrained DeiT-Small (0.986) lead, while PoPE-ViT and RoPE-ViT come out nearly
identical (0.9644 vs. 0.9645) — the PoPE advantage doesn't surface when training
from scratch on ~3.3k images.
## Stack
- **Models** — PyTorch, custom PoPE/RoPE attention, einops tensor ops
- **Backbones** — timm (DeiT-Small), torchvision (ResNet18, transforms)
- **Data** — Brain Tumor MRI (Kaggle), stratified 80/10/10 split, RandomAffine +
flip augmentation
- **Eval** — scikit-learn (AUROC, confusion matrices), matplotlib
ls -l
Python, PyTorch, timm, scikit-learn, einops, torchvision