Medical RAG Assistant — Leonard Wagner

cat README.md

# Medical RAG Assistant A privacy-first clinical decision support system that grounds an LLM in a curated medical knowledge base, with **all computation running locally** — no data leaves the machine. ## Why Clinical AI is only useful if it's trustworthy and confidential. Off-the-shelf chat assistants hallucinate, can't cite sources, and ship patient data to third parties. This system answers clinical questions from authoritative literature with inline citations, while keeping every byte — records, audio, embeddings — on-device. ## How - **Consultation pipeline** — upload audio or paste text → Whisper transcription → structured SOAP extraction → RAG-grounded recommendations, streamed via SSE - **Hybrid retrieval** — dense (MedEmbed) + BM25 sparse, fused with Reciprocal Rank Fusion, then cross-encoder reranked to top-8 - **Curated knowledge base** — ingested from authoritative free APIs (FDA drug labels, PubMed abstracts/guidelines/systematic reviews, MedlinePlus), filtered by a 104-cluster / 462-keyword GP topic taxonomy spanning 14 specialties - **Cited Q&A chat** — free-text clinical questions answered token-by-token with inline source citations - **Rigorous evaluation** — ablation across embed × chat × reranker × retrieval-mode on 4 medical QA benchmarks (PubMedQA, MedQA-USMLE, MMLU, MedMCQA), plus end-to-end WER/CER, ROUGE-L, BERTScore, and RAGAs faithfulness/precision on PriMock57 and ACI-Bench ## Stack - **Backend** — FastAPI (Python 3.12), PostgreSQL 16, SQLAlchemy + Pydantic - **LLM & RAG** — Ollama (`qwen3:4b`), Qdrant hybrid vector store, MedEmbed-large embeddings, gte-reranker-modernbert cross-encoder - **Transcription** — faster-whisper `large-v3-turbo`, in-process CPU/GPU - **Frontend** — React 18 + Vite + Ant Design 5, SSE streaming - **Infra** — Docker Compose (PostgreSQL + Qdrant), uv, Makefile task runner

ls -l

tech Python, FastAPI, React, PyTorch, Qdrant, Ollama, PostgreSQL, faster-whisper