Setting up Local RAG Pipeline in Docker

Posted on: 27 October 2025

This project is a prototype of a local Retrieval-Augmented Generation (RAG) system, exploring how RAG pipelines work. It's a useful step before scaling or adapting it to production environments, where proprietary information is behind a firewall. Everything runs entirely on a local pc and requires no cloud services or external data transfers. I used open-source tools only (Chroma, Sentence Transformers, and Ollama) with Python scripting.

You'll find the package on GitHub.

Prerequisites

Windows 11 with WSL 2
Docker Desktop installed with WSL enabled
Python installed
At least 4GB RAM available
At least 5GB disk space

RAG Pipeline Components

LLM: Gemma 2B (via Ollama)
Embeddings: all-MiniLM-L6-v2
Vector Store: ChromaDB
Interface: Flask + HTML
Supported Formats: Markdown (.md), HTML (.html, .htm)

Privacy & Security

This prototype provides basic safeguards but is not production ready.

All data stays on your machine.
No external API calls after setup.
No telemetry or analytics.
Can run with complete network isolation.
Docker container isolation.
API Key for extra layer of security.

Summary of Python Scripts

1. config.py - Configuration Management

Central configuration file that defines all system settings:

Paths: Input documents, ChromaDB storage, model cache locations.
Models: Uses sentence-transformers/all-MiniLM-L6-v2 for embeddings and gemma:2b via Ollama for text generation.
Chunking: Documents split into 512-character chunks with 50-character overlap.
Retrieval: Retrieves top 5 most similar chunks with 0.7 similarity threshold.
Server: Flask runs on port 5000.
Uses environment variables for flexibility in deployment.

2. ingest.py - Document Processing & Indexing

Handles the ingestion pipeline that prepares documents for retrieval:

DocumentProcessor class:

Reads Markdown and HTML files.
Converts them to plain text (strips formatting, scripts, styles).
Splits text into overlapping chunks for better context preservation.
Creates unique IDs for each chunk using MD5 hashing.

VectorStore class:

Manages ChromaDB persistent storage.
Generates embeddings using SentenceTransformer.
Stores document chunks with their vector embeddings.
Provides collection statistics and reset functionality.

Main workflow:

Scans input directory for .md, .html, .htm files
Processes each file into chunks.
Generates embeddings in batches (100 at a time).
Stores everything in ChromaDB for efficient similarity search.

3. query.py - Query Processing Engine

The core RAG logic that answers user questions:

RAGEngine class:

Retrieval: Converts user questions to embeddings and finds similar document chunks using vector search.
Context Building: Assembles retrieved chunks with source attribution.
Response Generation: Uses Ollama LLM with a structured prompt that instructs it to answer based only on provided context.
Error Handling: Handles missing collections or empty databases.

Query flow:

Embeds the user's question.
Finds top-K most similar document chunks.
Builds context from retrieved chunks.
Sends context + question to LLM.
Returns answer with source citations.

4. server.py - Web API Server

Flask-based web server that exposes the RAG system via HTTP:

Features:

API Key Authentication: Optional security via X-API-Key header (controlled by RAG_API_KEY environment variable).
Three endpoints:
- GET / - Serves the web interface.
- GET /api/status - Returns system health (e.g., document count, available files, readiness).
- POST /api/query - Processes questions and returns AI-generated answers.
Lazy Loading: RAG engine initializes on first query for faster startup.
Error Handling: Error responses with HTTP status codes.
Authentication decorator ensures protected endpoints require valid API keys when enabled.

System Architecture

User Question → Flask Server → RAG Engine → ChromaDB (Vector Search)
                                    ↓
                            Retrieved Chunks
                                    ↓
                            Ollama LLM (gemma:2b)
                                    ↓
                            Answer + Sources → User

Installation & Setup

Start Docker Desktop in Windows.
Start WSL. A Bash terminal opens in your home directory.

Copy rag-local directory to your home directory.

Project Structure:

rag-local/
├── Dockerfile              # Container definition
├── docker-compose.yml      # Docker Compose configuration
├── requirements.txt        # Python dependencies
├── README.md              # This file
├── app/
│   ├── config.py          # Configuration settings
│   ├── ingest.py          # Document ingestion
│   ├── query.py           # RAG query engine
│   ├── server.py          # Flask web server
│   ├── templates/
│   │   └── index.html     # Web interface
│   └── static/
│       └── style.css      # CSS styling
└── data/
 ├── input/             # Your documents (add files here)
 ├── chroma_db/         # Vector database (auto-created)
 └── models/            # Cached ML models (auto-created)

The project contains a sample .md and .html file in the data/input directory.

Generate an API Key.

openssl rand -base64 32

This key will add a minimal security layer.

Copy the API Key. You will need it for the steps that follow. (Also, retain it for later use.)
Open docker-compose.yml and find the following line: RAG_API_KEY=change-this-to-a-strong-random-key
Replace "change-this-to-a-strong-random-key" with your API key.
Navigate to the project directory.

cd ~/rag-local

Build the Docker image.

docker-compose build

This may take 10-20 minutes as it downloads all required components.

Start the container.

docker-compose up -d

The first startup after a build may take 5-10 minutes as it downloads the Gemma 2B model (~1.7GB).

Open a browser and enter:

http://localhost:5000

At the prompt, enter the API key you copied earlier, then click OK.

The RAG interface opens (Flask + HTML) where you can query against the Markdown and HTML input files.

The interface also provides commands to add and update input files.

Useful Docker Commands

# Start the container
docker-compose up

# Start in background (-d = detached mode)
docker-compose up -d

# Stop the container
docker-compose down

# View logs (-f = follow)
docker-compose logs -f

# Rebuild after code changes
docker-compose build --no-cache

Troubleshooting

Container won't start

Check to verify that Docker Desktop is running.
In Docker Desktop, go to Settings > General. Verify that Use the WSL 2 based engine is enabled.
Check port 5000 is not in use: netstat -an | grep 5000.
View logs: docker-compose logs.

Ollama errors

Wait longer on first startup (model download takes time).
Check container logs: docker-compose logs | grep ollama.
Restart container: docker-compose restart.

Out of memory

Reduce TOP_K in app/config.py (default: 5).
Reduce CHUNK_SIZE in app/config.py (default: 512).
Close other applications.
Restart Docker Desktop.

Slow responses

This is normal for CPU-only inference with limited pc resources.
Gemma 2B on CPU typically takes 10-30 seconds per response.
Consider reducing the amount of context retrieved (TOP_K).

Configuration

Edit app/config.py to customize:

CHUNK_SIZE = 512          # Size of text chunks
CHUNK_OVERLAP = 50        # Overlap between chunks
TOP_K = 5                 # Number of chunks to retrieve
SIMILARITY_THRESHOLD = 0.7  # Minimum similarity score

Next: Generating Information Products with RAG
Previous: Finding Semantically Similar Content