Posted on:

This project is a prototype of a local Retrieval-Augmented Generation (RAG) system, exploring how RAG pipelines work. It's a useful step before scaling or adapting it to production environments, where proprietary information is behind a firewall. Everything runs entirely on a local pc and requires no cloud services or external data transfers. I used open-source tools only (Chroma, Sentence Transformers, and Ollama) with Python scripting.

You'll find the package on GitHub.

Prerequisites

RAG Pipeline Components

Privacy & Security

This prototype provides basic safeguards but is not production ready.

Summary of Python Scripts

1. config.py - Configuration Management

Central configuration file that defines all system settings:


2. ingest.py - Document Processing & Indexing

Handles the ingestion pipeline that prepares documents for retrieval:

DocumentProcessor class:

VectorStore class:

Main workflow:


3. query.py - Query Processing Engine

The core RAG logic that answers user questions:

RAGEngine class:

Query flow:

  1. Embeds the user's question.
  2. Finds top-K most similar document chunks.
  3. Builds context from retrieved chunks.
  4. Sends context + question to LLM.
  5. Returns answer with source citations.

4. server.py - Web API Server

Flask-based web server that exposes the RAG system via HTTP:

Features:


System Architecture

User Question → Flask Server → RAG Engine → ChromaDB (Vector Search)
                                    ↓
                            Retrieved Chunks
                                    ↓
                            Ollama LLM (gemma:2b)
                                    ↓
                            Answer + Sources → User

Installation & Setup

  1. Start Docker Desktop in Windows.

  2. Start WSL. A Bash terminal opens in your home directory.

  3. Copy rag-local directory to your home directory.

    Project Structure:

    rag-local/
    ├── Dockerfile              # Container definition
    ├── docker-compose.yml      # Docker Compose configuration
    ├── requirements.txt        # Python dependencies
    ├── README.md              # This file
    ├── app/
    │   ├── config.py          # Configuration settings
    │   ├── ingest.py          # Document ingestion
    │   ├── query.py           # RAG query engine
    │   ├── server.py          # Flask web server
    │   ├── templates/
    │   │   └── index.html     # Web interface
    │   └── static/
    │       └── style.css      # CSS styling
    └── data/
     ├── input/             # Your documents (add files here)
     ├── chroma_db/         # Vector database (auto-created)
     └── models/            # Cached ML models (auto-created)
    

    The project contains a sample .md and .html file in the data/input directory.

  4. Generate an API Key.

openssl rand -base64 32

This key will add a minimal security layer.

  1. Copy the API Key. You will need it for the steps that follow. (Also, retain it for later use.)

  2. Open docker-compose.yml and find the following line: RAG_API_KEY=change-this-to-a-strong-random-key

  3. Replace "change-this-to-a-strong-random-key" with your API key.

  4. Navigate to the project directory.

cd ~/rag-local
  1. Build the Docker image.
docker-compose build

​ This may take 10-20 minutes as it downloads all required components.

  1. Start the container.
docker-compose up -d

​ The first startup after a build may take 5-10 minutes as it downloads the Gemma 2B model (~1.7GB).

  1. Open a browser and enter:
http://localhost:5000
  1. At the prompt, enter the API key you copied earlier, then click OK.

    The RAG interface opens (Flask + HTML) where you can query against the Markdown and HTML input files.

    The interface also provides commands to add and update input files.

Useful Docker Commands

# Start the container
docker-compose up

# Start in background (-d = detached mode)
docker-compose up -d

# Stop the container
docker-compose down

# View logs (-f = follow)
docker-compose logs -f

# Rebuild after code changes
docker-compose build --no-cache

Troubleshooting

Container won't start

Ollama errors

Out of memory

Slow responses

Configuration

Edit app/config.py to customize:

CHUNK_SIZE = 512          # Size of text chunks
CHUNK_OVERLAP = 50        # Overlap between chunks
TOP_K = 5                 # Number of chunks to retrieve
SIMILARITY_THRESHOLD = 0.7  # Minimum similarity score

More posts: