Posted on:

This repository contains a simple, five-cell Google Colab notebook demonstrating a Retrieval-Augmented Generation (RAG) pipeline built from scratch (without frameworks like LangChain). This prototype uses free (no subscription) and open source tools and core Python libraries (i.e., faiss, sentence-transformers, google-genai).

The pipeline processes documents stored in Google Drive and uses the Gemini 2.5 Flash model to return responses from queries.

For your own use, you'll find my Colab Notebook cells on GitHub, written in Python, as well as test source.

RAG Pipeline Components

Components Description
google-genai (SDK) Google SDK used to call the Gemini 2.5 Flash model to generate a final answer based on retrieved context.
gemini-2.5-flash (Model) Large Language Model (LLM) from Google that receives the user query and retrieved context to generate the final answer.
faiss-cpu (FAISS) Vector database library used to store the vector embeddings and perform similarity search to retrieve relevant text chunks.
sentence-transformers Python framework loads the embedding model and converts text chunks and queries into vector embeddings.
all-MiniLM-L6-v2 (Model) Pre-trained Sentence Transformer Model chosen to generate the 384-dimensional vector embeddings.
numpy Core library for numerical operations on the vector embeddings during index creation, storage, and retrieval.
pickle Python's built-in module that serializes text chunks & metadata so they can be saved to Google Drive and reloaded later.
os Python library for environment tasks like setting the API key, defining and file paths, and reading docs.
re Python library for regex for cleaning of source during the chunking process.
google.colab (Utilities) Colab-specific utilities that handle mounting Google Drive to access source and retrieve the GEMINI_API_KEY from Secrets manager.

Getting Started

Follow these steps to set up and run your notebook.

Update: As of November 13, 2025, Google provides a VS Code extension that allows you to run Colab from within a local instance of VS Code. As VS Code is my IDE of choice, I took the extension for a test spin with high hopes. I found it to be buggy, so I'll still use the browser-based Colab platform for now, but I expect I'll be running my Colab notebook pipelines from VS Code when the extension matures.

1. Create Your Google Colab Notebook

Go to https://colab.research.google.com/ then create a notebook via the File menu.

2. Set Up API Key

The code requires a Gemini API Key to communicate with the model. Store this key securely in Colab's Secrets tool.

  1. Get your key from Google AI Studio: https://ai.google.dev/gemini-api/docs/api-key
  2. In the Colab Notebook, look for the Secrets tab in the left sidebar.
  3. Click the + icon to add a new secret.
  4. Set the Name exactly to: GEMINI_API_KEY
  5. Set the Value to the key you copied in Step 1.
  6. Ensure the "Notebook access" toggle is ON for this secret.

3. Prepare Source Folders and Documents

  1. Cell 2 configures the pipeline to read from and write to a specific location in your Google Drive. Create the following folder structure in your Google Drive:

    • My Drive/rag_docs_structured

      This is where you will upload DITA, Markdown, and HTML files.

    • My Drive/rag_index_gemini_faiss

      This is where the FAISS vector index will be saved.

  2. Upload source documents (.dita, .md, .html) to My Drive/rag_docs_structured.

4. Run Cells

Run each cell in your notebook sequentially. Alternately, concatenate the Python code into a single notebook cell, then run it. (I prefer to run each functional block of code separately for troubleshooting purposes.)

Important: Be sure to modify the query in Cell 5 to run against the content in the uploaded source. Searching for "# NEW QUERY", will take you there.

Python Cells for Notebook

  1. Setup: Installs dependencies and loads the Gemini API Key.
  2. Environment: Mounts Google Drive and defines necessary file paths.
  3. Chunking: Manually loads and splits text documents (.dita, .md, .html) from Google Drive into small chunks.
  4. Indexing: Generates vector embeddings for each chunk and builds a persistent FAISS index in Drive.
  5. Query: Loads the FAISS index, retrieves the top context chunks for a given query, and uses the Gemini API to generate an answer.

Cell 1

The process begins by manually loading and chunking structured documents (.md, .dita, .html), preparing the text by separating it into small, context-preserving segments.

# Cell 1: Setup and Dependencies

# Install core foundational libraries for RAG
!pip install -q faiss-cpu sentence-transformers google-genai numpy

import os
from google.colab import drive, userdata

# --- API Key Retrieval from Colab Secrets ---
# This ensures the official 'google-genai' SDK is authenticated
os.environ["GEMINI_API_KEY"] = userdata.get('GEMINI_API_KEY')
if "GEMINI_API_KEY" not in os.environ or not os.environ["GEMINI_API_KEY"]:
    raise ValueError("GEMINI_API_KEY not found in Colab Secrets.")
print("Gemini API Key successfully loaded and dependencies installed.")

Cell 2

Next, the Sentence Transformer model (all-MiniLM-L6-v2) converts each of these text chunks into a vector embedding, which is a numerical representation of the text's meaning.

# Cell 2: Google Drive and Environment Setup

import os
from google.colab import drive

# Mount Google Drive to access documents and save the index
drive.mount('/content/drive')

# --- Configuration ---
# Define the path where your source documents are located on Google Drive
DOCS_DIR = '/content/drive/MyDrive/rag_docs_structured' 
# Define the path where the FAISS index will be saved. We'll add a specific file name later.
FAISS_INDEX_PATH = '/content/drive/MyDrive/rag_index_gemini_faiss'
FAISS_INDEX_FILE = 'my_faiss_index.bin' # Filename for the raw FAISS binary index

# Check/Create the document directory
if not os.path.exists(DOCS_DIR):
    os.makedirs(DOCS_DIR)
    print(f"Created document directory: {DOCS_DIR}")
else:
    print(f"Document directory confirmed: {DOCS_DIR}")


# Create the index parent directory if it doesn't exist 
if not os.path.exists(FAISS_INDEX_PATH):
    os.makedirs(FAISS_INDEX_PATH)
    print(f"Created directory for index: {FAISS_INDEX_PATH}")
else:
    print(f"Index directory confirmed: {FAISS_INDEX_PATH}")
    
print("\nGoogle Drive mounted and environment paths set.")

Cell 3

These embeddings are then stored in a FAISS index, which is a structure for quickly searching through millions of vectors. The original text is saved separately for persistence.

# Cell 3: Manual Document Loading and Chunking

import os
import re # For simple text cleaning and paragraph splitting
from typing import List, Dict

# DOCS_DIR is defined in Cell 2
documents: List[str] = [] # Stores the raw text content of all documents
chunks: List[Dict] = [] # Stores the final, processed chunks with metadata

print("--- 1. Preparing Source Documents (Manual Loading & Splitting) ---")

def load_all_text_files(doc_dir: str) -> List[str]:
    """Manually reads text content from specified file types in the directory."""
    all_text = []
    # Only process files with these extensions to skip images/binaries
    target_extensions = ['.md', '.dita', '.html'] 
    
    for root, _, files in os.walk(doc_dir):
        for file in files:
            if any(file.endswith(ext) for ext in target_extensions):
                file_path = os.path.join(root, file)
                try:
                    with open(file_path, 'r', encoding='utf-8') as f:
                        # Store a tuple of (filename, content) for simplicity
                        all_text.append((file, f.read()))
                except Exception as e:
                    print(f"Warning: Could not read file {file}. Error: {e}")
    return all_text

def chunk_text(filename: str, text: str, max_chars: int = 512, overlap: int = 50) -> List[Dict]:
    """Splits text content into chunks based on a simple paragraph/sentence separator."""
    
    # 1. Clean up XML/HTML tags
    # This is a minimalist approach, for production, a dedicated HTML parser is better.
    text = re.sub(r'<\/?(html|body|h1|p|topic|title|filepath|/topic|/body)?>', '', text)
    text = re.sub(r'<\?xml[^>]*\?>', '', text).strip()
    
    # 2. Split by paragraph breaks, then sentence/line breaks
    separators = ["\n\n", "\n", ".", " "]
    
    # Start with all paragraphs/sentences
    elements = [t.strip() for t in re.split(r'(\n\n)', text) if t.strip()]
    
    final_chunks = []
    current_chunk = ""

    for element in elements:
        if len(current_chunk) + len(element) + 1 <= max_chars:
            # If element fits, append it
            current_chunk += (" " + element if current_chunk else element)
        else:
            # If element doesn't fit, finalize the current chunk
            if current_chunk:
                final_chunks.append({
                    "text": current_chunk,
                    "metadata": {"filename": filename}
                })
            
            # Start a new chunk, ensuring overlap if possible
            if len(current_chunk) >= overlap:
                current_chunk = current_chunk[-overlap:] + " " + element
            else:
                current_chunk = element
    
    # Add the last chunk
    if current_chunk:
        final_chunks.append({
            "text": current_chunk,
            "metadata": {"filename": filename}
        })
        
    return final_chunks


# --- Execution ---
raw_documents = load_all_text_files(DOCS_DIR)
print(f"Loaded {len(raw_documents)} source files.")

for filename, content in raw_documents:
    new_chunks = chunk_text(filename, content)
    chunks.extend(new_chunks)

print(f"\nSuccessfully split content into {len(chunks)} final chunks.")

# Example inspection
if chunks:
    print("\nExample Chunk 1:")
    print(f"  Content: '{chunks[0]['text'][:100]}...'")
    print(f"  Metadata: {chunks[0]['metadata']}")
else:
    print("Warning: No chunks were created.")

Cell 4

User queries are converted into a vector and then used to search the FAISS index to retrieve the top n most relevant context chunks.

# Cell 4: Embeddings and FAISS Indexing

import os
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import pickle # Used to save complex Python objects like the list of chunk dictionaries

# Paths and variables defined in Cell 2 and 3
# chunks: list of dicts with 'text' and 'metadata'
# FAISS_INDEX_PATH and FAISS_INDEX_FILE are defined in Cell 2

print("--- 2. Creating Embeddings and Vectors ---")

# 1. Load the Sentence Transformer Model
model_name = "all-MiniLM-L6-v2"
model = SentenceTransformer(model_name)
print(f"Embedding model loaded: {model_name}")

# 2. Prepare Chunk Text for Embedding
chunk_texts = [chunk['text'] for chunk in chunks]

if not chunk_texts:
    raise ValueError("No text chunks found. Please check Cell 3 output and document loading.")

# 3. Generate Embeddings
print(f"Generating embeddings for {len(chunk_texts)} chunks...")
embeddings = model.encode(chunk_texts, convert_to_numpy=True).astype('float32')

# 4. Build FAISS Index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
print(f"FAISS index built successfully with dimension: {dimension}")

# --- 3. Indexing and Metadata (Persistence) ---
print("\n--- 3. Indexing and Metadata (Persistence) ---")

# Save the raw FAISS index
full_index_path = os.path.join(FAISS_INDEX_PATH, FAISS_INDEX_FILE)
faiss.write_index(index, full_index_path)
print(f"FAISS index (Vectors) saved to: {full_index_path}")

# FIX: Save the entire original 'chunks' list using pickle. 
# This preserves the text content and the metadata dictionary together.
chunk_data_file = os.path.join(FAISS_INDEX_PATH, 'chunk_data.pkl')
with open(chunk_data_file, 'wb') as f:
    pickle.dump(chunks, f)
print(f"Full Chunk Data (Text and Metadata) saved to: {chunk_data_file}")

print("Indexing and Persistence steps completed.")

Cell 5

Finally, this retrieved context is combined with the original question into a prompt that is sent to the Gemini 2.5 Flash model, which generates an answer based only on the provided source content — not from any external source.

# Cell 5: Retrieval and Response Generation (NEW QUERY)

import os
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
from google import genai
import pickle 

# --- Configuration and Initialization ---
# Paths and variables defined in Cell 2
FAISS_INDEX_PATH = '/content/drive/MyDrive/gemini-api-1/rag_index_gemini_faiss'
FAISS_INDEX_FILE = 'my_faiss_index.bin'
full_index_path = os.path.join(FAISS_INDEX_PATH, FAISS_INDEX_FILE)
chunk_data_file = os.path.join(FAISS_INDEX_PATH, 'chunk_data.pkl') 
model_name = "all-MiniLM-L6-v2"

# 1. Load Components
print("--- 4. User Queries & Retrieval Setup (Loading Components) ---")

# Load the FAISS Index
try:
    index = faiss.read_index(full_index_path)
    print(f"FAISS index loaded successfully from {full_index_path}")
except Exception as e:
    raise FileNotFoundError(f"Could not load FAISS index: {e}. Ensure Cell 4 ran correctly.")

# Load the full chunk data (text and metadata) using pickle
try:
    with open(chunk_data_file, 'rb') as f:
        full_chunk_data = pickle.load(f)
    print(f"Full chunk data loaded for {len(full_chunk_data)} chunks.")
except Exception as e:
    raise FileNotFoundError(f"Could not load chunk data: {e}. Ensure Cell 4 ran correctly.")


# Load the Sentence Transformer Model (must be the same one used for embedding)
model = SentenceTransformer(model_name)

# Initialize the Gemini Client (API key is pulled from environment variables)
client = genai.Client()
GEMINI_MODEL = "gemini-2.5-flash"


# --- 2. Retrieval Function ---
def retrieve_context(query: str, k: int = 3) -> (str, list):
    """Embeds the query and searches the FAISS index for the top-k chunks."""
    
    # 1. Embed the query
    query_vector = model.encode(query).astype('float32').reshape(1, -1)
    
    # 2. Search the FAISS index
    D, I = index.search(query_vector, k)
    
    # 3. Extract the original chunk data using the indices
    retrieved_chunks = []
    
    for idx in I[0]:
        original_chunk = full_chunk_data[idx]
        
        # Access 'text' and 'filename' from the loaded dictionary
        context_text = f"Source File: {original_chunk['metadata'].get('filename', 'N/A')}\nContent: {original_chunk['text']}\n---\n"
        
        retrieved_chunks.append({
            "text": context_text,
            "metadata": original_chunk['metadata']
        })
        
    # Concatenate the text into a single context string for the LLM
    full_context = "".join([c["text"] for c in retrieved_chunks])
    
    return full_context, retrieved_chunks


# --- 3. Execute the Query and Generate Response ---
# NEW QUERY
query = "What are there similarities between a horse and a car?"
print(f"\nUser Query: {query}")

# Perform retrieval
context, source_docs = retrieve_context(query, k=3)

# 4. Construct the Final Prompt for grounding
system_prompt = f"""Use ONLY the following pieces of context to answer the user's question. 
Your answer MUST be based ONLY on the provided context. Do not use external knowledge or make up facts.
For every piece of information used, cite the source filename (from the 'Source File:' line in the context).

Context:
{context}

Question: {query}
Helpful Answer:"""

# 5. Call the Gemini API
print("Sending prompt to Gemini...")
response = client.models.generate_content(
    model=GEMINI_MODEL,
    contents=system_prompt
)

# --- Print Results ---
print("\n==================================")
print("✅ Final Gemini RAG Response:")
print(response.text.strip())

print("\n--- Provenance (Source Documents Used) ---")
for i, doc in enumerate(source_docs):
    print(f"* Chunk {i+1} Source File: **{doc['metadata'].get('filename', 'N/A')}**")
    # Display the first 70 characters of the text that was sent to the model
    print(f"  Snippet: {doc['text'][doc['text'].find('Content:') + len('Content:'):].strip()[:70]}...")
print("==================================")

Tagged with:

More posts: