RAG Pipeline Terms

Posted on: 08 November 2025

Approximate Nearest Neighbor

Approximate Nearest Neighbor (ANN) algorithms are used to search large collections of vectors efficiently. Instead of finding the exact closest match, they quickly identify items that are close enough in meaning for practical use. ANN algorithms trade a small amount of precision for much faster search performance. These techniques make large-scale vector search fast and scalable.

Chunks

Chunks are small, meaningful segments of text derived from larger documents. Each chunk represents a coherent piece of information suitable for analysis by the AI system. Chunks are converted into embeddings, allowing semantic comparison between pieces of text. Chunk size affects how well the system captures meaning. Chunking ensures that even large documents can be represented and retrieved efficiently.

Chunking

Chunking is the process of dividing large documents into smaller, coherent segments called chunks. Each chunk preserves enough context to be meaningful on its own but remains short enough for processing by an LLM. Chunking helps the system manage long documents efficiently and improves the precision of retrieval.

Context Window

The context window is the limit on how much text an LLM can read at one time. Every word or symbol the model processes counts toward this limit as tokens. If too much text is included, the system must choose which chunks to include in the prompt. This choice affects output accuracy. Understanding context windows ensures that relevant material is prioritized for generation.

Cosine Similarity

Cosine similarity is a metric used to measure how similar two vectors are by calculating the cosine of the angle between them in a multi-dimensional space. It ranges from -1 to 1, where 1 indicates the vectors point in exactly the same direction (most similar), 0 indicates they are orthogonal (unrelated), and -1 indicates they point in opposite directions (most dissimilar). In practice, particularly for text embeddings and natural language processing, cosine similarity is widely used because it focuses on the orientation of vectors rather than their magnitude, making it effective for comparing documents or word embeddings regardless of their length. This property makes it especially valuable in information retrieval systems and recommendation engines, where the goal is to find content that is semantically similar to a query or reference item.

Embedding Model

An embedding model is a smaller neural network trained to convert text into numerical embeddings. It captures relationships between words and concepts in a way that reflects meaning. These models specialize in understanding relationships between texts, not generating new text like an LLM. Embedding models are used to measure semantic similarity between text passages and are often separate from the larger LLM used for generation.

Embeddings

Embeddings are numerical representations of text. Each chunk becomes an embedding that captures its semantic meaning as a vector. By comparing embeddings, the system can determine semantic similarity between different pieces of text. This enables retrieval of conceptually related information even if exact wording differs. Embeddings are central to connecting text meaning with mathematical structure.

Fine-Tuning

Fine-tuning is the process of retraining a preexisting LLM on domain-specific examples to improve its performance for a particular task. Unlike retrieval-augmented generation (RAG), fine-tuning modifies the model’s parameters. Fine-tuning requires retraining on curated datasets, which is slower and more costly than updating a RAG system’s source content. RAG avoids retraining by retrieving relevant data dynamically, making updates faster and easier to maintain.

Generator

The generator is the stage where the LLM produces new text from the retrieved context. It synthesizes content from the prompt and organizes it into a coherent response. The generator applies its learned patterns of language to express retrieved facts accurately and clearly. This process completes the transformation of static data into new information products.

Grounding

Grounding means anchoring an AI model’s responses in real, verifiable data. In RAG systems, grounding occurs when the LLM uses retrieved chunks from trusted sources as the factual basis for its output. Grounded responses are less likely to contain hallucinations. Grounding improves trust in AI systems by ensuring outputs can be traced back to real sources.

Hallucination

Hallucination occurs when an LLM generates information not supported by retrieved context. It may result from irrelevant or incomplete data retrieval. While retrieval-based systems reduce this risk, human review remains essential for quality assurance. Hallucination highlights the importance of grounding generative output in verified data.

While each chunk includes provenance data, the LLM may sometimes blend retrieved facts with its learned linguistic patterns, producing statements that seem factual but lack an exact source match. These mismatches occur because the model generates plausible language rather than verifying data against the index. Continuous validation, feedback tuning, and human review are essential to reduce, but not eliminate, this risk of hallucinations.

Index

An index is the data structure built by the vector database that allows quick similarity searches among stored vectors. It organizes vectors in a way that enables efficient lookup using mathematical proximity rather than text matching. This structure powers the retrieval stage of a RAG system. Indexes are typically updated automatically when new content is added or existing files are modified.

Indexing

Indexing is the process of scanning files in the knowledge base and preparing them for processing. Each document is analyzed for text content, structure, and context. The result of indexing is a set of chunks that are later converted into embeddings. Metadata such as document name or modification date is also stored for future filtering. Indexing enables the AI system to organize data for efficient retrieval and reuse. The indexing process is often automated to ensure new or updated content is always reflected in the database.

Inference

Inference is the process of producing output from a trained LLM based on a prompt. It represents the execution phase where the model interprets embeddings and context to form a final text output. Inference turns the retrieved knowledge into language that users can read and use.

Knowledge Base

The knowledge base is the collection of documents available for AI processing. It can include structured files such as spreadsheets or unstructured files such as text documents or PDFs. During indexing, these files are scanned and analyzed so that their content can be represented as chunks. Each chunk carries contextual information through metadata, which helps track its source and meaning. The knowledge base provides the foundation for document retrieval and generation within a RAG system.

Large Language Model

A Large Language Model (LLM) is an AI system trained on vast amounts of text from the internet and other sources. It learns patterns in language that allow it to generate human-like text, answer questions, and assist with various writing tasks. The "large" refers to both the enormous dataset it learns from and the billions of parameters (adjustable settings) that make up its internal structure. These models predict what words should come next in a sequence, which enables them to have conversations and complete complex language tasks.

It's crucial to understand what the LLM does and doesn't do. The LLM is not a database. It doesn't store your company's documentation or "memorize" facts. Instead, it's like a highly skilled editor who can read source material you provide and rewrite it into different formats while maintaining accuracy.

When the LLM generates documentation, it's reading the chunks you retrieved and transforming them—similar to how you might read several related sections and synthesize them into a single document. The LLM has learned patterns of language from its training, so it knows how to structure sentences and organize sections. But every fact in its output should come from the context you provided.

This is why retrieval is so important. If the system retrieves the wrong chunks, the LLM will generate based on incorrect data. The LLM can also experience hallucination, where it confidently states information not present in the provided context.

Metadata Filtering

Metadata filtering uses attributes such as author, date, or document type to narrow retrieval results in a vector database. This ensures that only relevant or current information is used during generation, improving accuracy and control. For example, metadata filters can limit retrievals to documents authored in the last six months.

Natural Language Processing

Natural Language Processing (NLP) is a way for computers to work with human language. It helps machines read, understand, and respond to words the way people do. NLP looks at how we speak and write so computers can make sense of it. People use NLP when they write prompts for large language models, because the model interprets the wording, structure, and intent of the prompt using NLP techniques. The model also uses NLP to generate clear, relevant responses in natural-sounding language. NLP powers things like chatbots, translation tools, and apps that sort or summarize text. In short, it helps computers better understand people and communicate in a way that feels more natural.

Normalization

Normalization standardizes text before it is analyzed. It removes extra spaces, converts special characters, and ensures consistent encoding, such as UTF-8. This cleaning step makes data uniform across different file types and systems, improving downstream processes such as embedding and indexing.

Prompt

A prompt is the input given to the LLM. It includes the user's request and the most relevant chunks retrieved from the vector database. The quality of the prompt determines the accuracy and clarity of the generated result. A well-constructed prompt keeps context within the context window. Prompt construction is key to aligning retrieval output with generative performance.

Prompt Template

A prompt template is a predefined text pattern that arranges the user’s request and retrieved chunks into a complete prompt. Templates help maintain consistency in how information is presented to the LLM, improving both reliability and style of generated outputs.

Query Embedding

A query embedding is the vector form of a user's request. It is created in the same way as document embeddings, allowing direct comparison in the vector database. The retriever uses this embedding to find document chunks most similar in meaning to the query. This mechanism ensures that the LLM receives contextually relevant information for generation.

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) combines retrieval of stored knowledge with generative modeling. It ensures that the LLM bases its responses on relevant, factual data. The retriever finds matching chunks, and the generator creates new text using that context. This integration reduces hallucination and improves factual reliability. RAG connects storage, retrieval, and generation into a unified process.

Retriever

The retriever searches the vector database for document chunks most semantically similar to the query embedding. It selects the top-k chunks that best match the user’s request, based on semantic similarity. The retriever forms the bridge between stored knowledge and generative processing.

Segmentation

Segmentation is the step that divides documents into logical sections before chunking. It identifies structural boundaries such as headings, lists, or paragraphs to keep related content together. Segmentation often uses document structure (like Markdown headings or HTML tags) to determine section boundaries. While segmentation organizes a document for readability, chunking later divides those sections into smaller units suitable for embedding.

Semantic Similarity

Semantic similarity measures how closely two embeddings represent the same meaning. It is often computed using cosine similarity. By comparing embeddings numerically, the system can retrieve information related in meaning rather than identical in wording. Semantic similarity enables the retrieval step in the RAG process.

Temperature

Temperature is a model parameter that controls randomness in text generation. Lower temperatures (for example, 0.2) make responses more predictable and factual, while higher temperatures (for example, 0.8) encourage creativity and variation. Adjusting temperature helps balance precision and expressiveness in the generated output.

Tokens

Tokens are the smallest text units an LLM processes, roughly corresponding to words or word fragments. The number of tokens determines how much content fits within the context window. Managing tokens helps control both response length and computational cost. Tokens are the measurement unit for how the model reads and generates text.

Top-k

Top-k refers to a sampling parameter used during text generation that limits the model’s next-word choices to the k most probable tokens. The model then selects one token from this reduced set according to their relative probabilities. A smaller k value (for example, 20) makes outputs more focused and deterministic, while a larger k value allows greater diversity and creativity.

Top-p

Top-p, also known as nucleus sampling, is a probabilistic text generation method that selects from the smallest possible set of tokens whose cumulative probability adds up to p (for example, 0.9). The model samples only from this dynamic subset, allowing a balance between predictability and variation. Lower p values produce more precise responses, while higher p values yield more creative or varied text.

Vector

A vector is a list of numbers representing the meaning of text in mathematical form. Each embedding is stored as a vector in a high-dimensional space. The distance between two vectors indicates how similar their meanings are, often measured by cosine similarity. Vectors make it possible for computers to perform semantic comparisons across large document sets.

Vector Database

A vector database stores vectors and their associated metadata. It enables rapid search based on semantic similarity rather than keyword matching. When a user query is received, the database is searched for vectors most similar to the query embedding. The retrieved chunks are then passed to the LLM for generation. Vector databases are optimized for large-scale similarity searches.

Vector Metadata

Vector metadata refers to information stored alongside each vector, such as the original file name, section title, or timestamp. Metadata allows filtering or ranking retrieved results to ensure the system selects the most relevant and current content. This layer maintains traceability from generated text back to its original source.

Vector Search

Vector search is a method of finding related content by comparing vectors rather than text keywords. It identifies documents or chunks with similar meanings, even when the wording differs. Vector search compares numerical distances between vectors using measures such as cosine similarity. This process is central to retrieval-augmented generation, allowing the system to locate contextually relevant information efficiently.

Tagged with:

glossary
RAG

Next: Gemini RAG Pipeline (Basic)
Previous: Generating Information Products with RAG