AI Glossary 2026: 100+ Essential AI Terms Explained

Plain-English definitions for 100+ AI terms, including LLMs, RAG, embeddings, agents, prompting, fine-tuning, safety, evals, and LLMOps.

2026-04-27 · Mahdy Hasan · AI & ML

An AI glossary is a reference guide to the language used when designing, building, and operating artificial intelligence products. This guide explains more than 100 terms across large language models, embeddings, RAG, prompting, fine-tuning, agents, safety, multimodal AI, machine learning, and production operations.

AI conversations move fast. In one meeting you can hear RAG, LoRA, MCP, MoE, and CoT, with each person using the terms a little differently. This glossary gives builders, product managers, and founders a shared reference. Use the topic list to jump to the section you need, or search the page for a specific term.

Each definition is written for people building products. The goal is enough precision to support a real decision, without turning a quick lookup into a research paper.

What Are the Core Building Blocks of a Large Language Model?

These are the terms that define what an LLM is and how it works at the most fundamental level. Understanding them gives you the conceptual frame for everything else in this glossary.

Large Language Model (LLM): A language model trained on large text collections to predict and generate sequences of tokens. LLMs can answer questions, summarize, classify, extract information, and produce text or code.
Transformer: A neural network architecture built around attention mechanisms. The 2017 'Attention Is All You Need' paper introduced the architecture that underpins most modern LLMs.
Foundation Model: A broadly trained model that can be adapted to many tasks instead of being built for one narrow use case.
Base Model: A pretrained model before task-specific instruction tuning or alignment. It predicts likely continuations but may not behave like a polished assistant.
Token: A unit of text processed by a language model. A token may be a word, part of a word, punctuation, or another text fragment, depending on the tokenizer.
Tokenization: The process of converting raw text into tokens before the model can process it. Different models use different tokenizers, which affects output behaviour.
Context Window: The maximum number of tokens a model can process in one request, including instructions, conversation history, retrieved context, tool results, and generated output. Limits vary by model and provider.
Parameters: The learned numerical values inside a model. Parameter count affects model size and compute requirements, but it does not by itself prove that one model is better than another.
Weights: The actual numeric values of a model's parameters after training. When you download a model, you are downloading its weights.
Inference: Running a trained model on an input to produce an output. Inference happens during testing and production, while training is when model weights are updated.
Training: The process of exposing a model to data and adjusting its weights to reduce a loss function. Training large models can require substantial computing resources, while smaller fine-tuning jobs may need much less.
Pre-training: The initial large-scale training on a broad dataset. It produces a base model before task-specific fine-tuning or alignment.

The 2017 Attention Is All You Need paper introduced the Transformer as an architecture based on attention mechanisms rather than recurrence or convolution.

Read the original Transformer paper

What Are Embeddings and How Do They Power AI Search?

Embeddings are how AI systems represent meaning mathematically. They are the foundation of semantic search, RAG pipelines, recommendation systems, and almost everything that involves finding relevant content at scale.

Embedding: A numerical representation of text (or image or audio) as a list of floating-point numbers. Captures semantic meaning so that similar concepts have similar numbers.
Vector: The actual array of numbers that makes up an embedding. 'Vector' and 'embedding' are often used interchangeably in AI contexts.
Vector Database (Vector Store): A system that stores embeddings and retrieves similar vectors efficiently. A dedicated vector database is one option; some relational and search databases also support vector search.
Semantic Search: Searching by meaning rather than keyword matching. Uses embeddings to find conceptually related content even when the exact words differ.
Dense Vector: An embedding where most dimensions contain non-zero values. Dense vectors are commonly used to represent semantic meaning.
Sparse Retrieval: Retrieval based mainly on token or term matches, often represented as vectors with many zero values. It remains useful for names, codes, and exact terminology.
Hybrid Search: Combining semantic vector search with keyword or sparse retrieval, then merging or reranking the results.
Cosine Similarity: A measure of the angle between two vectors. A higher score usually indicates more similar direction, although the exact score meaning depends on the embedding model and index.
Latent Space: The high-dimensional mathematical space where embeddings live. Similar concepts cluster together in this space, which is what makes semantic search work.
Approximate Nearest Neighbor (ANN) Search: A family of methods that finds likely nearest vectors quickly without comparing every stored vector exactly. Vector indexes use ANN to trade a small amount of recall for speed.

What Is RAG and What Terms Do You Need to Know?

RAG, or retrieval-augmented generation, helps a model answer with information retrieved at request time. It is useful when an AI feature needs current or private knowledge that should not depend only on model training.

RAG (Retrieval-Augmented Generation): A pattern that retrieves relevant information and gives it to a generative model as context for an answer. RAG can improve factual grounding, but the result still needs evaluation.
Chunking: Splitting a document into smaller pieces before indexing it. Chunks need enough context to be meaningful without becoming so large that retrieval loses precision.
Retrieval: The step in RAG where the system searches for relevant evidence. It may use vector similarity, keyword matching, metadata filters, or a hybrid of these methods.
Reranking: A second scoring step that reorders initially retrieved items using a more precise relevance model before context is sent to the generator.
Query Rewriting: Turning a user's question into one or more retrieval queries that are easier for the search system to match.
Metadata Filtering: Limiting retrieval by attributes such as customer, date, document type, language, or permission before similarity scoring.
Retrieval Evaluation: Measuring whether search returns the evidence needed to answer a question, often with metrics such as recall, precision, or ranking quality.
Grounding: Connecting a model's response to relevant, verifiable source material or tool results. Retrieval is one way to ground an answer, but grounded systems can still produce errors and need evaluation.
Knowledge Graph: A structured database of entities and their relationships. Can be used alongside RAG for more precise, relationship-aware retrieval.
Semantic Caching: Reusing a stored response when a new query is similar enough under a defined threshold and policy. It can reduce latency and cost, but teams must account for permissions, freshness, and false matches.

The original RAG paper described language generation that combines a model's learned parameters with information retrieved from an external index.

Read the original RAG paper

What Prompting Techniques Do AI Engineers Actually Use?

Prompt engineering is the practice of designing instructions, context, examples, and output requirements for a model. A prompt is only one part of reliability, alongside model choice, data, tools, guardrails, and evaluation.

Prompt: The instructions and context sent to a model. Prompts can include goals, examples, retrieved evidence, constraints, and a requested output format.
System Prompt: A higher-priority instruction block used by many chat systems to define behaviour, boundaries, and task context before user messages are processed.
Zero-shot Prompting: Asking the model to complete a task without examples. It can work well for familiar tasks, but performance should be tested against the required format and accuracy.
Few-shot Prompting: Providing one or more examples in the prompt to demonstrate the desired behaviour or format. Examples can improve consistency, but poor examples can also steer the model in the wrong direction.
Chain of Thought (CoT): A reasoning approach where a model works through intermediate steps before producing an answer. Product teams should evaluate final answer quality rather than depend on hidden reasoning text.
Prompt Injection: A vulnerability where instructions in user input or external content alter a model's behaviour in unintended ways. It can lead to policy violations, data exposure, or unsafe tool actions if system controls are weak.
Jailbreak: A prompt crafted to bypass a model's safety guardrails and make it produce disallowed content. An ongoing challenge for model providers and product teams alike.
Context Stuffing: Placing a large amount of source material directly into the prompt. It can be practical for small collections, but irrelevant context may reduce quality and increase cost.

OWASP defines prompt injection as input that changes a model's behaviour or output in unintended ways. Retrieval and fine-tuning do not remove this vulnerability on their own.

Read the OWASP prompt injection guidance

What Is Fine-tuning and When Should You Use It Instead of RAG?

Fine-tuning adapts a pretrained model's behaviour for a task or style by updating model weights. RAG supplies information at request time instead. They solve different problems and can also be used together, so choose based on the failure you need to fix.

Fine-tuning: Continuing the training of a pretrained model on task-specific examples to adapt its behaviour. The data, method, and model size determine the time, cost, and hardware required.
SFT (Supervised Fine-Tuning): Fine-tuning using labeled input/output pairs where a human has provided the correct answers. The most common fine-tuning approach.
RLHF (Reinforcement Learning from Human Feedback): A family of methods that uses human preference data to help train or align a model. A common pipeline trains a reward model from comparisons, then optimises the language model against that reward.
DPO (Direct Preference Optimization): A preference-training method that learns directly from pairs of preferred and rejected responses without training a separate reward model in the standard way.
PEFT (Parameter-Efficient Fine-Tuning): A family of techniques that updates a small portion of a model's parameters or adds a small set of trainable parameters instead of updating the full model.
LoRA (Low-Rank Adaptation): A PEFT method that freezes the base model and adds trainable low-rank matrices to selected layers. The resulting adapter can be stored separately from the base weights.
QLoRA: LoRA combined with a quantized base model. It reduces memory requirements and can make fine-tuning practical on smaller hardware than full-precision training would require.
Instruction Tuning: Fine-tuning a base model specifically to follow natural language instructions. Turns a raw next-token predictor into a useful assistant model.

The LoRA paper describes freezing pretrained weights and adding trainable low-rank matrices, which reduces the number of trainable parameters for a downstream task.

Read the LoRA paper

The DPO paper describes a preference-training objective that uses preferred and rejected responses without the separate reward-model training stage used in a common RLHF pipeline.

Read the DPO paper

How Do You Control What an LLM Outputs?

LLM outputs are probabilistic. These settings and concepts help teams control format, variation, and failure handling, although the exact controls differ across models and providers.

Temperature: A sampling control used by many models to adjust output variation. Lower values tend to be more focused; higher values tend to be more varied. Supported ranges and behaviour differ by provider.
Top-p (Nucleus Sampling): Controls output diversity by limiting token selection to the smallest set whose cumulative probability reaches p. Often used alongside temperature.
Top-k: Limits token selection to the k most probable next tokens at each step. A cruder form of output control compared to top-p.
Structured Output: Model output constrained to a defined structure such as JSON or a schema. It makes responses easier for software to validate and use.
Hallucination: When a model produces a claim that is unsupported, incorrect, or fabricated. Mitigation can include retrieval, tools, validation, citations, constrained outputs, and human review.
Perplexity: A measure of how well a language model predicts a sequence of tokens. Lower values mean the evaluated text was more predictable to that model, but scores are not directly comparable across different tokenizers or datasets and do not capture every aspect of quality.
Logprobs: The log probability scores a model assigns to candidate or generated tokens. They can support ranking and classification, but they are not automatically calibrated measures of confidence or truth.

What Architecture and Infrastructure Terms Do AI Engineers Use?

These are the terms that explain how models are built and how they run in production. You do not need to implement these, but understanding them helps you have informed conversations with engineering teams about performance, cost, and hardware requirements.

Attention Mechanism: The core innovation inside transformers. Allows the model to weigh how relevant each token in the input is to every other token. 'Attention is all you need.'
Multi-head Attention: Running several attention operations in parallel with different learned projections, then combining their outputs. This lets the model represent different relationships between tokens.
KV Cache (Key-Value Cache): A GPU memory optimisation that stores intermediate attention computations so they do not need to be recalculated for each new token. Critical for inference speed.
Flash Attention: A memory-efficient method for computing exact attention by reducing reads and writes between GPU memory levels. Compatible hardware and software can improve speed and reduce memory use.
Quantization: Representing model weights or activations with lower numerical precision to reduce memory use and sometimes improve speed. The effect on quality depends on the model, task, precision, and quantization method.
Distillation: Training a smaller 'student' model to mimic the outputs of a larger 'teacher' model. Produces compact, fast models that retain much of the original capability.
Mixture of Experts (MoE): An architecture that routes each token through a subset of specialized model components instead of activating every parameter for every token.
Speculative Decoding: A technique where a draft model proposes tokens that a target model verifies. It can improve generation speed when the models, workload, and serving setup are a good fit.
VRAM: Video memory available on a GPU. Model weights, context length, batch size, precision, and runtime design all affect how much VRAM inference or training needs.
Inference Engine: Specialised software for running models efficiently in production. Examples: vLLM, Ollama, TensorRT-LLM, llama.cpp.

What Is an AI Agent and How Do Multi-Agent Systems Work?

Agent systems let a model choose actions, call tools, inspect results, and continue through a workflow. The terminology is still evolving, so it helps to distinguish a controlled workflow from a system that can make more decisions on its own.

AI Agent: A model-driven system that can choose actions, use tools, observe results, and continue toward a defined objective within set permissions and stopping rules.
Tool Use / Function Calling: A model's ability to request a predefined function, such as an API call, database query, or calculation, and use the returned result.
MCP (Model Context Protocol): An open protocol for connecting AI applications to tools and data sources through a consistent client-server interface.
Orchestration: The application layer that coordinates model calls, tool use, state, permissions, retries, and workflow steps into a controlled process.
ReAct (Reason + Act): A pattern where a model alternates between deciding what to do, taking an action, and using the observation to choose the next step.
Memory (Short-term / Long-term): Short-term memory is information available in the current context. Long-term memory is persisted information that a system can retrieve across sessions.
Autopilot / Autonomous Mode: A configuration that lets an agent execute multiple steps with limited human confirmation. Broader permissions require stronger controls, logs, limits, and recovery paths.

The official MCP documentation defines MCP as an open standard for connecting AI applications to external data sources, tools, and workflows.

Read the MCP documentation

What Do AI Safety and Alignment Terms Mean for Product Builders?

Safety and alignment are not only research topics. Product teams need controls that match the users, data, permissions, and consequences of their AI feature. These terms explain the main ideas.

Alignment: The work of making an AI system behave in ways that match intended goals, rules, and human preferences.
Constitutional AI: A training approach introduced by Anthropic that uses a written set of principles to guide model critiques, revisions, and preference feedback, reducing reliance on direct human labels for every example.
Guardrails: Programmatic or model-based controls that check, filter, block, or modify model inputs, outputs, and actions. Guardrails can reduce risk, but they do not make an AI system perfectly safe.
NSFW Filter / Content Moderation: Rules or classifiers that detect and handle harmful, explicit, or policy-violating content in model inputs and outputs.
Red Teaming: Deliberately testing an AI system for misuse, unsafe behaviour, prompt attacks, data exposure, or other failures before and after release.

NIST's Generative AI Profile treats confident false content as a risk to manage across design, evaluation, deployment, and use. It does not present a single control as a complete fix.

Read the NIST Generative AI Profile

What Are Multimodal AI Terms and Why Do They Matter?

Multimodal systems work with more than one kind of information, such as text, images, audio, or video. These terms matter when a product needs to understand or generate media beyond plain text.

Multimodal: A model or system that can process or generate more than one type of data, such as text, images, audio, or video.
Vision-Language Model (VLM): A model that jointly processes images and text. Powers image captioning, visual Q&A, and document understanding from scanned files.
ASR (Automatic Speech Recognition): Converting spoken audio into text. ASR is commonly used for transcription, voice interfaces, call analysis, and accessibility.
TTS (Text-to-Speech): Converting written text into spoken audio. TTS is used in voice assistants, accessibility tools, media, and automated calls.
Diffusion Model: A generative model that learns to reverse a gradual noise-adding process. Diffusion methods are used in many systems that generate or edit images, audio, video, and other data.
CLIP: OpenAI's model that links images and text in a shared embedding space. Powers image search and is used inside many image generation pipelines.

What ML Fundamentals Should Every AI Builder Understand?

You do not need to be a machine learning researcher to build AI products. But these foundational terms come up in technical conversations, documentation, and when evaluating whether a model or fine-tune job worked.

Supervised Learning: Training with labeled data where the correct answer is provided. The model learns to map inputs to outputs. Most fine-tuning is supervised.
Unsupervised Learning: Learning patterns or structure from data without explicit target labels. Clustering and dimensionality reduction are common examples.
Reinforcement Learning (RL): Training through trial, error, and reward signals. The model learns by maximising a reward function. RLHF applies this to language model alignment.
Gradient Descent: An optimization method that updates model parameters in a direction intended to reduce the loss function.
Loss Function: A mathematical measure of how wrong the model's predictions are. Training is the process of minimising the loss function over many examples.
Overfitting: When a model memorises training data instead of learning generalizable patterns. Performs well on training data, poorly on new data. A common fine-tuning failure mode.
Epoch: One complete pass through a training dataset. The useful number of epochs depends on the data, model, task, and signs of overfitting.
Batch Size: The number of training examples used for one gradient update. Larger batches may improve hardware throughput, but they also require more memory and can change training behaviour.
Learning Rate: A hyperparameter that controls the size of weight updates during training. A value that is too high can destabilise training, while one that is too low can make learning slow or ineffective.
Benchmark: A standardised test used to compare model performance. Examples: MMLU (general knowledge), HumanEval (coding), HellaSwag (commonsense reasoning).

What Operational AI Terms Do You Need to Run LLMs in Production?

A useful demo is only the beginning. Production systems also need cost control, quality measurement, version management, monitoring, limits, and recovery paths.

Agentic AI: A broad label for AI systems that can choose actions, use tools, and progress through multi-step work with some degree of autonomy.
LLMOps: The practices used to deploy, evaluate, monitor, version, secure, and maintain LLM-based systems in production. It overlaps with MLOps and software operations while adding concerns such as prompts, retrieval, tool calls, and model-provider changes.
Prompt Versioning: Tracking prompt changes, linking them to evaluation results, and preserving the ability to compare or restore earlier versions.
Evals (Evaluations): Tests used to measure model or system behaviour against examples, rubrics, human judgments, or task-specific metrics.
Streaming: Returning output incrementally as it is generated instead of waiting for the complete response. This can make interactive interfaces feel more responsive.
Model Router: A system that dynamically selects which model to use per query based on complexity, cost, or latency requirements. Use a cheaper model for simple queries; a powerful one for complex ones.
Dual-model Setup: Using two models with different cost, speed, or capability profiles, then routing each task to the model that fits it.
Multi-tenant AI: An AI system where multiple customers share infrastructure while data, permissions, configuration, usage, and logs remain correctly isolated.
Prompt Caching: Reusing processing for a repeated prompt prefix so the same instructions or context do not need to be computed from scratch on every request.
Regression Eval: A repeatable test set used to detect whether a model, prompt, retrieval, or workflow change improved or harmed behaviour on important product tasks.
Latency: The time between sending a request and receiving a useful response. Teams may track time to first token and total response time separately.
Throughput: The amount of work a system can process over time, such as requests or generated tokens per second.
Rate Limit: A provider or application limit on requests, tokens, or compute within a time period. Products need retries, queues, or backpressure when limits are reached.
Observability: Logs, traces, metrics, and quality signals that help a team understand model calls, tool use, latency, cost, errors, and user outcomes.

What is the difference between an LLM and a transformer?

A transformer is a neural network architecture. An LLM is a language model trained at large scale, usually using transformer architecture. Transformers are also used for images, audio, biology, and other kinds of data, so not every transformer is an LLM.

What is the difference between RAG and fine-tuning?

RAG retrieves information at request time without changing model weights. Fine-tuning updates model weights to adapt behaviour or task performance. Use RAG for changing or private knowledge, fine-tuning for learned behaviour, and evaluate both against the real task before choosing.

What does hallucination mean in AI and how do you prevent it?

Hallucination is when a model produces a claim that is unsupported, incorrect, or fabricated. Teams can reduce the risk with retrieval, tools, output validation, citations, constrained tasks, evaluation, and human review. No single technique removes hallucinations completely.

What is a context window and why does it matter for my product?

The context window is the total number of tokens a model can process in one request, including instructions, conversation history, retrieved documents, tool results, and generated output. A larger window can hold more material, but relevant context, latency, and cost still need to be managed.

What is the difference between a vector database and a regular database?

A traditional database usually retrieves structured values through exact conditions, ranges, joins, or full-text search. A vector database stores embeddings and retrieves items by mathematical similarity. Many products combine relational, keyword, and vector search rather than choosing only one.

What is MCP and why is it relevant for AI builders?

MCP, or Model Context Protocol, is an open protocol for connecting AI applications to tools and data sources through a consistent client-server interface. It can reduce custom integration work when compatible clients and servers are available.

AI terminology keeps changing, but the useful test stays simple: can the term help your team make a better product decision? Bookmark this glossary for quick reference, then explore our AI software development guide when you are ready to turn the vocabulary into a working product.