AI & ML Glossary: 120 Terms Explained in Plain English

🧱 Foundations & Concepts

20 terms

AI

Artificial Intelligence

Software that mimics tasks we used to think only humans could do — recognising images, writing, planning, reasoning.

An umbrella term for systems that perform tasks normally requiring human intelligence. Modern AI is dominated by machine learning, especially deep neural networks trained on massive datasets.

ML

Machine Learning

Teaching computers patterns from examples instead of writing rules by hand.

A branch of AI where systems learn behaviour from data rather than from explicit programming. Includes supervised, unsupervised, and reinforcement learning.

Deep Learning

Machine learning using neural networks with many stacked layers — the engine behind modern AI.

A subfield of ML using neural networks with multiple hidden layers. Powers virtually every modern AI breakthrough — image recognition, speech, LLMs, protein folding.

Neural Network

A web of simple math units loosely inspired by brain cells, trained to turn inputs into useful outputs.

A computational model made of interconnected layers of nodes (neurons), each applying a weighted sum and a non-linear activation. Training adjusts the weights so the network maps inputs to desired outputs.

LLM

Large Language Model

A very large neural network trained on huge amounts of text to read, write, and reason in language.

A transformer-based model with billions of parameters trained on internet-scale text. Generates language by predicting the next token. Examples: GPT, Claude, Gemini, Llama.

SLM

Small Language Model

A compact LLM small enough to run on a laptop or phone, trading some quality for speed and privacy.

A language model in the 1B–10B parameter range, designed for on-device or low-latency inference. Often distilled from larger models or trained on high-quality curated data.

Foundation Model

A general-purpose model pretrained on broad data that you can adapt to many specific tasks.

A large model trained on diverse, broad data that serves as a base for many downstream applications via fine-tuning or prompting. Coined by Stanford CRFM.

Frontier Model

The most capable AI models in the world at any given moment — the cutting edge.

Refers to the most advanced general-purpose AI models available, typically those that push the limits of capability and scale. Subject to additional safety scrutiny.

Base Model

The raw pretrained model before it's taught to follow instructions or chat.

A language model after pre-training but before instruction tuning or RLHF. Continues text well but doesn't reliably follow user commands.

Instruct Model

A base model further trained to actually follow user instructions and answer questions properly.

A base model after supervised fine-tuning on instruction-following data, making it usable as an assistant.

Chat Model

An instruct model trained on multi-turn conversations, with a system/user/assistant turn format.

An instruct model further tuned on dialogue data and equipped with a conversational turn structure (system, user, assistant roles).

Multimodal Model

A model that can read and combine more than one kind of input — text, images, audio, video.

A model that processes multiple input modalities (text, images, audio, video) within a unified architecture, enabling tasks that span them.

Vision-Language Model

VLM

An AI that can look at an image and discuss it in language — describe, answer questions, reason about it.

A multimodal model jointly trained on images and text. Can caption, answer visual questions, do OCR, and reason over diagrams.

Generative AI

GenAI

AI that creates new content — text, images, audio, code — rather than just classifying or predicting.

A class of AI models that learn the distribution of training data and generate new samples from it. Includes LLMs, diffusion models, GANs.

Discriminative Model

A model that learns to tell categories apart instead of creating new examples.

A model that learns the boundary between classes — answers 'which?' rather than 'what does a sample look like?'. Logistic regression, classifiers, BERT for classification.

Supervised Learning

Training a model on labelled examples — every input comes paired with the right answer.

An ML paradigm where models learn from input-output pairs. The model adjusts its parameters to minimise the gap between predicted and true labels.

Unsupervised Learning

Finding patterns in data without anyone labelling what's what.

An ML paradigm where models discover structure in unlabelled data — clustering, dimensionality reduction, density estimation.

Reinforcement Learning

Training an agent by giving it rewards for good actions and penalties for bad ones — like training a dog with treats.

An ML paradigm where an agent learns by interacting with an environment and optimising a reward signal. Foundational to RLHF, game-playing AI, robotics.

Self-Supervised Learning

SSL

Training a model by hiding part of the input and asking it to predict the missing piece — no human labels needed.

Models generate their own training signal from the data itself — e.g. masking words and predicting them. The dominant paradigm for pre-training LLMs.

Transfer Learning

Taking a model trained on one task and adapting it to a different but related task — faster than starting fresh.

Reusing knowledge learned on one task to bootstrap performance on another, typically via fine-tuning a pretrained model.

🏗️ Models & Architectures

20 terms

Transformer

The neural network architecture behind nearly every modern LLM — fast, parallel, and built around attention.

A neural network architecture introduced in 2017 ('Attention Is All You Need') that replaced RNNs for sequence modelling. Uses self-attention to model relationships between all positions in parallel.

Attention

The mechanism that lets a model decide which parts of the input matter most when producing each output.

A weighted sum mechanism where each output position learns to focus on the most relevant input positions. The core building block of transformers.

Self-Attention

Attention applied within a single sequence so every word can look at every other word.

An attention mechanism where queries, keys, and values all come from the same sequence, letting each token model dependencies on every other token.

Cross-Attention

Attention that lets one sequence look at a different sequence — used when conditioning on outside info.

An attention mechanism where queries come from one sequence and keys/values from another. Common in encoder-decoder models and image-text models.

Multi-Head Attention

MHA

Running attention several times in parallel with different learned views, then combining the results.

Splits attention into multiple parallel heads, each learning a different relationship pattern (syntax, coreference, etc.). Outputs are concatenated and projected.

Encoder

The half of a model that reads the input and turns it into rich internal representations.

A stack of transformer layers that maps an input sequence into a sequence of contextual embeddings. Used standalone (BERT) or paired with a decoder (T5).

Decoder

The half of a model that generates output one token at a time, using what it's already produced as context.

A stack of transformer layers that produces output sequences autoregressively. Uses causal masking so each token only attends to previous positions.

Encoder-Decoder

Architecture that reads everything first, then writes everything — good for translation, summarisation.

A transformer architecture with separate encoder and decoder stacks. The decoder cross-attends to encoder outputs. T5, BART, original Transformer.

Decoder-Only

Architecture used by most modern LLMs — no separate reader, just one big writer that generates next token.

A transformer with only a causal decoder stack. Used by GPT, Claude, Llama, Gemini. Treats every task as next-token prediction.

Encoder-Only

Architecture optimised for understanding text rather than generating it — used in classification, search.

A transformer with only an encoder stack. Produces bidirectional embeddings, used for classification, retrieval, NER. BERT, RoBERTa, DeBERTa.

MoE

Mixture of Experts

A model that activates only a small set of its 'experts' per question — more brainpower for less compute.

An architecture where a router selects a sparse subset of expert sub-networks to process each token. Enables huge total parameter counts with constant per-token compute.

Dense Model

A model where every parameter is used for every input — the traditional architecture.

A transformer where all parameters are active for every forward pass. Contrasts with MoE / sparse models. Llama, Mistral, most early LLMs.

Sparse Model

A model where only a fraction of parameters fire per input — often an MoE.

A model in which only a small subset of parameters is activated per input. MoE is the dominant sparse-model approach today.

CNN

Convolutional Neural Network

The architecture that revolutionised image recognition by sliding small filters across pixels.

A neural network using convolutional layers that share weights spatially. Dominant for vision before transformers; still strong on many image tasks.

RNN

Recurrent Neural Network

An older sequence model that processes tokens one at a time, passing along a hidden state.

A neural network that processes sequences step-by-step with a hidden state. Largely superseded by transformers due to parallelism and long-range dependencies.

LSTM

Long Short-Term Memory

A type of RNN with gating that helps it remember things from much earlier in a sequence.

An RNN variant with input, output, and forget gates. Dominant for sequence modelling before transformers; still used in some specialised settings.

Diffusion Model

An image generator that starts with pure noise and gradually 'denoises' it into a coherent picture.

A generative model that learns to reverse a noising process. Powers Stable Diffusion, Midjourney, Sora, DALL·E. Now also used for audio and video.

GAN

Generative Adversarial Network

Two networks playing a cat-and-mouse game — one generates fakes, the other learns to spot them.

A generative framework where a generator and discriminator are trained adversarially. Dominant for image generation before diffusion models.

Autoencoder

A network that learns to compress its input then reconstruct it, capturing the essence in between.

A neural network trained to reproduce its input, with a narrow bottleneck in the middle that forces it to learn a compressed representation.

VAE

Variational Autoencoder

An autoencoder with a probabilistic bottleneck — useful for generating new samples, not just reconstructing.

An autoencoder whose latent space follows a probability distribution, enabling controlled sampling. Foundational to many generative models, including some diffusion variants.

🎓 Training & Fine-tuning

20 terms

Pre-training

The first, most expensive training phase — feeding the model billions of text examples to learn language.

The initial training run on massive unlabelled (or weakly labelled) data using a self-supervised objective like next-token prediction. Produces the base model.

Fine-tuning

Continuing to train a pretrained model on a smaller, more specific dataset to specialise it.

Adapting a pretrained model to a specific task or domain by continued training on a smaller curated dataset. Can be full fine-tuning or parameter-efficient (LoRA, adapters).

SFT

Supervised Fine-Tuning

Fine-tuning a base model on high-quality input-output examples so it learns to follow instructions.

Fine-tuning a pretrained model on (prompt, completion) pairs using next-token prediction. The first step in turning a base model into an instruct model.

RLHF

Reinforcement Learning from Human Feedback

Teaching a model what good answers look like using human ratings, not just example text.

A training pipeline where humans rank model outputs, a reward model is trained on those rankings, and the LLM is fine-tuned with RL (usually PPO) to maximise reward.

RLAIF

RL from AI Feedback

Like RLHF, but using a strong AI model to rate outputs instead of humans — cheaper and faster to scale.

A variant of RLHF where AI models provide the preference signal in place of human raters. Used in Anthropic's Constitutional AI.

DPO

Direct Preference Optimization

A simpler alternative to RLHF that skips the reward model and trains directly on human preferences.

An alignment method that fine-tunes a model directly on preference pairs (chosen, rejected) using a contrastive loss. Simpler and more stable than full RLHF/PPO.

PPO

Proximal Policy Optimization

A popular RL algorithm — the workhorse used in the RL step of RLHF.

An RL algorithm that updates policies within a trust region for stability. The standard choice for the reinforcement-learning step in RLHF pipelines.

GRPO

Group Relative Policy Optimization

A newer RL algorithm that replaces PPO's value function with rank-based comparisons inside a group.

An RL algorithm introduced by DeepSeek that removes the critic from PPO by computing advantages relative to a group of sampled responses. More efficient for reasoning training.

Reward Model

A model that predicts how much a human would like a given response — used to guide RLHF training.

A model trained on human preference data to score arbitrary outputs. Used in the RL stage of RLHF to provide a dense reward signal.

Constitutional AI

CAI

Aligning a model using a written set of principles instead of relying purely on human ratings.

A training method introduced by Anthropic where a model critiques and revises its own outputs against a written 'constitution' of principles, then learns from those revisions.

LoRA

Low-Rank Adaptation

A fine-tuning trick that updates a tiny set of extra parameters while freezing the main model — fast and cheap.

A parameter-efficient fine-tuning method that injects low-rank trainable matrices into model layers while freezing the base weights. Drastically reduces compute and memory.

QLoRA

Quantized LoRA

LoRA performed on top of a quantized (compressed) base model — fine-tune huge models on a single GPU.

Combines LoRA with 4-bit quantization of the base model, enabling fine-tuning of very large models on consumer hardware.

PEFT

Parameter-Efficient Fine-Tuning

An umbrella for methods that fine-tune only a small fraction of a model's parameters.

A family of fine-tuning techniques (LoRA, adapters, prefix tuning) that update only a small fraction of parameters, retaining most of the base model frozen.

Adapter

A small trainable module slotted into a frozen model — a precursor to LoRA.

Small bottleneck modules inserted between layers of a frozen pretrained model and trained for a specific task. A foundational PEFT technique.

Prefix Tuning

Learning a small 'fake prompt' of trainable vectors that conditions the model toward a task.

A PEFT method that learns continuous prefix vectors prepended to keys and values at every layer, while keeping model weights frozen.

Prompt Tuning

Learning a tiny continuous prompt that steers a frozen model toward a task — even smaller than prefix tuning.

A PEFT method that learns a small set of continuous embeddings prepended to the input. Cheaper than prefix tuning but typically less expressive.

Distillation

Knowledge Distillation

Training a small 'student' model to imitate a big 'teacher' model — shrinking models without losing too much.

Training a smaller model to match the outputs (or internal states) of a larger one. Produces fast, deployable models that retain much of the teacher's capability.

Quantization

Compressing a model by storing its weights in smaller number formats — 4 or 8 bits instead of 16 or 32.

Reducing the numerical precision of model weights (e.g. FP16 → INT8 → INT4) to shrink memory and speed up inference, usually with a small quality cost.

Pruning

Removing redundant weights or neurons from a model to make it leaner.

Removing weights, neurons, or whole structures judged unimportant. Can be unstructured or structured; often paired with fine-tuning to recover performance.

Gradient Descent

The algorithm that nudges a model's weights in whichever direction reduces the error, over and over.

The core optimisation algorithm of deep learning. Computes the gradient of a loss with respect to parameters and updates them in the opposite direction, scaled by a learning rate.

⚡ Inference & Serving

15 terms

Inference

Using a trained model to produce outputs — what happens every time you send a prompt.

The process of generating outputs from a trained model. Distinct from training; usually the dominant cost in production deployment.

Latency

How long it takes a model to respond — measured end-to-end or to the first character.

Time taken to produce a response. For LLMs, broken into TTFT (first token) and per-token decode latency.

TTFT

Time To First Token

How long you wait before the first character appears — what users perceive as 'responsiveness'.

The latency from request submission to the model emitting its first output token. Dominated by prompt encoding and KV-cache warmup.

TPS

Tokens Per Second

How fast a model generates after it starts — higher = more text per second.

The rate of output token generation. Limited by memory bandwidth (decode-bound) more than compute on modern hardware.

KV Cache

A memory of already-computed attention values so the model doesn't redo work for each new token.

The cached keys and values from prior tokens, reused on every decode step to avoid recomputing the full prompt. Grows linearly with context length.

Paged Attention

An efficient way to manage KV cache memory, inspired by virtual memory in operating systems.

An inference optimisation introduced in vLLM that allocates KV cache in non-contiguous blocks, dramatically reducing memory fragmentation and enabling higher batch sizes.

Continuous Batching

Letting different requests be added and removed from a batch mid-generation, instead of waiting for the slowest one.

A serving technique where new requests join an in-flight batch as soon as slots free up, dramatically improving throughput over static batching.

Speculative Decoding

Using a small fast model to draft several tokens that the big model then verifies in one shot.

A decoding technique where a draft model proposes multiple tokens which the target model verifies in parallel. Speeds up generation when drafts are often correct.

Beam Search

Generation that explores multiple promising candidate sequences in parallel and keeps the best ones.

A decoding algorithm that maintains the top-K partial sequences at each step. Common in machine translation; less used for chat LLMs.

Greedy Decoding

Always pick the single most likely next token — fast and deterministic but boring.

Decoding that selects the highest-probability token at each step. Deterministic but often produces repetitive or generic text.

Temperature

A knob that controls randomness: low = predictable, high = creative and unpredictable.

A parameter that scales token logits before softmax. Lower values sharpen the distribution (more deterministic), higher values flatten it (more diverse).

Top-K Sampling

Restricting sampling to the K most likely next tokens — keeps output coherent without going full robot.

Truncates the sampling distribution to the K highest-probability tokens before renormalising and sampling.

Top-P (Nucleus) Sampling

Sampling only from the smallest set of tokens whose probabilities add up to P — adaptive top-K.

Truncates the distribution to the smallest set of tokens whose cumulative probability exceeds P. More adaptive than top-K to confident vs uncertain steps.

vLLM

A popular open-source library for serving LLMs fast, originally built around paged attention.

An open-source LLM inference engine known for high throughput via paged attention and continuous batching. Widely used for production serving.

Context Window

How much text a model can 'see' at once — its short-term memory, measured in tokens.

The maximum number of tokens a model can process in a single forward pass. Modern models range from 8K to 2M+ tokens.

🔍 Retrieval & RAG

12 terms

RAG

Retrieval-Augmented Generation

Letting an LLM look things up before answering, instead of guessing from memory.

A pattern where an LLM is given retrieved documents at inference time and uses them to ground its response. Reduces hallucination and lets models use information not in their training data.

Embeddings

Turning words or sentences into numbers so a computer can tell which ones mean similar things.

Dense vector representations of text (or other modalities) where semantically similar items are close in vector space. Foundational to search, RAG, and clustering.

Vector Database

A database optimised for finding the most similar embeddings to a query, fast.

A database designed to store and query high-dimensional vectors using approximate nearest neighbour search. Pinecone, Weaviate, Qdrant, pgvector.

Vector Search

ANN Search

Finding the closest embeddings to a query — used to retrieve relevant documents for RAG.

Searching for the nearest vectors to a query embedding, typically with approximate nearest neighbour algorithms (HNSW, IVF, ScaNN) for speed.

Semantic Search

Search that understands meaning — finds 'car repair' when you type 'fix my vehicle'.

Search that matches on meaning rather than exact keywords, typically powered by embedding similarity instead of (or alongside) lexical scoring.

Reranking

A second-pass step that takes the top retrieved results and re-orders them with a more accurate model.

A pipeline stage that re-scores an initial set of retrieval candidates using a more expensive but more accurate model (often a cross-encoder).

Cross-Encoder

A model that scores a (query, document) pair together — slow but accurate.

A model that takes a concatenated query-document pair and outputs a relevance score. More accurate than bi-encoders but cannot pre-index documents.

Bi-Encoder

A model that encodes queries and documents separately — fast because you can pre-index everything.

A model that encodes queries and documents independently into the same vector space, enabling fast retrieval via similarity search.

BM25

Best Matching 25

The classic keyword-based search algorithm — fast, simple, and still surprisingly strong.

A bag-of-words ranking function that scores documents by term frequency, inverse document frequency, and document length. The de-facto baseline for lexical retrieval.

Hybrid Search

Combining keyword search and semantic search to get the best of both — exact matches plus meaning.

A retrieval approach that fuses lexical (BM25) and semantic (vector) scores, often producing better results than either alone.

Chunking

Splitting long documents into smaller pieces so each chunk can be embedded and retrieved on its own.

The pre-processing step that divides documents into retrieval-sized passages. Strategy (fixed-size, semantic, recursive) materially affects RAG quality.

Cosine Similarity

A measure of how similar two embedding vectors are, based on the angle between them.

Measures similarity as the cosine of the angle between two vectors. Standard similarity metric for normalised text embeddings.

📊 Evaluation & Quality

10 terms

Hallucination

When a model confidently makes up information that isn't true.

Generation of plausible-sounding but factually incorrect output. A core failure mode of LLMs, mitigated (not eliminated) by RAG, grounding, and better training.

Perplexity

A measure of how 'surprised' a model is by text — lower means it predicts the data better.

The exponential of the cross-entropy loss. Standard intrinsic metric for language model quality on a held-out corpus.

MMLU

Massive Multitask Language Understanding

A common benchmark of 57 subjects (math, law, medicine, etc.) used to compare LLM general knowledge.

A multiple-choice benchmark spanning 57 subjects from high-school to professional level, widely used to compare LLMs on general knowledge and reasoning.

HumanEval

A benchmark of 164 programming problems used to measure how well models can write code.

A benchmark of 164 Python programming problems with unit tests. Standard for measuring code generation capability.

Benchmark

A standardised test used to compare AI models fairly.

A standardised dataset and evaluation protocol used to measure and compare model capabilities. Examples: MMLU, GSM8K, HumanEval, MT-Bench, IFEval.

Eval Harness

Software that runs many benchmarks on a model in a consistent way, so results are comparable.

A framework that standardises how benchmarks are run — prompt format, scoring, sampling settings — so models can be compared fairly. EleutherAI lm-eval-harness is the de facto standard.

LLM-as-Judge

Using one LLM to score the outputs of another — fast, scalable, but biased.

An evaluation method where a capable LLM rates outputs from other models. Cheaper than human eval but suffers from position bias, verbosity bias, and family bias.

BLEU

Bilingual Evaluation Understudy

A classic translation metric that compares n-gram overlap between machine output and reference text.

An n-gram precision metric originally developed for machine translation. Largely superseded by neural metrics and LLM judges for modern generation tasks.

ROUGE

Recall-Oriented Understudy for Gisting Evaluation

A summarisation metric that measures n-gram recall between generated and reference summaries.

A family of recall-oriented n-gram overlap metrics used historically for summarisation evaluation.

Truthfulness

Whether a model's confident statements are actually true — distinct from being fluent or persuasive.

Whether a model's assertions are factually correct. Distinct from fluency and helpfulness; evaluated by benchmarks like TruthfulQA and SimpleQA.

💬 Prompting & Agents

15 terms

Prompt

The text you send to an AI to ask it to do something.

The input text given to an LLM. Comprises any combination of system instructions, user messages, examples, and retrieved context.

System Prompt

A standing instruction that sets the model's role, tone, and rules before the conversation starts.

An initial instruction that conditions an LLM's behaviour across the conversation, typically containing role, constraints, format, and safety rules.

Chain-of-Thought

CoT

Asking a model to think step by step before answering — usually gives much better results on hard problems.

A prompting technique that elicits intermediate reasoning steps before a final answer. Significantly improves accuracy on math, reasoning, and multi-step tasks.

Few-Shot Learning

Showing the model a handful of input-output examples in the prompt so it understands the pattern.

Prompting an LLM with a small number of demonstrations of the desired input-output mapping, without any weight updates.

Zero-Shot Learning

Asking the model to do something with no examples — just a clear instruction.

Prompting an LLM to perform a task using only instructions, with no demonstrations. Relies on the model's pretraining and instruction tuning.

In-Context Learning

ICL

An LLM's ability to learn a new task purely from examples in the prompt, without retraining.

The ability of LLMs to adapt their behaviour based on demonstrations in the prompt, without any parameter updates. Emerges at scale.

Prompt Engineering

The craft of writing prompts that get the best behaviour out of a model.

The practice of designing prompts to elicit reliable, high-quality model behaviour. Covers structure, examples, format constraints, and role framing.

Prompt Injection

An attack where malicious text in the input hijacks the model's instructions.

A security vulnerability where untrusted input contains instructions that override the system prompt. Especially dangerous in agents that read web pages or emails.

Jailbreak

Tricking a model into ignoring its safety rules through clever prompting.

A prompt that bypasses an LLM's safety training and elicits content the model was tuned to refuse. Distinct from prompt injection (different threat model).

Function Calling

When an LLM produces a structured request to call a tool — the basis of agents and integrations.

A feature where an LLM, given tool schemas, outputs structured JSON specifying which tool to call and with what arguments. The foundation of modern agent frameworks.

Tool Use

An LLM calling external tools — code, search, APIs — to do things it can't do on its own.

The general capability of an LLM to invoke external systems (code execution, search, APIs) via structured outputs, expanding its action space beyond text generation.

Agent

An LLM that doesn't just answer once — it plans, takes actions, observes results, and keeps going until done.

An LLM-driven system that autonomously plans, executes tools, observes results, and iterates toward a goal across multiple turns.

Agentic Workflow

A multi-step process where an AI agent makes decisions about what to do next at each step.

A workflow where an LLM agent makes routing or tool-selection decisions at each step, rather than following a fixed pipeline.

MCP

Model Context Protocol

An open standard from Anthropic for connecting LLMs to tools and data sources in a consistent way.

An open protocol introduced by Anthropic that standardises how LLM applications connect to external tools, resources, and prompts. Decouples agents from tool implementations.

ReAct

Reason + Act

A prompting pattern where the model alternates between thinking and taking actions until it solves the task.

A prompting pattern where the model interleaves reasoning ('Thought:') and acting ('Action:') steps, observing tool outputs between them. Foundational to early agent frameworks.

🔤 Tokens & Text

8 terms

Token

A chunk of text — roughly a word or a piece of one — that the model actually sees and processes.

The atomic unit of input/output for an LLM. Created by a tokenizer; typically 1 token ≈ 4 characters or ~0.75 English words.

Tokenizer

The component that splits text into tokens before feeding it to the model.

The component that converts raw text to and from token IDs. Modern LLMs use subword tokenizers (BPE, WordPiece, SentencePiece).

BPE

Byte Pair Encoding

A common tokenisation algorithm that merges frequent character pairs to build a vocabulary.

A subword tokenisation algorithm that iteratively merges the most frequent pair of adjacent symbols. Used by GPT, Llama, and most modern LLMs.

Vocabulary

The full set of tokens a model knows — typically tens or hundreds of thousands.

The complete set of token IDs a tokenizer can produce. Modern LLMs use vocabularies of 32K–256K tokens.

Sequence Length

The number of tokens in a particular input or output.

The token count of a specific input or output. Bounded by the model's context window.

Special Tokens

Reserved tokens that mark structure — start of message, end of turn, system role, etc.

Reserved tokens in a vocabulary used for structural purposes — beginning-of-sequence, end-of-turn, role markers, padding.

Stop Sequence

A string that tells the model to stop generating once it produces it.

A token or string that, when produced by the model, terminates generation. Used to enforce response boundaries in structured outputs and chat formats.

Sampling

How a model picks the next token from its probability distribution — random or deterministic.

The decoding strategy used to choose the next token from the model's predicted distribution. Includes greedy, top-K, top-P, and temperature-based methods.

The AI & ML glossary for job seekers

AI

ML

Deep Learning

Neural Network

LLM

SLM

Foundation Model

Frontier Model

Base Model

Instruct Model

Chat Model

Multimodal Model

Vision-Language Model

Generative AI

Discriminative Model

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Self-Supervised Learning

Transfer Learning

Transformer

Attention

Self-Attention

Cross-Attention

Multi-Head Attention

Encoder

Decoder

Encoder-Decoder

Decoder-Only

Encoder-Only

MoE

Dense Model

Sparse Model

CNN

RNN

LSTM

Diffusion Model

GAN

Autoencoder

VAE

Pre-training

Fine-tuning

SFT

RLHF

RLAIF

DPO

PPO

GRPO

Reward Model

Constitutional AI

LoRA

QLoRA

PEFT

Adapter

Prefix Tuning

Prompt Tuning

Distillation

Quantization

Pruning

Gradient Descent

Inference

Latency

TTFT

TPS

KV Cache

Paged Attention

Continuous Batching

Speculative Decoding

Beam Search

Greedy Decoding

Temperature

Top-K Sampling

Top-P (Nucleus) Sampling

vLLM

Context Window

RAG

Embeddings

Vector Database

Vector Search