Concept

Context Window

concept context-window working-memory multi-source

Context Window

The context window is the finite sequence of tokens visible to the model during a single inference call. It is the model’s only working memory. Anything inside the context window is directly and precisely accessible; anything outside it does not exist from the model’s perspective.


Analogy

In the LLM OS framing:

  • Parameters (disk/long-term storage) = vague, probabilistic knowledge; like remembering something you read months ago.
  • Context window (RAM/working memory) = precise, immediate access; like having a document open in front of you.

What goes in the context window

A typical conversation context contains, in order:

  • A system message — hidden instructions from the operator (identity, tone, constraints, knowledge cutoff reminders).
  • Conversation history — all prior turns in the current chat, encoded with special role tokens.
  • Tool outputs — results of web searches, code execution, file reads dumped back in by the application.
  • The user’s current message.

Clicking “New Chat” clears the context window entirely — the model has no memory of previous conversations unless memory features (e.g., ChatGPT’s Memory) have saved summaries.


Implications for users

Put information in the window, don’t rely on recall. Parametric memory is vague and prone to Hallucination. If you need the model to work accurately with specific information (a paper, a book chapter, a dataset), paste it into the context rather than asking the model to recall it.

Context size is a precious resource. Irrelevant tokens in the context window can degrade output quality and slow the model down. Good practice: start a new chat when switching topics.

Long contexts have limits. Models have a maximum context length (e.g., 8,000 tokens for early models; 200,000+ tokens for modern models like Claude). For tasks requiring more context than fits, retrieval-augmented generation (RAG) provides a workaround: relevant passages are fetched from a document store and inserted into the window for each query.


Tokens to think

Each token in the output receives a fixed compute budget (one forward pass through the Transformer). Complex reasoning cannot be done in a single token — it must be distributed across many output tokens. This is why chain-of-thought prompting works: the model allocates its compute budget across a reasoning trace rather than trying to collapse the answer into a single step.

Consequence: forcing an answer into the first token (e.g., “Answer in one word”) prevents the model from computing a correct answer on anything that requires multi-step reasoning.


See also