How I Use LLMs

Cleaned transcript — original talk by Andrej Karpathy
Video: youtube.com/watch?v=EWvNQjAaOHw
Author: karpathy.ai
This is a cleaned version of the auto-generated YouTube transcript — punctuation added, filler removed, restructured for readability. Not a verbatim transcript. For exact quotes, please refer to the original video.

Introduction

Hi everyone. In this video I want to continue our general-audience series on large language models like ChatGPT. In the previous video — "Deep Dive into LLMs" — we went into the under-the-hood fundamentals of how these models are trained and how to think about their cognition and psychology. In this video I want to go into practical applications. I'll show you lots of examples, walk through the settings that are available, and show you how I use these tools and how you can use them in your own life and work.

The Growing LLM Ecosystem

The web page we'll start with is ChatGPT. As you might know, ChatGPT was developed by OpenAI and deployed in 2022 — the first time people could just talk to a large language model through a text interface. It went viral.

Since then, the ecosystem has grown enormously. In 2025 there are many apps that are like ChatGPT. ChatGPT by OpenAI is the original incumbent — most popular and most feature-rich because it's been around the longest. But there are many others:

To keep track of them: Chatbot Arena (lmarena.ai) ranks models by blind human comparisons. The SEAL leaderboard from Scale AI is another resource for evaluating models across a wide variety of tasks.

ChatGPT Interaction Under the Hood

The most basic form of interaction: give the model text, get text back. Ask for a haiku about what it's like to be an LLM, and it responds. Under the hood, your query and its response are chopped into tokens — little text chunks. You can see this at tiktokenizer: paste your text, select GPT-4o, and you'll see your query as a sequence of token IDs.

Because this is a conversation, there's more going on than just the tokens. There's a chat format with special tokens marking turns: <|im_start|>user<|im_sep|> … content … <|im_end|>, and the same for the assistant. What looks to us like chat bubbles is, under the hood, one shared token stream we're building together.

When you click New Chat, you wipe the token window — resetting the context to zero. The context window is the model's working memory. Anything inside it is directly accessible. Anything outside it is not.

The entity you're talking to: think of it as a 1 TB zip file on a disk. Pre-training compressed the entire internet into the parameters of this neural network — lossy and probabilistic, like vague recollection rather than verbatim recall. Post-training attached a personality: human labelers at OpenAI wrote example conversations, and the model learned to imitate that assistant persona.

"Hi, I'm ChatGPT. I'm a 1 TB zip file. My knowledge comes from the internet, which I read in its entirety about six months ago — and I only remember it vaguely. My winning personality was programmed by example by human labelers at OpenAI."

Basic LLM Interactions: Examples

A few practical examples of how I actually use these models:

How much caffeine is in one shot of an Americano? ChatGPT says roughly 63 mg. Why is this a good query? Because (1) the information isn't recent, and (2) it's discussed extensively online, so the model has good recall. Verify against primary sources if needed.

Medical information — recently I was sick with a runny nose and asked about medications. ChatGPT walked me through the ingredients in DayQuil and NyQuil. I took out the box and checked the ingredient list. Everything checked out. Common information, not high-stakes, and the model's knowledge is solid on it. But always verify what matters.

Two habits worth developing:

  1. Start a new chat when switching topics. Irrelevant tokens in the context window can distract the model and slightly slow it down. Keep the working memory clean.
  2. Be aware of what model you're using. In the top-left dropdown you can see the current model. If you're not logged in, you may be using a smaller, less capable model without knowing it.

Be Aware of the Model You're Using: Pricing Tiers

ChatGPT has three tiers for individuals: Free, Plus ($20/month), and Pro ($200/month). The free tier gives access to GPT-4o mini — a smaller model that's less creative, has less knowledge, and hallucinates more. The Plus tier gives 80 messages per 3 hours on GPT-4o, the flagship model. The Pro tier gives unlimited GPT-4o and access to the most powerful reasoning models.

The same logic applies across providers. Claude has a free tier and a Professional plan. Gemini, Grok — all of them. Bigger models cost more to run, so companies charge more. Make the tradeoff for yourself. If you're using these professionally, the top tier is probably worth it.

My approach: I treat all these models as my LLM Council. For a big decision — where to go on vacation, a hard technical question — I'll ask multiple models and compare. Each has its strengths.

Thinking Models and When to Use Them

We saw in the previous video that models go through three training stages: pre-training, supervised fine-tuning, and reinforcement learning. RL is where the model gets to practice on large collections of math and code problems and discovers thinking strategies that work. These strategies — trying different approaches, backtracking, revising assumptions — emerge from RL rather than being explicitly programmed.

A thinking model is one that has been additionally tuned with RL. Qualitatively, it will do more thinking before responding, taking longer but achieving higher accuracy, especially on hard problems in math, code, and reasoning. For simple factual or conversational queries, there's no benefit to waiting.

Concrete example: I had a gradient-check bug in some neural network code. GPT-4o gave me a list of things to double-check but didn't find the actual problem. When I switched to o1 Pro (OpenAI's top reasoning model, Pro tier only), it thought for one minute, identified a mismatch in how parameters were packed and unpacked, and solved the issue exactly.

I also tried the same prompt on:

For OpenAI: models starting with "o" are thinking models — o1, o3-mini, o3-mini-high, o1 Pro. For Grok: toggle "Think" before submitting. Other providers have similar toggles.

My practice: try the fast non-thinking model first. If the result seems off, switch to a thinking model.

Tool Use: Internet Search

So far we've been talking to the model as a zip file — inert, closed off, just a neural network emitting tokens. Now we want to give it tools.

The problem: the model's knowledge is frozen at its training cutoff, which might be six months to a year ago. Questions about recent events won't be answered reliably from parametric knowledge.

The solution: web search. The model emits a special token signaling a search query. The application pauses generation, runs the search, scrapes the web pages, dumps their text into the context window, and resumes generation. Now the model has the information in its working memory and can answer.

Different providers handle this differently:

Examples from my own use where I reach for search:

General rule: if the answer could be obtained by doing a Google search and skimming the top links, use the search tool.

Tool Use: Deep Research

Deep research is a combination of internet search and thinking, run for a sustained period — tens of minutes. The model issues many searches, processes lots of papers and pages, accumulates a large context window, and produces something close to a custom research report.

OpenAI introduced this as a Pro-tier feature. Perplexity (also called "Deep Research") and Grok ("Deep Search") have cloned it.

Concrete example: I wanted to understand Ca-AKG, a supplement in Brian Johnson's longevity mix. I gave ChatGPT a detailed research prompt and turned on Deep Research. It asked clarifying questions, went off for about 10 minutes, looked at 27 sources, and produced a report covering research in worms, flies, and mice; ongoing human trials; proposed mechanisms of action; and safety concerns with citations I could follow up on.

My take: ChatGPT's Deep Research currently produces the most thorough output. Perplexity and Grok are shorter. In all cases: treat this as a first draft. Hallucinations are still possible. Follow the citations and verify what matters.

More examples:

File Uploads: Adding Documents to Context

Instead of having the model rely on parametric knowledge, you can give it concrete documents — PDFs, text files, web pages — via file upload. The document gets loaded into the context window, putting it directly in working memory.

Reading papers: I uploaded a PDF about Evo 2, a large biological foundation model trained on DNA sequences. Drag, drop, ask for a summary. Then go through the paper asking questions as I read. The model is like a knowledgeable collaborator reading alongside you.

Reading books: I almost never read books alone anymore. For The Wealth of Nations — written in 1776 — I go to Project Gutenberg, find the chapter I'm reading, copy-paste it into Claude or ChatGPT with "Please summarize this chapter," then read the chapter myself while asking questions. LLMs make old texts and texts outside your domain dramatically more accessible. I don't know of a tool that integrates this cleanly, but copy-pasting works remarkably well.

Practical note: when accuracy matters, ask the model to first transcribe what it sees so you can verify. Claude likely converts a PDF to plain text and discards images; images may or may not be understood well.

Tool Use: Python Interpreter

Beyond knowledge retrieval, we can give models the ability to write and execute code.

Simple multiplication: ChatGPT computes 30 × 9 in its head. A large multiplication it couldn't possibly do mentally — it opens a Python interpreter, writes a one-liner, executes it, and returns the exact result. OpenAI has trained ChatGPT to recognize when it needs tools.

State of the ecosystem: not all models have a Python interpreter. Grok 3 appeared not to, and would attempt difficult multiplication in its head — getting remarkably close but wrong. Claude wrote JavaScript and got it right. Gemini 2.0 Pro appeared to calculate directly and was sometimes correct, sometimes wrong on harder problems. Keep track of which models have which tools.

ChatGPT Advanced Data Analysis: Figures and Plots

Advanced Data Analysis turns ChatGPT into a junior data analyst. It can gather data, write code to plot it, fit trend lines, and generate figures inline.

Example workflow: ask for OpenAI's valuation history by year (with search), then plot on a log scale, then fit a trend line and extrapolate to 2030. ChatGPT wrote and ran Python, showed the figure, and reported the extrapolation.

Caveat: the model silently substituted $100M for the N/A valuation in 2015 without telling me. It also verbally reported the wrong 2030 figure while the plot showed something much larger. When pressed, it acknowledged the error. The model is a capable but absent-minded junior analyst. Read the code. Verify the numbers.

Claude Artifacts: Apps and Diagrams

Claude's Artifacts feature renders code directly in your browser.

Flashcard app example: I copied the Adam Smith Wikipedia introduction, asked Claude to generate 20 flashcards, then asked it to "use the Artifacts feature to write a flashcard app." Claude wrote React code, hardcoded the Q&A pairs, and rendered a working flashcard app inline — with flip animations, correct/incorrect tracking, and shuffle. No backend required.

The use case I find most valuable: diagram generation. I give Claude a chapter from a book and ask for a "conceptual diagram." It writes Mermaid diagram code, renders it inline, and produces a tree-structured map of the chapter's argument. Extremely useful if you're a visual thinker. I use this for books, papers, and source code.

Cursor: Writing Code Professionally

I spend most of my professional time writing code, but I don't use ChatGPT for code snippets — the context is too limited. Instead I use Cursor, a desktop app (also: VS Code, Windsurf) that has access to all the files on your filesystem.

Cursor uses Claude 3.7 Sonnet via API under the hood, but handles all the context management automatically — it knows your whole codebase, not just what you paste in.

The feature I use most is Composer (Cmd+I), especially with the agent integration. It operates autonomously across multiple files: edits code, runs commands, installs packages.

Example: I started a new project, asked Composer to set up a React repo and build Tic-Tac-Toe. It wrote all the CSS, TypeScript, and JavaScript. Then I said "when someone wins, add confetti." It installed the react-confetti library, updated the component, added winning-cell CSS highlighting, and downloaded a victory sound file. The whole thing worked.

This is what people call vibe coding — giving control to Composer and narrating what you want. If it gets stuck, you can always fall back to reading and editing the files yourself.

Audio (Speech) Input/Output

Roughly half my queries are spoken, not typed. On mobile it's closer to 80%.

On mobile: the ChatGPT app has a microphone icon — tap it, speak, tap again, and your speech is transcribed to text.

On desktop: ChatGPT's web app doesn't have a transcription microphone. I use Super Whisper (also: Whisper Flow, Mac Whisper) — a system-wide app that listens on a hotkey (I use F5), transcribes, and pastes. Works across any app. For product names or library names that don't transcribe well, I type instead.

Output: most apps have a "Read Aloud" button. You can also find system-wide text-to-speech apps.

Don't type when you can speak. It's faster, and it works well.

Advanced Voice Mode: True Audio Inside the Model

What I described above is "fake audio" — converting between audio and text using separate models, while the LLM itself still processes text.

True audio (ChatGPT's "Advanced Voice Mode") is different. The model processes audio tokens natively — a spectrogram is chunked into quantized tokens, the model is trained on them alongside text, and it can both understand and generate audio directly. No text in the loop.

The result: the model can change speaking style, voice character, pacing, and emotion in ways that text-to-speech can't. I demonstrated: normal voice → Yoda voice → pirate voice, counting fast from 1 to 20, animal sounds.

Caveat: ChatGPT's Advanced Voice Mode is quite conservative and refuses a lot. Grok's advanced voice (on mobile) has more personality modes — default, romantic, unhinged, conspiracy, and others — and is less restricted. Advanced Voice Mode is included in Plus and Pro tiers.

NotebookLM: Podcast Generation

NotebookLM (notebooklm.google.com) from Google: upload any sources — PDFs, web pages, text files — and they enter the model's context. You can then (1) chat with the content or (2) click "Generate" to get a custom podcast about those sources.

I uploaded the Evo 2 paper and got a ~30-minute podcast. The hosts discuss the content naturally, you can break in with questions in interactive mode, and you can customize the angle with special instructions before regenerating.

I use this when I'm heading out for a walk or a long drive and want something to listen to that no existing podcast covers. I've also published a series on Spotify called "Histories of Mysteries" made from generated podcasts on topics I was curious about.

Image Input: OCR and Visual Queries

Just as audio and text are tokenized, images can be chunked into patches, quantized to a vocabulary of ~100,000 patch tokens, and fed to the same Transformer. The model just models statistical patterns — it doesn't know some tokens are text and others are image patches.

Examples from my own use:

Technique: on macOS, Shift+Cmd+4 lets you draw a selection and copy it to clipboard. Cmd+V to paste directly into the chat. Very fast for grabbing snippets of your screen.

Image Output: DALL-E, Ideogram, and Others

ChatGPT uses DALL-E 3 for image generation. Given an arbitrary prompt, it produces high-quality stylized images. Under the hood, ChatGPT writes a caption and passes it to DALL-E (a separate model) — the language model itself doesn't generate images directly.

I use it occasionally for content creation — thumbnails and icons for YouTube videos. I also use Ideogram, a competitor. I don't use it daily.

Video Input: Point and Talk on Mobile

Advanced Voice Mode on the mobile app includes a video mode. You tap a camera icon, and the model can see what your camera sees. I demonstrated: identifying books on my shelf, a CO2 monitor and its reading, acoustic foam panels, a map of Middle-Earth.

Under the hood it's likely sampling one image per second rather than true video, but it feels continuous as a user. I don't use it in my daily work, but I'd show it to parents or grandparents — pointing at things and asking questions is an extremely natural interface.

Video Output: Sora, Veo 2, and Others

There are many AI video generation tools now, all rapidly improving. I don't use them heavily. Veo 2 (Google) is near state-of-the-art at time of filming. Sora (OpenAI) and others are competitive. All of them are worth being aware of if you work in a creative field.

ChatGPT Memory and Custom Instructions

Memory: ChatGPT can save information from conversation to conversation. It doesn't always trigger automatically — sometimes you have to ask ("please remember this"). Once saved, that text is prepended to future conversations so the model has context about you.

Over time, the model accumulates context about your preferences, interests, and background. It starts making more relevant recommendations. I wasn't sure about this feature initially, but I've come around — it noticeably improves personalization. You can view, edit, and delete memories in Settings.

(Memory appears to be unique to ChatGPT at time of filming; other providers don't yet have this.)

Custom Instructions: in Settings → Customize ChatGPT, you can tell it how to talk to you and what to know about you. I've told it to drop the HR-speak, be direct, and lean toward educational explanations. I've also told it that I'm learning Korean and want responses to default to a specific formality level. Any persistent preference belongs here.

Custom GPTs

Custom GPTs are saved prompts with a name, description, and instructions. They save you from re-explaining a task every time.

Korean vocabulary extractor: give it a sentence; it extracts vocabulary in dictionary form, formatted Korean ; English, ready to paste into Anki flashcards. Under the hood: just a few-shot prompt describing the task with four input/output examples.

Korean detailed translator: give it a sentence; it translates it word-by-word and morpheme-by-morpheme, showing exactly how each piece maps to English. Far better than Google Translate for language learners. Includes examples so the output format is consistent.

Korean caption reader: screenshot a subtitle from a Korean TV show, paste it in, and this GPT OCRs it, translates it, and breaks it down. Instructions give the model context (a TV show), specify an OCR step, and then the same translation format.

The general principle: few-shot prompts always outperform zero-shot. Don't just describe the task — give examples of exactly what input produces what output. Custom GPTs are just a way to save those prompts so you don't retype them.

Summary

There is a rapidly growing, changing, and thriving ecosystem of LLM apps. ChatGPT is the first and the incumbent — most feature-rich overall. But competitors are either reaching feature parity or exceeding ChatGPT in specific areas: Perplexity for search, Claude for artifacts and diagrams, Grok for voice personality.

Things to track when working across these apps:

  1. What you're talking to: a zip file with lossy recall of the internet plus a learned assistant persona.
  2. Pricing tier and model: bigger models mean more knowledge, better writing, fewer hallucinations. The difference matters for professional use.
  3. Thinking vs. non-thinking models: try without thinking first; switch to a reasoning model for hard math, code, or reasoning problems.
  4. Tool availability: internet search, Python interpreter, and code execution vary across providers and tiers. Models without tools will hallucinate results for tasks that require them.
  5. Modality: text, audio (fake vs. native), image input/output, video input/output — capabilities vary by app and tier.
  6. Quality-of-life features: file uploads, memory, custom instructions, custom GPTs — look for equivalents across providers.
  7. Web vs. mobile: some features (Advanced Voice, video) are mobile-only; others are desktop-only.

It's a bit of a zoo right now, but also an incredible time to be using these tools. I use them dozens of times a day — for code, research, reading, language learning, and much more. The main lesson: know your tools, know their limits, and always verify what matters.

Thanks for watching.