Chapter 5 · Modalities · 8 min read
Typing is the slow path. The fastest builders feed Claude their voice, their screen, and the screenshot of the thing they were just looking at — usually all in one prompt. This chapter is how to do that without gimmicks.
You learned to iterate text in the previous chapter. Here you expand the input side: voice on every surface that has it, multimodal context (screenshots, image paste, drag-drop) as a first-class signal, and the small prompting shifts that make voice prompts land as well as typed ones.
A typed prompt is a sentence you composed. A voice prompt is a thought you spoke. The two have different shapes, and different jobs.
Voice wins when:
Typing wins when:
Most builders end up alternating, sentence by sentence. Voice the messy thought; type the precise reference.
Claude listens differently on each surface. The shortcut, the latency, the transcript quality — each one feels distinct. You'll pick favorites within a week of trying all four.
The Claude desktop app has a built-in microphone button next to the prompt input. Click it, speak, click again. Transcript appears live in the prompt area. Edit it before sending if you want.

This is the surface most builders use most. No setup, no third-party tools, transcription is already good. The one trick worth knowing: dictate the rough thought, then edit the transcript before sending — voice + light edit beats voice-and-send every time.
/voice is built inClaude Code's terminal CLI has voice dictation built in. Run /voice once to enable it.
After that:
Space to record, release to insert the transcript. Review, edit if you want, press Enter to send./voice tap. Tap Space to start, tap again to stop. Auto-submits when the transcript is at least three words.Two things make /voice better than a generic dictation tool:
regex, OAuth, JSON, localhost transcribe correctly. Your current project name and git branch are added as recognition hints automatically./config or your settings file. Defaults to English.Audio streams to Anthropic for transcription (not local). Free — doesn't consume tokens or count toward /usage limits. Requires a Claude.ai account (not API key / Bedrock / Vertex / Foundry). Doesn't work over SSH or remote sessions — needs local microphone access. Available since Claude Code v2.1.69; tap mode since v2.1.116.
/voice is the right default for the terminal. Reasons to add a system-wide tool on top:
/voice doesn't workWispr Flow ↗ is the popular default — quiet, hold-a-hotkey, types the transcript into whichever app is foreground. Free tier handles most builder volume.
Super Whisper ↗ runs Whisper locally. Same hotkey-driven flow. Choose this when privacy matters more than Wispr's UX.
macOS Dictation ↗ is the zero-install baseline (Fn Fn by default). Use it to confirm your hardware works before paying for anything; not the daily driver.
The Claude iOS and Android apps both have native dictation. The interaction shape is different from desktop: you're usually one-handed, often walking, often capturing a thought rather than executing a task.
Mobile is where voice does its most distinctive work — you couldn't have typed that prompt; you wouldn't have opened a laptop to capture the thought; voice is what makes the capture possible at all.
The web app at claude.ai has a microphone button identical to the desktop app's. Use this when you're on a borrowed machine, on Linux, or in an environment where you don't want to install the app.
Voice expands what you can say to Claude. Multimodal expands what you can show.
A pasted screenshot is a prompt's missing context made visible. Three places it lands harder than any equivalent paragraph:
Drag a Figma export or screenshot of the page into the prompt. Ask: "Wire this up against my existing components." Claude reads the visual, names the components it sees, asks about the ones it can't infer, and writes the JSX. You skipped the description-of-the-mockup paragraph entirely.

Stack trace in your terminal? Browser devtools panel with a network failure? Don't transcribe it. Screenshot it (Cmd+Shift+4), paste it into Claude, ask "what's going on?" The error gets read with surrounding context — the line numbers, the colors, the full call site — that copy-paste loses.
Sketch a UI on paper. Snap it with your phone. Paste into Claude. Get a working component. The fidelity isn't art-school; the workflow is real and saves an hour vs. a Figma round-trip.
The pattern that compounds most: voice the goal, paste the artifact, type the constraint. Three modalities, one prompt:
(voice) "Add a metrics dashboard to this page." (paste screenshot of current page) (type)
Use the existing /lib/metrics queries; new component goes in app/components/metrics-card.tsx.
Voice carries intent. Paste carries context. Type carries the precise reference. None of the three alone produces what the three together do.
Voice changes how a prompt sounds. The substance shouldn't change, but the surface does.
Different work, different surface:
| Work | Surface | Why |
|---|---|---|
| Daily flow on a real codebase | Terminal + /voice | The agent runs there; built-in dictation is coding-aware |
| Dictating into desktop, browser, Slack — anywhere | Wispr Flow / Super Whisper | One system-wide tool beats per-app mics |
| Reviewing diffs, asking "why" | Desktop app | Live transcript, easy to pivot from voice to type |
| Capturing a thought during a walk | Mobile | The thought wouldn't have survived to a laptop |
| Pairing with someone else watching | Desktop app + cast | Voice + visible transcript reads as natural |
| Debugging UI / Figma / mockup | Desktop app + paste | Multimodal lives there |
Try this once, on a real task, today.
What you'll notice: the voiced part lands the goal the way you actually thought it. The pasted part anchors the context. The edit catches the imprecision. Three modalities, one prompt, in less time than typing the equivalent.
The next chapter — Ecosystem — covers the broader stack: skills, MCP servers, plugins. Voice and multimodal are the input layer; that chapter is the surrounding tooling.