WEBLLM(1)
NAME
webllm — chat with a Large Language Model running entirely in your browser
SYNOPSIS
webllm webllm --list webllm --list-all webllm <number> webllm <model-id> webllm --unload webllm --cache webllm --rm <model-id> webllm --rm-all
DESCRIPTION
Runs an open-source LLM (Llama, Qwen, Gemma, Phi, SmolLM…) fully on the client with no server: inference is accelerated by WebGPU and the model weights are downloaded once, then cached by the browser. Powered by WebLLM (@mlc-ai/web-llm): the engine is a self-hosted module (/vendor/web-llm-<version>.js), so it loads under the site's own CSP with no third-party CDN. The model weights are fetched from HuggingFace.
With no argument (or --list) it prints a curated list of small, browser-friendly models. Pick one by its number (webllm 1) or by id (webllm Qwen2.5-1.5B-Instruct-q4f16_1-MLC, a unique substring also works) to start a chat. A progress bar tracks the one-time model download; on a browser without WebGPU an explanatory error is shown instead.
Once the chat is ready, type a message and press Enter. The reply streams in token by token. The conversation keeps its context until you /reset or /exit. Ctrl+C interrupts a running generation and closes the session. The loaded model stays in GPU memory for the page session, so re-running webllm with the same model resumes instantly; webllm --unload frees it.
Smaller models start faster and use less memory but are less capable; larger ones are smarter but download more and need a stronger GPU.
Downloaded weights persist in the browser's cache. webllm --cache lists the models currently stored, webllm --rm <id> deletes one (id or a unique substring), and webllm --rm-all clears them all (after a confirmation). These cache operations do not need WebGPU, so they work on any browser.
Models ship in two builds: q4f16 (16-bit, smaller/faster) requires the optional WebGPU shader-f16 feature, while q4f32 (32-bit, larger) runs anywhere. webllm probes the GPU and automatically picks the build that will run — and transparently swaps a requested q4f16 id to its q4f32 twin when shader-f16 is missing.
OPTIONS
--list, -l list the recommended (small) models --list-all list every model id WebLLM knows about --unload, --stop free the loaded model from GPU memory --cache list the models stored in the browser cache --rm <id> delete a cached model (id or unique substring) --rm-all delete every cached model (asks to confirm)
CHAT COMMANDS
/exit /quit /bye close the chat session /reset /clear forget the conversation context (keep the model) /model show the model currently loaded /help list these chat commands
EXAMPLES
webllm webllm 1 webllm Llama-3.2-1B webllm --unload webllm --cache webllm --rm Qwen2.5-0.5B webllm --rm-all