ssh — guest: ~ connected

WEBLLM(1)

NAME

webllm — chat with a Large Language Model running entirely in your browser

SYNOPSIS

webllm webllm --list webllm --list-all webllm <number> webllm <model-id> webllm --unload webllm --cache webllm --rm <model-id> webllm --rm-all

DESCRIPTION

Runs an open-source LLM (Llama, Qwen, Gemma, Phi, SmolLM…) fully on the client with no server: inference is accelerated by WebGPU and the model weights are downloaded once, then cached by the browser. Powered by WebLLM (@mlc-ai/web-llm): the engine is a self-hosted module (/vendor/web-llm-<version>.js), so it loads under the site's own CSP with no third-party CDN. The model weights are fetched from HuggingFace.

With no argument (or --list) it prints a curated list of small, browser-friendly models. Pick one by its number (webllm 1) or by id (webllm Qwen2.5-1.5B-Instruct-q4f16_1-MLC, a unique substring also works) to start a chat. A progress bar tracks the one-time model download; on a browser without WebGPU an explanatory error is shown instead.

Once the chat is ready, type a message and press Enter. The reply streams in token by token. The conversation keeps its context until you /reset or /exit. Ctrl+C interrupts a running generation and closes the session. The loaded model stays in GPU memory for the page session, so re-running webllm with the same model resumes instantly; webllm --unload frees it.

Smaller models start faster and use less memory but are less capable; larger ones are smarter but download more and need a stronger GPU.

Downloaded weights persist in the browser's cache. webllm --cache lists the models currently stored, webllm --rm <id> deletes one (id or a unique substring), and webllm --rm-all clears them all (after a confirmation). These cache operations do not need WebGPU, so they work on any browser.

Models ship in two builds: q4f16 (16-bit, smaller/faster) requires the optional WebGPU shader-f16 feature, while q4f32 (32-bit, larger) runs anywhere. webllm probes the GPU and automatically picks the build that will run — and transparently swaps a requested q4f16 id to its q4f32 twin when shader-f16 is missing.

OPTIONS

--list, -l list the recommended (small) models --list-all list every model id WebLLM knows about --unload, --stop free the loaded model from GPU memory --cache list the models stored in the browser cache --rm <id> delete a cached model (id or unique substring) --rm-all delete every cached model (asks to confirm)

CHAT COMMANDS

/exit /quit /bye close the chat session /reset /clear forget the conversation context (keep the model) /model show the model currently loaded /help list these chat commands

EXAMPLES

webllm webllm 1 webllm Llama-3.2-1B webllm --unload webllm --cache webllm --rm Qwen2.5-0.5B webllm --rm-all

Ludovic Toinel

Architecte Fullstack & Innovation @ Capgemini

Architecte Fullstack & Innovation chez Capgemini. Blogueur, hacker, photographe, pilote de drones, voyageur et musicien.

Ce portail est un terminal interactif qui nécessite JavaScript. Activez-le pour explorer le shell.