Superpowers Brainstorming

Design summary

Full picture before we write the spec — let me know if anything needs changing

What we're building

Real-time animated talking faces for the voice-web interface, powered by Simli.ai. Each voice profile gets its own realistic face. Starting with Hermes. Faces animate in sync with any TTS engine (ElevenLabs or Piper).

UI behaviour

Default (PiP mode) — chat transcript fills the screen, Hermes face floats as a picture-in-picture overlay in the bottom corner
Multi-agent (Gallery mode) — all agent faces visible side-by-side, speaking face highlighted, others dimmed
Auto-switching — gallery activates automatically when a second agent starts speaking; returns to PiP when back to one agent
Pin override — a toggle lets you lock either mode regardless of agent count
Fallback — if Simli is unreachable, UI silently falls back to current audio-only experience; no broken state

New components

PCMDecoder — browser-side, converts MP3 (ElevenLabs) or WAV (Piper) chunks to raw 16-bit PCM at 16kHz for Simli
SimliClient wrapper — thin class around the Simli SDK; manages one WebRTC session per agent face, handles reconnect
FaceManager — owns all <video> elements, activates/deactivates faces per speaking agent, drives PiP ↔ gallery transitions
/api/simli/session — new endpoint on voice-web server; calls Simli API server-side so the API key never reaches the browser; returns session token + face config per agent

What doesn't change

voice-gateway — no changes at all
ElevenLabs integration, Piper fallback, engine selector — all unchanged
WebSocket session protocol — no new WS message types needed
Existing audio player — kept as fallback; bypassed when Simli is active

Phase plan

Phase 1 — Hermes face in PiP mode with real-time lip sync (ElevenLabs + Piper)
Phase 2 — Per-profile faces, gallery mode, auto-switch + pin
Phase 3 — Custom face upload per profile (refine personas)

Does this match what you had in mind? Any changes before I write the spec?