Design summary
Full picture before we write the spec — let me know if anything needs changing
What we're building
Real-time animated talking faces for the voice-web interface, powered by Simli.ai. Each voice profile gets its own realistic face. Starting with Hermes. Faces animate in sync with any TTS engine (ElevenLabs or Piper).
UI behaviour
- Default (PiP mode) — chat transcript fills the screen, Hermes face floats as a picture-in-picture overlay in the bottom corner
- Multi-agent (Gallery mode) — all agent faces visible side-by-side, speaking face highlighted, others dimmed
- Auto-switching — gallery activates automatically when a second agent starts speaking; returns to PiP when back to one agent
- Pin override — a toggle lets you lock either mode regardless of agent count
- Fallback — if Simli is unreachable, UI silently falls back to current audio-only experience; no broken state
New components
- PCMDecoder — browser-side, converts MP3 (ElevenLabs) or WAV (Piper) chunks to raw 16-bit PCM at 16kHz for Simli
- SimliClient wrapper — thin class around the Simli SDK; manages one WebRTC session per agent face, handles reconnect
- FaceManager — owns all
<video>elements, activates/deactivates faces per speaking agent, drives PiP ↔ gallery transitions - /api/simli/session — new endpoint on voice-web server; calls Simli API server-side so the API key never reaches the browser; returns session token + face config per agent
What doesn't change
- voice-gateway — no changes at all
- ElevenLabs integration, Piper fallback, engine selector — all unchanged
- WebSocket session protocol — no new WS message types needed
- Existing audio player — kept as fallback; bypassed when Simli is active
Phase plan
- Phase 1 — Hermes face in PiP mode with real-time lip sync (ElevenLabs + Piper)
- Phase 2 — Per-profile faces, gallery mode, auto-switch + pin
- Phase 3 — Custom face upload per profile (refine personas)
Does this match what you had in mind? Any changes before I write the spec?