Skip to content

Add SmallestAI Hydra S2S framework support#172

Open
weiz9 wants to merge 3 commits into
mainfrom
pr/wz/smallestai-hydra-s2s-framework
Open

Add SmallestAI Hydra S2S framework support#172
weiz9 wants to merge 3 commits into
mainfrom
pr/wz/smallestai-hydra-s2s-framework

Conversation

@weiz9

@weiz9 weiz9 commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds SmallestAI Hydra speech-to-speech as an EVA framework (framework: smallest_hydra, s2s: hydra), alongside the existing S2S integrations (Gemini Live, ElevenLabs). Bridges the Twilio user-simulator WebSocket to Hydra over a raw WebSocket, modeled on the Gemini Live server: three concurrent tasks (forward user audio / process Hydra events / pace output) with sync_buffer_to_position recording. Audio is 16 kHz in / 48 kHz out, recorded at 48 kHz native.

Standalone against main — 1 commit, 9 files.

What's included

  • smallest_hydra_server.py — the server. OpenAI-Realtime-shaped protocol: session.configure handshake, input_audio_buffer.append in, response.output_audio.delta out, client-side tools (response.function_call_arguments.doneconversation.item.createresponse.create), native barge-in, generate_initial_response greeting, keepalive during idle.
  • Transcripts — the assistant transcript comes from Hydra's native response.output_audio_transcript.delta stream (accumulated per response, finalized on response.done) — accurate, no extra STT call. Hydra emits no user transcript, so each user utterance is batch-transcribed by a configurable STT (s2s_transcription.py: smallest default / openai / deepgram), fail-soft, with a short-utterance guard (≥300 ms) to avoid hallucinations on near-silence. The audit log is timestamp-sorted so async user transcriptions stay ordered.
  • Metricsmodel_response latency anchored on Hydra's speech_stopped (the simulator's user_speech_stop races), with a 50 ms floor; token usage from response.done.
  • Payload cap — instructions truncated to keep session.configure within Hydra's ~32 KB ceiling (raised from ~18 KB; it silently drops audio above the limit). The greeting steer is preserved. Tool-heavy domains can exceed the ceiling on tool schemas alone (see limitations).
  • Audio — new 48 kHz helpers (mulaw_8k_to_pcm16_48k, pcm16_48k_to_mulaw_8k) + in-memory WAV builder in audio_bridge.py.
  • Framework registered in worker._get_server_class() and the RunConfig.framework Literal; simulation_version bumped to 2.0.2.
  • .env.example (config example + framework enum) and contract doc §13 updated.

Testing

  • pytest tests/unit/assistant/test_smallest_hydra_server.py — 16 pass (tool conversion, 48 kHz audio round-trip, transcriber selection + per-provider response parsing, fail-soft). ruff check/format clean.
  • Validated across 4 live airline runs (--record-ids 1.1.2 --num-trials 1): conversation succeeds end-to-end (greeting, multi-turn, get_reservation + search_rebooking_options + rebook_flight + assign_seat, DB mutation), EVA-X pass 1.0; native assistant transcript accurate (e.g. "Austin AUS to Los Angeles LAX"); model_response latency records clean values; token usage per turn.

Config example

EVA_FRAMEWORK=smallest_hydra
EVA_MODEL__S2S=hydra
EVA_MODEL__S2S_PARAMS='{"model":"hydra","api_key":"<SMALLEST_API_KEY>","voice":"wren","generate_initial_response":true,"transcription":{"provider":"smallest","model":"pulse-pro","language":"en"}}'

Known limitations (Smallest-side, not this code)

  • ~32 KB payload ceiling (Smallest-side). Airline (~22 KB) fits with full instructions. Tool-heavy domains still don't: ITSM's 59 tool schemas alone are ~45 KB, exceeding the ceiling before any instructions, so those domains stay degraded (instructions floored) until the limit clears their tools.
  • English-only (en) per Hydra's current support.

@weiz9 weiz9 changed the base branch from pr/wz/deepgram-voice-agent-framework to main July 1, 2026 20:47
Bridge the Twilio user-simulator WebSocket to Smallest's Hydra speech-to-speech
model (framework: smallest_hydra, s2s: hydra), modeled on the Gemini Live server:
three concurrent tasks (forward user audio / process Hydra events / pace output)
with sync_buffer_to_position recording. Audio is 16 kHz in / 48 kHz out; recorded
at 48 kHz native.

- Assistant transcript comes from Hydra's native response.output_audio_transcript
  .delta stream (accumulated per response, finalized on response.done). Hydra emits
  no user transcript, so each user utterance is batch-transcribed by a configurable
  STT (s2s_transcription.py: smallest default / openai / deepgram), fail-soft, with
  a short-utterance guard to avoid hallucinations on near-silence.
- model_response latency anchored on Hydra's speech_stopped (the simulator's
  user_speech_stop races), with a 50 ms floor; token usage from response.done.
- session.configure handshake, client-side tools, native barge-in, generate_
  initial_response greeting; ~18 KB payload cap on instructions (Hydra drops audio
  above it).
- New 48 kHz audio helpers + in-memory WAV builder in audio_bridge.py.
- Register framework in worker + config Literal; bump simulation_version to 2.0.6.
- Unit tests for tool conversion, audio round-trip, transcriber selection/parsing.
@weiz9 weiz9 force-pushed the pr/wz/smallestai-hydra-s2s-framework branch from de2077a to 7492a7b Compare July 1, 2026 20:55
weiz9 and others added 2 commits July 1, 2026 20:55
Smallest raised the payload limit from ~18 KB to ~32 KB. Bump _MAX_PAYLOAD_BYTES
so airline-sized prompts (~22 KB) are no longer truncated. Document that tool-heavy
domains (e.g. ITSM: 59 tools ≈ 45 KB) still exceed the ceiling on tool schemas
alone, which instruction truncation cannot resolve.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant