Add SmallestAI Hydra S2S framework support#172
Open
weiz9 wants to merge 3 commits into
Open
Conversation
Bridge the Twilio user-simulator WebSocket to Smallest's Hydra speech-to-speech model (framework: smallest_hydra, s2s: hydra), modeled on the Gemini Live server: three concurrent tasks (forward user audio / process Hydra events / pace output) with sync_buffer_to_position recording. Audio is 16 kHz in / 48 kHz out; recorded at 48 kHz native. - Assistant transcript comes from Hydra's native response.output_audio_transcript .delta stream (accumulated per response, finalized on response.done). Hydra emits no user transcript, so each user utterance is batch-transcribed by a configurable STT (s2s_transcription.py: smallest default / openai / deepgram), fail-soft, with a short-utterance guard to avoid hallucinations on near-silence. - model_response latency anchored on Hydra's speech_stopped (the simulator's user_speech_stop races), with a 50 ms floor; token usage from response.done. - session.configure handshake, client-side tools, native barge-in, generate_ initial_response greeting; ~18 KB payload cap on instructions (Hydra drops audio above it). - New 48 kHz audio helpers + in-memory WAV builder in audio_bridge.py. - Register framework in worker + config Literal; bump simulation_version to 2.0.6. - Unit tests for tool conversion, audio round-trip, transcriber selection/parsing.
de2077a to
7492a7b
Compare
Smallest raised the payload limit from ~18 KB to ~32 KB. Bump _MAX_PAYLOAD_BYTES so airline-sized prompts (~22 KB) are no longer truncated. Document that tool-heavy domains (e.g. ITSM: 59 tools ≈ 45 KB) still exceed the ceiling on tool schemas alone, which instruction truncation cannot resolve.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds SmallestAI Hydra speech-to-speech as an EVA framework (
framework: smallest_hydra,s2s: hydra), alongside the existing S2S integrations (Gemini Live, ElevenLabs). Bridges the Twilio user-simulator WebSocket to Hydra over a raw WebSocket, modeled on the Gemini Live server: three concurrent tasks (forward user audio / process Hydra events / pace output) withsync_buffer_to_positionrecording. Audio is 16 kHz in / 48 kHz out, recorded at 48 kHz native.Standalone against
main— 1 commit, 9 files.What's included
smallest_hydra_server.py— the server. OpenAI-Realtime-shaped protocol:session.configurehandshake,input_audio_buffer.appendin,response.output_audio.deltaout, client-side tools (response.function_call_arguments.done→conversation.item.create→response.create), native barge-in,generate_initial_responsegreeting, keepalive during idle.response.output_audio_transcript.deltastream (accumulated per response, finalized onresponse.done) — accurate, no extra STT call. Hydra emits no user transcript, so each user utterance is batch-transcribed by a configurable STT (s2s_transcription.py:smallestdefault /openai/deepgram), fail-soft, with a short-utterance guard (≥300 ms) to avoid hallucinations on near-silence. The audit log is timestamp-sorted so async user transcriptions stay ordered.model_responselatency anchored on Hydra'sspeech_stopped(the simulator'suser_speech_stopraces), with a 50 ms floor; token usage fromresponse.done.session.configurewithin Hydra's ~32 KB ceiling (raised from ~18 KB; it silently drops audio above the limit). The greeting steer is preserved. Tool-heavy domains can exceed the ceiling on tool schemas alone (see limitations).mulaw_8k_to_pcm16_48k,pcm16_48k_to_mulaw_8k) + in-memory WAV builder inaudio_bridge.py.worker._get_server_class()and theRunConfig.frameworkLiteral;simulation_versionbumped to 2.0.2..env.example(config example + framework enum) and contract doc §13 updated.Testing
pytest tests/unit/assistant/test_smallest_hydra_server.py— 16 pass (tool conversion, 48 kHz audio round-trip, transcriber selection + per-provider response parsing, fail-soft). ruff check/format clean.--record-ids 1.1.2 --num-trials 1): conversation succeeds end-to-end (greeting, multi-turn,get_reservation+search_rebooking_options+rebook_flight+assign_seat, DB mutation), EVA-X pass 1.0; native assistant transcript accurate (e.g. "Austin AUS to Los Angeles LAX");model_responselatency records clean values; token usage per turn.Config example
EVA_FRAMEWORK=smallest_hydra EVA_MODEL__S2S=hydra EVA_MODEL__S2S_PARAMS='{"model":"hydra","api_key":"<SMALLEST_API_KEY>","voice":"wren","generate_initial_response":true,"transcription":{"provider":"smallest","model":"pulse-pro","language":"en"}}'Known limitations (Smallest-side, not this code)
en) per Hydra's current support.