Oct 21, 2025
Background Audio for Speech-to-Speech Voice Agents
SourceOverview
The voice AI industry is on the precipice of a significant architectural shift. Historically, voice AI has relied on chained architectures – transcribing speech to text, processing it through an LLM, and converting the output back to speech. This architecture has several drawbacks, including:
- Inability to capture non-textual audio cues. Vocal signals like "hmm" that indicate thinking are lost during transcription.
- Loss of prosody. Speech-to-text conversion strips away acoustic features like tone, emotion and expression that convey meaning beyond words.
- Latency. Processing through three sequential stages introduces additional delays in response generation.
Speech-to-speech models represent a paradigm shift – a single model processes audio input directly and generates audio output in real-time, preserving the natural characteristics of human speech. Given the nascency of this technology, several capabilities remain undeveloped. One of the most critical is background audio, which serves as a key signal of authenticity when users interact with a voice agent.
Despite the advancements of OpenAI's Realtime API and Agents SDK, neither framework natively supports background audio injection for voice agents. This writeup and GitHub repository detail a custom extension to OpenAI's TwilioRealtimeTransportLayer that enables continuous background audio playback with automatic muting when the agent is speaking, resulting in more natural and realistic voice interactions.
Solution Summary
The solution cleanly extends OpenAI's TwilioRealtimeTransportLayer through a custom TwilioWithBackgroundAudio class that maintains a separate background audio stream synchronized with agent speech state, preserving base transport functionality (audio routing, interruption handling etc.) without modifying SDK source code. Key features include:
1. Speech State Detection. The challenge is tracking agent speech across multiple lifecycle stages: when audio generation begins, when it completes, and when playback to the caller finishes. The solution combines three mechanisms:
_onAudiooverride to detect streaming start- RealtimeSession listeners (
response.done,response.cancelled) to track generation lifecycle - Twilio mark events injected post-response that Twilio echoes back upon playback completion, signaling when to resume background audio
2. Buffer Management. Sends Twilio's clear event before each agent response to flush buffered audio, preventing background audio bleed-through and preserving pristine voice quality.
Execution Flow
The following trace illustrates how background audio synchronizes with agent speech during a typical call.
Pre-Call Initialization
- Constructor loads background audio file from disk via
loadBackgroundAudio() setupTwilioListeners()registers WebSocket event handlers forstartandmarkeventssetupSessionListeners()(called fromindex.ts) registers OpenAI session event handlers
0:00 | Call Start
- Twilio sends
startevent withstreamSid startBackgroundAudio()initiates a timer that sends 160-byte audio chunks every 20ms (aligns with Twilio Media Streams API requirement for 8kHz audio)- Background audio loops continuously using modulo arithmetic on
backgroundPosition, with drift correction to maintain precise timing
0:03 | Agent Speaks
- OpenAI generates first audio chunk, triggering
_onAudio() - Sets
isAgentSpeaking = true - Calls
stopBackgroundAudio()to cancel the timer - Sends Twilio
clearcommand to flush buffered background audio chunks - Forwards agent audio to caller via
super._onAudio()
0:06 | Agent Speech Generation Complete
- OpenAI emits
response.doneevent sendEndOfAudioMark()injects a unique Twilio mark event into the audio stream- Stores mark name in
pendingMarkNameand waits for confirmation
0:08 | Agent Speech Playback Complete
- Twilio confirms mark playback via WebSocket callback
- Verifies mark name matches
pendingMarkName - Sets
isAgentSpeaking = falseand clearspendingMarkName - Calls
startBackgroundAudio()to resume from currentbackgroundPosition– audio continues seamlessly without restarting
0:11 | User Interrupts Agent
- OpenAI fires
audio_interruptedevent - Immediately sets
isAgentSpeaking = false, clearspendingMarkName, and resumes background audio without waiting for mark confirmation
Getting Started
Audio Requirements
A sample background audio file (sample-background-mulaw-8khz.raw) is included in the repository. To use custom audio, files must be μ-law encoded at 8kHz. Convert existing files using FFmpeg:
ffmpeg -i input.mp3 -ar 8000 -ac 1 -acodec pcm_mulaw output.rawQuick Start
import { TwilioBackgroundAudioTransport } from './TwilioBackgroundAudioTransport';
import { RealtimeAgent, RealtimeSession } from '@openai/agents/realtime';
// Create agent and transport with background audio
const agent = new RealtimeAgent({ name: 'Assistant', instructions: '...' });
const transport = new TwilioBackgroundAudioTransport({
twilioWebSocket: connection,
backgroundAudioPath: './sample-background-mulaw-8khz.raw' // included in repo
});
// Setup and connect
const session = new RealtimeSession(agent, { transport });
transport.setupSessionListeners(session);
await session.connect({ apiKey: process.env.OPENAI_API_KEY });