A vacation project that explores the OpenAI Agents SDK while integrating concepts closer to the capability frontier.
My goal was to combine multi-agent collaboration, realtime voice, Twilio's Media Streams API, MCP servers, and computer use into a practical demonstration.
The problem
Picture a critical incident in production.
You gather vendors and system integrators on a Zoom call.
The agents take notes, answer routine questions by leveraging MCP for system access, and conduct basic system changes using computer use capabilities.
Reflections
- The Agents SDK is sparsely documented. You will have to read the source code to get a real-world implementation to work.
- No parity between Python and Typescript. I started with Python, but due to limitation in realtime voice agents and computer use, I had to switch to Typescript. Reading the documentation again now, I can see some of the issues are partially resolved already.
- Voice latency is a hurdle, especially when tools are involved. The demo below is sped up 4x.
- I never managed to make the agent acknowledge the question before diving into tool use, leaving an awkward silence at the start of every interaction.
- Visualizing the conversation is not trivial or out-of-the-box. Transcripts arrive (understandably) late, requiring additional processing. It is also difficult to line up the conversation with the computer use screenshots, especially when several agents and tools are involved.
-
Few external tracing options exist for Typescript. I tried AgentOps, but was underwhelmed. Python has more mature options though.
- Reliability is okay, if you catch and ignore the occasional error.
Demo
A short video of the agent in action. You only see the conversation monitoring, the call itself happened from my phone through Twilio.
The code can be found here and here.