Blog

OpenAI unveils astonishing AI voices: what if your assistant spoke like a knight or a podcaster?

Artificial intelligence has taken a new step in its ability to understand and speak to us. OpenAI has just unveiled three new audio models that revolutionize speech recognition and voice synthesis. This advancement could change how we interact with virtual assistants daily.

Key takeaways:

  • OpenAI rolls out three new speech-to-text and text-to-speech models in its API.
  • Its goal is to help build more powerful, customizable, and intelligent voice AIs.
  • Its engineers aim to build the future of voice assistance, from customer service to transcription of spoken interactions.

Models that listen better than ever

Remember Whisper, OpenAI’s speech recognition system? Despite its strengths, it sometimes struggled with strong accents or noisy environments. That changes today with the arrival of two new models: gpt-4o-transcribe and gpt-4o-mini-transcribe.

These newcomers they reduce the word error rate in recognition. Their secret? Intensive training on diverse audio datasets and the use of reinforcement learning. The result is impressive: even in a crowded café, with a strong accent, these models capture your words with unprecedented accuracy.

Comparative tests on the FLEURS benchmark (which evaluates speech recognition in over 100 languages) show that these models outperform not only Whisper but also competing solutions like Gemini-2.0-Flash or Scribe-v1.

Voices that can adapt to every situation

On the voice synthesis side, OpenAI is making a big splash with its third model: gpt-4o-mini-tts The big innovation? You can now “instruct” the model on how to express itself. Imagine asking your assistant to:

  • Speak like a medieval knight to tell a story,
  • Adopt a professional tone for a presentation,
  • Use a gentle voice for a bedtime story…

This customization opens fascinating possibilities! A customer service agent could adjust its tone to the situation—reassuring when there’s a problem, enthusiastic when presenting a new product.

Discover three for yourself:

The culmination of an "agentic" strategy

These models are part of a broader vision. In recent months, OpenAI has multiplied launches focused on autonomy: Operator, Deep Research, Computer-Using Agents… The goal? To create assistants capable of carrying out complex tasks independently.

Adding advanced voice capabilities was the missing piece: "For agents to be truly useful, people need to be able to have deeper, more intuitive interactions beyond text," OpenAI explains in its blog post.

The combination of speech recognition and synthesis models now makes it possible to build full conversational agents. To streamline this process, OpenAI even launched an integration with its Agents SDK.

Impressive technical innovations

Under the hood, these models benefit from several advances: a pretraining on audio datasets specialized, advanced “distillation” techniques to transfer knowledge from large models to lighter versions, and areinforcement learning to improve accuracy.

These models rely on the architectures of ChatGPT, GPT-4o and GPT-4o-mini, already recognized for their performance. This solid foundation, combined with training specifically for audio, explains their exceptional capabilities.

And tomorrow?

OpenAI does not intend to stop there. The company is already working on further improvements, notably the ability for developers to use their own customized voices. Video is also among the next frontiers. The objective: to create “agentic multimodal” experiences capable of integrating text, audio and video.

These advances raise questions about how we will interact with AI in the years to come. The text interfaces that dominate today could give way to natural conversations, where AI understands us and responds with appropriate vocal nuances.

OpenAI appears to have pulled ahead in the race for natural interaction. These audio models, available now through the company's API, could reshape our daily relationship with technology. Can you imagine chatting with your assistant like a friend who adjusts its tone to your current needs? This reality has never been closer.

The article "OpenAI unveils astonishing AI voices: what if your assistant spoke like a knight or a podcaster?" was published on the site Abondance.