Microsoft has announced three new foundational AI models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—marking the most concrete evidence yet that the $3 trillion software giant intends to compete directly with OpenAI and Google on model development, not just distribution.
The announcement represents the first major output from Microsoft’s superintelligence team, formed just six months ago by Mustafa Suleyman to pursue what he calls “AI self-sufficiency.” The timing is notable: Microsoft’s stock just closed its worst quarter since the 2008 financial crisis, as investors demand proof that hundreds of billions in AI infrastructure spending will translate into revenue.
MAI-Transcribe-1: Best-in-Class Speech Recognition
MAI-Transcribe-1 achieves the lowest average Word Error Rate on the FLEURS benchmark across the top 25 languages by Microsoft product usage, averaging just 3.8% WER. According to Microsoft’s benchmarks, it beats OpenAI’s Whisper-large-v3 on all 25 languages, Google’s Gemini 3.1 Flash on 22 of 25, and both ElevenLabs’ Scribe v2 and OpenAI’s GPT-Transcribe on 15 of 25 each.
“I’m very excited that we’ve now got the first models out, which are the very best in the world for transcription,” Suleyman told VentureBeat. “Not only that, we’re able to deliver the model with half the GPUs of the state-of-the-art competition.”
The model uses a transformer-based text decoder with a bi-directional audio encoder, accepts MP3, WAV, and FLAC files up to 200MB, and delivers batch transcription speeds 2.5 times faster than existing Azure Fast offerings. Microsoft is already deploying MAI-Transcribe-1 internally in Copilot’s Voice mode and Microsoft Teams for conversation transcription.
MAI-Voice-1 and MAI-Image-2
MAI-Voice-1 is Microsoft’s text-to-speech model, capable of generating 60 seconds of natural-sounding audio in a single second. The model preserves speaker identity across long-form content and supports custom voice creation from just a few seconds of audio through Microsoft Foundry, priced at $22 per 1 million characters.
MAI-Image-2 debuted as a top-three model family on the Arena.ai leaderboard and delivers at least 2x faster generation times on Foundry and Copilot compared to its predecessor. Microsoft is rolling it out across Bing and PowerPoint, pricing it at $5 per 1 million tokens for text input and $33 per 1 million tokens for image output. WPP, one of the world’s largest advertising holding companies, is already building with MAI-Image-2 at scale.
The End of Microsoft’s OpenAI Exclusivity
To understand why these models matter, one must understand the contractual shift that made them possible. Until October 2025, Microsoft was contractually prohibited from independently pursuing artificial general intelligence. The original 2019 deal with OpenAI gave Microsoft a license to OpenAI’s models in exchange for building cloud infrastructure—but when OpenAI sought to expand compute footprint beyond Microsoft, Microsoft renegotiated.
“Back in September of last year, we renegotiated the contract with OpenAI, and that enabled us to independently pursue our own superintelligence,” Suleyman explained. The revised agreement frees Microsoft to build its own frontier models while retaining license rights to everything OpenAI builds through 2032.
The three new MAI models are available immediately through Microsoft Foundry and a new MAI Playground, signaling that the era of Microsoft relying solely on OpenAI for frontier AI capabilities has definitively ended.