With today’s live stream, OpenAI debuted GPT-4o or GPT Omni. It’s the latest model in an increasingly crowded AI space. At first glance, GPT-4o is remarkable—an AI model that is faster and less expensive. The most striking feature is its new real-time voice capabilities. Watching the demo gives you uncanny echoes of the movie Her, where the AI character Samantha comes across as entirely relatable and ultimately substitutes for human companionship.
Okay, so let’s say it: OpenAI has shipped Samantha.
If you missed today’s live stream, it’s archived here.
OpenAI’s GPT-4o
While there was endless speculation on OpenAI releasing an AI-powered search platform (or even an early version of GPT-5), that was left for later this year. Instead, we got an upgraded version of GPT-4 that is significantly faster and less expensive. Here is OpenAI’s description of the new capabilities of GPT-4o:
GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, and image and generates any combination of text, audio, and image outputs. It can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time(opens in a new window) in a conversation. It matches GPT-4 Turbo performance on text in English and code, with significant improvement on text in non-English languages, while also being much faster and 50% cheaper in the API. GPT-4o is especially better at vision and audio understanding compared to existing models.
Astonishing Voice Capabilities
If we’ve been living in an era where people talk to their friends as they walk down the street, we’re entering a new world where they’ll be talking to AI. With its response in milliseconds, it’s eliminated the delay we’ve accepted as the norm throughout the digital age, where a computer listens to your input, processes it, and then provides a delayed response. The new voice capabilities are good enough that you soon won’t be able to determine if an actual human or AI is on the other end of the conversation.
As OpenAI notes,
Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average.
The voice features will revolutionize customer service and support in business and other organizations. As someone (@wiffyclyro6477) commented in response to another OpenAI video,
That might sound fine from the perspective of advanced economies, but it could have a devastating impact on the developing world.
The Impact on Education
But don’t get too comfortable. The new voice capabilities also have implications for education in designing services for students and others in the university community. And while some faculty may want to duck their heads in the sand, it will undoubtedly impact the role of faculty in the future. If you teach – or have colleagues who do – it’s time for some serious reflections on the value-add of a live instructor. We won’t see any faculty replacement at the moment but look down the road to the future. If you can’t see this as a possibility, you’d best clean the lenses on your binoculars.
Of course, we’re still wrestling with integrating basic text-based AI platforms into the learning experience, a challenge that’s played out with mixed results in both K12 and Higher Ed. And now, here we are, leapfrogging ahead to a world of near-instantaneous voice response. As OpenAI’s CEO Sam Altman said on his blog today,
The original ChatGPT showed a hint of what was possible with language interfaces; this new thing feels viscerally different. It is fast, smart, fun, natural, and helpful.
That it’s fast and intelligent will make it essential in learning environments; that it’s fun and natural will make it completely seductive.
Cheaper and Faster AI
The speed with which GPT-4o responded in OpenAI’s demo today is astonishing. However, the fact that it will be less expensive raised questions about whether it is worth it to continue paying for a subscription. Heavy users will want unlimited access, but others may see the free version as sufficient for their needs. That was a surprising development, given that OpenAI – and everyone else – are still grappling with how to monetize AI services.
[UPDATE:] From what we understand, OpenAI is currently saying that a Plus account will get 5x higher usage, or about 80 uses per three hours on GPT-4o. If that’s the case, then free users will be limited to around 16 uses per every three hours. The lower limit will be a deal-breaker for most power users—especially for people who want to integrate it into their everyday lives—so we suspect that most Plus users will keep their subscriptions.
On the other hand, the increase in speed and lower cost will accelerate the adoption of AI globally. ChatGPT took off not just in technically advanced countries due to its low hardware and bandwidth requirements. That’s a double-edged sword: It’s amazing to see technology in everyone’s hands, but it’s concerning as very few countries have come to grips with regulating the misuse of Generative AI.
An AI Demo With A Few Glitches
Of course, not everything went smoothly today in unveiling GPT-4o, and credit to OpenAI for risking a live demo at this early stage. The audio occasionally cut out even though the team was working with a wired connection to a laptop. And as Yahoo! Finance noted, after solving the math problem,
it chimed in with a flirtatious-sounding voice: ‘Wow, that’s quite the outfit you’ve got on.’
Whether that’s a glitch or simply a peak into what our future holds is hard to say.
Future Developments
OpenAI is determined to maintain its preeminent role in AI developments. Unsurprisingly, today’s announcements were scheduled one day before Google’s I/O conference, where we expect to see even more AI capabilities. For now, the faster speed is available to all users; GPT-4o’s new vision and voice capabilities will roll out over the next few weeks.
But we know this is only the beginning of our AI journey. As Sam Altman put it,
. . . we’ll have more stuff to share soon 🙂
Indeed, they will.
Emory Craig is a writer, speaker, and consultant specializing in virtual reality (VR) and generative AI. With a rich background in art, new media, and higher education, he is a sought-after speaker at international conferences. Emory shares unique insights on innovation and collaborates with universities, nonprofits, businesses, and international organizations to develop transformative initiatives in XR, GenAI, and digital ethics. Passionate about harnessing the potential of cutting-edge technologies, he explores the ethical ramifications of blending the real with the virtual, sparking meaningful conversations about the future of human experience in an increasingly interconnected world.