New OpenAI Developments: ChatGPT Can Now "See, Hear, and Speak"

OpenAI has unveiled GPT-4V today, a model that combines generative AI capabilities and vision-based processing. The latest OpenAI development could transform human-AI interaction. We are moving from communicating with AI to conversing with it. With a state-of-the-art text-to-speech model, the new ChatGPT-Plus release will listen to us and respond by replicating human voices.

The new GPT Vision (GPT-V) framework is not GPT-5, even though users will be tempted to conflate them. GPT-V embodies a unique conceptual trajectory, serving as the foundational element of the enhanced multimodal version of GPT-4 that OpenAI foreshadowed earlier this academic year. It will pave the way for more immersive pedagogical and research applications – and compound the challenges education is already grappling with.

It’s hard to fathom the implications of OpenAi’s development. With more seamless interactions, ChatGPT could become ubiquitous in our lives. We wonder how it will impact Google, which is deeply challenged by the artificial intelligence developments of the past ten months. Of course, Google could easily mirror what OpenAI has achieved, but it would come at the expense of its massive advertising revenue. It’s getting harder and harder to see a future for search outside of a generative AI framework.

Wikipedia may also be affected. It’s hard to imagine going to Wikipedia when I can query ChatGPT and quickly get more comprehensive results that include Wikipedia and other sources. As we’ve seen, new AI developments always bring a host of ethical implications. If the past ten months in AI have seemed like a whirlwind, they may turn out to be the lull before the storm.

OpenAI Developments

The new ChatGPT update was revealed earlier today and will initially rolled out for ChatGPT Enterprise and Plus subscribers over the next two weeks. Access for other user groups will follow on a yet-to-be-announced timeline. ChatGPT’s newly integrated multimodal conversational attributes expand the possibilities for interacting with generative AI. Yes, it will be able to “see” (using that term broadly), “hear” through speech recognition, and “talk,” responding to you similar to the way a human being would respond.

If ChatGPT has felt something like a novelty item up to now, the new OpenAI development makes it feel like a companion, like a best friend ready to answer any question and solve any problem.

ChatGPT can now see, hear, and speak in a major upgrade to the platform. — ChatGPT can now see, hear, and speak in a significant upgrade to the platform.

The voice function will mean users can ask ChatGPT questions via a microphone, similar to how you would speak to Amazon’s personal assistant, Alexa, or Apple’s Siri. Here is OpenAI’s description of the voice function on its blog, which will operate similarly to Amazon’s Alexa or Apple’s Siri.

The new voice capability is powered by a new text-to-speech model, capable of generating human-like audio from just text and a few seconds of sample speech . . . We collaborated with professional voice actors to create each of the voices. We also use Whisper, our open-source speech recognition system, to transcribe your spoken words into text.

To use this feature, you’ll have to enable the functionality, and in return, it will also allow ChatGPT to respond with human-like audio in one of five distinct voices. If you want to check out the new ChatGPT voices, there are samples of them on OpenAI’s blog.

How The New Features Will Be Used

Of course, there are a myriad of ways the features of the new upgrade could be used. OpenAI strikes a home and family chord in its blog by suggesting some basic tasks.

When you’re home, snap pictures of your fridge and pantry to figure out what’s for dinner (and ask follow-up questions for a step-by-step recipe). After dinner, help your child with a math problem by taking a photo, circling the problem set, and having it share hints with both of you.

They also suggest using it as a travel guide, snapping a picture of a landmark, and having a live conversation with AI about it. But you can see thousands of advanced uses as well. An architect or historian could capture images of buildings and engage with ChatGPT to delve into their significance or design origin. Perhaps even take it a step further and ask for refinements in the structure’s design. You could snap a photo in a restaurant and then of your dinner, having your own culinary guide and critic with you at all times. The possibilities boggle the mind.

OpenAI Developments for DALL-E 3

Complementing this is the introduction of DALL-E 3, OpenAI’s image generation tool. By merging image creation with natural language processing, researchers and creators can engage in detailed discussions with the model to hone its outputs. Its collaboration with ChatGPT further expands its utility in crafting intricate image prompts and brings together visual and voice-based AI, which started out as separate realms.

Ethical Implications of Conversational AI

While the new OpenAI developments are undeniably groundbreaking, they usher in a series of ethical challenges. We could encapsulate these in the following list as a basic framework, but it’s not exhaustive:

Data Privacy: With AI models capable of processing and interpreting visual data, concerns about user privacy come to the forefront. It’s one thing to have to type in all the data you’re sharing with ChatGPT, but it somewhat limits what you share (unless you’re just doing cut and paste). With the new OpenAI developments, we enter a world where we interact with our AI apps and platforms through conversation, it becomes much easier to share data – especially personal data. Knowing what measures are in place to ensure data security becomes increasingly urgent.
Bias and Representation: We know AI models are trained on vast datasets and can inadvertently inherit and propagate biases present in those datasets. Generative AI improves if platforms monitor the input and output. Ensuring that the AI’s image and conversational recognition tools are free from prejudice is imperative.
Misinformation and Manipulation: There are already deep concerns about how enhanced image generation tools, like DALL-E 3, can be exploited to create deceptive or manipulated images that could foster the spread of misinformation. But enhanced speech recognition AI could take that even further, generating instant responses that could negatively shape human behavior. Imagine a world in the future where fake social media accounts and automated bots seem so antiquated, the rudimentary features of a previous digital era.
Dependence on AI: Conversational AI could radically expand our dependence on ChatGPT as it becomes more intuitive and integrated into our daily lives. Will we see an over-reliance on AI for decision-making? Will it eclipse human judgment and intuition?
Economic Impacts: No one knows how advanced AI tools will impact the professional sector, which could lead to workforce displacement and unemployment. But after seeing the demos today, it’s easy to imagine that conversational AI will also impact jobs that require manual skills.

Of course, the OpenAI developments announced today also bring other issues to the foreground. All of that data that will now be spoken back to you came from somewhere. It was scraped from news media sites, online resources, databases, and even some from here – Digital Bodies. And Wikipedia was another key source of information for the development of GPT. No one was compensated for the scraped data that will now be spontaneously spoken back to you. And that makes you wonder about the future of sites such as Wikipedia, which were always a labor of love but not done to enrich other companies financially.

AI Will Reveal Who We Really Are

As we have said before, the OpenAI developments reveal that we are just getting underway in this new AI landscape. The impact on education, work, and personal lives will be profound and far-reaching. Today’s announcement of an upgraded ChatGPT that can see, hear, and speak is only a glimpse of the doorway into our future. How we embrace and manage it will say everything about who we are as a society and global community.

Emory Craig

Emory Craig is a writer, speaker, and consultant specializing in virtual reality (VR) and artificial intelligence (AI) with a rich background in art, new media, and higher education. A sought-after speaker at international conferences, he shares his unique insights on innovation and collaborates with universities, nonprofits, businesses, and international organizations to develop transformative initiatives in XR, AI, and digital ethics. Passionate about harnessing the potential of cutting-edge technologies, he explores the ethical ramifications of blending the real with the virtual, sparking meaningful conversations about the future of human experience in an increasingly interconnected world.

New OpenAI Developments: ChatGPT Can Now “See, Hear, and Speak”

OpenAI Developments

How The New Features Will Be Used

OpenAI Developments for DALL-E 3

Ethical Implications of Conversational AI

AI Will Reveal Who We Really Are

Related