OpenAI's GPT-4o: What's in the new ChatGPT generative AI model and how does it work?

OpenAI has upped the ante in the highly competitive generative artificial intelligence world by introducing a new model it hopes will attract more users into its platform and fend off all challengers.

GPT-4o is an updated version of the underlying large language model technology that powers ChatGPT. It was rumoured last week to be launched as a search engine to challenge Google but Reuters reported that OpenAI delayed it.

OpenAI chief executive Sam Altman denied any launches – only to post on X that the company has "been hard at work on some new stuff we think people will love".

The "o" in the name stands for "omni" and the California-based company is touting GPT-4o as something for all, which makes sense as "omni" means "all" or "everything" – does OpenAI want to be omnipresent in our lives?

What is GPT-4o?

Short answer: GPT-4o, according to OpenAI, is its "new flagship model that can reason across audio, vision and text in real time".

Shorter answer: it's OpenAI's fastest AI model.

The "omni" name refers to "a step towards much more natural human-computer interaction", OpenAI said in a blog post on Monday.

It is also natively multimodal, meaning it can accept any combination of text, audio and image as input, and also generate any combination of text, audio and image outputs.

How fast is GPT-4o?

OpenAI claims GPT-4o can respond to audio inputs in as little as 232 milliseconds, with an average of 320 milliseconds, which is similar to human response time in a conversation, according to several studies.

Consequently, GPT-4o requires the use of fewer tokens in languages, the basic unit in AI that calculates the length of text and can include punctuation marks and spaces. Token counts vary from one language to another.

Among the languages highlighted by OpenAI that use fewer tokens with GPT-4o are Arabic (from 53 to 26), Gujarati (145 to 33), Hindi (90 to 31), Korean (45 to 27) and Chinese (34 to 24).

For perspective, we can make some comparisons to a 1968 study from Robert Miller – Response time in man-computer conversational transactions – which detailed the three magnitudes of computer mainframe responsiveness.

The research revealed a response time of 100 milliseconds is perceived as instantaneous, while one second or less are fast enough for users to feel they are interacting freely with the information. A response time of more than 10 seconds would lose user attention completely.

How does GPT-4o work?

The simplest answer is that OpenAI, well, simplified the process of converting input into output.

In OpenAI's previous AI models, Voice Mode was used to talk to ChatGPT at latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. Voice Mode used three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in and outputs text, and a third simple version converts that text back to audio.

"This process means that the main source of intelligence, GPT-4, loses a lot of information – it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion," OpenAI said.

But with GPT-4o, OpenAI was able to merge all these functions into a single model, with end-to-end capabilities across text, vision and audio, significantly reducing the amount of time consumed and information processed.

"All inputs and outputs are processed by the same neural network," OpenAI said. A neural network is an AI technique that teaches computers to process data similarly to the human brain.

Still, OpenAI said it was "still just scratching the surface" of GPT-4o capabilities and limitations, given that it is their first model that merges all of these modalities.

What can GPT-4o not do?

Speaking of limitations, OpenAI acknowledged "several" of them across the GPT-4o model, including inconsistencies in responses featured in a blooper reel. It even demonstrated how GPT-4o can be adept in sarcasm.

In addition, OpenAI said it continues to refine the model's behaviour through post-training – which is critical in addressing safety concerns, a key sticking point in modern-day AI.

The company said it has created new safety systems to serve as guardrails for voice outputs, in addition to testing the model, with more than 70 experts in the fields of social psychology, bias, fairness and misinformation to identify any risks that may seep through.

"We will continue to mitigate new risks as they’re discovered. We recognise that GPT-4o’s audio modalities present a variety of novel risks," OpenAI said.

How much does GPT-4o cost?

Good news – it's free for all users, with paid users enjoying "up to five times the capacity limits" of their free peers, OpenAI chief technology officer Mira Murati said in the unveiling presentation.

However, if you're not a paying OpenAI user, it will set you back $5 and $15 for one million tokens of input and output, respectively.

Allowing the free use of GPT-4o should serve OpenAI well, which would also complement the company's other paid offerings.

In August, OpenAI launched its ChatGPT Enterprise monthly plan, the price of which varies depending on user requirements. It's the third tier after its basic free service and the $20-a-month Plus plan.

The company in January launched its online ChatGPT Store that gives users access to more than three million custom versions of GPTs, developed by OpenAI's partners and its community.

OpenAI hopes to attract more users as competition heats up in the generative AI world – and there are a lot coming for them.

How does OpenAI stack against its biggest rivals at this point?

OpenAI's move to introduce a new, free and faster large language model is an indication of how it has its hands full against its competition in generative AI.

Google, arguably its biggest rival in the space, has Gemini, which was the first AI model to beat human experts on massive multitask language understanding, one of the widely used methods to test the knowledge and problem-solving abilities of AI.

Gemini can be accessed on the Google One AI Premium plan for $19.99 a month, which includes 2TB of storage, 10 per cent back from purchases made on the Google Store and more features across Gmail, Google Docs, Google Slides and Google Meet.

In February, it launched Gemma, aimed at assisting developers and researchers in “building AI responsibly” and is more for modest tasks such as basic chatbots or summarisation jobs.

Anthropic, meanwhile, in March launched Claude 3 – its direct challenge at generative AI leader OpenAI.

The company backed by Google itself and Amazon has three tiers – Haiku, Sonnet and Opus – each offering increasing capabilities that will suit user needs.

Haiku is priced at $0.25 per million tokens (MTok) for input and $1.25 for output, while Sonnet costs $3 and $15. Opus is the most expensive at $15 and $75.

For comparison, OpenAI’s GPT-4 Turbo comes in at $10 for input and $30 for output, and also with a smaller context window of 128,000 MTok.

Microsoft, OpenAI's biggest backer, charges $20 a month for its Copilot pro service, which guarantees faster performance and "everything" the service offers. If you're not willing to pay, there's a free Copilot tier, which, obviously, has limited functionalities.

And then, there's xAI's Grok, from OpenAI's friend-turned-enemy, Elon Musk.

Grok's current version, Grok-1.5, is only available to subscribers of X's Premium+ tier, which starts at $16 per month, or $168 a year.

Regional entities are also taking aim at the leaders: on Monday Abu Dhabi's Technology Innovation Institute introduced the second iteration of its large language model, Falcon 2, to compete with models developed by Meta, Google and OpenAI.

Also on Monday, Core42, a unit of Abu Dhabi's artificial intelligence and cloud company, G42, launched a bilingual Arabic and English chatbot developed in the UAE, Jais Chat. It can be downloaded and used for free on Apple's iPhones.