What’s New With GPT-4: Features and Limitations

Written by Norbert Fijałek | Apr 4, 2023 7:51:54 AM

GPT-4 reportedly boasts features and functionalities that blow its previous iterations out of the water, further challenging how we work and create. While still a work in progress, this next generation of AI technology not only upends paradigms, but also complicates our relationship with technology. Understanding its value, recognizing the proper context of its use, and knowing its limits can help organizations and individuals in appreciating this breakthrough.

Being able to converse to an AI tool as if it’s an average person is nothing short of amazing. The free version of ChatGPT, OpenAI’s chatbot that’s based on GPT-3.5, can answer general knowledge questions, hold a chat conversation, and write all sorts of things, such as essays, blog posts, and poems.

If you think that’s incredible, wait until you hear about GPT-4. But first, here’s a quick rundown of GPT’s history thus far:

GPT-4’s features

As of this writing (March 2023), we’re seeing GPT-4 outclass its predecessors by leaps and bounds. This is because GPT-4 can do the following feats:

GPT-4 now accepts visual inputs. Being large language models (LLMs), previous iterations of GPT only accepted text inputs. However, GPT-4 is a large multimodal model, which means that it can accept images as inputs and respond to prompts regarding those images. To illustrate, users can ask GPT to describe an image in great detail, explain charts and graphs, and read text contained in images.

GPT-4 commits fewer inaccuracies. As a chatbot that generates textual responses to a user’s prompts, GPT-4 has been tracked to be more factual and less prone to make errors in reasoning than GPT-3.5. According to Cornell University’s assessment, GPT-4 performed like an average human in tests made for humans (such as the SATs) and passed a simulated bar exam by scoring as high as the top 10% of actual test takers. GPT-4 also performs basic math better than its predecessor, despite the former not being integrated with a calculator.

GPT-4 exhibits greater “steerability.” Developers have included conversational styles, tones of voice, and perspectives in GPT-4’s pretraining, thereby making the model more able to incorporate such parameters when generating output. Due to this, users can “steer” or change how GPT-4 behaves on demand by commanding it to either take a particular tone and style or assume a persona, such as a marketing expert or a licensed therapist.

GPT-4 has a larger context window. In machine learning parlance, “context” refers to relationships in sequential data. To illustrate, given the statements “I have a sister named Amanda” and “She has red hair,” we know that the pronoun ‘she’ is referring to Amanda, and that if we ask GPT-4 if Amanda has red hair, the model’s response would be “yes.”

With that in mind, a “context window” is the range of tokens (i.e., items that an AI understands, which are comparable to words for humans) that the AI can go to and consider when generating text, images, or other types of response to a user’s prompts. In visual terms, a context window is like a literal window on the side of a train.

To illustrate, imagine chatting with an AI chatbot online. Holding a chat conversation requires being able to track things tackled in the thread and refer to these when needed. However, only the most recent tokens are kept in focus within the window, while older tokens are edged out of view as the conversation chugs along. Therefore, having a larger context window means being able to hold more data in chat conversations for much longer than LLMs with smaller context windows. GPT-4 has a context window that is eight times bigger than that of GPT-3.5’s. This means that during a chat session, GPT-3.5 would go off-topic or fail to follow sequential instructions much sooner than GPT-4 would.

Moreover, the largest version of the GPT-4 model can handle inputs that are up to 32,000 tokens (approximately 25,000 words), increasing its potential for processing much larger datasets. To illustrate, it can accept two long documents and state the common theme that runs between them.

GPT-4 can handle more languages. GPT-4 understands and writes in 26 languages, more than its predecessor. According to OpenAI’s research, GPT-4 outperformed three other LLMs, namely GPT-3.5, PaLM, and Chinchilla, when it came to answering the Massive Multitask Language Understanding (MMLU) test in English. OpenAI also tested GPT-4’s capability in other languages by translating the MMLU test into 26 languages and then answering each translated version. GPT-4’s scores for 24 of the languages still beat the other LLMs’ scores for the original English test.

GPT-4 can also process documents written in an esoteric fashion, such as legal documents, contracts, tax codes, and programming codes. In fact, GPT-4 can be used to debug programming code based on error messages given by code checkers.

GPT-4 reportedly adheres to guardrails more tightly. Generative AI can be used as a tool for spreading disinformation or as an accessory to commit a crime. GPT-3.5’s developers moderated users and addressed safety flaws upon encountering these, but with GPT-4, they incorporated safety measures during the pretraining process.

OpenAI claimed that they’ve “decreased (GPT-4’s) tendency to respond to requests for disallowed content by 82% compared to GPT-3.5, and GPT-4 responds to sensitive requests (e.g., medical advice and self-harm) in accordance with our policies 29% more often.” This means that GPT-4 is less likely to relay instructions on how to make a bomb, for example.

While these are marked improvements, GPT-4 may still go off its guardrails. For instance, Microsoft’s Bing chatbot, which runs on GPT-4, made headlines for controversial responses it made to users.

GPT-4’s limitations

While this new multimodal model is impressive, it still has a lot of room for improvement. These are the things that businesses must be mindful of when using GPT-4.

Limited knowledge: Just like any other pretrained model before it, GPT-4 is limited by its training data. It doesn’t know of events beyond September 2021, so GPT-4 may fail to provide accurate output if such output requires more recent information. However, unlike with GPT-3.5, users can first supply the missing information to GPT-4 so that it can include that information in its response.

AI hallucinations: While GPT-4 is indeed more accurate than its predecessor, it is not 100% free of falsehoods. It can still “hallucinate” (i.e., make up “facts”) and put out flawed logic, though it does so less frequently than GPT-3.5. This means that while GPT-4 still doesn’t cite its sources, users must still verify its output, no matter how truthful that output seems to be.

Costly to implement: Smaller companies may find the likes of ChatGPT Plus to be too expensive and too powerful to be cost-effective at the scale they’re operating in. To illustrate, game developer Latitude used to run its online role-playing game, AI Dungeon, on an older version of GPT. In the game, users submit prompts for the AI to generate tales of fantasy. As the game’s popularity grew, so did its AI bill, which went to almost US$200,000 a month in order to fulfill the millions of user queries it was receiving daily. To survive, Latitude had to switch to a less expensive language software and charge players extra for more advanced AI features, too.

GPT-4’s features and stellar performance in its showcases are headline-grabbing, but enterprises need to ensure that exploring for and using solutions powered by generative AI (and GPT-4 for that matter) are grounded on their own, unique circumstances — their technological maturity, financial capacity, and organizational readiness to take on this innovation.