Multimodal Generative AI: What Is It to Enterprises?

Written by Norbert Fijałek | May 6, 2024 9:59:23 AM

Generative AI might have appeared to be a gimmick in its nascent stages. However, it actually presented significant scientific and practical advancements right from the start. Its initial capacity to generate coherent and contextually appropriate text or imagery had profound implications on content creation, information synthesis, and automated decision-making. As the technology’s reliability improves, it’s playing a more significant role in the enterprise’s analytics toolset. Its upgraded version — multimodal generative AI — is poised to do even more for businesses. Experts from the Lingaro data science and AI practice, Norbert Fijałek, Krystian Jabłoński, and Taylor van Valkenburg, share their insights.

What is multimodal generative AI?

While we’re now more or less familiar with generative AI and what it can do, the term “multimodal generative AI” sounds jargony, so let’s break it down. In the context of communications, “mode” simply means the form in which a language is expressed, be it through text, sound, texture (think Braille), and imagery. To illustrate, internet phone calls can be in voice-only mode or videocall mode, where both voice and video modes are utilized simultaneously. The first type of internet call is in single mode, whereas the second type is in multiple modes.

“Modality,” a term rooted in the word “mode” and used in the context of human – computer interactions, is the classification of input/output channels between a human user and a computer. It’s where such channels can be classified according to sensory nature (i.e., auditory vs. visual) or type of object utilized to convey meaning (i.e., sound vs. image vs. text). With that in mind, let’s move on to generative AI.

Unimodal vs. multimodal generative AI

Generative AI systems began as unimodal. Their input has one modality, and their output has one modality, though the modality of the output doesn’t have to be the same as the modality of the input. To illustrate, a chatbot can receive text prompts and generate text responses, while an image generator, such as DALL·E 2, can receive text prompts and produce images based on those prompts.

Unimodal AI systems are limited to just one type of input and one type of output. This means that if you wanted output that has two or more modalities, you’d have to integrate two or more unimodal generative AI systems. While technically feasible, training models separately is like trying to learn a language through one sense at a time. For someone who is learning in this manner, a word might sound familiar, but they wouldn’t necessarily know how to spell it.

Similar to how humans learn by using multiple senses, multimodal generative AI systems can be trained on multiple modalities at once.

Figure 1. A diagram illustrating the difference between unimodal and multimodal generative AI

To illustrate, a vision language model (VLM) is a multimodal model that is simultaneously trained on both images and text. Technically speaking, a VLM has three sub-models:

A model for images
A model for text
A model for finding out how images and text relate to one another

Imagine having an embedding or map of meanings for images, where images that look and mean similarly to one another are placed close together. Let’s limit the image embedding to equipment used in sports, such as shoes, jerseys, balls, and rackets. Then, imagine another map of meanings, but for sports equipment-related text. Lastly, let’s have a map of meanings where images and text with similar meanings are grouped close together. To illustrate, an image of a green tennis ball may have the phrase “green tennis ball” beside it in the shared map of meanings.

Figure 2. An example of how a vision language model is tested

What you get is a VLM that can perform both text-to-image and image-to-text searches. Note that with text-to-image search, the VLM does not have to check image metadata such as image names, tags, and descriptions. It only has to look at the images themselves. This is like someone who goes to a sports equipment shop, scans the products on display, and finds the green tennis balls they were looking for.

Developments in deep learning, neural network architectures, and natural language understanding has enabled AI to now more closely understand the world the way we humans do. It can now generate multimodal objects like we can, too.

Moreover, since the breadth of what AI can understand and process goes far beyond what an individual mind can, AI can generate objects of great size much faster. Think visualized digital twins of a production floor used in what-if scenario analyses. Yearly performance reports with walls of text and fancy graphs that take more than a month to make could now be done in less than a day through multimodal generative AI.

How can enterprises benefit from multimodal generative AI?

The ability to quickly generate digital objects, such as documents, webpages, videos, CGI animations, and digital twins can save enterprises substantial time and effort. Here are a few use case examples:

Bottle packaging design generator

A multimodal generative AI can give an FMCG company the ability to quickly come up with new packaging designs for their bottled products, including graphical and textual elements. Greater design speed means being able to launch new variants and limited-edition products sooner. It also enables more rapid market testing to see what designs and design elements make a product sell more. It also enables brands to pull off the “new look, same great formula” strategy more frequently. That is, they’ll be able to make new stock appear to be fresher versions of past stock more often than before.

Hyperpersonalized ad content generator at scale

Since generative AI can digest mountains of data quickly, it can be fed consumer data — specifically consumer behaviors and preferences — to create highly personalized ad campaigns. These campaigns can feature AI-generated image and text ads as well as videos that are complete with audio and subtitles. With such an AI, a global CPG enterprise would be able to market its portfolio of brands with hypersegmentation.

Authors’ note:

Digital twins

Digital twins are virtual replicas of physical systems, such as a production floor, or virtual abstract representations of abstract systems, such as supply chains. Digital twins can feature AI-generated 3D models that are updated in real time. Managers and analysts can also tweak variables and have the AI produce visual and textual updates on the digital twin.

Generate rich, sophisticated output at a grand scale and at great speed

Getting generative AI models to imitate the way we humans understand the world more closely is no small achievement. This level of understanding makes an AI much easier to use because when we say what we want it to make, it will make something that’s much closer to what we expect it to produce.

Given sufficient data and imagination, multimodal generative AI tools are able to produce richer, more sophisticated outputs than past AI systems. If it can understand a business at a much grander scale, in finer detail, and at faster speed, then these four hypothetical use cases of multimodal generative AI will soon become reality:

Knowledge database bots: Managers would be able to talk with knowledge database AI bots, ask for an infographic of sales projections per region for the coming month, and get a map with sales projections written over each region.
Autogenerated business documents: Accounting and financial statements, performance and progress reports, and other cyclical business documents would be automatically generated — complete with data visualizations.
Inexpensive instructional videos: Manufacturers would be able to make instructional videos for every SKU that needs them.
Massive analytics reports: AI bots for customer service call centers will be able to understand volumes of voice recordings and generate analytics reports from them.

When it comes to generating business value from generative AI, enterprises are only limited by the data they have and their imagination. However, they’re also morally bound by their ethics. Given how AI is proving itself valuable day by day, adopting and growing with AI might just be the best tech investment companies can make.

View full post