LAST UPDATED
Jun 24, 2024
Multimodal artificial intelligence is gaining popularity as technologies like mobile phones, vehicles, and wearables use different modalities to create seamless and robust user experiences.
The term “multimodal” refers to the different ways in which humans communicate with systems. Depending on the user's preference or abilities, these could be interaction modalities like touch, speech, vision, gestures, haptics, etc.
These modalities might have different ways they are expressed or perceived. For example:
Language Modality
Vision Modality
Each response for both modalities can be language-to-vision or vision-to-speech, among others. Multimodal AI is significant because it accommodates diverse preferences and communication abilities, making technologies like smart devices more inclusive and adaptable for everyday tasks from voice commands and touch interactions to visual recognition.
Conventional supervised or unsupervised learning algorithms have been used with specific data types like images, text, or speech that have made training straightforward. But in reality, data comes with different modalities (e.g., vision combined with sound and text captions in movies), each conveying unique information to enhance its overall understanding.
A classic example illustrating the need for multimodal understanding is the McGurk effect. This phenomenon shows that our perception of speech sounds can be influenced by visual cues, underlining the complex interplay between different senses and modalities.
The primary motivation behind multimodal AI is to create models capable of capturing the nuances and correlations between the different data types, thereby representing information more comprehensively.
Examples of multimodal systems include GPT-4V, Google’s Gemini, and Microsoft’s open-source LLaVA, which combine text and image modalities, demonstrating the power of integrated data processing. Despite the potential, multimodal models face challenges such as accurate representation, alignment, and reasoning across modalities, which are ongoing areas of research and development in the field.
There are three main types of modalities in multimodal machine learning, and they include:
These modalities involve one data type with a similar structure (e.g., text-only or image-only).
Imagine a facial recognition system designed to identify individuals. This system is fed with two types of data sources:
In this scenario, although both data sources are essentially images (hence homogeneous in terms of modality), they differ in their origin, quality, and context.
These involve multiple data types (e.g., text, image, and audio). The challenge in multimodal systems is integrating and analyzing these diverse data types, which may vary significantly in structure and format. This can be text-image and speech-video modalities, among a variety of others. Models with heterogeneous modalities are complex, and the model finds the relationships between the different data types.
Fig. 1. The dimensions of heterogeneity. Source: Multimodal Machine Learning | CVPR Tutorial
Imagine a virtual classroom environment where AI gauges student engagement and emotional responses during a lesson. This system uses two different modalities:
The AI system integrates these two data streams to better understand the students' emotional states. For example, a student might verbally express understanding (e.g., saying "I get it"), but their facial expression could show confusion or doubt. By analyzing speech and facial expressions, the AI can more accurately assess the student's true emotional response, leading to insights such as whether the student might need additional help or clarification on the topic.
In this scenario, the multimodal AI system uses heterogeneous modalities (audio and video) to capture a fuller picture of the student's engagement and emotional states, which can be crucial for adaptive learning and personalized education.
This concept refers to inherently correlated (or linked) modalities, where information from one enhances the understanding of another. This interconnectedness allows for a more comprehensive understanding of the overall context.
Imagine a car navigation system that uses voice commands (audio modality) and visual map displays (visual modality) to assist drivers. Here's how these interconnected modalities work together:
In this example, the interconnectedness of audio and visual modalities creates a more user-friendly and efficient navigation experience. The voice commands allow for hands-free interaction, enhancing driving safety, and the visual maps provide clear and precise navigational information. The system effectively combines these modalities to enhance overall functionality and the user experience.
Cross-modal interactions cover a broader spectrum of how different modalities can relate to and interact within a multimodal system. Two interactions happen:
These interactions operate across various dimensions:
Overall, these interactions and dimensions are pivotal in determining how various modalities within a multimodal system collaborate, influencing the richness of representation and the efficacy of the combined output.
Multimodal machine learning integrates diverse data sources to model relationships between modalities. With varied qualities and structures, these systems create intelligent models that make sense of the world to offer coherent contextual information.
Fig. 2. A high-level view of the training process for multimodal models
A typical multimodal system includes:
Where possible, it's beneficial to use pre-trained and reusable components for efficiency.
To continually improve the model, you could use Reinforcement Learning with Human Feedback (RLHF) as a post-training technique or Retrieval-Augmented Generation (RAG).
RLHF ensures the following in multimodal training:
LLaVA is an excellent example of a large multimodal model (LMM) that utilizes RLHF.
According to the survey paper created by Paul Liang et. al., six (6) core challenges in multimodal machine learning are important to consider when training your algorithms:
In every multimodal interaction, the goal is to create an output that accurately represents the interacting modalities. Challenges arise when the model does not learn the representations of each modality well enough to adequately reflect cross-modal interactions. Depending on your task, you could think of:
For example, in a multimodal system analyzing text and images, the representation challenge involves creating a unified structure that accurately combines linguistic patterns and visual features for comprehensive understanding. Mechanisms in information processing like attention or transfer learning, among others.
Alignment involves carefully ensuring that information across different modalities harmonizes, promoting accurate associations. This involves recognizing connections between different modalities and constructing an integration built from the underlying data structure to create coherent combinations of all modalities.
A significant challenge is temporal (time) alignment in dynamic modalities, essential for synchronizing data streams like video and audio.
Consider a system aligning spoken words with corresponding textual transcripts. Accurate alignment is important for correctly associating spoken phrases with their corresponding text.
This involves developing robust models that effectively utilize information from multiple modalities to produce an output, considering the problem structure and alignment.
The challenge lies in creating models that leverage multiple modalities through multi-step inferential reasoning, especially in scenarios with conflicting or ambiguous inputs from the modalities.
For example, autonomous vehicles integrate information from sensors (visual and LiDAR data) and textual maps. They reason about the environment, aligning visual input with map data to make informed decisions for safe navigation. It infers that visual cues indicating an obstacle correspond to mapped structures, influencing the vehicle's path for effective manoeuvring.
This refers to synthesising coherent and contextually relevant output across various modalities, ensuring the meaningful creation of information.
Here, the challenges for the model to consider can be the following, depending on the choice of task:
For a language translation system handling text and images, the generation challenge involves creating translated text that accurately corresponds to the visual context captured in the images.
This involves addressing issues related to transferring knowledge and models across diverse modalities in multimodal machine learning, ensuring adaptability and consistency.
The challenge is to devise mechanisms that facilitate smooth knowledge transfer while maintaining performance consistency when applying models to domains where data distributions vary significantly.
In a speech recognition system, transference challenges may arise when adapting the model trained in one language to accurately recognize and interpret speech in another.
Quantitatively assessing information integration quality, relevance, and effectiveness across multiple modalities is important. The challenge is to define metrics and criteria for objective evaluation, providing a basis for improving the quality of multimodal information processing.
In a sentiment analysis system analyzing text and audio, quantification challenges include developing metrics that accurately measure the alignment between textual sentiment and corresponding emotional cues in the spoken words.
Understanding the relationships between modalities is an exciting and significant part of AI research. This will help researchers and the industry create better and more inclusive products, enabling more intuitive and complete interactions between humans and machines. The more we understand these modalities and their relationships, the closer they reach human-like multimodal communication.
The motivations are wide-ranging—supporting disabled populations by accommodating accessibility needs, facilitating explainability by surfacing connections hidden within single channels, and even further reducing barriers to human-computer cooperation by supporting flexibility and adaptability in communication.
Mixture of Experts (MoE) is a method that presents an efficient approach to dramatically increasing a model’s capabilities without introducing a proportional amount of computational overhead. To learn more, check out this guide!
Get conversational intelligence with transcription and understanding on the world's best speech AI platform.