A modality is a kind of data, such as text, image, audio, video, code, or sensor data.
Multimodal AI works across more than one modality. A multimodal system might caption an image, answer questions about a chart, search photos with text, or combine speech and documents in one assistant.
Text-image pairs
Many multimodal systems learn from paired data.
image: photo of a yellow bus
text: "a yellow bus parked beside a school"
The pair teaches the system that a visual pattern and a language description can refer to the same thing.
Large collections of image-caption pairs can train models to connect visual and textual representations.
Contrastive learning
Contrastive learning trains a model to bring matching pairs closer together and push non-matching pairs farther apart.
For text and images:
- the embedding for a photo should be close to its real caption
- the embedding should be farther from unrelated captions
This creates a shared embedding space where text and images can be compared.
Shared embedding spaces
A shared embedding space lets different modalities live in a compatible vector space.
That makes cross-modal retrieval possible:
- text query -> matching images
- image -> related captions
- audio clip -> related text
- document screenshot -> extracted meaning
The system is not making text and images identical. It is learning representations that can be compared.
Image captioning
Image captioning generates text that describes an image.
This usually requires a vision component to represent the image and a language component to produce text.
Captioning is useful, but it can be incomplete or wrong. A caption model may describe obvious objects while missing context, uncertainty, or small details.
Visual question answering
Visual question answering asks a model to answer questions about an image.
image: chart
question: Which month had the highest revenue?
answer: March
This requires more than object labels. The system must connect the question with relevant visual evidence and produce an answer.
Multimodal assistants
Multimodal assistants combine several capabilities:
- reading text
- inspecting images
- following spoken instructions
- using tools
- retrieving documents
- generating responses
These systems are useful because real tasks rarely live in one clean data type. A support request might include text, screenshots, logs, and account data.
Why connecting modalities is hard
Different modalities have different structure.
Text is sequential. Images are spatial. Audio changes over time. Video combines space and time. Documents mix layout, text, and visual hierarchy.
Connecting them requires good representations and careful evaluation. A system may understand the broad topic of an image while failing on exact counts, small text, spatial relationships, or domain-specific meaning.
Quick Check
One answerWhat does contrastive learning try to do with matching text-image pairs?
Choose the best answer and use it to track your progress through the lesson.
Why that answer is correct
Contrastive learning helps build shared spaces by making related items close and unrelated items less close.
What to carry forward
- modalities are kinds of data such as text, image, audio, and video
- multimodal AI connects more than one modality
- paired examples help models learn relationships across modalities
- contrastive learning can create shared embedding spaces
- captioning and visual question answering combine vision and language
- multimodal systems are powerful but still need careful evaluation
The next lesson shows how models become parts of larger AI systems.