learn.colinkim.dev

Multimodal AI

Understand how AI systems connect text, images, audio, and other data through paired examples and shared representations.

A modality is a kind of data, such as text, image, audio, video, code, or sensor data.

Multimodal AI works across more than one modality. A multimodal system might caption an image, answer questions about a chart, search photos with text, or combine speech and documents in one assistant.

Text-image pairs

Many multimodal systems learn from paired data.

image: photo of a yellow bus
text: "a yellow bus parked beside a school"

The pair teaches the system that a visual pattern and a language description can refer to the same thing.

Large collections of image-caption pairs can train models to connect visual and textual representations.

Contrastive learning

Contrastive learning trains a model to bring matching pairs closer together and push non-matching pairs farther apart.

For text and images:

  • the embedding for a photo should be close to its real caption
  • the embedding should be farther from unrelated captions

This creates a shared embedding space where text and images can be compared.

Shared embedding spaces

A shared embedding space lets different modalities live in a compatible vector space.

That makes cross-modal retrieval possible:

  • text query -> matching images
  • image -> related captions
  • audio clip -> related text
  • document screenshot -> extracted meaning

The system is not making text and images identical. It is learning representations that can be compared.

Image captioning

Image captioning generates text that describes an image.

This usually requires a vision component to represent the image and a language component to produce text.

Captioning is useful, but it can be incomplete or wrong. A caption model may describe obvious objects while missing context, uncertainty, or small details.

Visual question answering

Visual question answering asks a model to answer questions about an image.

image: chart
question: Which month had the highest revenue?
answer: March

This requires more than object labels. The system must connect the question with relevant visual evidence and produce an answer.

Multimodal assistants

Multimodal assistants combine several capabilities:

  • reading text
  • inspecting images
  • following spoken instructions
  • using tools
  • retrieving documents
  • generating responses

These systems are useful because real tasks rarely live in one clean data type. A support request might include text, screenshots, logs, and account data.

Why connecting modalities is hard

Different modalities have different structure.

Text is sequential. Images are spatial. Audio changes over time. Video combines space and time. Documents mix layout, text, and visual hierarchy.

Connecting them requires good representations and careful evaluation. A system may understand the broad topic of an image while failing on exact counts, small text, spatial relationships, or domain-specific meaning.

Quick Check

One answer

What does contrastive learning try to do with matching text-image pairs?

Choose the best answer and use it to track your progress through the lesson.

What to carry forward

  • modalities are kinds of data such as text, image, audio, and video
  • multimodal AI connects more than one modality
  • paired examples help models learn relationships across modalities
  • contrastive learning can create shared embedding spaces
  • captioning and visual question answering combine vision and language
  • multimodal systems are powerful but still need careful evaluation

The next lesson shows how models become parts of larger AI systems.

Progress

Quick checks

No quick checks in this lesson.

Mark lesson manually or answer quick checks to track progress.