learn.colinkim.dev

Computer vision foundations

Learn how AI systems treat pixels as data for classification, detection, segmentation, and image embeddings.

Computer vision is the field of AI focused on understanding images and video.

To a person, an image contains objects, scenes, faces, text, lighting, and motion cues. To a model, an image starts as numbers.

Pixels as data

A digital image is a grid of pixels. Each pixel stores color values, often red, green, and blue.

image -> grid of pixels -> numbers -> model

Computer vision models learn patterns in those numbers. Early patterns might correspond to edges and textures. Later patterns might correspond to object parts, full objects, layouts, or scenes.

Image classification

Image classification assigns one or more labels to an entire image.

input: photo
output: "dog"

Classification is useful when the question is “what is in this image overall?” It does not necessarily say where the object is.

Object detection

Object detection identifies objects and their locations.

The output may include labels and bounding boxes:

dog: x=45, y=80, width=210, height=160
person: x=260, y=40, width=90, height=260

Detection is useful when location matters, such as counting objects, reading scenes, or assisting robotics.

Segmentation

Segmentation labels regions of an image, often at the pixel level.

Instead of drawing a rough box around a dog, a segmentation model can mark which pixels belong to the dog.

Segmentation is useful for medical imaging, photo editing, autonomous systems, and any task where shape boundaries matter.

Convolutional neural networks

Convolutional neural networks, or CNNs, were a major architecture for vision. A convolution looks at small local regions of an image and learns filters for patterns such as edges, corners, and textures.

CNNs are useful because images have local structure. Nearby pixels often relate to one another.

You do not need to implement convolutions here. The key idea is that CNNs learn visual features by scanning local patterns across the image.

Vision transformers

Vision transformers adapt transformer ideas to images. Instead of processing text tokens, they split an image into patches and process those patches with attention.

This lets the model learn relationships between different image regions. Like language transformers, vision transformers can scale well with enough data and compute.

CNNs and vision transformers are not enemies. Modern systems may use either architecture or combine ideas from both.

Image embeddings

An image embedding is a vector representation of an image. Similar images can be close in vector space.

Image embeddings support:

  • visual search
  • duplicate detection
  • recommendation
  • clustering photo collections
  • connecting images with text
  • retrieval for multimodal systems

Quick Check

One answer

How is segmentation different from object detection?

Choose the best answer and use it to track your progress through the lesson.

What to carry forward

  • images are grids of pixel values before they become model inputs
  • classification labels whole images
  • detection identifies objects and locations
  • segmentation labels image regions or pixels
  • CNNs learn local visual patterns
  • vision transformers process image patches with attention
  • image embeddings make visual similarity computable

The next lesson explains how models can generate images instead of only analyzing them.

Progress

Quick checks

No quick checks in this lesson.

Mark lesson manually or answer quick checks to track progress.