Post-training and alignment | Modern AI

Post-training is the work done after large-scale pretraining to shape how a model behaves.

Pretraining gives a model broad pattern knowledge. Post-training tries to make that model more useful, more steerable, and less likely to cause harm in real use.

Alignment

Alignment means making a model’s behavior match human intentions, values, rules, and safety expectations.

That is harder than it sounds. People disagree. Instructions can be ambiguous. Good behavior depends on context. A model may satisfy the literal request while violating the user’s real goal.

Alignment is not a single technique. It is a set of training methods, evaluations, policies, product choices, and monitoring practices.

Instruction tuning

Instruction tuning uses examples that pair user requests with desired responses.

instruction: Explain overfitting to a beginner.
response: Overfitting happens when...

This teaches the model to treat user messages as requests to answer or act, rather than as text to continue in any style.

Instruction tuning is one reason assistant-style models feel different from base models.

Human feedback

Human feedback often comes from people comparing multiple model outputs and choosing the better one.

The comparison might consider correctness, clarity, helpfulness, safety, tone, refusal behavior, and whether the answer follows instructions.

The goal is to provide a training signal for qualities that are hard to capture with simple labels.

Reward models and RLHF

In reinforcement learning from human feedback, or RLHF, preference data is often used to train a reward model. A reward model predicts which outputs humans would prefer.

Then the language model is further trained to produce outputs that score well according to that reward model.

Conceptually:

human preferences -> reward model -> train assistant behavior

RLHF can improve helpfulness and reduce some unwanted behavior, but it can also create new problems if the reward signal is incomplete or exploitable.

DPO

Direct Preference Optimization, or DPO, is another way to train from preference pairs. Instead of training a separate reward model and then using reinforcement learning, DPO directly adjusts the model to prefer chosen responses over rejected responses.

For this course, the important distinction is simple: both RLHF and DPO use preference data to shape behavior, but they optimize from that data differently.

Safety tuning

Safety tuning tries to reduce harmful behavior. It may teach the model to:

refuse certain dangerous requests
avoid private or sensitive information
be careful with medical, legal, or financial uncertainty
not assist with abuse, fraud, or violence
acknowledge limits instead of inventing answers

Safety tuning is not only about refusals. A safer model should also help with benign requests, redirect risky requests when possible, and avoid unnecessary obstruction.

Why alignment is not solved

Alignment is difficult because models are flexible, users are diverse, and real situations are messy.

A model can:

misunderstand intent
follow a harmful instruction too literally
refuse a harmless request
produce biased or unfair outputs
sound confident when uncertain
behave differently under small prompt changes
fail in situations unlike its training data

Quick Check

One answer

What is the main purpose of post-training?

Choose the best answer and use it to track your progress through the lesson.

What to carry forward

post-training shapes behavior after pretraining
instruction tuning teaches models to follow requests
human feedback can train preferences that are hard to label directly
RLHF often uses a reward model
DPO directly trains from preference comparisons
safety tuning tries to balance helpfulness and harmlessness
alignment is important, practical, and still not solved

The next lesson shifts from language to vision.