AI systems are not evaluated only by whether they sound good.
Evaluation asks whether a system works for a task, for the intended users, under realistic conditions, with acceptable risks.
Benchmarks
A benchmark is a standardized evaluation task or dataset. Benchmarks make it easier to compare models on specific abilities.
Benchmarks can test math, coding, reading comprehension, image understanding, factual questions, reasoning, safety behavior, or domain-specific tasks.
Benchmarks are useful, but they are incomplete. A high score on one benchmark does not prove broad reliability.
Human evaluation
Human evaluation asks people to judge model outputs. Reviewers may assess correctness, clarity, helpfulness, tone, completeness, safety, and whether the answer follows instructions.
Human review can catch issues benchmarks miss. It can also be inconsistent, expensive, and sensitive to reviewer instructions.
Good evaluation often combines automated tests, human review, and real-world monitoring.
Hallucination
A hallucination is an output that is unsupported, fabricated, or false while appearing plausible.
Hallucinations happen because generative models produce likely outputs, not guaranteed facts. A model may fill gaps with patterns that sound right.
Retrieval and citations can reduce hallucinations, but they do not eliminate them.
Bias
Bias is systematic unfairness or distortion in outputs. It can come from training data, labels, deployment context, evaluation choices, or user behavior.
Bias can appear as stereotypes, unequal performance across groups, unfair recommendations, or missing representation.
Bias is not only a social issue. It is also a technical reliability issue: the system behaves differently depending on inputs in ways that may be unjustified or harmful.
Robustness
Robustness means the system keeps working under variation.
A robust model should handle phrasing changes, typos, lighting changes, noisy audio, unusual formatting, and realistic edge cases.
Some models are surprisingly sensitive. A small change in wording or image conditions can change the output.
Calibration
Calibration describes whether confidence matches correctness.
A well-calibrated system is uncertain when it is likely wrong and confident when it is likely right. Many AI systems can sound confident even when they should be uncertain.
This matters because users often rely on confidence cues.
Distribution shift
Distribution shift happens when deployment data differs from training or evaluation data.
For example, a medical imaging model trained on one hospital’s equipment may perform worse at another hospital. A language model evaluated on clean benchmark questions may struggle with messy workplace documents.
Adversarial examples
An adversarial example is an input designed to make a model fail.
For image models, small changes can sometimes cause wrong classifications. For language models, carefully written prompts can try to bypass instructions or trigger unsafe behavior.
Adversarial testing helps reveal brittle behavior before attackers or accidents do.
Data contamination
Data contamination happens when evaluation examples appear in the training data.
If a model has seen benchmark answers during training, the benchmark may overstate real capability. This is especially important for models trained on large web-scale datasets.
Quick Check
One answerWhat does calibration measure?
Choose the best answer and use it to track your progress through the lesson.
Why that answer is correct
Calibration is about matching confidence to actual reliability.
What to carry forward
- evaluation must go beyond fluent output
- benchmarks test specific tasks but do not prove universal quality
- human evaluation captures qualities automated tests may miss
- hallucinations are plausible but unsupported or false outputs
- bias, robustness, calibration, and distribution shift affect reliability
- adversarial examples and data contamination can distort evaluation
The next lesson covers safety, ethics, and responsible AI in concrete terms.