Validations to check that column values belong to a defined domain. They differ in whether they operate row by row or on the set of distinct values, and in the direction of the containment check.
Validation
What it checks
Operates on
Column values must be in set
Every row has a valid value
Each row individually
Column distinct values must be in set
No unexpected category exists
Unique values only
Column distinct values must contain set
All expected categories are present
Unique values only
Column distinct values must be equal to set
Categories match exactly — no more, no less
Unique values only
Column most common value must be in set
The most frequent value is one of the expected ones
The set of distinct values found in the column must be contained in the given set. Fails if any category appears in the data that is not in the allowed set — regardless of how many rows have that value.
Use this when you want to detect category drift — for example, a new payment method appearing in your data that your pipeline does not know how to handle.
Difference from "Column values must be in set"
Column values must be in set checks every individual row and reports how many rows fail. This validation only checks which unique values exist in the column — it does not report a row count.
The given set must be fully contained in the column's distinct values. Fails if any value from the expected set is missing from the column.
Use this when you need to guarantee that all expected categories are present in your data — for example, ensuring a report covers all regions or all product lines.
Difference from "Column distinct values must be in set"
This validation checks the opposite direction: the column must contain the given set, but can have additional values. "Must be in set" checks that the column values are contained within the given set.
The set of distinct values in the column must exactly match the given set — no extra values, no missing values.
Use this when you need strict control over the full domain of a column — for example, an enum-like field where you know every possible value and want to catch both additions and removals.
The most frequent value in the column must be one of the values in the given set.
Use this when you want to catch anomalies in distribution — for example, ensuring the dominant payment method or product category is always one of the expected ones, which could indicate a data pipeline issue if it shifts unexpectedly.