Data quality is how much you can trust a number when you act on it. A dataset is high quality when it is accurate (reflects reality), complete (no missing records), fresh (recent enough to matter), consistent (the same thing means the same everywhere), valid (values stay within expected ranges), and unique (no duplicates distorting counts). Most teams discover quality problems only after a wrong number has already influenced a decision.
Why data quality problems stay hidden
Dashboards do not notify you when data goes wrong. Queries still run. Numbers still appear. The difference is that those numbers are quietly wrong.
The typical failure pattern: a pipeline stalls at 2 a.m., a table fills with nulls, a business metric doubles overnight — and the dashboard reports nothing unusual, because dashboards wait for someone to look. Your team makes a product decision on Monday based on numbers that broke on Friday. Discovery comes from a stakeholder, not a monitor.
Gartner estimates poor data quality costs organizations an average of $12.9 million per year. The majority of that cost is not cleanup — it is the downstream effect: wrong decisions, wasted engineering time, and dashboards that slowly lose the organization's trust.
The solution is not more manual spot checks. It is treating data quality the way software teams treat application reliability: with automated monitoring against a learned baseline, not a spreadsheet reviewed on Friday afternoons.
The six dimensions of data quality
Six dimensions capture the different ways data can fail you. Understanding each one makes monitoring choices clear.
1. Accuracy
Accuracy measures whether data reflects the real-world event it represents. An order table with a shipping date in 1970 is inaccurate. A revenue row where the amount is negative by mistake is inaccurate. Accuracy is the hardest dimension to monitor automatically, because it requires comparing data against an external source of truth — something most pipelines do not have.
2. Completeness
Completeness measures whether all expected records and fields are present. An events table that loaded 200 rows instead of the usual 10,000 is incomplete. A customer record without an email column is incomplete. Completeness is closely tied to null rate — the percentage of rows where a column is missing a value. A column that is normally 2% null suddenly becoming 40% null is a completeness failure.
3. Freshness
Data freshness measures the gap between now and the last time a table received data — and whether that gap is normal. A table that updates hourly but has not changed in a day is stale, even if every row is individually correct. Stale data is dangerous precisely because nothing looks broken: the dashboard loads, the numbers are there, they are just old.
4. Consistency
Consistency measures whether the same fact means the same thing everywhere it appears. A customer labeled "active" in the CRM but "churned" in the billing system is inconsistent. A product ID appearing in three formats across three tables is inconsistent. Consistency failures produce the classic problem: three dashboards, three different revenue numbers, and a meeting spent arguing about which one is right.
5. Validity
Validity measures whether values conform to expected formats and ranges. A zip code field containing freeform text is invalid. A percentage column holding a value of 340 is invalid. Validity failures often come from transformation bugs or API changes where a field type or format shifts without the downstream system being updated.
6. Uniqueness
Uniqueness measures the absence of duplicate records. A duplicate pipeline load that inserts the same 10,000 rows twice will pass every other dimension check — the data is accurate, complete, fresh, consistent, and valid — but the metric totals are wrong by exactly 2×. Uniqueness failures are the easiest to miss and among the most damaging to reporting accuracy.
The data quality monitoring checklist
This table maps each dimension to a specific check, a warning signal, and an appropriate monitoring frequency for most production tables. Adjust thresholds to match your data's typical pattern.
| Dimension | What to check | Warning signal | Frequency |
|---|---|---|---|
| Freshness | Minutes since last row inserted | Gap > 2× normal update interval | Hourly |
| Completeness / Volume | Row count vs. historical baseline | ±30% from expected count for that time window | Hourly |
| Null rate | % null in key columns | Jump > 5 percentage points above normal | Every 2–4 hours |
| Consistency | Same entity ID present across related tables | Mismatch rate > 0.1% | Daily |
| Validity | Values within expected range or format | Any out-of-range row count > 0 | Daily |
| Uniqueness | Duplicate rate on primary key | Any duplicate > 0 | After each load |
The hardest part of this checklist is not knowing what to check — it is knowing what "normal" looks like. A row count that seems low on Sunday is expected. A table quiet at 3 a.m. might just be running a maintenance window. A learned baseline that accounts for weekday and hourly rhythm is what separates useful alerts from noise you learn to ignore.
Before and after: what monitoring changes
A product team monitors weekly active users for Monday's review meeting. Their events pipeline runs nightly.
Without monitoring: On Friday at 11 p.m., a schema change upstream causes the user_id column in the events table to populate as null for 68% of rows. The pipeline completes without error. The dashboard shows 12,400 WAU — unchanged from last week, because the non-null segment looks consistent. On Monday, leadership cuts paid acquisition spend by $40,000, citing "flat engagement." At 2 p.m., an engineer notices the null spike while working on an unrelated report. Three days of wrong data had already driven the budget decision.
With monitoring: A Slack alert fires at 11:05 p.m.: "events.user_id null rate jumped from 2% → 68% — 3× the Friday-night baseline. Likely cause: upstream schema change. Diagnosis query attached." The on-call engineer resolves it before the weekend ends. Monday's meeting uses the correct number and the budget decision is made on accurate data.
The difference is not the team's skill. It is whether monitoring existed to surface the problem in minutes rather than 72 hours.
Common mistakes teams make
- Fixed thresholds instead of baselines. "Alert if row count < 1,000" fires every Sunday during the expected weekend dip. Teams dismiss the alert — until a real problem fires and gets dismissed too. Learned baselines that account for weekday rhythm eliminate this.
- Spot checks on a schedule. Manual checks on Monday morning miss problems that happened Saturday and were already consumed by Sunday's automated reports. Monitoring needs to run continuously.
- Treating data quality as only a data team problem. Most quality failures originate at the application layer (a form field stops saving), the pipeline layer (a join produces nulls), or the infrastructure layer (a schema change goes unannounced). Waiting for a data engineer to notice is waiting too long.
- Monitoring tables but not metrics. A metric like Daily Active Users can look healthy at the table level but be wrong at the metric level — because a null column excluded exactly the users who should be counted. Metric-level monitoring catches what table-level monitoring misses.
- Skipping uniqueness checks after loads. Duplicate loads are silent. The data passes every other check — and uniqueness is usually the last one added.
How to start monitoring without a data team
You do not need a data engineer to get started with data quality monitoring. The minimum viable setup:
- Connect read-only. Create a read-only database role. Monitoring is observation — no write permissions needed.
- Pick five critical tables. The tables that feed your core business metrics: revenue, active users, orders, sign-ups. Start there, not everywhere.
- Let a baseline form. Give your monitoring tool 7–14 days of data before trusting anomaly alerts. The baseline needs enough history to understand your weekly rhythm.
- Define one business metric. The number you review every Monday. Define it once — "Daily Active Users = distinct
user_ids ineventswhereevent_date= today" — and monitor both the metric value and the underlying table. - Route alerts to where your team works. Slack, email, or PagerDuty. An alert that requires logging into a separate tool gets ignored.
See where your data quality stands today — the free 2-minute health check grades your dataset A–F across all six dimensions, with no account required. Or compare the data quality monitoring tools available in 2026 to find the right fit for your stack.
Frequently asked questions
What is data quality in simple terms?
Data quality is how much you can trust a number when you use it. High-quality data is accurate (it reflects what really happened), complete (no rows or columns are missing unexpectedly), fresh (recent enough to be relevant), consistent (the same fact means the same thing everywhere), valid (values are in expected formats and ranges), and unique (no duplicate records inflating the counts). The hard part is not defining it — it is noticing when it slips.
What are the six dimensions of data quality?
The six most widely used dimensions are accuracy, completeness, freshness (timeliness), consistency, validity, and uniqueness. Each captures a different failure mode: freshness catches stale pipelines, completeness and null rate catch missing records, validity catches format errors, consistency catches divergent definitions, and uniqueness catches duplicate loads. Most real-world data quality incidents involve at least two dimensions simultaneously.
How do you measure data quality?
You measure data quality with specific checks on each dimension: null rate (percentage of blank values in key columns), freshness lag (gap between now and the last table update), row count anomaly (volume relative to historical baseline), duplicate rate on primary keys, and value-range checks on numeric columns. The most useful single signal is detection speed — how quickly a problem is caught after it occurs.
What is the difference between data quality and data observability?
Data quality is the property — how trustworthy the data is at a point in time. Data observability is the practice of continuously monitoring that property so problems surface in minutes rather than days. Data quality is the goal; observability is how you maintain it in production without waiting for a stakeholder to notice something is wrong.
Can you monitor data quality without a dedicated data team?
Yes. The core checks — freshness, row-count anomalies, null rate, schema drift — run automatically against a read-only connection and require no SQL authorship or pipeline ownership. A data team extends what you can monitor, but it is not a prerequisite to start catching the problems that matter most.