Blog

What Is Data Quality? Dimensions, Metrics, and How to Monitor It

June 18, 2026 · Francisco Ferreira

Data quality is how much you can trust a number when you act on it. A dataset is high quality when it is accurate (reflects reality), complete (no missing records), fresh (recent enough to matter), consistent (the same thing means the same everywhere), valid (values stay within expected ranges), and unique (no duplicates distorting counts). Most teams discover quality problems only after a wrong number has already influenced a decision.

Why data quality problems stay hidden

Dashboards do not notify you when data goes wrong. Queries still run. Numbers still appear. The difference is that those numbers are quietly wrong.

The typical failure pattern: a pipeline stalls at 2 a.m., a table fills with nulls, a business metric doubles overnight — and the dashboard reports nothing unusual, because dashboards wait for someone to look. Your team makes a product decision on Monday based on numbers that broke on Friday. Discovery comes from a stakeholder, not a monitor.

Gartner estimates poor data quality costs organizations an average of $12.9 million per year. The majority of that cost is not cleanup — it is the downstream effect: wrong decisions, wasted engineering time, and dashboards that slowly lose the organization's trust.

The solution is not more manual spot checks. It is treating data quality the way software teams treat application reliability: with automated monitoring against a learned baseline, not a spreadsheet reviewed on Friday afternoons.

The six dimensions of data quality

Six dimensions capture the different ways data can fail you. Understanding each one makes monitoring choices clear.

1. Accuracy

Accuracy measures whether data reflects the real-world event it represents. An order table with a shipping date in 1970 is inaccurate. A revenue row where the amount is negative by mistake is inaccurate. Accuracy is the hardest dimension to monitor automatically, because it requires comparing data against an external source of truth — something most pipelines do not have.

2. Completeness

Completeness measures whether all expected records and fields are present. An events table that loaded 200 rows instead of the usual 10,000 is incomplete. A customer record without an email column is incomplete. Completeness is closely tied to null rate — the percentage of rows where a column is missing a value. A column that is normally 2% null suddenly becoming 40% null is a completeness failure.

3. Freshness

Data freshness measures the gap between now and the last time a table received data — and whether that gap is normal. A table that updates hourly but has not changed in a day is stale, even if every row is individually correct. Stale data is dangerous precisely because nothing looks broken: the dashboard loads, the numbers are there, they are just old.

4. Consistency

Consistency measures whether the same fact means the same thing everywhere it appears. A customer labeled "active" in the CRM but "churned" in the billing system is inconsistent. A product ID appearing in three formats across three tables is inconsistent. Consistency failures produce the classic problem: three dashboards, three different revenue numbers, and a meeting spent arguing about which one is right.

5. Validity

Validity measures whether values conform to expected formats and ranges. A zip code field containing freeform text is invalid. A percentage column holding a value of 340 is invalid. Validity failures often come from transformation bugs or API changes where a field type or format shifts without the downstream system being updated.

6. Uniqueness

Uniqueness measures the absence of duplicate records. A duplicate pipeline load that inserts the same 10,000 rows twice will pass every other dimension check — the data is accurate, complete, fresh, consistent, and valid — but the metric totals are wrong by exactly 2×. Uniqueness failures are the easiest to miss and among the most damaging to reporting accuracy.

The data quality monitoring checklist

This table maps each dimension to a specific check, a warning signal, and an appropriate monitoring frequency for most production tables. Adjust thresholds to match your data's typical pattern.

Dimension What to check Warning signal Frequency
Freshness Minutes since last row inserted Gap > 2× normal update interval Hourly
Completeness / Volume Row count vs. historical baseline ±30% from expected count for that time window Hourly
Null rate % null in key columns Jump > 5 percentage points above normal Every 2–4 hours
Consistency Same entity ID present across related tables Mismatch rate > 0.1% Daily
Validity Values within expected range or format Any out-of-range row count > 0 Daily
Uniqueness Duplicate rate on primary key Any duplicate > 0 After each load

The hardest part of this checklist is not knowing what to check — it is knowing what "normal" looks like. A row count that seems low on Sunday is expected. A table quiet at 3 a.m. might just be running a maintenance window. A learned baseline that accounts for weekday and hourly rhythm is what separates useful alerts from noise you learn to ignore.

Before and after: what monitoring changes

A product team monitors weekly active users for Monday's review meeting. Their events pipeline runs nightly.

Without monitoring: On Friday at 11 p.m., a schema change upstream causes the user_id column in the events table to populate as null for 68% of rows. The pipeline completes without error. The dashboard shows 12,400 WAU — unchanged from last week, because the non-null segment looks consistent. On Monday, leadership cuts paid acquisition spend by $40,000, citing "flat engagement." At 2 p.m., an engineer notices the null spike while working on an unrelated report. Three days of wrong data had already driven the budget decision.

With monitoring: A Slack alert fires at 11:05 p.m.: "events.user_id null rate jumped from 2% → 68% — 3× the Friday-night baseline. Likely cause: upstream schema change. Diagnosis query attached." The on-call engineer resolves it before the weekend ends. Monday's meeting uses the correct number and the budget decision is made on accurate data.

The difference is not the team's skill. It is whether monitoring existed to surface the problem in minutes rather than 72 hours.

Common mistakes teams make

How to start monitoring without a data team

You do not need a data engineer to get started with data quality monitoring. The minimum viable setup:

  1. Connect read-only. Create a read-only database role. Monitoring is observation — no write permissions needed.
  2. Pick five critical tables. The tables that feed your core business metrics: revenue, active users, orders, sign-ups. Start there, not everywhere.
  3. Let a baseline form. Give your monitoring tool 7–14 days of data before trusting anomaly alerts. The baseline needs enough history to understand your weekly rhythm.
  4. Define one business metric. The number you review every Monday. Define it once — "Daily Active Users = distinct user_ids in events where event_date = today" — and monitor both the metric value and the underlying table.
  5. Route alerts to where your team works. Slack, email, or PagerDuty. An alert that requires logging into a separate tool gets ignored.

See where your data quality stands today — the free 2-minute health check grades your dataset A–F across all six dimensions, with no account required. Or compare the data quality monitoring tools available in 2026 to find the right fit for your stack.

Frequently asked questions

What is data quality in simple terms?

Data quality is how much you can trust a number when you use it. High-quality data is accurate (it reflects what really happened), complete (no rows or columns are missing unexpectedly), fresh (recent enough to be relevant), consistent (the same fact means the same thing everywhere), valid (values are in expected formats and ranges), and unique (no duplicate records inflating the counts). The hard part is not defining it — it is noticing when it slips.

What are the six dimensions of data quality?

The six most widely used dimensions are accuracy, completeness, freshness (timeliness), consistency, validity, and uniqueness. Each captures a different failure mode: freshness catches stale pipelines, completeness and null rate catch missing records, validity catches format errors, consistency catches divergent definitions, and uniqueness catches duplicate loads. Most real-world data quality incidents involve at least two dimensions simultaneously.

How do you measure data quality?

You measure data quality with specific checks on each dimension: null rate (percentage of blank values in key columns), freshness lag (gap between now and the last table update), row count anomaly (volume relative to historical baseline), duplicate rate on primary keys, and value-range checks on numeric columns. The most useful single signal is detection speed — how quickly a problem is caught after it occurs.

What is the difference between data quality and data observability?

Data quality is the property — how trustworthy the data is at a point in time. Data observability is the practice of continuously monitoring that property so problems surface in minutes rather than days. Data quality is the goal; observability is how you maintain it in production without waiting for a stakeholder to notice something is wrong.

Can you monitor data quality without a dedicated data team?

Yes. The core checks — freshness, row-count anomalies, null rate, schema drift — run automatically against a read-only connection and require no SQL authorship or pipeline ownership. A data team extends what you can monitor, but it is not a prerequisite to start catching the problems that matter most.