Data anomaly detection is the automated identification of unexpected changes in your data: a table that stops receiving rows, a key column that goes 40% null overnight, a schema change that breaks a downstream join without raising any error. Every team running a production database experiences data anomalies regularly. Most discover them only after a wrong number has already reached a dashboard and driven a decision. The gap between when something breaks and when someone notices is where the damage happens.
The four types of data anomalies
Not all anomalies look the same. Understanding the four types makes it clear which monitoring checks to prioritize.
| Type | What it looks like | Example | What catches it |
|---|---|---|---|
| Volume anomaly | Row count drops or spikes vs. expected range | Events table loads 400 rows instead of 9,000 | Row-count monitoring vs. learned baseline |
| Distribution anomaly | Column null rate, mean, or cardinality shifts | user_id null rate jumps from 0.4% to 68% | Null rate and distribution monitoring |
| Freshness anomaly | Table not updated within expected window | Orders table quiet for 6 hours during business day | Freshness monitoring with per-table SLAs |
| Schema anomaly | Column added, removed, renamed, or retyped | cancelled_at shifts from timestamp to string | Schema drift monitoring |
Most real-world data incidents involve more than one type simultaneously. A schema change (user_id switching from integer to UUID) immediately cascades into a distribution anomaly (null rate spike) and eventually a volume anomaly (downstream count drops near zero). The order matters for diagnosis: catching the schema event first typically halves the time to resolution.
Why static thresholds miss most anomalies
The instinctive approach to anomaly detection is a fixed rule: alert if row count falls below 1,000, alert if null rate exceeds 5%. That approach produces two failure modes:
- False positives on regular patterns. Sunday row counts are always lower than Tuesday counts. An events table quiet at 3 a.m. is not broken. A threshold set to catch Tuesday problems fires every Sunday and every night, teaching teams to dismiss alerts until a real incident gets dismissed too.
- False negatives on relative failures. A table that normally loads 50,000 rows loading 35,000 is a 30% drop worth investigating. If the threshold is "below 1,000," the alert never fires.
A learned baseline solves both problems. The system observes your data for 7–14 days, learns the typical volume for Monday at 8 a.m. versus Sunday at 3 a.m., and alerts when the observed value deviates from the pattern for that specific time window. No manual threshold to calibrate. No Sunday false positives.
The key parameter is the segment: baselines segmented by day of week and hour of day catch the difference between a normal weekend dip and an actual outage. Without segmentation, the baseline is an average that fits no specific moment well.
The 3-signal data anomaly detection framework
Three checks cover the failure modes that cause the most downstream damage. Add them in this order. Each builds on the last.
Together, these three signals catch the incidents responsible for the majority of wrong-number events in production: broken pipelines (volume), upstream type changes (schema and distribution), and silent null cascades (distribution). For teams monitoring business metrics like DAU or MRR, add a fourth check: metric value versus same-weekday baseline, since a metric can be wrong even when all three table-level checks pass. For a deeper look at the relationship between table health and metric health, see the guide to monitoring business metrics in production.
Data anomaly detection checklist
This checklist maps each anomaly type to a specific check, threshold recommendation, and monitoring frequency. Start with freshness and volume on your five most critical tables. Add null-rate checks on the columns your core metrics depend on. Add schema drift on all tables that downstream systems join against.
| Anomaly type | What to monitor | Threshold | Frequency |
|---|---|---|---|
| Freshness | Gap since last row inserted | >2× normal update interval | Hourly |
| Volume | Row count vs. same-time-window baseline | ±30% from expected, or >3σ from distribution | Hourly |
| Null rate | % null on key columns (IDs, amounts, dates) | >5 pp above 7-day baseline | Every 2–4 hours |
| Distribution | Mean, cardinality, percentiles on numeric columns | Mean shift >25% from 14-day baseline | Daily |
| Schema drift | Column type, name, presence vs. last-known state | Any change | After each load |
| Metric value | DAU, MRR, churn vs. same-weekday baseline | >20% deviation from same-weekday average | Every 2 hours |
The thresholds here are starting points, not rules. A table that typically loads 50,000 rows with ±3,000 variance can use a tighter threshold than a table whose counts naturally vary 40% week to week. After 7–14 days of learned baseline data, tighten thresholds based on what you observe.
Before and after: what anomaly detection changes
A fintech startup runs a nightly pipeline that loads transaction events into their warehouse. The core revenue dashboard reads from this table.
Without anomaly detection: On Wednesday night, an upstream API change moves the amount field format from float to a string with a currency symbol ("$124.50" instead of 124.50). The pipeline loads successfully — the column exists, rows arrive, no error fires. The transformation that aggregates revenue multiplies a string by zero, producing MRR at $0. Wednesday's dashboard shows no revenue. The on-call engineer assumes a business slowdown. Thursday's leadership call discusses canceling a paid campaign because "revenue fell off a cliff." At 11 a.m., a data analyst notices the format change. Three hours of leadership attention and one cancelled campaign were driven by a format string.
With anomaly detection: At 2:34 a.m. Thursday, an alert fires: "transactions.amount distribution anomaly — mean dropped from $182 to $0. Schema check shows column type changed from FLOAT to VARCHAR at 2:01 a.m. Diagnosis query attached." The on-call engineer resolves the transformation within 40 minutes. The dashboard shows correct revenue before the 9 a.m. standup.
The detection fired on two signals simultaneously: schema drift (type change) and distribution anomaly (mean collapse). Together they made the diagnosis immediate, not investigative.
How to set up data anomaly detection
- Connect read-only. Create a database role with SELECT-only permissions. Anomaly detection is observation — it never modifies your data. A read-only connection also prevents accidental modification of production tables.
- Select your critical tables. Start with five: the tables that feed your most-reviewed metrics. For most teams this is events, payments, users, orders, and signups. Expand later.
- Let baselines form before trusting volume alerts. Give the system 7–14 days of observation before relying on statistical anomaly thresholds. During this period, schema drift and freshness alerts are immediate; volume and distribution baselines are still learning.
- Set schema drift alerts immediately. Schema changes are structural events that trigger immediately on the first occurrence. These alerts are reliable from day one and often the highest-value check for teams shipping code frequently.
- Add one business metric alert. Define the metric SQL and watch the computed output, not just the tables. A table can pass every check while the metric it drives returns a wrong number. Tabkeel writes the metric SQL automatically when you describe the metric in plain language.
See where your data quality stands before setting anything up. The free 2-minute data quality check grades your dataset A–F across the core anomaly dimensions, no account required. Or connect your first table to Tabkeel's Free plan and let the baseline start learning tonight — 10 tables and 2 business metrics, no credit card.
What data anomaly detection doesn't replace
- Hard constraints and data tests. Anomaly detection flags when null rate spikes from 0.5% to 22%. A data test guarantees no nulls pass at all. Both belong in a production data system — tests for the rules you know, anomaly detection for the failures you don't anticipate. For the full spectrum of checks, see what is data quality.
- Deterministic range validation. A statistical anomaly detector won't catch a
percentagecolumn holding 340. That requires an explicit range check. Anomaly detection covers statistical deviations, not rule violations. - Root cause investigation. Anomaly detection surfaces the problem and often the most likely cause. The actual fix still lives in your pipeline, source system, or transformation logic. The attached diagnosis query cuts investigation time; it doesn't eliminate it.
For a comparison of tools that combine anomaly detection with schema drift and business metric monitoring, see the data observability tools overview. For the broader framework these checks sit within, see the guide to the five pillars of data observability.
Frequently asked questions
What is a data anomaly?
A data anomaly is any pattern in your data that deviates from its expected statistical behavior. This includes volume drops (fewer rows than normal), distribution shifts (null rate spike, mean collapse), schema changes (column added, renamed, or retyped), and freshness gaps (table not updated within its expected window). Anomalies are defined relative to each table's own history, not against a universal threshold.
What is data anomaly detection?
Data anomaly detection is the automated identification of unexpected changes in your data before they produce wrong numbers on a dashboard. It uses statistical baselines learned from historical data to distinguish real anomalies (a pipeline broke) from expected variation (Sunday is always quieter than Tuesday).
How does anomaly detection differ from data testing?
Data testing checks hard constraints: not null, unique, value in range. An anomaly detector watches statistical patterns: is this volume drop unusual given the time of day and day of week? Tests catch rule violations. Anomaly detection catches unexpected deviations from learned behavior. Both belong in a production data system. See also: data quality for the full list of dimensions each approach covers.
How long does baseline formation take?
Most statistical anomaly detectors need 7–14 days of data to form reliable volume and distribution baselines. Schema drift and freshness alerts are accurate from day one — any schema change or freshness gap triggers immediately. Time to first meaningful statistical alert is typically 24–72 hours after connection.
Can I detect data anomalies without a data engineer?
Yes. Modern tools connect via a read-only credential, learn baselines automatically, and alert without requiring SQL expertise. A data engineer adds value in defining custom metric SQL and interpreting anomalies in the context of your pipeline architecture. The core setup and daily alerts require no engineering background.