Monitoring business metrics means continuously tracking the numbers that drive decisions — daily active users, monthly recurring revenue, churn rate — and getting alerted the moment they fall outside normal range or when the data feeding them breaks. Most teams discover metric problems only after a stakeholder asks why the numbers look off. The gap between when something breaks and when someone notices is where wrong decisions get made.
Why business metric alerts fail most teams
Most monitoring setups watch infrastructure: server uptime, query latency, error rates. Those matter. But a pipeline can finish without error and still deliver wrong data to every metric that depends on it.
The typical failure: a schema change upstream causes a user_id column to populate as null in 68% of rows. The pipeline completes. No error fires. The dashboard shows DAU down 30% — and the team spends Monday standup debating whether it is a real engagement drop or a tracking bug. The number was wrong the whole time.
Two problems make this worse:
- Fixed thresholds ignore rhythm. Setting an alert for "DAU below 10,000" fires every Saturday during the expected weekend dip. Teams learn to dismiss it — until a real drop triggers the same alert and gets ignored too.
- Table-level monitoring misses metric-level failures. A table can look healthy — correct row count, recent update, normal null rate — while a specific column silently generates wrong values for your query. You need to watch the metric output directly, not just the table it comes from.
A learned baseline that accounts for day-of-week and hour-of-day rhythm solves the threshold problem. Watching the metric query output directly solves the table problem.
The three business metrics worth monitoring in production
Daily Active Users (DAU)
Daily Active Users (DAU) is the count of distinct users who completed at least one meaningful action in your product on a given day. A 20% drop on a Tuesday with no deploy is worth investigating. A 20% drop on Sunday is probably normal. The same number carries entirely different weight depending on which day it lands.
DAU is vulnerable to null cascades. If the user_id column in your events table goes null — from a schema change, a broken API payload, or an upstream type mismatch — your DAU query returns almost nothing while the pipeline reports a successful load. The table looks fine. The metric is wrong.
What to monitor for DAU:
- Null rate on
events.user_id— the most common upstream failure mode - Row count in the events table vs. same weekday baseline — a zero load produces zero DAU
- DAU value vs. the same day of week over the prior four weeks, not yesterday
Monthly Recurring Revenue (MRR)
Monthly Recurring Revenue (MRR) is the normalized monthly revenue from active subscriptions, excluding one-time charges. It compounds: a small error in one month's calculation inflates every downstream trend.
MRR is particularly vulnerable to duplicate billing loads. A payment event table that loads twice in one night produces exactly 2× MRR — accurate at the row level, wrong at the metric level. Every other check passes. Only a uniqueness check on the payment primary key catches it.
What to monitor for MRR:
- Uniqueness check on
payment_id— duplicate loads are invisible without this - Freshness on the payments table — stale billing data produces stale revenue numbers
- MRR value vs. 30-day rolling average, with wider tolerance around known billing cycles
Churn Rate
Churn rate is the percentage of customers or revenue lost in a given period, typically measured monthly. It is a lagging indicator — a customer who cancels in August decided in June. But the data failures that distort churn are immediate.
A subscription cancellation table with a schema change — cancelled_at shifting from a timestamp to a date string — produces zero parsed cancellations for a week. The UI shows "0% churn month to date." That is a data failure, not a retention achievement.
What to monitor for churn:
- Schema drift alerts on date and timestamp columns in billing tables
- Row count in cancellation tables — a weekday zero is a red flag
- Churn rate value vs. 90-day rolling baseline, since churn has seasonal patterns
The business metric health check: four signals per metric
This table maps the most common failure mode for each metric to the right monitoring check. Use it to design your first alert set — it covers the failures that cause the most decision damage in the least time.
| Metric | Most common failure | What to check | Alert threshold |
|---|---|---|---|
| DAU | Null spike in user_id |
Null rate on events.user_id | >5 pp above day-of-week baseline |
| DAU | Zero event load | Row count in events table | >40% below same-weekday baseline |
| MRR | Duplicate billing load | Uniqueness on payment_id | Any duplicate > 0 |
| MRR | Stale billing data | Freshness lag on payments table | >25 hours since last insert |
| Churn | Schema change on date column | Schema drift on cancelled_at | Any column type change |
| Churn | Zero cancellation rows | Row count in subscription_changes | Zero rows on a non-holiday weekday |
Before and after: what a metric alert actually looks like
A SaaS team monitors DAU for their Monday review meeting. Their events pipeline runs nightly.
Without monitoring: On Thursday night, a deploy changes the user_id field in the event payload from an integer to a string. The pipeline ingests successfully — the column exists, no parsing error fires. But the existing DAU query casts user_id to integer, which silently produces null for every new event. Friday's DAU reads 1,200 instead of the usual 8,400. The team notices Monday morning. Three days of product decisions — a feature rollout, a paid campaign — were made on a dashboard showing 85% less engagement than reality.
With monitoring: At 2:18 a.m. Friday, a Slack alert fires: "events.user_id null rate jumped from 0.4% to 89% — 14× the Thursday-night baseline. Most likely cause: upstream type change in event payload. Diagnosis query attached." The on-call engineer patches the pipeline by 4 a.m. Friday's DAU reads correctly. Monday's review uses accurate data.
A diagnosis query is a SQL query attached to an alert that confirms or rules out the most likely cause of the anomaly. It cuts mean time to resolution from hours to minutes. An alert that says "DAU dropped" starts a fire drill. An alert with a diagnosis query ends it.
How to set up business metric monitoring without a data engineer
- Connect read-only. Create a database role with SELECT-only permissions. Monitoring is observation — it never needs to write. A read-only connection also prevents any risk of the monitoring tool modifying production data.
- Define your metric SQL once. For DAU:
SELECT COUNT(DISTINCT user_id) FROM events WHERE event_date = CURRENT_DATE. For MRR: the same query your billing dashboard already runs. The monitoring tool watches the output of this specific query, not just the underlying table. Tools like Tabkeel generate this SQL automatically from a plain-language description — you review and confirm, but don't write from scratch. - Let the baseline form before trusting alerts. Allow 7–14 days of history before enabling anomaly thresholds. The baseline needs enough data to understand your weekly rhythm — otherwise every Monday spike after a quiet Sunday looks like an anomaly.
- Set schema drift alerts on critical columns. A type change on a date or ID column is one of the most destructive upstream events for business metrics. Alert on any column type change in the tables your core metrics depend on.
- Route alerts to where your team already works. Slack for async teams, PagerDuty for on-call rotation, email for executives. An alert that requires logging into a separate tool will be ignored at 2 a.m.
Connect a read-only database in two minutes and watch your first metric tonight — Tabkeel's Free plan monitors 10 tables and 2 business metrics, and the AI writes the metric SQL so you don't have to. Free, then $39/mo when you need more.
Metric-level monitoring vs. table-level monitoring
Business metric monitoring is not the same as table monitoring. Table monitoring checks that data arrived, is fresh, and has expected volume. That is necessary — but not sufficient.
A table can pass every check while a metric query returns wrong results. Consider: the events table has 9,400 rows (normal for that time window), updated 45 minutes ago (fresh), with 1.2% nulls overall (within normal range). But the DAU query filters WHERE event_type = 'session_start', and that specific event type had its user_id go 100% null overnight. The table check passes. The metric is wrong.
Metric-level monitoring watches the computed output — the number that appears in your dashboard — and alerts when it moves outside the expected range for that specific time of day and weekday. Table-level monitoring tells you data arrived. Metric-level monitoring tells you whether that data produces the right number.
For a full comparison of tools that support both levels, see the data observability tools overview — several draw the metric-vs-table distinction differently, and the right choice depends on your stack.
Common mistakes in business metric monitoring
- Alerting on the dashboard, not the data. Manual review is not monitoring. By the time someone looks at the dashboard and notices something is wrong, the data has already driven decisions.
- Fixed thresholds on cyclical metrics. DAU below 8,000 fires every Saturday. The team disables it. Then a real Tuesday crash fires the same alert — now ignored. Learned baselines that account for weekday patterns eliminate this.
- Skipping uniqueness checks on billing tables. Duplicate loads pass every other check silently. Add a uniqueness check on payment primary keys after every load event.
- Starting with too many metrics. Begin with the three that appear in Monday's review meeting. Five monitors you act on are worth more than thirty you dismiss.
- Alerts without a diagnosis path. Every alert should attach a diagnosis query or link to a runbook. An alert that says "something is wrong" without pointing toward why creates panic, not resolution.
Frequently asked questions
What is business metric monitoring?
Business metric monitoring is the automated tracking of KPIs — DAU, MRR, churn rate — with alerts when values leave expected range or when the data producing them breaks. It sits between your pipelines and your dashboards, catching problems before they reach a decision-maker.
How is metric monitoring different from data observability?
Data observability covers the health of your entire data system: tables, pipelines, schemas, freshness. Business metric monitoring is a focused subset: it watches the output of the specific queries your KPIs depend on. You need both — observability to catch upstream failures, metric monitoring to catch the downstream impact on the numbers you act on.
Can you monitor business metrics without writing SQL?
Yes. Tools designed for this purpose generate the metric SQL automatically when you describe what you want to measure — "daily active users from the events table," for example. You review and approve the query before it runs, but you do not write it from scratch. The AI authors the query; you own the metric definition.
How quickly should a business metric alert fire?
For DAU and revenue metrics: within two hours of the anomaly occurring. For churn: daily checks are usually sufficient since churn is a lagging indicator. The goal is to catch a problem before it drives more than one day of data-backed decisions — and well before a stakeholder surfaces it in a meeting.
What is the difference between a metric alert and a data quality alert?
A data quality alert fires on the source table: null rate spiked, row count dropped, schema changed. A metric alert fires on the computed output: DAU dropped 32% below the Tuesday baseline. Both can fire for the same incident from different angles. The metric alert shows a KPI is affected; the data quality alert shows why. Together they cut mean time to resolution from hours to minutes.