How We Built a Three-Layer GA4 Audit Engine: Admin API, Playwright, and AI Anomaly Detection
Most GA4 audit tools check your property configuration and stop there. We built something different: a pipeline that combines static API analysis, live headless browser execution against your real site, and statistical anomaly detection across your historical data. This article explains why each layer is necessary and how they work together.
The Problem with Single-Layer Audits
When we started building GA4 Audits, the existing tools fell into two camps. The first group checked the GA4 Admin API: data retention settings, linked products, custom dimension count, that kind of thing. Useful as a baseline, but entirely disconnected from what actually happens when a real user visits your site.
The second group crawled your website and checked whether a GA4 measurement ID appeared on the page. Also useful as a starting point, but equally shallow: a measurement ID present in the source does not tell you whether the tag fires under consent restrictions, whether the DataLayer push arrives before or after the GA4 snippet initialises, or whether the consent mode signals are ordered correctly.
Neither approach could surface the class of failures that actually cause commercial harm: consent-driven data loss, attribution model misconfiguration causing Smart Bidding to optimise on the wrong signal, or statistical anomalies hiding inside otherwise clean-looking traffic trends.
The architecture we needed was not a single tool but a pipeline with three independent analytical layers, each interrogating a different aspect of the same property, with results aggregated into a single weighted score.
Layer 1: GA4 Admin API and Data API Static Analysis
The first layer queries the GA4 Admin API and the GA4 Data API to build a structural picture of the property. This layer runs 42 checks across property configuration, data stream setup, key event health, attribution model alignment, and account linkage.
A few checks that illustrate why this layer matters in practice:
DDA data sufficiency. Google's Data-Driven Attribution model requires a minimum of approximately 400 conversions in a 28-day window before it activates. Below that threshold, GA4 silently falls back to last-click. We query the Admin API to confirm DDA is the configured attribution model, then query the Data API to count key event volume over the trailing 28 days. If the property has DDA selected but insufficient conversion volume, every Smart Bidding campaign is optimising on last-click attribution while the interface claims data-driven. This is one of the most commercially significant misconfigurations we find, and no configuration-only audit can detect it.
Key event rate sanity. A key event with a conversion rate above 100% mathematically indicates that the event fires on every page view or session start rather than on actual conversions. We compute this by dividing key event count by session count from the Data API. A rate above 100% means the trigger is misconfigured. Smart Bidding receiving this signal will optimise toward users who generate any session rather than users who purchase.
Inactive key events. GA4 properties accumulate key events over time. Deprecated checkout funnels, legacy campaign goals, and experimental events that were marked as key events during a test and never cleaned up. We query key event volume over 30 days for each configured key event. Events with zero conversions in 30 days that are still toggled as key events confuse the bid algorithm and reduce the effective signal-to-noise ratio for Smart Bidding optimisation.
Reporting Identity risk. The Reporting Identity setting determines how GA4 stitches sessions together across devices. The Device-based identity model, which uses only device identifiers rather than user IDs or Google Signals, produces the most fragmented view and is the worst option for paid media attribution. We detect this via the Admin API and flag it as high severity for any property running paid campaigns.
The static analysis layer produces structured output that feeds directly into the scoring engine and serves as the baseline context for the anomaly detection layer.
Layer 2: Playwright Headless Browser Execution
The second layer is the most technically complex. We spin up a Chromium instance via Playwright and crawl the live site, intercepting network requests, monitoring the DataLayer, and evaluating consent mode signal timing.
This layer runs 61 checks. It is the layer that no configuration audit can replicate, because configuration tells you what is supposed to happen. Playwright tells you what actually happens.
Consent Mode V2 Signal Timing
Consent Mode V2 compliance requires that the consent signals arrive before the GA4 tag initialises. The sequence matters: if GA4 fires before the CMP has resolved the user's consent state, the tag either runs unconstrained or defaults to denied depending on the default consent configuration.
We capture this by intercepting all network requests to Google's measurement endpoints and the DataLayer push events in parallel. We record timestamps for: the first gtag/dataLayer push carrying consent parameters, and the first network request to google-analytics.com or googletagmanager.com. If the network request precedes the consent push, the implementation is non-compliant regardless of what the CMP vendor's documentation says.
We also detect the presence of consent mode default configuration in the tag itself. A common failure pattern is a site with a CMP that pushes ad_storage: denied on user deny, but no default configuration in the GA4 tag. In this case, the period between page load and CMP resolution, which can be 300 to 800 milliseconds on a slow connection, runs with unconstrained tracking. We flag this separately from a full consent mode failure because it represents a grey-area compliance risk that has different remediation steps.
DataLayer Pollution Detection
DataLayer pollution is underdiagnosed because it rarely breaks tracking outright. It degrades it. We define pollution as: more than 50 keys in the dataLayer object, pushes occurring more than 3 levels deep in nesting, or more than 20 discrete push events on a single page load.
Each of these conditions slows GTM's evaluation loop, increases the risk of tag sequencing errors, and degrades the accuracy of event parameter collection. We capture the full DataLayer state at page load completion and analyse it programmatically, flagging pollution thresholds with the specific count that triggered the check.
Tag Performance Budget
We intercept all third-party script requests during page load and compute the total JavaScript payload attributable to analytics and marketing tags. A site running GA4, GTM, a CMP, Facebook Pixel, LinkedIn Insight Tag, HubSpot tracking, and Hotjar can easily accumulate 400 to 600 kilobytes of third-party JavaScript before a single line of product code runs.
We report total tag payload size, total network requests attributable to the tracking stack, and flag properties where the analytics payload alone exceeds a configurable threshold. This data connects directly to Core Web Vitals: third-party tag weight is one of the primary controllable contributors to Largest Contentful Paint degradation.
Redundant Tag Detection
A common mistake during GTM migrations is leaving a hardcoded GA4 tag in the site's HTML while also deploying GA4 through GTM. Both tags fire. Every page view is counted twice. We detect this by looking for GA4 measurement IDs loaded both via a direct script tag and via the GTM container network response. When we find the same measurement ID delivered through both channels, we flag it as critical, because duplicate data contamination in GA4 cannot be retroactively corrected.
Layer 3: AI-Driven Statistical Anomaly Detection
The third layer operates on historical data retrieved from the GA4 Data API and applies statistical methods to surface patterns that are invisible to rule-based checks.
A rule-based check can tell you that direct traffic is above 20% of sessions. It cannot tell you whether that percentage is abnormal for this specific property given its channel mix, seasonality, and historical baseline. Statistical anomaly detection can.
Z-Score Traffic Anomaly Detection
We replaced a naive threshold check (flag if sessions drop more than 500% week-over-week) with a Z-score approach operating on a rolling 12-week baseline.
For each metric (sessions, users, key events, revenue), we compute the mean and standard deviation across the trailing 12 weeks. We then compute the Z-score for the most recent week: (current value - mean) / standard deviation. A Z-score above 2.5 or below -2.5 is flagged as anomalous.
The advantage over a fixed threshold is that it automatically adapts to the property's scale and volatility. A property that normally fluctuates 40% week-over-week requires a much larger movement to constitute an anomaly than a stable, high-volume e-commerce site. Fixed percentage thresholds produce too many false positives on volatile properties and miss real failures on stable ones.
We also apply day-of-week normalisation before computing Z-scores on daily data. Without this, a site with strong weekday bias will show apparent anomalies every weekend even when traffic is perfectly normal for that day pattern.
Coefficient of Variation Traffic Stability Scoring
Z-scores detect point anomalies (one bad week). Coefficient of variation (CV) measures whether a property's traffic is structurally unstable over a longer window.
CV is the ratio of standard deviation to mean, expressed as a percentage. A property with CV above 0.40 across 90 days of sessions is exhibiting volatility that is likely partly explained by tracking inconsistency rather than purely business variation. We report CV alongside its interpretation: what share of the observed volatility is likely attributable to tracking fragility versus genuine audience behaviour.
A high CV on revenue combined with a low CV on sessions is a particularly informative pattern: it suggests the revenue tracking pipeline (purchase events, transaction IDs, item arrays) is unreliable while session tracking is stable, which typically points to a checkout-specific implementation failure.
7-Day Trend Slope Detection
Point anomalies and volatility scoring both miss gradual deterioration. A property losing 2% of its key event volume per week will not trigger a Z-score alert in any single week, but over 15 weeks it has lost 30% of its conversion signal.
We fit a linear regression to the trailing 30 days of daily key event counts and extract the slope as a percentage change per week. A slope below -15% per week is flagged as a deterioration signal. A slope below -5% per week is recorded as a warning. We include the R-squared value in the output so the audit result communicates how confident the trend detection is, a slope of -20% with R-squared 0.85 is a strong signal; the same slope with R-squared 0.20 is likely noise.
AI Traffic Attribution Gap Detection
A newer problem that has emerged as conversational AI tools have grown: traffic from ChatGPT, Perplexity, Claude, and Copilot is arriving at sites but not being attributed correctly in GA4.
Some of this traffic arrives with a referrer from the AI platform's domain and is correctly attributed to referral in GA4. But a significant share arrives with no referrer (because the AI tool opens links in a new context that strips referrer headers) and is attributed to direct traffic in GA4. This is not a tracking failure but a platform limitation, and it means direct traffic figures for many sites are now systematically overstated.
We detect this by querying the referrer dimension for known AI platform domains (chat.openai.com, perplexity.ai, claude.ai, copilot.microsoft.com, and others) and computing the share of sessions arriving from these sources. We then cross-reference this with the direct traffic percentage. A site where AI referral traffic accounts for more than 3% of sessions should be interpreting its direct traffic figure with caution; we flag this with the actual AI referral count as evidence.
The Async Execution Pipeline
Running three layers of analysis sequentially would make the audit impractically slow. A full audit including Playwright crawling of a complex site can involve 30 to 60 seconds of browser execution time alone. We needed an architecture that parallelised safely.
The pipeline runs on Cloud Run using Cloud Tasks for the orchestration layer. When an audit is triggered, the API creates an audit record in the database with status pending, then dispatches a Cloud Tasks HTTP target to the worker service. The worker picks up the task, creates a per-audit Redis key for idempotency (using Upstash), and begins execution.
Within the worker, the three analysis layers run in a structured concurrency pattern. The Admin API and Data API calls for Layer 1 and Layer 3 can run simultaneously; Playwright execution for Layer 2 runs in parallel with these API calls. The Playwright instance is initialised once per audit run (not per check) and reused across all crawler checks, which eliminates the browser startup overhead from each individual check.
The Redis idempotency key prevents duplicate execution if Cloud Tasks retries the HTTP request due to a transient failure. If the worker receives a task for an audit ID that already has an active Redis lock, it returns 200 (to acknowledge receipt and prevent further retries) without rerunning the audit.
Check results are written to the database incrementally as each check completes rather than in a single batch at the end. This means the audit dashboard can display partial results in real-time as the audit progresses, even for complex properties where the full run takes several minutes.
The Scoring Engine
249 checks producing binary pass/fail results would be unusable. A property failing 60 out of 249 checks tells an analyst nothing about priority. The scoring engine translates check results into a single actionable score and a ranked remediation list.
Each check is assigned a severity level (critical, high, medium, low, informational) and a category weight reflecting its impact on the five functional areas: Property Configuration, Tag and Consent, UTM and Campaign Integrity, Data Quality, and E-commerce Integrity.
Critical failures (consent mode not implemented, DDA configured without sufficient conversion volume, duplicate transaction IDs detected) receive a multiplier that can suppress the overall score significantly even if the majority of checks pass. This reflects the reality that one critical consent failure is more commercially damaging than 30 low-severity informational flags.
The scoring output includes three tiers: the overall property health score (0 to 100), a per-module score for each of the five audit categories, and a ranked next-steps list that orders remediation actions by their estimated score impact if resolved. An analyst receiving an audit report sees not just what is wrong but what to fix first to achieve the greatest improvement.
What Single-Layer Tools Consistently Miss
After running audits across a wide range of GA4 properties, we have observed some consistent failure patterns that single-layer tools reliably miss:
Consent mode configuration that passes on desktop but fails on mobile. A CMP that loads its consent UI synchronously on desktop may load it asynchronously on mobile due to a different code path. The Playwright layer detects this because we can set device emulation and viewport dimensions per crawl request.
Attribution model settings that conflict with the traffic profile. A property with 80% direct and branded search traffic configured with a 7-day lookback window is systematically over-crediting last-touch because most of its conversions happen within a single session. The Layer 1 and Layer 3 combination surfaces this by checking the configured attribution model against the actual channel distribution.
Gradual key event decay that starts after a site migration. A new CMS deployment changes page URLs slightly. The GTM trigger fires on the old URL pattern. Key events start declining at 3% per week. By week 10 the property has lost 30% of its conversion signal. No threshold-based check catches this; the trend slope detection does.
Payment gateway self-referral in the channel report. Stripe, PayPal, and Klarna redirect users through their domains during checkout. Without referral exclusions configured in GA4, these domains appear as referral sources in the channel report and inflate the referral channel's conversion count while deflating the actual acquisition channel that originally brought the user. We detect the presence of these domains in the referral report and cross-reference them against the referral exclusion list in the Admin API.
What We Learned Building It
A few things that were not obvious at the start:
The GA4 Admin API quota limits are lower than documented. The documented quotas suggest generous headroom for property-level reads. In practice, running 42 Admin API checks across a property with multiple data streams and linked products exhausts quota faster than the per-request count suggests, because some Admin API endpoints count child resources as separate quota units. We built a request coalescing layer that batches Admin API calls and caches responses within a single audit run to avoid hitting quota limits mid-audit.
Playwright's network interception is sensitive to page load timing. Checking consent signal timing requires intercepting events in a specific order. If the page load is very fast (under 200ms to first meaningful paint), some events arrive within a single JavaScript microtask queue flush and are difficult to distinguish temporally. We added synthetic delay injection in the Playwright context to create measurable gaps between page load phases, which makes the timing analysis more reliable on high-performance sites.
Statistical anomaly detection produces false positives on very small properties. A Z-score approach requires enough historical data to produce a meaningful baseline. Properties with fewer than 200 sessions per week do not have sufficient statistical power for the Z-score and CV checks to be reliable. We suppress these checks for low-traffic properties and replace them with simpler threshold checks, with a note in the audit output explaining why.
The most important check is usually not the one with the worst score. Early in development, we sorted the audit output purely by check severity. Users consistently focused on fixing the top item in the list even when a different issue lower down was causing far more commercial harm. We reworked the scoring and presentation to lead with estimated business impact (revenue at risk, conversion signal quality) rather than raw severity, which produced better outcomes in practice.
The Output: From Raw Checks to Analyst-Ready Evidence
Raw check results are not useful to the people who receive them. A client stakeholder does not need to know that their Z-score for session count is -2.7. They need to know that their traffic pattern shows a statistically significant decline that is not explained by seasonality, what the likely causes are, and what to check first.
The report engine transforms check results into three output formats: a PDF report designed for client handoff, a PowerPoint deck built for presentation, and an Excel workbook with the full check detail for analysts who want to interrogate the raw data.
Each finding in the PDF and PPTX output includes: the specific data point that triggered the check (not a generic description), the severity and business impact classification, and a concrete next step with an estimated remediation effort. The goal is that a client receiving the PDF can act on it without needing a follow-up call to understand what it means.
The scoring summary page groups findings by business impact category (Smart Bidding signal quality, consent compliance, revenue accuracy, attribution integrity) rather than by technical module, because business stakeholders allocate remediation resources based on business impact, not on which API was queried to surface the issue.
Where We Are Going
The three-layer architecture is the foundation, but there are several areas we are actively developing.
Regression-aware scheduled audits will compare each scheduled audit run against the previous run and surface new failures that appeared between runs. The goal is to catch regressions within 24 hours of a deployment rather than during the next manual review cycle.
Cross-property analysis will identify patterns across multiple properties in an agency or enterprise account. If 8 out of 10 properties show the same consent mode failure pattern, that is almost certainly a shared GTM container or CMP template issue, and the fix is at the template level, not per-property.
BigQuery integration will enable audits on exported raw event data, which gives access to session-level and user-level patterns that the aggregated Data API cannot expose. Duplicate session detection, event parameter distribution analysis, and bot traffic fingerprinting are all significantly more powerful on raw export data.
Run a full three-layer audit on your property
249 proprietary checks. API analysis, live Playwright execution, and AI anomaly detection. Free to start, results in under 10 minutes.
Start Free Audit