What should I validate first when ip filtering vs bot detection: ga4's two defensive layers numbers disagree?

Start with date range, attribution model, conversion/key-event definition, reporting identity, and cross-domain or consent effects. Those five variables explain most “mystery” mismatches.

When is a discrepancy a tracking bug instead of a reporting difference?

It becomes a tracking problem when the gap is unexplained after scope alignment, or when one source is clearly missing sessions, events, revenue, or campaign context that should be present.

IP Filtering vs Bot Detection: GA4's Two Defensive Layers (2026)

How does GA4 handle bot traffic?

GA4 has two defensive layers. Layer 1 — automatic bot detection uses the IAB/ABC International Spiders & Bots List to identify known bot user-agents. Always on, can't be disabled. Catches Googlebot, Bingbot, AdsBot, archive crawlers, and roughly 1,500 other recognised bots. Layer 2 — IP filtering is the manual layer where you exclude specific IP ranges (your office, datacentres known for testing tools, etc.).

Together they catch most bot traffic. The 2026 gap: AI bot crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) aren't on the standard bot list and need either explicit IP filtering or robots.txt + server-level blocking. Properties with significant API or scraping traffic also need a custom user-agent filter at the GTM level.

Layer 1 — automatic bot detection

GA4 automatically excludes known bots based on the IAB/ABC International Spiders & Bots List. The mechanic:

Each event's user-agent is checked against the list at GA4's collection layer
Matching user-agents have their events filtered before they enter your reports
The exclusion is irreversible from your side — you can't see what was filtered out
The list updates regularly as new bots are added

What gets caught:

Search engine crawlers (Googlebot, Bingbot, Yandex, Baidu)
SEO tool crawlers (Ahrefsbot, Semrushbot, Mozbot)
Archive crawlers (Wayback Machine, Common Crawl)
Monitoring crawlers (Pingdom, UptimeRobot)
Approximately 1,500 other recognised bot user-agents

What doesn't get caught:

Bots with custom or spoofed user-agents
AI bot crawlers added recently (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) — coverage of these on the IAB/ABC list is inconsistent in 2026
Scrapers running headless browsers with real Chrome user-agents
Sophisticated commercial bot networks
Server-side scripts that don't identify as bots

The automatic layer is the cheap, broad defence. It catches obvious bots without configuration. It doesn't catch sophisticated or new bots.

Layer 2 — IP filtering (manual)

Configure in Admin → Data Streams → your stream → Configure tag settings → Show all → Define internal traffic, OR via Data Filters for broader patterns.

Common IP-based filters:

Office IPs (catches employee browsing) — covered in *Internal Traffic Filters*
Datacentre ranges (catches automated testing, monitoring tools, some scrapers)
Known bot infrastructure (specific datacentre IPs of bot networks you've identified)
Test environments (your staging/dev IPs sending real GA4 hits accidentally)

This layer requires active maintenance. You discover problematic IPs through analysis (sessions per user > 5, abnormally high event counts from one IP), add them to the filter, monitor for new patterns.

The 2026 AI bot problem

AI training and citation crawlers became significant traffic sources in 2024-2026. They generally don't identify themselves as bots in ways the IAB/ABC list catches:

GPTBot (OpenAI training)
OAI-SearchBot (OpenAI ChatGPT Search citations)
ClaudeBot (Anthropic training)
PerplexityBot (Perplexity citations)
Google-Extended (Gemini training)
CCBot (Common Crawl, used by many AI platforms)
YouBot (You.com)
Bytespider (TikTok/ByteDance)

Some of these honour robots.txt; some don't. Some appear in the IAB/ABC list; some don't or have inconsistent coverage. The result: AI bot traffic can pollute GA4 reports if not actively defended against.

The defensive options:

Option A — robots.txt declaration

For AI bots that honour robots.txt:

This blocks the bot from crawling. They never reach your site, so they never trigger GA4 events. Simplest defence.

The trade-off: blocking citation crawlers (PerplexityBot, OAI-SearchBot) means you can't be cited in those AI engines. Most properties want to allow citation crawlers and block training crawlers — see the related post on AI bot crawlers.

Option B — server-level user-agent blocking

Want to see which hidden implementation gaps are affecting your GA4 data quality?

Start free audit

For bots that ignore robots.txt:

CDN rules (Cloudflare, Akamai, AWS WAF) blocking specific user-agents
Server config (nginx, Apache) returning 403 for matching user-agents

This works regardless of robots.txt compliance. More technical to set up; more reliable.

Option C — GTM-level event blocking

Block at the analytics level rather than the request level:

Use this Variable as a Tag-firing condition. The bot still hits your site (server still serves content), but GA4 doesn't record the event.

This option preserves server logs (useful for debugging) while excluding bots from analytics. Less efficient than blocking at the server level but doesn't risk blocking legitimate users.

The four-layer defence pattern

For properties with significant bot exposure (B2B SaaS, popular blogs, content sites with scraper interest), combine all four layers:

Layer 1 — Automatic bot detection (always on, no config)

Layer 2 — IP filtering (manual list of known bad IPs and your own offices)

Layer 3 — robots.txt declarations (for compliant bots)

Layer 4 — Server-level user-agent blocking (for non-compliant bots and AI crawlers)

This catches most bot traffic. Sophisticated headless-Chrome scrapers still slip through; for those, behavioural analysis (sessions per user, engagement time anomalies) is the only practical defence — and that's diagnostic, not preventative.

Detecting bot traffic that slipped through

Even with all four layers, some bots arrive. The diagnostic patterns:

Pattern 1 — Sessions per user anomaly

A real user has 1-3 sessions per day. A bot can have 50-500.

Users with 10+ sessions in a day are almost always bots. Investigate, add their IPs to filters.

Pattern 2 — Geographic concentration anomaly

A B2B SaaS suddenly seeing 30% of traffic from an unexpected country. Often bot/scraper activity from a specific datacentre range.

GA4 → Reports → Demographics → Country, sorted by sessions. Look for outliers in the top 10 that don't match your business reality.

Pattern 3 — Engagement time near zero

Bots typically don't generate engagement_time_msec values, or generate near-zero values. A spike in low-engagement sessions correlates with bot traffic.

Filter Explorations to engagement_time_msec < 500 (half a second) and segment by source. The patterns are usually obvious.