How does GA4 handle bot traffic?
GA4 has two defensive layers. Layer 1 — automatic bot detection uses the IAB/ABC International Spiders & Bots List to identify known bot user-agents. Always on, can't be disabled. Catches Googlebot, Bingbot, AdsBot, archive crawlers, and roughly 1,500 other recognised bots. Layer 2 — IP filtering is the manual layer where you exclude specific IP ranges (your office, datacentres known for testing tools, etc.).
Together they catch most bot traffic. The 2026 gap: AI bot crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) aren't on the standard bot list and need either explicit IP filtering or robots.txt + server-level blocking. Properties with significant API or scraping traffic also need a custom user-agent filter at the GTM level.
Layer 1 — automatic bot detection
GA4 automatically excludes known bots based on the IAB/ABC International Spiders & Bots List. The mechanic:
- Each event's user-agent is checked against the list at GA4's collection layer
- Matching user-agents have their events filtered before they enter your reports
- The exclusion is irreversible from your side — you can't see what was filtered out
- The list updates regularly as new bots are added
What gets caught:
- Search engine crawlers (Googlebot, Bingbot, Yandex, Baidu)
- SEO tool crawlers (Ahrefsbot, Semrushbot, Mozbot)
- Archive crawlers (Wayback Machine, Common Crawl)
- Monitoring crawlers (Pingdom, UptimeRobot)
- Approximately 1,500 other recognised bot user-agents
What doesn't get caught:
- Bots with custom or spoofed user-agents
- AI bot crawlers added recently (GPTBot, ClaudeBot, PerplexityBot, Google-Extended) — coverage of these on the IAB/ABC list is inconsistent in 2026
- Scrapers running headless browsers with real Chrome user-agents
- Sophisticated commercial bot networks
- Server-side scripts that don't identify as bots
The automatic layer is the cheap, broad defence. It catches obvious bots without configuration. It doesn't catch sophisticated or new bots.
Layer 2 — IP filtering (manual)
Configure in Admin → Data Streams → your stream → Configure tag settings → Show all → Define internal traffic, OR via Data Filters for broader patterns.
Common IP-based filters:
- Office IPs (catches employee browsing) — covered in *Internal Traffic Filters*
- Datacentre ranges (catches automated testing, monitoring tools, some scrapers)
- Known bot infrastructure (specific datacentre IPs of bot networks you've identified)
- Test environments (your staging/dev IPs sending real GA4 hits accidentally)
This layer requires active maintenance. You discover problematic IPs through analysis (sessions per user > 5, abnormally high event counts from one IP), add them to the filter, monitor for new patterns.
The 2026 AI bot problem
AI training and citation crawlers became significant traffic sources in 2024-2026. They generally don't identify themselves as bots in ways the IAB/ABC list catches:
- GPTBot (OpenAI training)
- OAI-SearchBot (OpenAI ChatGPT Search citations)
- ClaudeBot (Anthropic training)
- PerplexityBot (Perplexity citations)
- Google-Extended (Gemini training)
- CCBot (Common Crawl, used by many AI platforms)
- YouBot (You.com)
- Bytespider (TikTok/ByteDance)
Some of these honour robots.txt; some don't. Some appear in the IAB/ABC list; some don't or have inconsistent coverage. The result: AI bot traffic can pollute GA4 reports if not actively defended against.
The defensive options:
Option A — robots.txt declaration
For AI bots that honour robots.txt:
This blocks the bot from crawling. They never reach your site, so they never trigger GA4 events. Simplest defence.
The trade-off: blocking citation crawlers (PerplexityBot, OAI-SearchBot) means you can't be cited in those AI engines. Most properties want to allow citation crawlers and block training crawlers — see the related post on AI bot crawlers.
Option B — server-level user-agent blocking
Want to see which hidden implementation gaps are affecting your GA4 data quality?
For bots that ignore robots.txt:
- CDN rules (Cloudflare, Akamai, AWS WAF) blocking specific user-agents
- Server config (nginx, Apache) returning 403 for matching user-agents
This works regardless of robots.txt compliance. More technical to set up; more reliable.
Option C — GTM-level event blocking
Block at the analytics level rather than the request level:
Use this Variable as a Tag-firing condition. The bot still hits your site (server still serves content), but GA4 doesn't record the event.
This option preserves server logs (useful for debugging) while excluding bots from analytics. Less efficient than blocking at the server level but doesn't risk blocking legitimate users.
The four-layer defence pattern
For properties with significant bot exposure (B2B SaaS, popular blogs, content sites with scraper interest), combine all four layers:
Layer 1 — Automatic bot detection (always on, no config)
Layer 2 — IP filtering (manual list of known bad IPs and your own offices)
Layer 3 — robots.txt declarations (for compliant bots)
Layer 4 — Server-level user-agent blocking (for non-compliant bots and AI crawlers)
This catches most bot traffic. Sophisticated headless-Chrome scrapers still slip through; for those, behavioural analysis (sessions per user, engagement time anomalies) is the only practical defence — and that's diagnostic, not preventative.
Detecting bot traffic that slipped through
Even with all four layers, some bots arrive. The diagnostic patterns:
Pattern 1 — Sessions per user anomaly
A real user has 1-3 sessions per day. A bot can have 50-500.
Users with 10+ sessions in a day are almost always bots. Investigate, add their IPs to filters.
Pattern 2 — Geographic concentration anomaly
A B2B SaaS suddenly seeing 30% of traffic from an unexpected country. Often bot/scraper activity from a specific datacentre range.
GA4 → Reports → Demographics → Country, sorted by sessions. Look for outliers in the top 10 that don't match your business reality.
Pattern 3 — Engagement time near zero
Bots typically don't generate engagement_time_msec values, or generate near-zero values. A spike in low-engagement sessions correlates with bot traffic.
Filter Explorations to engagement_time_msec < 500 (half a second) and segment by source. The patterns are usually obvious.
FAQ: IP Filtering vs Bot Detection: GA4's Two Defensive Layers
How close should ip filtering vs bot detection: ga4's two defensive layers numbers be before I worry?
What should I validate first when ip filtering vs bot detection: ga4's two defensive layers numbers disagree?
When is a discrepancy a tracking bug instead of a reporting difference?
Related guides for IP Filtering vs Bot Detection: GA4's Two Defensive Layers
BigQuery Cost Optimisation for GA4 Exports: 9 SQL Patterns (2026)
The biggest cost wins come from nine SQL patterns: (1) partition pruning via _TABLE_SUFFIX BETWEEN (10–50x cost difference vs derived filters), (2) clustering on source/medium/event_name (30–60% reduction on top of partitioning), (3) explicit column selection (never SELECT *)…
How to Stitch GA4 BigQuery Sessions Manually (2026)
GA4 doesn't store sessions as records in BigQuery exports — only individual events with session identifiers. To reconstruct sessions: join on user_pseudo_id + (SELECT value.int_value FROM UNNEST(event_params) WHERE key='ga_session_id') as the unique session key…
Run a GA4 audit before ip filtering vs bot detection: ga4's two defensive layers spreads into reporting decisions
Use GA4 Audits to surface implementation gaps, broken signals, and the next fixes to prioritize before the issue becomes harder to trust or explain.