What should a team validate first when how to stitch ga4 bigquery sessions manually appears?

Reproduce the problem in the live implementation, isolate whether it is scoped to one report or flow, and compare it against at least one secondary source before changing the setup.

How do I know whether the fix actually worked?

You need before-and-after evidence in the browser and in the downstream report. A clean-looking dashboard without validation is not enough.

When should this become a full GA4 audit instead of a quick fix?

If the issue touches attribution, consent, revenue, campaign quality, or data trust for more than one workflow, it is usually safer to audit the surrounding implementation than patch only the visible symptom.

How to Stitch GA4 BigQuery Sessions Manually (2026)

How do I reconstruct GA4 sessions in BigQuery?

GA4 doesn't store sessions as records in BigQuery exports — only individual events with session identifiers. To reconstruct sessions: join on `user_pseudo_id` + `(SELECT value.int_value FROM UNNEST(event_params) WHERE key='ga_session_id')` as the unique session key, then aggregate with MIN(event_timestamp) for session start, MAX(event_timestamp) for session end, COUNTIF(event_name='page_view') for pageviews, and SUM(engagement_time_msec) for engagement time.

The session start traffic source comes from traffic_source.source/medium/name at the user's first event in that session.

The full SQL pattern is below.

Why GA4 doesn't store sessions

In Universal Analytics, sessions were a first-class entity — every session had its own row with start/end times and aggregate metrics. GA4 fundamentally changed this: everything is an event. Sessions exist only as a tag (ga_session_id) attached to events.

The implication: if you want session-level metrics in BigQuery, you build them yourself. The standard reports in the GA4 UI do this aggregation behind the scenes; BigQuery exports give you raw events and expect you to do the work.

The benefit: full control. You can define sessions however your business actually thinks about them — not just GA4's default 30-minute timeout.

The session reconstruction pattern

Here's the canonical SQL pattern for reconstructing sessions from raw GA4 BigQuery events:

Run this with the appropriate date range in _TABLE_SUFFIX and you have one row per session with everything you need.

Why user_pseudo_id + ga_session_id

Both fields are needed because:

user_pseudo_id alone doesn't isolate sessions — a single user has many sessions
ga_session_id alone isn't unique across users — different users can have the same session_id (the integer is just an incrementing counter per user, not globally unique)

The combination is the unique session key. Some implementations also include ga_session_number (the sequential number of the session for that user) for additional context, but it's not needed for uniqueness.

For cross-device or logged-in user analysis, also include user_id if your implementation sets it:

This unifies the user across devices when login is captured.

Engagement time — the GA4 UI match

The engagement time calculation needs care to match GA4's UI numbers:

engagement_time_msec is sent as an event parameter on most events. Sum across all events in the session, divide by 1000 for seconds. This matches GA4's "Average engagement time per session" calculation in standard reports.

Want to see which hidden implementation gaps are affecting your GA4 data quality?

Start free audit

A common gotcha: some implementations don't capture engagement_time_msec on every event. Verify by checking event_params for missing values — if more than 5% of events lack engagement_time_msec, your engagement metrics will under-report vs the GA4 UI.

The 'engaged session' definition

GA4 marks a session as "engaged" if any of these are true:

Engagement time >= 10 seconds, OR
2+ pageviews/screen_views, OR
1+ conversion event

The CASE statement in the SQL above replicates this exactly. It's how you reproduce GA4's "Engaged sessions" metric in your custom reporting.

Session start traffic source

Traffic source data lives on multiple events but the session-start source is what you typically want. Two approaches:

Approach 1: Use traffic_source on session_start event

Approach 2: Use ANY_VALUE() in aggregation The pattern above uses ANY_VALUE(source) which works because traffic_source is consistent within a session — every event in a session has the same traffic_source values. ANY_VALUE() is faster than the explicit join to session_start.

For most reporting, ANY_VALUE() is sufficient and more efficient.

Custom session definitions

GA4's default session timeout is 30 minutes of inactivity. To analyse with a different timeout (e.g., 15 minutes for short-cycle e-commerce, 4 hours for long-form content), you build the sessionisation in SQL:

This sessionises with a 15-minute timeout instead of the default 30. Adjust the 900 threshold (in seconds) for any custom timeout.

Common mistakes

1. Forgetting NULL session_ids. Some events (especially when the GA4 SDK initialises) don't have ga_session_id. Filter IS NOT NULL early or you'll get phantom sessions with NULL keys.

2. Treating ga_session_id as globally unique. It's not. Always join with user_pseudo_id.

3. Using traffic_source on every event. Some events don't have traffic_source populated. Use ANY_VALUE() to handle gaps gracefully.

4. Calculating session duration as MAX-MIN. This works for sessions with multiple events but gives 0 for single-event sessions. Either accept that or set a default minimum (e.g., 1 second for single-event sessions).

5. Not partitioning the output. Session reconstruction tables can grow large. Partition by event_date (or session_start_date) and cluster by source/medium for efficient downstream queries.