Skip to content
BEE
Backend Engineering Essentials

[BEE-19057] JSONB and Semi-Structured Data in PostgreSQL

INFO

PostgreSQL's jsonb type stores JSON as a parsed binary structure — enabling GIN index-based containment queries and key-existence checks without a separate document database, while keeping ACID guarantees and the ability to join against normalized tables.

Context

Applications frequently encounter data that does not fit a fixed schema: user-defined metadata, configuration objects that vary per tenant, third-party webhook payloads, and sparse attribute sets where most rows have empty values for most columns. The traditional SQL response — add a column per attribute — breaks down when attributes are user-defined or when there are hundreds of possible fields.

PostgreSQL introduced the json type in 9.2 (2012) as a text column with syntax validation. The jsonb type arrived in 9.4 (2014) as a binary-encoded alternative: on write, the JSON is parsed, whitespace stripped, duplicate keys deduplicated (last value wins), and keys sorted. On read, no re-parsing is required. The binary representation is larger than equivalent text but dramatically faster for indexing and operator evaluation.

The practical consequence: jsonb can replace a separate document database for many use cases — user profiles with optional fields, product catalogs with per-category attributes, audit log payloads — while retaining the ability to JOIN against relational tables, enforce foreign keys on extracted fields, and participate in ACID transactions.

The tradeoff is real: JSONB columns are not free. Every write rewrites the entire JSON document, even for a single-key update. Deeply nested documents resist indexing. Fields that are queried frequently in WHERE clauses, used in ORDER BY, or joined on benefit from promotion to proper columns.

Design Thinking

JSONB vs Normalized Columns

The decision is not JSONB vs SQL — it is which fields belong in each:

CharacteristicUse JSONBUse a column
Attribute presenceSparse (most rows empty)Dense (most rows have a value)
Schema ownershipUser-defined, unknown at deploy timeDeveloper-defined, stable
Query patternContainment / key existenceEquality, range, aggregate, JOIN
Indexing needGIN on whole document or subsetB-tree on column
Update patternReplace whole objectUpdate single field

A common pattern: use JSONB for the variable portion (metadata, extra_attributes, raw_payload) and normalized columns for the queryable business fields. A webhook events table stores payload jsonb for the raw body but promotes event_type, resource_id, and occurred_at to real columns for queries.

GIN vs Functional Index

Two index strategies cover the JSONB use cases:

GIN index (CREATE INDEX ON t USING GIN (col)) — covers @> containment, ? key existence, ?|, ?&, and JSONPath operators. Use when queries are ad-hoc containment searches across many possible keys. Two operator classes:

  • jsonb_ops (default): supports all operators including @> on nested values
  • jsonb_path_ops: smaller index, faster @> queries, does not support ? or ?|

Functional index (CREATE INDEX ON t ((col->>'key'))) — a regular B-tree on a single extracted text value. Use when a specific key is queried frequently with equality or range conditions. Much faster than GIN for single-key point queries and enables ordering.

Generated column (PG 12+): col_extracted TEXT GENERATED ALWAYS AS (col->>'key') STORED — the database maintains the extracted value automatically. Add a B-tree index on the generated column for the best single-key query performance with no application changes.

Partial Updates

JSONB columns do not support in-place field updates at the storage level. Any UPDATE replaces the entire stored binary. For small documents this is acceptable. For large documents updated frequently, consider:

  • jsonb_set(target, path[], new_value) — returns a new document with one path changed; combine with UPDATE SET col = jsonb_set(col, ...) to update a single key without fetching the document first.
  • || concatenation operator — merges two objects shallowly; useful for applying a patch object.
  • Promotion: if a field is updated frequently in isolation, it should be a real column.

Best Practices

MUST create a GIN index before querying with @> or ? at any non-trivial table size. Without a GIN index, every @> query is a sequential scan deserializing every JSONB document in the table. Add the index at table creation, not retrospectively.

MUST use jsonb_path_ops when queries use only @> and the index is large. jsonb_path_ops produces a smaller, faster index for containment queries by hashing key paths rather than indexing individual keys. Use jsonb_ops only if ? (key existence) queries are also needed.

MUST promote frequently-queried fields to real columns or generated columns. A field that appears in WHERE, ORDER BY, or JOIN conditions on most queries is a candidate for promotion. A generated column is the lowest-friction path: it auto-updates and can be indexed without changing query patterns.

SHOULD use functional indexes for single-key equality and range queries. CREATE INDEX ON events ((payload->>'status')) is faster than GIN for WHERE payload->>'status' = 'failed' because it is a B-tree point lookup rather than a bitmap scan.

SHOULD avoid deep nesting. Queries on deeply nested paths (col #>> '{a,b,c,d}') are harder to index, slower to evaluate, and harder to maintain. Flatten structures when possible. If nesting is required, index the path expression that queries use.

SHOULD use jsonb_set() for targeted field updates to avoid a round-trip. Instead of SELECT → modify in application → UPDATE, use UPDATE t SET col = jsonb_set(col, '{key}', '"new_value"') WHERE id = $1 to update in one statement.

MAY store raw third-party payloads as JSONB and promote fields incrementally. Storing the raw payload enables re-processing without re-fetching. Promote fields to columns when query patterns crystallize — this is cheaper than schema migrations later if done with generated columns.

Visual

Example

Schema with JSONB and index strategies:

sql
-- Webhook events: normalized fields for queries, JSONB for raw payload
CREATE TABLE webhook_events (
    id          BIGSERIAL PRIMARY KEY,
    event_type  TEXT        NOT NULL,          -- promoted: frequent WHERE clause
    resource_id TEXT        NOT NULL,          -- promoted: frequent JOIN target
    occurred_at TIMESTAMPTZ NOT NULL,          -- promoted: range queries, ORDER BY
    payload     JSONB       NOT NULL,          -- stores full raw body
    received_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- GIN index: ad-hoc containment queries across any payload key
CREATE INDEX ON webhook_events USING GIN (payload jsonb_path_ops);

-- Functional index: single-key equality on a frequently-filtered field
CREATE INDEX ON webhook_events ((payload->>'tenant_id'));

-- Generated column + index: zero application change, best B-tree performance
ALTER TABLE webhook_events
    ADD COLUMN tenant_id TEXT
    GENERATED ALWAYS AS (payload->>'tenant_id') STORED;

CREATE INDEX ON webhook_events (tenant_id);

Query patterns and their index usage:

sql
-- Containment query: uses GIN index (jsonb_path_ops)
-- "find all events where payload contains status = failed"
SELECT id, event_type, occurred_at
FROM webhook_events
WHERE payload @> '{"status": "failed"}';

-- Key-existence check: uses GIN index (jsonb_ops only — not jsonb_path_ops)
SELECT id FROM webhook_events WHERE payload ? 'retry_count';

-- Single-key equality: uses functional index on (payload->>'tenant_id')
SELECT id, event_type
FROM webhook_events
WHERE payload->>'tenant_id' = 'acme-corp'
ORDER BY occurred_at DESC
LIMIT 50;

-- Path extraction: extract nested value as text
SELECT payload #>> '{metadata,source,ip}' AS source_ip
FROM webhook_events
WHERE event_type = 'payment.failed';

Partial update with jsonb_set():

sql
-- Update a single key without fetching the document
-- jsonb_set(target, path_array, new_value, create_missing)
UPDATE webhook_events
SET payload = jsonb_set(payload, '{retry_count}', '3', true)
WHERE id = 42;

-- Merge a patch object (shallow merge with || operator)
UPDATE webhook_events
SET payload = payload || '{"processed": true, "processed_at": "2026-04-14T00:00:00Z"}'::jsonb
WHERE id = 42;

Functional index on a nested path (PG 12+):

sql
-- Index a path two levels deep
CREATE INDEX ON webhook_events ((payload #>> '{metadata,region}'));

-- Query uses the index
SELECT count(*) FROM webhook_events
WHERE payload #>> '{metadata,region}' = 'us-east-1';

Implementation Notes

Update cost: Any UPDATE to a JSONB column rewrites the full binary document. For documents larger than a few kilobytes updated at high frequency, this creates significant write amplification. Profile with pg_stat_user_tables (look at n_dead_tup and autovacuum frequency). If write amplification is the bottleneck, promote hot fields to columns.

EXPLAIN output: Use EXPLAIN (ANALYZE, BUFFERS) to verify index usage. A GIN scan appears as Bitmap Index Scan on <index>. If you see Seq Scan with Filter: (payload @> ...), the GIN index is missing or not being selected (check operator class matches the query operator).

jsonb_ops vs jsonb_path_ops: The default jsonb_ops class indexes every key and value independently, supporting ?, ?|, ?&, and @>. The jsonb_path_ops class indexes paths to values as hashes — no ? support, but smaller index and faster @> lookups. Choose jsonb_path_ops when your queries are exclusively containment; jsonb_ops when you also need key-existence checks.

Deduplication: On insert, PostgreSQL keeps only the last occurrence of duplicate keys. INSERT ... '{"a":1,"a":2}'::jsonb stores {"a": 2}. This matters when deserializing from sources that generate duplicate keys (some XML-to-JSON converters).

Null semantics: SQL NULL and JSON null are distinct. payload IS NULL checks for SQL null. payload->>'key' IS NULL is true for both a missing key and a key with JSON null value. Use payload ? 'key' to distinguish.

Supabase / PostgREST: PostgREST exposes JSONB columns via the REST API with containment filter syntax (?col=cs.{"key":"value"}), enabling client-side containment queries without custom SQL. This makes JSONB a practical API design tool for user-defined attributes.

  • BEE-6001 -- SQL vs NoSQL Tradeoffs: JSONB closes much of the document-store gap without leaving the relational model; understand when a dedicated document database is still the right choice
  • BEE-6002 -- Indexing Deep Dive: GIN is one of PostgreSQL's specialized index types; understand how it differs from B-tree and when each is appropriate
  • BEE-18002 -- Tenant Isolation Strategies: JSONB metadata columns are commonly used for tenant-specific attributes; generated columns and functional indexes make tenant_id queryable without schema changes

References