[DEE-204] Subqueries vs JOINs

INFO

Prefer JOINs for most cases. Use subqueries when they express intent more clearly or when the query planner demonstrates they perform better. Always verify with EXPLAIN ANALYZE -- do not assume one is universally faster than the other.

Context

SQL provides two primary ways to combine data from multiple tables: JOINs and subqueries. A JOIN combines rows from two tables side by side based on a condition. A subquery nests one query inside another, typically in the WHERE, FROM, or SELECT clause.

The conventional wisdom "always rewrite subqueries as JOINs" is an oversimplification. Modern query optimizers (PostgreSQL 12+, MySQL 8.0+) frequently transform subqueries into joins internally, making the performance difference negligible in many cases. However, there are specific scenarios where the choice matters significantly:

Correlated subqueries re-execute the inner query for each row of the outer query, which can be costly if the optimizer cannot flatten them.
EXISTS vs IN have different NULL handling semantics that affect correctness, not just performance.
Scalar subqueries in the SELECT clause are sometimes the clearest way to express a computation, but can cause N+1-like behavior if not optimized.

The right approach is to write the query that most clearly expresses the intent, then check the execution plan. If the plans are identical, prefer the more readable form.

Principle

Developers SHOULD prefer JOINs for combining data from multiple tables, as they are generally more readable and give the optimizer more flexibility.
Developers SHOULD use EXISTS instead of IN when the subquery column can contain NULLs, to avoid incorrect results.
Developers MUST check the execution plan when choosing between a subquery and a JOIN for performance-critical queries.
Developers SHOULD NOT blindly rewrite every subquery as a JOIN -- some subqueries are clearer and perform identically after optimization.

Example

Same query: subquery vs JOIN

Subquery (IN):

sql

-- Find customers who have placed at least one order
SELECT customer_id, name
FROM customers
WHERE customer_id IN (
    SELECT customer_id FROM orders
);

JOIN (equivalent):

sql

-- Same result using JOIN
SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id;

EXISTS (equivalent, often preferred):

sql

-- Same result using EXISTS -- no DISTINCT needed
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);

On modern PostgreSQL, all three typically produce the same execution plan (a semi-join). Verify with EXPLAIN ANALYZE.

EXISTS vs IN: NULL handling difference

This is a correctness issue, not a performance issue.

sql

-- Setup: orders table has some rows with customer_id = NULL
INSERT INTO orders (order_id, customer_id, total)
VALUES (999, NULL, 0.00);

-- Find customers who have NOT placed an order

-- WRONG: NOT IN with NULLs in subquery returns NO rows
SELECT customer_id, name
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders  -- includes NULL!
);
-- Returns: EMPTY (0 rows) -- incorrect!

-- CORRECT: NOT EXISTS handles NULLs properly
SELECT c.customer_id, c.name
FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM orders o
    WHERE o.customer_id = c.customer_id
);
-- Returns: customers with no orders -- correct

Why NOT IN fails with NULLs: NOT IN (1, 2, NULL) evaluates as x != 1 AND x != 2 AND x != NULL. Since x != NULL is always UNKNOWN, the entire expression is UNKNOWN, and the row is excluded. This is correct SQL behavior, but it is almost never the intended result.

Rule of thumb: Always use NOT EXISTS instead of NOT IN when the subquery column is nullable.

Correlated vs uncorrelated subqueries

sql

-- Uncorrelated subquery: executes once, result is reused
-- The optimizer can execute this as a hash semi-join
SELECT name, salary
FROM employees
WHERE dept_id IN (
    SELECT dept_id FROM departments WHERE budget > 1000000
);

-- Correlated subquery: references outer query, may re-execute per row
-- Find employees who earn more than their department average
SELECT e.name, e.salary, e.dept_id
FROM employees e
WHERE e.salary > (
    SELECT AVG(e2.salary)
    FROM employees e2
    WHERE e2.dept_id = e.dept_id  -- correlated: references outer e.dept_id
);

The correlated subquery is harder to express as a JOIN. A JOIN equivalent would require a derived table or CTE:

sql

-- JOIN equivalent of the correlated subquery above
SELECT e.name, e.salary, e.dept_id
FROM employees e
JOIN (
    SELECT dept_id, AVG(salary) AS avg_salary
    FROM employees
    GROUP BY dept_id
) dept_avg ON dept_avg.dept_id = e.dept_id
WHERE e.salary > dept_avg.avg_salary;

Both forms are valid. Check EXPLAIN ANALYZE to see which the optimizer handles better for your data distribution.

EXPLAIN comparison

sql

-- PostgreSQL: subquery with IN
EXPLAIN ANALYZE
SELECT customer_id, name FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);

Hash Semi Join  (cost=25.00..75.00 rows=500 width=36)
                (actual time=0.15..0.80 rows=480 loops=1)
  Hash Cond: (customers.customer_id = orders.customer_id)
  ->  Seq Scan on customers  (...)
  ->  Hash  (...)
        ->  Seq Scan on orders  (...)

sql

-- PostgreSQL: equivalent JOIN
EXPLAIN ANALYZE
SELECT DISTINCT c.customer_id, c.name
FROM customers c JOIN orders o ON o.customer_id = c.customer_id;

HashAggregate  (cost=80.00..85.00 rows=500 width=36)
               (actual time=0.90..1.10 rows=480 loops=1)
  Group Key: c.customer_id, c.name
  ->  Hash Join  (cost=25.00..75.00 rows=2000 width=36)
                  (actual time=0.15..0.70 rows=2000 loops=1)
        ->  Seq Scan on customers c  (...)
        ->  Hash  (...)
              ->  Seq Scan on orders o  (...)

The IN subquery was optimized into a Semi Join (stops after first match per customer), while the JOIN version must produce all matching rows then deduplicate with HashAggregate. For this query, the subquery form is actually more efficient.

When subqueries are clearly better

Scenario	Why Subquery Wins
Existence check	`EXISTS` performs a semi-join -- stops scanning after the first match. A JOIN must process all matches then deduplicate.
Negation (anti-join)	`NOT EXISTS` is safer than `NOT IN` due to NULL handling and often optimized identically to `LEFT JOIN ... IS NULL`.
Scalar computation	A subquery in SELECT that computes a single value (e.g., a count or average) can be clearer than a lateral join or CTE.
Filtering by aggregate	`WHERE dept_id IN (SELECT dept_id ... GROUP BY ... HAVING ...)` is more readable than a JOIN to a derived table.

When JOINs are clearly better

Scenario	Why JOIN Wins
Selecting columns from both tables	JOINs naturally produce columns from both sides; subqueries in WHERE cannot add columns to the result.
Multiple relationships	Chaining JOINs across 3+ tables is more readable than nested subqueries.
Optimizer flexibility	JOINs give the optimizer more freedom to choose join order, algorithms, and parallelism.
Aggregation across joined data	GROUP BY on joined data is straightforward; equivalent subqueries require awkward restructuring.

Common Mistakes

Always rewriting subqueries as JOINs for performance. Modern optimizers frequently transform subqueries into joins internally. Blindly rewriting can make code less readable without any performance gain. Always check EXPLAIN ANALYZE before and after rewriting.
Using NOT IN with nullable columns. If the subquery returns any NULL value, NOT IN returns zero rows. This is technically correct SQL behavior but almost never the intended result. Use NOT EXISTS or add WHERE column IS NOT NULL to the subquery.
Ignoring correlated subquery performance. A correlated subquery that the optimizer cannot flatten will execute once per outer row. On large outer result sets, this becomes an N+1-like problem. Check the execution plan for "SubPlan" nodes in PostgreSQL or "DEPENDENT SUBQUERY" in MySQL EXPLAIN -- both indicate per-row execution.
Using a subquery in SELECT without checking loop cost. A scalar subquery in the SELECT clause (e.g., (SELECT COUNT(*) FROM orders WHERE orders.customer_id = c.customer_id)) is essentially a correlated subquery. If the optimizer cannot transform it into a join, it executes once per output row. Use a lateral join or CTE for complex per-row computations.
Forgetting DISTINCT with JOIN-based existence checks. When replacing IN (subquery) with a JOIN, the JOIN may produce duplicate rows if the subquery side has multiple matches. Forgetting DISTINCT changes the result set cardinality.

DEE-200 Query and Performance Overview
DEE-201 Reading Execution Plans -- how to verify subquery vs JOIN performance
DEE-203 JOIN Strategies -- understanding JOIN types and algorithms
DEE-205 Query Optimization Patterns -- broader query writing techniques

References

PostgreSQL Documentation: Subquery Expressions -- EXISTS, IN, ANY, ALL syntax and semantics
MySQL Documentation: Optimizing Subqueries with EXISTS Strategy -- how MySQL optimizes IN to EXISTS internally
LearnSQL: Subquery vs JOIN -- practical comparison with examples
MSSQLTips: SQL EXISTS vs IN vs JOIN Performance Comparison -- benchmarked comparison
Redgate: Consider Using NOT EXISTS Instead of NOT IN -- NULL handling deep dive
Use The Index, Luke: Subqueries -- index-aware query writing for joins and subqueries

[DEE-204] Subqueries vs JOINs ​

Context ​

Principle ​

Example ​

Same query: subquery vs JOIN ​

EXISTS vs IN: NULL handling difference ​

Correlated vs uncorrelated subqueries ​

EXPLAIN comparison ​

When subqueries are clearly better ​

When JOINs are clearly better ​

Common Mistakes ​

Related DEEs ​

References ​