[DEE-204] Subqueries vs JOINs
INFO
Prefer JOINs for most cases. Use subqueries when they express intent more clearly or when the query planner demonstrates they perform better. Always verify with EXPLAIN ANALYZE -- do not assume one is universally faster than the other.
Context
SQL provides two primary ways to combine data from multiple tables: JOINs and subqueries. A JOIN combines rows from two tables side by side based on a condition. A subquery nests one query inside another, typically in the WHERE, FROM, or SELECT clause.
The conventional wisdom "always rewrite subqueries as JOINs" is an oversimplification. Modern query optimizers (PostgreSQL 12+, MySQL 8.0+) frequently transform subqueries into joins internally, making the performance difference negligible in many cases. However, there are specific scenarios where the choice matters significantly:
- Correlated subqueries re-execute the inner query for each row of the outer query, which can be costly if the optimizer cannot flatten them.
- EXISTS vs IN have different NULL handling semantics that affect correctness, not just performance.
- Scalar subqueries in the SELECT clause are sometimes the clearest way to express a computation, but can cause N+1-like behavior if not optimized.
The right approach is to write the query that most clearly expresses the intent, then check the execution plan. If the plans are identical, prefer the more readable form.
Principle
- Developers SHOULD prefer JOINs for combining data from multiple tables, as they are generally more readable and give the optimizer more flexibility.
- Developers SHOULD use EXISTS instead of IN when the subquery column can contain NULLs, to avoid incorrect results.
- Developers MUST check the execution plan when choosing between a subquery and a JOIN for performance-critical queries.
- Developers SHOULD NOT blindly rewrite every subquery as a JOIN -- some subqueries are clearer and perform identically after optimization.
Example
Same query: subquery vs JOIN
Subquery (IN):
-- Find customers who have placed at least one order
SELECT customer_id, name
FROM customers
WHERE customer_id IN (
SELECT customer_id FROM orders
);JOIN (equivalent):
-- Same result using JOIN
SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON o.customer_id = c.customer_id;EXISTS (equivalent, often preferred):
-- Same result using EXISTS -- no DISTINCT needed
SELECT c.customer_id, c.name
FROM customers c
WHERE EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);On modern PostgreSQL, all three typically produce the same execution plan (a semi-join). Verify with EXPLAIN ANALYZE.
EXISTS vs IN: NULL handling difference
This is a correctness issue, not a performance issue.
-- Setup: orders table has some rows with customer_id = NULL
INSERT INTO orders (order_id, customer_id, total)
VALUES (999, NULL, 0.00);
-- Find customers who have NOT placed an order
-- WRONG: NOT IN with NULLs in subquery returns NO rows
SELECT customer_id, name
FROM customers
WHERE customer_id NOT IN (
SELECT customer_id FROM orders -- includes NULL!
);
-- Returns: EMPTY (0 rows) -- incorrect!
-- CORRECT: NOT EXISTS handles NULLs properly
SELECT c.customer_id, c.name
FROM customers c
WHERE NOT EXISTS (
SELECT 1 FROM orders o
WHERE o.customer_id = c.customer_id
);
-- Returns: customers with no orders -- correctWhy NOT IN fails with NULLs: NOT IN (1, 2, NULL) evaluates as x != 1 AND x != 2 AND x != NULL. Since x != NULL is always UNKNOWN, the entire expression is UNKNOWN, and the row is excluded. This is correct SQL behavior, but it is almost never the intended result.
Rule of thumb: Always use NOT EXISTS instead of NOT IN when the subquery column is nullable.
Correlated vs uncorrelated subqueries
-- Uncorrelated subquery: executes once, result is reused
-- The optimizer can execute this as a hash semi-join
SELECT name, salary
FROM employees
WHERE dept_id IN (
SELECT dept_id FROM departments WHERE budget > 1000000
);
-- Correlated subquery: references outer query, may re-execute per row
-- Find employees who earn more than their department average
SELECT e.name, e.salary, e.dept_id
FROM employees e
WHERE e.salary > (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.dept_id = e.dept_id -- correlated: references outer e.dept_id
);The correlated subquery is harder to express as a JOIN. A JOIN equivalent would require a derived table or CTE:
-- JOIN equivalent of the correlated subquery above
SELECT e.name, e.salary, e.dept_id
FROM employees e
JOIN (
SELECT dept_id, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept_id
) dept_avg ON dept_avg.dept_id = e.dept_id
WHERE e.salary > dept_avg.avg_salary;Both forms are valid. Check EXPLAIN ANALYZE to see which the optimizer handles better for your data distribution.
EXPLAIN comparison
-- PostgreSQL: subquery with IN
EXPLAIN ANALYZE
SELECT customer_id, name FROM customers
WHERE customer_id IN (SELECT customer_id FROM orders);Hash Semi Join (cost=25.00..75.00 rows=500 width=36)
(actual time=0.15..0.80 rows=480 loops=1)
Hash Cond: (customers.customer_id = orders.customer_id)
-> Seq Scan on customers (...)
-> Hash (...)
-> Seq Scan on orders (...)-- PostgreSQL: equivalent JOIN
EXPLAIN ANALYZE
SELECT DISTINCT c.customer_id, c.name
FROM customers c JOIN orders o ON o.customer_id = c.customer_id;HashAggregate (cost=80.00..85.00 rows=500 width=36)
(actual time=0.90..1.10 rows=480 loops=1)
Group Key: c.customer_id, c.name
-> Hash Join (cost=25.00..75.00 rows=2000 width=36)
(actual time=0.15..0.70 rows=2000 loops=1)
-> Seq Scan on customers c (...)
-> Hash (...)
-> Seq Scan on orders o (...)The IN subquery was optimized into a Semi Join (stops after first match per customer), while the JOIN version must produce all matching rows then deduplicate with HashAggregate. For this query, the subquery form is actually more efficient.
When subqueries are clearly better
| Scenario | Why Subquery Wins |
|---|---|
| Existence check | EXISTS performs a semi-join -- stops scanning after the first match. A JOIN must process all matches then deduplicate. |
| Negation (anti-join) | NOT EXISTS is safer than NOT IN due to NULL handling and often optimized identically to LEFT JOIN ... IS NULL. |
| Scalar computation | A subquery in SELECT that computes a single value (e.g., a count or average) can be clearer than a lateral join or CTE. |
| Filtering by aggregate | WHERE dept_id IN (SELECT dept_id ... GROUP BY ... HAVING ...) is more readable than a JOIN to a derived table. |
When JOINs are clearly better
| Scenario | Why JOIN Wins |
|---|---|
| Selecting columns from both tables | JOINs naturally produce columns from both sides; subqueries in WHERE cannot add columns to the result. |
| Multiple relationships | Chaining JOINs across 3+ tables is more readable than nested subqueries. |
| Optimizer flexibility | JOINs give the optimizer more freedom to choose join order, algorithms, and parallelism. |
| Aggregation across joined data | GROUP BY on joined data is straightforward; equivalent subqueries require awkward restructuring. |
Common Mistakes
Always rewriting subqueries as JOINs for performance. Modern optimizers frequently transform subqueries into joins internally. Blindly rewriting can make code less readable without any performance gain. Always check
EXPLAIN ANALYZEbefore and after rewriting.Using NOT IN with nullable columns. If the subquery returns any NULL value,
NOT INreturns zero rows. This is technically correct SQL behavior but almost never the intended result. UseNOT EXISTSor addWHERE column IS NOT NULLto the subquery.Ignoring correlated subquery performance. A correlated subquery that the optimizer cannot flatten will execute once per outer row. On large outer result sets, this becomes an N+1-like problem. Check the execution plan for "SubPlan" nodes in PostgreSQL or "DEPENDENT SUBQUERY" in MySQL EXPLAIN -- both indicate per-row execution.
Using a subquery in SELECT without checking loop cost. A scalar subquery in the SELECT clause (e.g.,
(SELECT COUNT(*) FROM orders WHERE orders.customer_id = c.customer_id)) is essentially a correlated subquery. If the optimizer cannot transform it into a join, it executes once per output row. Use a lateral join or CTE for complex per-row computations.Forgetting DISTINCT with JOIN-based existence checks. When replacing
IN (subquery)with a JOIN, the JOIN may produce duplicate rows if the subquery side has multiple matches. Forgetting DISTINCT changes the result set cardinality.
Related DEEs
- DEE-200 Query and Performance Overview
- DEE-201 Reading Execution Plans -- how to verify subquery vs JOIN performance
- DEE-203 JOIN Strategies -- understanding JOIN types and algorithms
- DEE-205 Query Optimization Patterns -- broader query writing techniques
References
- PostgreSQL Documentation: Subquery Expressions -- EXISTS, IN, ANY, ALL syntax and semantics
- MySQL Documentation: Optimizing Subqueries with EXISTS Strategy -- how MySQL optimizes IN to EXISTS internally
- LearnSQL: Subquery vs JOIN -- practical comparison with examples
- MSSQLTips: SQL EXISTS vs IN vs JOIN Performance Comparison -- benchmarked comparison
- Redgate: Consider Using NOT EXISTS Instead of NOT IN -- NULL handling deep dive
- Use The Index, Luke: Subqueries -- index-aware query writing for joins and subqueries