Skip to content
DEE
Database Engineering Essentials

[DEE-203] JOIN Strategies

INFO

Choose the JOIN type based on data requirements, not convenience. Each JOIN type has distinct semantics -- using the wrong one silently produces incorrect results or unnecessary performance costs.

Context

JOINs are the mechanism through which relational databases combine data from multiple tables. They are not a performance feature or a convenience -- they are how the relational model works. Without JOINs, you either denormalize everything into one table (violating normalization principles) or fetch data in multiple queries and assemble it in application code (the N+1 pattern).

SQL defines several JOIN types, each with different semantics for how matched and unmatched rows are handled. Choosing the wrong JOIN type is a logic error, not just a performance concern. A LEFT JOIN where an INNER JOIN was intended returns rows with NULL columns that downstream code may not expect. A CROSS JOIN where a filtered JOIN was intended produces a Cartesian product that can be orders of magnitude larger than intended.

Beyond the logical JOIN type, the database engine must choose a physical join algorithm -- Nested Loop, Hash Join, or Merge Join -- to actually execute the operation. Understanding both the logical type and the physical algorithm helps developers write efficient queries and interpret execution plans.

Principle

  • Developers MUST choose the JOIN type that matches the data requirement, not the one they are most familiar with.
  • Developers SHOULD use INNER JOIN as the default unless rows without matches are explicitly needed.
  • Developers MUST ensure join columns are indexed, especially on the "many" side of a one-to-many relationship.
  • Developers SHOULD NOT use CROSS JOIN unless a Cartesian product is genuinely needed (e.g., generating combinations for a calendar or matrix).

Visual

JOIN Type Reference

JOIN TypeMatched RowsUnmatched LeftUnmatched RightUse When
INNER JOINIncludedExcludedExcludedYou only want rows that have a match in both tables
LEFT JOINIncludedIncluded (NULLs for right)ExcludedYou need all rows from the left table regardless of match
RIGHT JOINIncludedExcludedIncluded (NULLs for left)You need all rows from the right table (often rewritten as LEFT JOIN)
FULL OUTER JOINIncludedIncluded (NULLs for right)Included (NULLs for left)You need all rows from both tables (e.g., reconciliation reports)
CROSS JOINN/A -- all combinationsN/AN/AYou genuinely need every combination of rows (Cartesian product)

Example

Sample data

sql
CREATE TABLE departments (
    dept_id   INT PRIMARY KEY,
    dept_name TEXT NOT NULL
);

INSERT INTO departments VALUES
    (1, 'Engineering'),
    (2, 'Marketing'),
    (3, 'Finance');

CREATE TABLE employees (
    emp_id    INT PRIMARY KEY,
    name      TEXT NOT NULL,
    dept_id   INT REFERENCES departments(dept_id)
);

INSERT INTO employees VALUES
    (101, 'Alice',  1),
    (102, 'Bob',    1),
    (103, 'Carol',  2),
    (104, 'Dave',   NULL);  -- Dave has no department

INNER JOIN -- only matched rows

sql
SELECT e.name, d.dept_name
FROM employees e
INNER JOIN departments d ON d.dept_id = e.dept_id;
namedept_name
AliceEngineering
BobEngineering
CarolMarketing

Dave is excluded (no matching department). Finance is excluded (no matching employee).

LEFT JOIN -- all employees, even without a department

sql
SELECT e.name, d.dept_name
FROM employees e
LEFT JOIN departments d ON d.dept_id = e.dept_id;
namedept_name
AliceEngineering
BobEngineering
CarolMarketing
DaveNULL

Dave appears with NULL for dept_name. Finance still excluded (no employee references it).

FULL OUTER JOIN -- all rows from both tables

sql
SELECT e.name, d.dept_name
FROM employees e
FULL OUTER JOIN departments d ON d.dept_id = e.dept_id;
namedept_name
AliceEngineering
BobEngineering
CarolMarketing
DaveNULL
NULLFinance

Both Dave (no department) and Finance (no employees) appear.

CROSS JOIN -- Cartesian product

sql
-- Generate a schedule grid: every employee x every department
SELECT e.name, d.dept_name
FROM employees e
CROSS JOIN departments d;
-- Returns 4 x 3 = 12 rows

Anti-join pattern with LEFT JOIN

sql
-- Find departments with no employees
SELECT d.dept_name
FROM departments d
LEFT JOIN employees e ON e.dept_id = d.dept_id
WHERE e.emp_id IS NULL;
dept_name
Finance

This is more efficient than NOT IN with a subquery, especially when the subquery can contain NULLs.

Physical Join Algorithms

The database engine chooses how to physically execute a JOIN independently of the logical JOIN type. Understanding these helps when reading execution plans (see DEE-201).

AlgorithmHow It WorksBest ForCost
Nested LoopFor each row in the outer table, scan the inner table (or use an index)Small tables, or when an index exists on the inner join columnO(N * M) without index, O(N * log M) with index
Hash JoinBuild a hash table from the smaller table, then probe it with each row from the larger tableLarge tables without useful indexes, equality conditions onlyO(N + M) but requires memory for hash table
Merge JoinSort both inputs on the join key (or use pre-sorted index), then mergeBoth inputs already sorted, or large tables with indexes on join keyO(N log N + M log M) for sort, O(N + M) for merge

PostgreSQL supports all three. MySQL added Hash Join support in version 8.0.18; prior versions rely on Nested Loop with index lookups for most joins.

Common Mistakes

  1. Accidental CROSS JOIN. Omitting the ON clause or using comma-separated tables without a WHERE condition produces a Cartesian product. With two 10,000-row tables, this generates 100 million rows. Always specify an explicit join condition.

    sql
    -- WRONG: accidental cross join (comma syntax, missing condition)
    SELECT * FROM orders, customers;
    
    -- CORRECT: explicit join
    SELECT * FROM orders o JOIN customers c ON c.customer_id = o.customer_id;
  2. LEFT JOIN when INNER JOIN suffices. Using LEFT JOIN "just in case" when the business logic guarantees a match (e.g., a NOT NULL foreign key) adds unnecessary overhead. The optimizer may not be able to eliminate the outer-join logic, leading to suboptimal plans. Use INNER JOIN when every row is guaranteed to have a match.

  3. Not indexing join columns. Without an index on the join column (especially the foreign key side), the database falls back to sequential scans or hash joins even when a nested loop with an index would be much faster. Always ensure foreign key columns have an index.

  4. Joining on mismatched types. Joining a VARCHAR column to an INT column forces an implicit type cast on every row, preventing index use. Ensure join columns have the same data type.

  5. Using RIGHT JOIN instead of rewriting as LEFT JOIN. RIGHT JOIN is semantically identical to LEFT JOIN with the table order swapped. Most teams standardize on LEFT JOIN for readability. RIGHT JOIN adds cognitive overhead without any benefit.

  6. Ignoring NULL semantics in outer joins. After a LEFT JOIN, columns from the right table may be NULL for unmatched rows. Filtering on those columns in the WHERE clause (e.g., WHERE right_table.column = 'value') implicitly converts the LEFT JOIN to an INNER JOIN, because NULL never equals anything. Place such filters in the ON clause instead if you want to preserve the outer-join behavior.

  • DEE-200 Query and Performance Overview
  • DEE-201 Reading Execution Plans -- see how the database executes your JOINs
  • DEE-202 The N+1 Query Problem -- JOINs as the primary solution
  • DEE-204 Subqueries vs JOINs -- when to use each

References