Skip to content
BEE
Backend Engineering Essentials

[BEE-16005] Container Fundamentals

Context

Containers are everywhere in modern backend engineering -- from local development to production Kubernetes clusters. Yet many engineers treat them as a black box: "it's like a lightweight VM." This mental model leads to predictable mistakes: running as root, using :latest tags, stuffing state into the container filesystem, and skipping resource limits.

Understanding what containers actually are -- isolated Linux processes, not mini-VMs -- changes how you build, secure, and operate them.

Related BEPs:

Principle

A container is an isolated process on the host kernel -- not a virtual machine. Build images small, run as non-root, enforce resource limits, and never store state in the container filesystem.

What a Container Actually Is

A container is a regular Linux process (or process tree) that the kernel has placed into a restricted view of the system. Two kernel primitives do the work:

PrimitiveJob
NamespacesWhat the process can see (filesystem, network, PIDs, users)
cgroupsWhat the process can use (CPU, memory, I/O)

No hypervisor. No guest kernel. The container shares the host kernel directly.

Namespaces -- Isolation

Linux has seven namespace types used by containers:

NamespaceIsolates
PIDProcess IDs -- container processes start from PID 1
NETNetwork interfaces, routes, firewall rules, ports
MNTFilesystem mount points (the container's /)
UTSHostname and domain name
IPCInter-process communication (semaphores, message queues)
USERUID/GID mappings (enables rootless containers)
CGROUPcgroup hierarchy visibility

When Docker starts a container, it calls clone() or unshare() with the appropriate namespace flags. The process genuinely cannot see processes, network interfaces, or filesystem paths outside its namespaces.

cgroups -- Resource Limits

Control groups (cgroups) are a kernel feature for organizing processes into hierarchies and applying resource policies:

  • CPU: limit to a percentage of a core, or set CPU shares for scheduling priority
  • Memory: hard limit in bytes; the kernel OOM-kills the process if exceeded
  • Block I/O: throttle read/write bandwidth per device
  • Network: classify and prioritize traffic (via tc)

Without cgroup limits, a single runaway container can consume all CPU and memory on the host, starving every other workload.

Container vs. VM

ContainerVM
KernelShared (host)Separate per VM
Boot timeMillisecondsSeconds to minutes
Image sizeMegabytesGigabytes
Isolation levelProcess-levelHardware-level
OverheadMinimalHypervisor + guest OS

VMs remain the right choice when you need strong hardware-level isolation (multi-tenant hosting, different OS families). Containers are the right choice for running many instances of the same application on shared infrastructure.

OCI: The Standard

The Open Container Initiative (OCI) defines the portable standards that make containers interoperable across runtimes (Docker, Podman, containerd, CRI-O):

  • Image Spec -- defines the image manifest, filesystem layer format (tar archives), and image configuration JSON
  • Runtime Spec -- defines what a conformant runtime must do when given an unpacked image bundle (create namespaces, apply cgroups, exec the process)
  • Distribution Spec -- the HTTP API for pushing and pulling images to/from registries

Because Docker, Kubernetes, and cloud registries all implement OCI, an image built with docker build runs identically on any OCI-compliant runtime.

Image Layers and Copy-on-Write

A container image is a stack of read-only layers. Each RUN, COPY, and ADD instruction in a Dockerfile creates one layer. At runtime, Docker adds a thin writable layer on top -- the container layer.

[ writable container layer ]   ← changes live here, gone on container stop
[ COPY . /app             ]   ← read-only
[ RUN npm ci              ]   ← read-only
[ FROM node:20-alpine     ]   ← read-only base

Copy-on-write (CoW): when a container modifies a file from a read-only layer, the storage driver copies the file up to the writable layer first. The original layer is untouched and shared across all containers using the same image.

Layer caching: Docker hashes each instruction + context. If nothing changed, it reuses the cached layer and skips the step. Cache invalidation is sequential -- changing layer N invalidates all layers below N. This has a direct consequence for how you order Dockerfile instructions.

Dockerfile: Bad vs. Good

Bad Dockerfile

dockerfile
# BAD: large base image, root user, no multi-stage, bad layer order
FROM node:20

WORKDIR /app

COPY . .
RUN npm install

EXPOSE 3000
CMD ["node", "src/server.js"]

Problems:

  • node:20 is ~1 GB; ships curl, git, compilers, and other attack surface
  • Runs as root (UID 0) -- if the app is exploited, the attacker has root in the container
  • Copies source before installing dependencies -- any source change invalidates the npm install cache
  • Includes node_modules, .git, test files in the image

Resulting image size: ~1.1 GB

Good Dockerfile

dockerfile
# GOOD: multi-stage, minimal base, non-root, cache-optimized layers
# ---- build stage ----
FROM node:20-alpine AS builder

WORKDIR /app

# Copy dependency manifests first -- cached unless deps change
COPY package.json package-lock.json ./
RUN npm ci --omit=dev

# Copy source after deps are installed
COPY src/ ./src/

# ---- runtime stage ----
FROM node:20-alpine AS runtime

# Create a non-root user
RUN addgroup -S appgroup && adduser -S appuser -G appgroup

WORKDIR /app

# Copy only the artifacts needed at runtime
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/src ./src
COPY package.json ./

USER appuser

EXPOSE 3000
CMD ["node", "src/server.js"]

.dockerignore (prevents build context bloat and cache busting):

node_modules
.git
*.test.js
.env
coverage/

Resulting image size: ~180 MB -- an 83% reduction. The runtime stage contains no compiler, no build tools, and no root access.

Container Networking Basics

Docker creates a virtual network bridge (docker0) by default. Each container gets a virtual Ethernet pair -- one end in the container's NET namespace, one end connected to the bridge.

Common network modes:

ModeUse case
bridge (default)Containers on the same host can reach each other by container name
hostContainer shares host network stack -- no isolation, maximum performance
noneNo network access
overlayMulti-host networking in Swarm / Kubernetes

In Kubernetes, each Pod gets its own network namespace. The CNI plugin (Flannel, Calico, Cilium) handles IP assignment and inter-node routing.

Resource Limits in Orchestration

Always set resource requests and limits. In Kubernetes:

yaml
resources:
  requests:
    cpu: "250m"      # 0.25 cores guaranteed at scheduling time
    memory: "256Mi"
  limits:
    cpu: "500m"      # hard cap -- throttled if exceeded
    memory: "512Mi"  # hard cap -- OOM-killed if exceeded

Without limits, a single pod can starve its node. Without requests, the scheduler cannot bin-pack pods correctly and nodes become over-committed.

Image Security Scanning

Container images accumulate CVEs in base image packages. Integrate scanning into CI:

  • Trivy (trivy image myapp:1.2.3) -- fast, free, scans OS packages and language deps
  • Docker Scout -- integrated into Docker Hub and Docker Desktop
  • Grype -- alternative to Trivy, good GitHub Actions integration

Scanning rules:

  1. Fail the pipeline on CRITICAL severity CVEs
  2. Rebuild and re-push images when base image updates are available (use Renovate or Dependabot for base image pinning)
  3. Pin base images to a digest, not a tag: FROM node:20-alpine@sha256:abc123...

Common Mistakes

1. Running as root in the container

dockerfile
# missing USER instruction -- process runs as UID 0
CMD ["node", "server.js"]

If an attacker exploits your app, they get root in the container. With a misconfigured volume mount or privileged mode, root in the container can become root on the host. Always add USER appuser before CMD.

2. Using :latest in production

dockerfile
FROM node:latest   # resolves to different commits over time

:latest is not a version. It changes when the maintainer pushes a new image. Two builds from the same Dockerfile can produce different images. Pin to an exact version and digest.

3. Fat base images

ubuntu:22.04 ships a package manager, shell utilities, and hundreds of packages that your app never calls -- all potential CVE surface. Prefer alpine variants (busybox shell, minimal packages) or distroless images (no shell at all).

4. No resource limits

A container without memory limits that hits a memory leak will consume all available host memory, triggering the kernel OOM killer, which may kill unrelated processes -- including the container runtime itself.

5. Storing state in the container filesystem

bash
# inside a container
echo "important-data" > /app/data/results.json
# container restarts -- /app/data/results.json is gone

The writable container layer is ephemeral. It disappears when the container is removed or replaced. Persist state in volumes (docker run -v /host/path:/app/data) or external storage (databases, object storage).

Summary

ConceptKey point
Containers vs. VMsContainers share the host kernel; VMs have a separate guest kernel
NamespacesIsolate what the process sees: PID, NET, MNT, UTS, IPC, USER
cgroupsLimit what the process uses: CPU, memory, I/O
Image layersRead-only stacked tarballs; writable layer added at runtime
CoWFiles from lower layers are copied up only when modified
Multi-stage buildsSeparate build environment from runtime; dramatically smaller images
OCIPortable image + runtime standard; interoperable across Docker, containerd, CRI-O
Non-root userMandatory security hygiene; create a dedicated user in the Dockerfile
Resource limitsAlways set requests and limits in orchestration
No state in containerUse volumes or external storage for persistent data

References