I Built a container runtime from scratch in c++

ContainersLinuxnamespacesCgroupsOverlayFS

Native Blog20 min read17 Apr '26

What comes to your mind the moment you hear containers? could be anyone of this or could be all

isolated
changes made inside container not visible in the host
solves works on my machine problem
easy to scale
kubernetes

but have you ever thought how all this works under the hood? ever thought how the same image runs on linux, windows, mac?

Instead of thinking docker as just a black box im gonna explain how it all works Im gonna explain how without using virtual machines containers achieve isolation by running in the same host as your other processes.

Containerization is built on three key features provided by the Linux kernel:

Namespaces
Cgroups
Chroot

Let's break down these concepts in simpler terms.

Namespaces

A namespace is a feature in the Linux kernel that isolates system resources so that a process thinks it has its own separate environment.

Namespaces allow you to specify what resources a specific application or process is allowed to share or inherit from the host and which ones should be isolated within the group(a set of processes that share the same namespace).

There are

PID Namespace → isolates process IDs so processes see their own separate process tree
Network Namespace → isolates network stack (interfaces, IPs, routing tables)
Mount Namespace → isolates filesystem view (what / looks like)
UTS Namespace → isolates system identifiers like hostname
IPC Namespace → isolates inter-process communication (shared memory, message queues)
User Namespace → isolates user and group IDs (root inside ≠ root outside)
Control Group (cgroup) → Isolates the view of the cgroup hierarchy, hiding the full host cgroup path from the process.
Time → The newest addition (introduced in kernel 5.6), it allows processes to have different system times by isolating the boot and monotonic clocks.

Chroot

Chroot, short for "Change Root", allows you to change your root directory to a custom location. This creates an isolated environment within the filesystem, which is essential for containers.

Cgroups

Cgroups, or control groups, allow the kernel to restrict access to system resources for a program. This ensures that containers do not exceed their allocated resources and maintains system stability.

We're going to build a container that takes an argument as a input, runs it and cleans up all the resources it used when it exits. Additionally we're going to use a layered file system and setup network for our containers in the same way as how docker does.

1. Create a process with namespaces isolation

In the first step of creating a container we create a process with the required isolation We can pick and choose what isolation we want there is no need to use all namespaces.

to do that we use the clone() system call.

clone() is a Linux system call used to create a new process… but with fine-grained control over what gets shared between parent and child.

c++

pid_t child_process_id = clone(child_func, stackTop, SIGCHLD | CLONE_NEWUTS | CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET, &args);

child_func is the function that first gets executed when the child process is created
the flags we pass here determines the isolation (each flag denotes each namespace)
args - the argument we pass to the child function

notice, here the developer is responsible for setting up memory(not kernel) and cleaning memory for the child process. so we create a memory space using malloc and pass the pointer pointing to the top of the memory

by top of the stack i mean the highest memory address where the stack begins

In most architectures stack grows downwards why? that's a Hardware design decision. CPU stack pointer (rsp in x86) decreases when pushing data. and generally the heap grows upwards, this maximizes the memory usage

High Address   ↓[  Stack   ]   → grows DOWN     (free space) [   Heap   ]   → grows UP 	  ↑Low Address

2. Setting resource limits

Alright once we have created a process with the isolation we want next step is to create resource limits so our containers don't consume infinite resource and choke up the machine.

i've used cgroupsv2 in this project coz that's better and is recommended. you can read about the differences here

c++

  namespace fs = std::filesystem;  // create path  fs::path cg_path{"/sys/fs/cgroup/Aegis"};   // create those in the filesystem  fs::create_directories(cg_path);   std::ofstream ofs;   // in v2 controllers are not enabled by default  // enabling controllers (required in v2)  ofs.open("/sys/fs/cgroup/cgroup.subtree_control");  ofs << "+pids +memory";  ofs.close();   // limit no of processes  ofs.open(cg_path / "pids.max");  ofs << "3";  ofs.close();   ofs.open(cg_path / "memory.max");  ofs << "209715200"; // memory byte limit 200MB  ofs.close();   // add process to cgroup  ofs.open(cg_path / "cgroup.procs");  ofs << std::to_string(pid);  ofs.close();

in cgroupsV2 we have to enable controllers manually

Each controller manages one type of resource.

rest of the code is pretty much self explanatory, we set a max no of processes to be 3 if it goes above that kernel automatically kills it with SIGKILL

When memory usage limit exceeds kernel triggers OOM(out of memory) killer OOM killer selects victim then sends SIGKILL.

3. The overlayFS filesystem

What's the difference between a normal file system and a overlay filesystem? what's the need for this?

Inorder for your application to run on a container we need a root filesystem so that your application has all the required system files to run.

But imagine you run 10 containers. that's 10 times the same root filesystem copied inside every container.

So you come to a solution to reuse this files, But but but imagine container1 changes some contents in this shared files now it affects all other containers that's not good so we need a solution to reuse content to reduce storage but at the same time changes made in one containerr shouldn't be visible to other containers.

that's where layered filesystem comes into the picture overlayFS in linux is a implementation of the layered filesystem.

How layered filesystem works?

There are three main layers that combine and make the layered filesystem possible

lowerdir
upperdir
merged

lowerdir contains all the shared files (rootfs in our case) contents present in lowerdir never gets modified.

This layer is read-only. Most importantly, this layer can be stacked (this is how docker does filesystem layering you might have read about this)

Upperdir is the writable layer, whatever changes you make inside the container gets written to this writable layer. Linux kernel does a brilliant thing here called copy-on-write(CoW)

in simple terms, when you change some content present in the lowerdir it gets copied to the upperdir and modified here. that way you're not messing with the shared lowerdir contents.

Now it should make sense that you pull a Node.js image once, and then build multiple custom images on top of it.

Each of your custom images just adds its own layers (like your app code or dependencies), but they all reuse the same base Node.js layers underneath.

So even if you run multiple containers from different images, they’re still sharing the common base layers efficiently instead of duplicating everything.

When you mutate a file or perform a write that file gets copied to the upperdir beforehand only then the content gets changed.

this is how you run multiple containers with the same image.

The merged directory is what we as users actually see. All the lowerdir, upperdir, and copy-on-write magic is handled internally by the Linux kernel through OverlayFS.

It combines all these layers and presents a single unified filesystem view to us.

So from our perspective, everything looks like a normal filesystem.

code:

c++

  // overlayFS  // Base dir  utils::cmd("mkdir -p /tmp/aegis/lower /tmp/aegis/upper /tmp/aegis/work /tmp/aegis/merged");   // base filesystem in lowerdir  utils::cmd("mkdir -p /tmp/aegis/lower/bin /tmp/aegis/lower/proc /tmp/aegis/lower/lib /tmp/aegis/lower/lib64 /tmp/aegis/lower/usr/bin");   // A list of essential binaries for container  const std::vector<std::string> bins = {"/bin/bash", "/bin/ps", "/bin/hostname", "/bin/ls", "/bin/ip", "/bin/ping"};   for(const auto& bin : bins) {    // copy binary inside container (lower dir)    utils::cmd("cp " + bin + " /tmp/aegis/lower/bin");     // command to find and copy all shared library dependencies for the binary    std::string ldd_cmd = "ldd " + bin + " | grep -oE '/[^ ]+' | xargs -I '{}' cp --parents '{}' /tmp/aegis/lower/";    utils::cmd(ldd_cmd);  }   // this is responsible for loading and linking shared libraries at runtime  utils::cmd("cp /lib64/ld-linux-x86-64.so.2 /tmp/aegis/lower/lib64/");   // python executable  utils::cmd("cp /usr/bin/python3 /tmp/aegis/lower/usr/bin/");   // dns config   utils::cmd("mkdir -p /tmp/aegis/lower/etc");  utils::cmd("cp /etc/resolv.conf /tmp/aegis/lower/etc/");   // merged becomes the combined filesystem view  std::string overlay_mount =    "mount -t overlay overlay "    "-o lowerdir=/tmp/aegis/lower,"    "upperdir=/tmp/aegis/upper,"    "workdir=/tmp/aegis/work "    "/tmp/aegis/merged";   utils::cmd(overlay_mount);

Here i copy all the necessary files just enough to run some basic commands inside the lowerdir.

Workdir is the scratch space for the kernel to perform copy on write and other operations. we can call it as a buffer for kernel to temporarily do some operations before writing them to upperdir.

workdir has to be empty otherwise the mount will fail.

4. Mount system Propagation

By default Linux mounts can be shared.

That means:

If a mount event happens in one place, it can propagate (spread) to other mount namespaces.

So even though containers use mount namespaces, mounts can still “talk” to each other if propagation is shared.

Imagine your container mounts something inside /mnt/data, but that mount shows up on the host system tooo that's mount propagation leaking accross boundaries.

We want our containers to be isolated.

So We control mount propagation modes.

There are mainly 4 types:

1. `MS_SHARED`

Mount events propagate both ways
Container ↔ Host

2. `MS_PRIVATE` (what we're using)

No propagation at all
Fully isolated

3. `MS_SLAVE`

One-way propagation
Host → Container (but not reverse)

4. `MS_UNBINDABLE`

Cannot be bind-mounted

c++

  // make this mount system private from the outer environment, MS_REC applies this recursively to all files and folders  // MS_ = mount flags  if (mount(NULL, "/", NULL, MS_REC | MS_PRIVATE, NULL) == -1) {    perror("mount private root failed");    exit(1);  }

This command modifies the mount behavior of root / MS_REC recursively applies this MS_PRIVATE mount flag behaviour to the entire filesystem tree, without this / will be private but /proc /sys might not be private

this command makes this filesytem tree completely private to the outside world (it doesn't appear on the host or any other namespace)

This is exactly what docker does under the hood.

5. Pivot_root vs chroot

I mentioned chroot is used to change the root directory of a process, so the container believes a specific directory as the root of its filesystem.

but in production systems chroot has some risks

chroot only changes the filesystem view for the process but it doesn't change the filesystem tree.

In production systems we cannot allow a process running in a container environment to access the host.

Imagine a process running inside a container can get access to host's SSH keys. (that's a huge problem)

So we use pivot_root.

let's better understand pivot_root with code

c++

  // pivot root has a requirement since it changes the mount points it needs the new root to be a mount point  // so create a mount the directory to the same location so kernel treats it as a mount point instead of just a normal directory  mount(config_.rootfs.c_str(), config_.rootfs.c_str(), NULL, MS_BIND, NULL);   // temporary dir for pivot root to put host root  mkdir((config_.rootfs + "/old_root").c_str(), 0755);   // pivot root   syscall(SYS_pivot_root, config_.rootfs.c_str(), (config_.rootfs + "/old_root").c_str());   // remove the mounted host root for safety reasons  umount2("/old_root", MNT_DETACH);  rmdir("/old_root");

Pivot_root fixes that by swapping the root filesystem at the kernel level

new_root becomes --> / old_root gets moved to something like old_root

now we can manually unmount the old_root from this namespace this way there's no way for a process running inside container to access the host.

6. Iptables and networking setup for container using bridge network

Containers has its own Network namespace so it thinks itself as a standalone host but in reality it doesn't have NIC it doesn't know how to talk to the public internet

I'll use a better analogy here, think of your container as your mobile phone and host system as your home router your mobile phone doesn't know how to talk to the internet unless you connect it to your home wifi

pls don't say ill turn on mobile data 😭🙏

Mobile phone and your router are connected in a local network both has private IPs to talk with each other like 10.0.0.1 10.0.0.2

but you cannot send a request to public internet with a private IP

you can send but you won't get any response back

so your router before sending request out to the internet it rewrites the request packet with its own public IP then sends it out. Again on response it intercepts and packet and it knows it has to forward the response packet to your mobile phone.

The exact same concept is applicable here except, container mobile phone and router your host system

request flow:

container -> host -> internet -> host -> container

To connect out host and containers we use something called as virtual ethernet veth pairs

Let's understand this better with code:

c++

======= PARENT SIDE ========  cmd("ip link add " + veth_host + " type veth peer name " + veth_container); // create a cable  cmd("ip link set " + veth_container + " netns " + std::to_string(pid)); // push one end to the container   cmd("ip link add name " + bridge + " type bridge"); // create a bridge  cmd("ip link set " + veth_host + " master " + bridge); // attach host veth to the bridge   //power on both the interfaces  cmd("ip link set " + veth_host + " up");  cmd("ip link set " + bridge + " up");   // enable IP forwarding  // so host acts like a router between container and internet  cmd("sudo sysctl -w net.ipv4.ip_forward=1");   cmd("ip addr add 10.0.0.1/24 dev " + bridge); // assign an ip address to the bridge  // bridge is gonna act as a gateway that route requests back and forth of the container and the host   // so give it a private ip    // Allow traffic from our bridge to the main network interface  cmd("iptables -A FORWARD -i " + bridge + " -j ACCEPT");   // Allow established traffic to come back from the main interface to our bridge  cmd("iptables -A FORWARD -o " + bridge + " -m state --state RELATED,ESTABLISHED -j ACCEPT");   // for every network request coming from container rewrite container ip with host ip  // so the response actually reaches back the host  // internet doesn't know how to send response to private ip  cmd("sudo iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o wlan0 -j MASQUERADE"); ======= CHILD SIDE =======    utils::cmd("ip link set lo up") ; // bring up the loopback interface   utils::cmd("ip link set veth1 up");    utils::cmd("ip addr add 10.0.0.2/24 dev veth1");    utils::cmd("ip route add default via 10.0.0.1");

I said containers are connected with host. But technically these veth pairs are connected to something called as bridge

a bridge acts like a network switch we see in offices. suppose you run 10 containers, all 10 veth pairs connect with this bridge.

We create a virtual ethernet pair, think about this like a physical ethernet cable

one end is plugged inside your host's bridge
and other one is plugged inside container

Next step we turn on this devices (bridge and veth) and assign private IPs to both our bridge and container so they could talk with each other on the local network.

By default Linux doesn't act like a router forwarding packets. So we turn it on using this command

bash

sudo sysctl -w net.ipv4.ip_forward=1

this command acts like a kernel-level switch says kernel to allow fowarding packets between interfaces (here wlan0 and veth)

Remember i told router rewrites the IP to its public IP before sending the packet out? that is achieved using this command

bash

sudo iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o wlan0 -j MASQUERADE

iptables is a command in linux that lets us modify network firewall rules

that -j MASQUERADE tells your host to replace the container's IP with hosts IP

Before: SRC: 10.0.0.2 (container)  DST: google.com After MASQUERADE: SRC: 192.168.1.5 (host)  DST: google.com

7. Race condition between parent and child while setting up network

We now know one end of veth pair is connected in parent and other end in child

after connecting em both needs to turn them on to actually use them.

But if the child process tries to use the veth pair before parent actually created, it fails with an error.

So we want to pause execution until parent completes its network setup. So we use pipe() in c++ same concept as chanels in GO

c++

===== PARENT SIDE ======  // child needs to wait until parent sets up network   // use pipe to block and notify once the network setup is complte  int fd[2];  pipe(fd); // after network setup steps   write(fd[1], "x", 1); // send data thru pipe to let child know network setup is done ===== CHILD SIDE =======   char buf;   read(args->read_fd, &buf, 1); // BLOCK until a byte is read

This way child blocks at read() until parent writes to the pipe Only then execution continues.

Containers are so damn cool, if you stop thinking that as just a black box and think about how this shi actually works.

I Hope i was able to explain things clearly

I tried to write it as much toned down as i can. Checkout the source code, I bet it will be worth your time.

Happy Learning :)

GitHub Repository

Source code and implementation details.