Skip to content

Part XVI: Containerization and IPC

This part defines how ISLE supports Linux-compatible containerization (namespaces, cgroups), POSIX Inter-Process Communication (IPC), and the Linux credential model, mapping these constructs to ISLE's native capability and scheduling primitives.

Type Definitions Used in This Part

/// Unique identifier for a schedulable task within the kernel.
/// Globally unique, never reused (monotonically increasing from boot).
/// Used for PID translation in PID namespaces.
pub type TaskId = u64;

/// Handle to a physical page frame. Wraps the frame number.
/// Used by pipe buffers for zero-copy page gifting via vmsplice().
pub struct PhysPage {
    /// Physical frame number (PFN).
    pub pfn: u64,
}

/// Wait queue head for blocking operations.
/// Used by pipe buffers to block readers/writers.
/// Defined in Section 10.3 (isle-core/src/sync/wait.rs).
pub struct WaitQueueHead { /* see Section 10.3 */ }

/// Namespace type enumeration for hierarchy tracking.
/// ISLE implements all 8 Linux namespace types (see §12a.6).
#[repr(u32)]
pub enum NamespaceType {
    Pid = 0,    // CLONE_NEWPID
    Net = 1,    // CLONE_NEWNET
    Mnt = 2,    // CLONE_NEWNS
    Uts = 3,    // CLONE_NEWUTS
    Ipc = 4,    // CLONE_NEWIPC
    User = 5,   // CLONE_NEWUSER
    Cgroup = 6, // CLONE_NEWCGROUP
    Time = 7,   // CLONE_NEWTIME (Linux 5.6+)
}

Note on Capability<T> syntax: This document uses Capability<NetStack> and Capability<VfsNode> as type hints indicating what resource a capability references. The underlying Capability struct (Section 11.1) is non-generic; the target type is determined by the object_id field. This notation is for documentation clarity only.

63. Namespace Architecture

Linux namespaces isolate global system resources. In ISLE, namespaces are not primitive kernel objects; rather, they are synthesized from ISLE's native Capability Domains (Section 11) and Virtual Filesystem (VFS) mounts.

63.1 Capability Domain Mapping

When a process creates a new namespace via clone(CLONE_NEW*) or unshare(), ISLE allocates a new Capability Domain or modifies the existing one:

  • CLONE_NEWPID (PID Namespace): Creates a new PID translation table in the process's Capability Domain. The isle-compat layer translates local PIDs (e.g., PID 1) to global ISLE task IDs.
  • CLONE_NEWNET (Network Namespace): Creates an isolated network stack instance:
  • The new namespace has no network interfaces except lo (loopback, 127.0.0.1/8)
  • No connectivity to the host or external network unless explicitly configured
  • Network interfaces (physical NICs, VETH pairs, bridges, VLANs) are owned by a specific namespace and cannot be accessed from other namespaces
  • Each namespace has its own routing table, iptables/nftables rules, and socket port space
  • Per-namespace network state is defined below; the isle-net subsystem (Section 34-36) implements the network stack that operates within these namespace boundaries:
/// Per-namespace network state.
pub struct NetNamespace {
    /// Namespace ID (unique across the system).
    pub ns_id: u64,

    /// Network interfaces owned by this namespace.
    /// Interface names are scoped to the namespace (eth0 in NS1 != eth0 in NS2).
    /// Key is a fixed-size interface name (IFNAMSIZ = 16 bytes, matching Linux).
    pub interfaces: RwLock<BTreeMap<InterfaceName, Arc<NetInterface>>>,

    /// Loopback interface (always present, cannot be deleted).
    pub loopback: Arc<NetInterface>,

    /// Routing table (per-namespace, not shared).
    pub routes: RwLock<RouteTable>,

    /// Firewall rules (iptables/nftables equivalent).
    /// Rules are scoped to this namespace only.
    pub firewall: RwLock<FirewallRules>,

    /// Port allocation bitmap (per-namespace).
    /// Allows the same port number to be bound in different namespaces.
    pub port_allocator: Mutex<PortAllocator>,

    /// Capability to this network stack (for delegation).
    /// Processes in this namespace implicitly hold this capability.
    pub stack_cap: Capability<NetStack>,
}

/// Fixed-size interface name (matching Linux IFNAMSIZ = 16).
/// Prevents unbounded heap allocation and OOM attacks via long names.
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct InterfaceName([u8; 16]);

VETH pairs for inter-namespace connectivity: A VETH (virtual ethernet) pair connects two namespaces. Creation: ip link add veth0 type veth peer name veth1 ip link set veth1 netns <target-namespace> In ISLE, this creates two virtual interfaces that are cross-linked: - veth0 in the caller's namespace - veth1 in the target namespace - Packets sent to one end appear on the other (like a virtual patch cable)

Container networking flow: 1. Container runtime creates a new network namespace for the container 2. Creates a VETH pair: one end in host namespace (e.g., veth0), one in container (e.g., eth0) 3. Host end is attached to a bridge (e.g., docker0, cni0) for external connectivity 4. Container end is assigned an IP from the bridge's subnet 5. NAT/masquerading rules on the host allow container → external traffic 6. Port forwarding rules map host ports → container ports - CLONE_NEWNS (Mount Namespace): Creates a private copy of the VFS mount tree for the process. Changes to this tree do not affect the parent domain unless explicitly marked shared. - CLONE_NEWUTS (UTS Namespace): Creates an isolated hostname/domainname state. Stored as a reference-counted UtsNamespace struct in the process's NamespaceSet (see §63.2). - CLONE_NEWIPC (IPC Namespace): Isolates System V IPC objects and POSIX message queues. - CLONE_NEWUSER (User Namespace): Creates a new UID/GID mapping table within the Capability Domain. - CLONE_NEWTIME (Time Namespace): Creates isolated offsets for CLOCK_MONOTONIC and CLOCK_BOOTTIME. The container sees its own "boot time" starting from zero, independent of the host's actual boot time. The TimeNamespace struct with offset fields is defined in §63.2 below.

63.2 Namespace Implementation

Namespaces are implemented entirely within the isle-compat layer. The core microkernel (isle-core) is unaware of namespaces; it only understands Capability Domains and object access rights.

pub struct NamespaceSet {
    /// PID translation table (Local PID -> Global Task ID)
    pub pid_map: RwLock<BTreeMap<u32, TaskId>>,

    /// Pending PID namespace for future children (set by setns(CLONE_NEWPID)).
    /// When set, fork()/clone() creates children in this namespace rather than
    /// the current process's PID namespace. The process's own PID is unchanged.
    pub pending_pid_ns: Option<Arc<PidNamespace>>,

    /// Mount tree root for this namespace
    pub vfs_root: Capability<VfsNode>,

    /// UID/GID translation mappings
    pub id_map: RwLock<IdMapping>,

    /// Network stack instance capability
    pub net_stack: Capability<NetStack>,

    /// UTS namespace state (hostname, domainname).
    pub uts_state: Arc<UtsNamespace>,

    /// IPC namespace (SysV semaphores, message queues, shared memory).
    pub ipc_state: Arc<IpcNamespace>,

    /// Cgroup namespace (cgroup root view).
    pub cgroup_state: Arc<CgroupNamespace>,

    /// Time namespace offsets (CLOCK_MONOTONIC, CLOCK_BOOTTIME).
    pub time_state: Arc<TimeNamespace>,

    /// Pending time namespace for future children (set by setns(CLONE_NEWTIME)).
    /// When set, fork()/clone() creates children with the target time offsets.
    /// Follows Linux 5.8+ semantics where CLONE_NEWTIME affects children only.
    pub pending_time_ns: Option<Arc<TimeNamespace>>,

    /// Tracks whether a non-user namespace has been joined in the current
    /// setns() sequence. Used to enforce that CLONE_NEWUSER must be joined
    /// before other namespace types. Reset after user NS switch.
    pub joined_non_user_ns_in_sequence: bool,
}

/// UTS namespace state.
pub struct UtsNamespace {
    /// Hostname (max 64 bytes, NUL-terminated).
    pub hostname: Mutex<[u8; 65]>,
    /// NIS domain name (max 64 bytes, NUL-terminated).
    pub domainname: Mutex<[u8; 65]>,
}

/// IPC namespace state (SysV IPC objects).
pub struct IpcNamespace {
    /// SysV shared memory segments (shmid -> segment).
    pub shm_segments: RwLock<BTreeMap<i32, Arc<ShmSegment>>>,
    /// SysV semaphore sets (semid -> semaphore set).
    pub sem_sets: RwLock<BTreeMap<i32, Arc<SemSet>>>,
    /// SysV message queues (msqid -> queue).
    pub msg_queues: RwLock<BTreeMap<i32, Arc<MsgQueue>>>,
    /// POSIX message queues (name -> queue).
    pub posix_mqueues: RwLock<BTreeMap<String, Arc<PosixMqueue>>>,
}

/// Cgroup namespace state.
pub struct CgroupNamespace {
    /// Root cgroup directory visible to processes in this namespace.
    /// Processes see this as "/" in /sys/fs/cgroup.
    pub cgroup_root: Arc<Cgroup>,
}

/// Time namespace state (Linux 5.6+).
pub struct TimeNamespace {
    /// Offset added to CLOCK_MONOTONIC for processes in this namespace.
    /// Allows containers to see a "boot time" starting from 0.
    pub monotonic_offset_ns: AtomicI64,
    /// Offset added to CLOCK_BOOTTIME for processes in this namespace.
    pub boottime_offset_ns: AtomicI64,
}

63.2.1 Container Root Filesystem: pivot_root(2)

Container runtimes (runc, containerd, crun) require a mechanism to change the root filesystem after setting up the mount namespace. ISLE implements the standard pivot_root(2) syscall:

/// pivot_root(new_root: &CStr, put_old: &CStr) -> Result<()>
///
/// Atomically swaps the root mount with another mount point. Required for
/// OCI-compliant container creation.
///
/// # Prerequisites (checked by syscall)
/// - new_root must be a mount point
/// - put_old must be at or under new_root
/// - Caller must be in a mount namespace (CLONE_NEWNS or unshare(CLONE_NEWNS))
/// - Caller must have CAP_SYS_ADMIN in its user namespace
///
/// # Operation
/// 1. Attach new_root to the root of the mount namespace
/// 2. Move the old root to put_old
/// 3. The process's root directory is now new_root
/// 4. Subsequent umount(put_old) removes the old root from the namespace
///
/// # Container Runtime Usage
/// ```
/// // Standard OCI container creation sequence:
/// unshare(CLONE_NEWNS);                    // New mount namespace
/// mount("none", "/", NULL, MS_REC | MS_PRIVATE, NULL);  // Make all private
/// mount("/var/lib/container/rootfs", "/var/lib/container/rootfs",
///       NULL, MS_BIND | MS_REC, NULL);     // Bind-mount rootfs onto itself
/// pivot_root("/var/lib/container/rootfs", "/var/lib/container/rootfs/.oldroot");
/// chdir("/");                              // Ensure we're in new root
/// umount2("/.oldroot", MNT_DETACH);        // Detach old root
/// // Process now sees container rootfs as /
/// ```
///
/// # Difference from chroot(2)
/// pivot_root is fundamentally different from chroot:
/// - chroot only affects the process's view of the root directory
/// - pivot_root actually moves the mount point, affecting all processes in the namespace
/// - chroot can be escaped via mount namespace tricks; pivot_root cannot
/// - Container runtimes MUST use pivot_root for secure isolation
///
/// # Error codes
/// - EBUSY: new_root is not a mount point, or put_old is not under new_root
/// - EINVAL: new_root and put_old are the same
/// - ENOENT: path component does not exist
/// - ENOTDIR: path component is not a directory
/// - EPERM: Caller lacks CAP_SYS_ADMIN, or not in mount namespace
/// - ENOSYS: Not implemented (will not occur in ISLE)
SYSCALL_DEFINE2(pivot_root, const char __user *, new_root, const char __user *, put_old)

Interaction with other namespaces: - pivot_root operates on the caller's mount namespace - The root change is visible to all processes sharing that mount namespace - Combined with PID namespace: the container's init (PID 1) sees only the new root - Combined with User namespace: unprivileged processes can pivot_root within their own user namespace if they have CAP_SYS_ADMIN there

Implementation notes: The VFS layer (Section 27a) handles the mount tree manipulation. The mount_lock referenced below is a per-namespace Mutex<()> that serializes mount tree modifications (mount, unmount, pivot_root). It sits at lock hierarchy level MOUNT_LOCK (below VFS_LOCK, above DENTRY_LOCK).

  1. Lookup new_root and verify it's a mount point
  2. Lookup put_old and verify it's under new_root
  3. Lock the mount tree for modification (holds mount_lock)
  4. Detach the current root from the namespace's mount list
  5. Attach new_root as the new namespace root
  6. Reattach the old root at put_old position
  7. Publish the new root via RCU: rcu_assign_pointer(namespace->root, new_root)
  8. Unlock the mount tree

Atomicity with respect to path lookups: Steps 4–6 are performed while holding mount_lock, and the old root pointer remains valid in the RCU-published slot until step 7 overwrites it. Path lookups (open(), stat(), readlink(), etc.) take an RCU read-side reference to the namespace root at the start of lookup via rcu_dereference(namespace->root). This ensures: - In-flight path lookups that started before pivot_root complete with the old root (consistent view) - New path lookups that start after step 7 see the new root - No path lookup can see a partially-updated state (no torn reads, no null pointer) - Between steps 4–6, the data structures are modified under mount_lock, but lookups still see the old root via RCU The RCU grace period after step 7 ensures that by the time umount(put_old) completes, no in-flight lookups hold references to the old root.

63.2.2 Joining Namespaces: setns(2) and nsenter

Container operations like docker exec require joining an existing namespace. ISLE implements setns(2) for this purpose:

/// setns(fd: RawFd, nstype: c_int) -> Result<()>
///
/// Reassociates the calling thread with the namespace referenced by fd.
///
/// # Parameters
/// - fd: File descriptor referring to a namespace (obtained from /proc/[pid]/ns/[type])
/// - nstype: Namespace type (CLONE_NEW* constant) or 0 to auto-detect from fd
///
/// # Prerequisites
/// - Caller must have CAP_SYS_ADMIN in the target namespace's owning user namespace
/// - For PID namespaces: No restriction (affects future children only, per Linux 3.8+)
/// - For user namespaces: Caller must not be in a chroot environment
/// - The namespace must still exist (owning process hasn't exited)
///
/// # Container Runtime Usage (docker exec)
/// ```
/// // Join a running container's namespaces:
/// int fd = open("/proc/[container_pid]/ns/mnt", O_RDONLY | O_CLOEXEC);
/// setns(fd, CLONE_NEWNS);  // Join mount namespace
/// close(fd);
///
/// fd = open("/proc/[container_pid]/ns/net", O_RDONLY | O_CLOEXEC);
/// setns(fd, CLONE_NEWNET); // Join network namespace
/// close(fd);
///
/// // PID namespace must be joined via clone(), not setns()
/// // (kernel limitation: can't change PID namespace of running process)
/// // exec() into container: now running in container's namespaces
/// execve("/bin/sh", ["/bin/sh"], envp);
/// ```
///
/// # User Namespace Ordering Note
/// When joining multiple namespaces, joining the user namespace changes the
/// caller's effective capabilities. Container runtimes typically join the user
/// namespace first (if at all), then other namespaces. ISLE does not enforce
/// this ordering (matching Linux behavior) — it is the caller's responsibility
/// to join namespaces in an order that makes sense for their authorization context.
///
/// # Namespace file descriptors
/// Each namespace type is exposed via /proc/[pid]/ns/:
/// ```
/// /proc/[pid]/ns/cgroup         → Cgroup namespace
/// /proc/[pid]/ns/ipc            → IPC namespace
/// /proc/[pid]/ns/mnt            → Mount namespace
/// /proc/[pid]/ns/net            → Network namespace
/// /proc/[pid]/ns/pid            → PID namespace (current)
/// /proc/[pid]/ns/pid_for_children → PID namespace for future children (after setns)
/// /proc/[pid]/ns/time           → Time namespace (Linux 5.6+)
/// /proc/[pid]/ns/time_for_children → Time namespace for future children (Linux 5.8+)
/// /proc/[pid]/ns/user           → User namespace
/// /proc/[pid]/ns/uts            → UTS namespace
/// ```
///
/// The `*_for_children` symlinks reveal the pending namespace set by
/// `setns(CLONE_NEWPID)` or `setns(CLONE_NEWTIME)`. They differ from the
/// main symlinks when a process has called `setns()` but not yet forked.
/// Container introspection tools (`lsns`, `nsenter --target`) use these.
///
/// These are magic links: reading them returns the namespace type, and
/// opening them gives a file descriptor that can be passed to setns().
///
/// # Error codes
/// - EBADF: Invalid fd
/// - EINVAL: fd does not refer to a namespace, nstype doesn't match fd type,
///           or (for PID namespace) caller has other threads in its thread group
/// - EPERM: Caller lacks CAP_SYS_ADMIN in target namespace's user namespace
/// - ENOENT: Namespace has been destroyed
SYSCALL_DEFINE2(setns, int, fd, int, nstype)

PID namespace special case: A process cannot change its own PID namespace via setns() — the process's PID in its original namespace remains unchanged. However, setns(fd, CLONE_NEWPID) is valid since Linux 3.8: it sets the PID namespace for future children created by fork()/clone(). The caller's own PID view is unchanged, but newly created children will be in the target PID namespace.

This is why docker exec uses nsenter with --fork flag: it joins other namespaces via setns(), sets the target PID namespace for children, then forks a child that inherits all joined namespaces and has the correct PID view.

Implementation:

fn sys_setns(fd: RawFd, nstype: c_int) -> Result<()> {
    let file = current_task().fd_table.get(fd)?;
    let ns_inode = file.inode.downcast_ref::<NsInode>()
        .ok_or(Errno::EINVAL)?;

    // Verify nstype matches (if specified)
    if nstype != 0 && ns_inode.nstype != nstype {
        return Err(Errno::EINVAL);
    }

    // Enforce user namespace ordering: CLONE_NEWUSER must be joined first.
    // If we're joining a non-user namespace and have already joined a user NS
    // in this sequence, that's fine. But if we're joining a user NS after
    // joining a non-user NS, reject. Track via joined_non_user_ns flag.
    if ns_inode.nstype == CLONE_NEWUSER {
        if current_task().ns_state.joined_non_user_ns_in_sequence {
            // User NS must be joined before other NS types
            return Err(Errno::EINVAL);
        }
    } else {
        // Mark that we've joined a non-user namespace
        current_task().ns_state.joined_non_user_ns_in_sequence = true;
    }

    // Check CAP_SYS_ADMIN in target namespace's user namespace
    let target_user_ns = ns_inode.namespace.user_ns.upgrade().ok_or(Errno::ENOENT)?;
    if !has_cap_sys_admin_in(current_task(), &target_user_ns) {
        return Err(Errno::EPERM);
    }

    // Join the namespace (type-specific switch)
    match ns_inode.nstype {
        CLONE_NEWNS => current_task().ns_state.switch_mount(&ns_inode.namespace),
        CLONE_NEWNET => current_task().ns_state.switch_net(&ns_inode.namespace),
        CLONE_NEWUTS => current_task().ns_state.switch_uts(&ns_inode.namespace),
        CLONE_NEWIPC => current_task().ns_state.switch_ipc(&ns_inode.namespace),
        CLONE_NEWUSER => {
            // Chroot'd processes cannot join user namespaces (could escape chroot)
            if current_task().is_chrooted() {
                return Err(Errno::EPERM);
            }
            current_task().ns_state.switch_user(&ns_inode.namespace)?;
            // Reset the flag after user NS switch (capabilities now from new NS)
            current_task().ns_state.joined_non_user_ns_in_sequence = false;
        }
        CLONE_NEWCGROUP => current_task().ns_state.switch_cgroup(&ns_inode.namespace),
        CLONE_NEWTIME => {
            // Time namespace affects future children, not the caller (Linux 5.8+ semantics).
            // Set pending_time_ns so fork()/clone() children use the target time offsets.
            current_task().ns_state.pending_time_ns =
                Some(ns_inode.namespace.as_time_ns().expect("TIME namespace"));
        }
        CLONE_NEWPID => {
            // PID namespace affects future children, not the caller.
            // Set pending_pid_ns so fork()/clone() creates children in target NS.
            current_task().ns_state.pending_pid_ns =
                Some(ns_inode.namespace.as_pid_ns().expect("PID namespace"));
        }
        _ => return Err(Errno::EINVAL),
    }

    Ok(())
}

63.3 Namespace Hierarchy and Inheritance

Namespaces form a hierarchical tree with parent-child relationships. When a process creates a new namespace via clone() or unshare(), the new namespace is a child of the caller's namespace:

Root Namespace (init)
  ├── PID NS 1 (container A)          ← child of root PID NS
  │   └── PID NS 1.1 (nested container) ← child of PID NS 1
  ├── PID NS 2 (container B)          ← child of root PID NS
  └── User NS 1 (unprivileged container)
      └── User NS 1.1 (child of User NS 1)

Parent-child link semantics:

/// Per-namespace-type hierarchy tracking.
pub struct NamespaceHierarchy {
    /// Pointer to parent namespace (None for root).
    /// The parent reference is weak (Weak<Namespace>) to prevent reference cycles.
    /// When the parent is dropped (all processes exited), child namespaces
    /// become orphans but remain functional until their own processes exit.
    pub parent: Option<Weak<Namespace>>,

    /// Children of this namespace (weak references).
    /// Weak references allow children to be destroyed independently of the parent.
    /// This matches Linux behavior: a parent namespace can be destroyed while
    /// children still exist (children become orphans but remain functional).
    /// The Vec is cleaned up lazily when iterating (dead Weak refs are removed).
    pub children: Mutex<Vec<Weak<Namespace>>>,

    /// Namespace type (PID, NET, MNT, UTS, IPC, USER, CGROUP, TIME).
    pub ns_type: NamespaceType,

    /// Inode number for this namespace (for /proc/PID/ns/*).
    /// Generated from a global counter, unique across all namespace types.
    pub ns_id: u64,
}

Inheritance rules:

Namespace Type Child Inherits Modification Scope
CLONE_NEWPID No (child starts fresh with PID 1) Child's PID 1 = child init process
CLONE_NEWNET No (child gets isolated network stack) Child has no interfaces except loopback
CLONE_NEWNS Yes (copy-on-write mount tree) Child's mounts are private unless marked shared
CLONE_NEWUTS Yes (copies parent's hostname/domainname) Container runtimes typically overwrite via sethostname()
CLONE_NEWIPC No (child gets empty IPC namespace) Child has isolated SysV/POSIX IPC
CLONE_NEWUSER No (child starts with empty UID/GID mappings) Parent must write /proc/PID/uid_map and gid_map to grant subordinate ranges
CLONE_NEWCGROUP No (child gets own cgroup root) Child's cgroup is a child of caller's cgroup
CLONE_NEWTIME No (child gets zero offsets) Child's time offsets are independent

Note on CLONE_NEWUTS: The child namespace initially inherits the parent's hostname and domainname (copy, not reference). Container runtimes (runc, containerd) typically overwrite this immediately with the container ID via sethostname().

Note on CLONE_NEWUSER: A newly created user namespace starts with empty UID/GID mappings — all UIDs/GIDs resolve to nobody/nogroup (65534) until mappings are written to /proc/PID/uid_map and /proc/PID/gid_map by a privileged process in the parent namespace. This is a critical security property: children do not automatically inherit the parent's full UID range. Instead, the parent explicitly grants a subordinate range (typically from /etc/subuid and /etc/subgid).

Namespace reference counting: Each namespace is reference-counted via Arc<Namespace>. A namespace is destroyed when: 1. All processes in the namespace have exited (process count → 0) 2. All file descriptors referring to /proc/PID/ns/* are closed 3. All bind mounts of the namespace file have been unmounted

Note: Namespace destruction is independent of parent/child relationships. A child namespace can outlive its parent (it becomes an orphan but remains functional), and a parent can be destroyed while children exist.

63.4 User Namespace UID/GID Mapping Security

User namespaces allow unprivileged users to have "root" (UID 0) within a namespace while mapping to an unprivileged UID outside. This is the foundation of rootless containers.

Security model:

/// UID/GID mapping for a user namespace.
/// Maps UIDs inside the namespace to UIDs outside (in the parent namespace).
pub struct IdMapping {
    /// UID mappings: (inner_uid_start, outer_uid_start, count)
    /// Example: (0, 100000, 65536) maps inner UID 0-65535 → outer UID 100000-165535
    pub uid_map: Vec<IdMapEntry>,

    /// GID mappings: (inner_gid_start, outer_gid_start, count)
    pub gid_map: Vec<IdMapEntry>,

    /// Whether the mapping is "one-to-one" (identity mapping).
    /// This is only allowed for privileged processes (CAP_SETUID/CAP_SETGID).
    pub is_identity: bool,

    /// /proc/PID/uid_map write restrictions:
    /// - Writer must be in the parent user namespace OR have CAP_SETUID in parent
    /// - Writer must have CAP_SYS_ADMIN in the target namespace (or be the target)
    /// - Mapped UIDs must be valid in the parent namespace
    /// - Total mapped range cannot exceed /proc/sys/kernel/uid_max (or configured limit)
}

pub struct IdMapEntry {
    pub inner_start: u32,
    pub outer_start: u32,
    pub count: u32,
}

Capability interactions with user namespaces:

  • A process with UID 0 inside a user namespace has full capabilities within that namespace
  • Capabilities are NOT granted against resources owned by ancestor namespaces
  • Example: A process with "root" in User NS 1 cannot mount() a filesystem from the host
  • The cap_effective mask is computed at syscall entry time based on:
  • The process's current UID within its user namespace
  • The target object's owning user namespace
  • The intersection of the process's capability bounding set with capabilities valid for the target

Determining the owning user namespace for kernel objects:

Object Type Owning User Namespace Mechanism
File (VFS inode) User namespace of the mount Each mount has mnt_user_ns set at mount time. Files inherit from their mount.
Socket User namespace of the creating process Stored in sock->sk_user_ns at socket creation
IPC object (shm, sem, msg) User namespace of the creating namespace IPC namespace → User namespace mapping at IPC NS creation
Capability token User namespace of the issuing process Stored in capability header
Process (for signals) User namespace of the process Stored in task_struct->user_ns
Device node User namespace of the initial mount Device nodes are always in the initial namespace

cap_effective computation algorithm:

The effective capability set for a process operating on an object is the intersection of: 1. The process's current effective capabilities (cap_effective) 2. The capabilities valid for the target object's namespace

This ensures that a process which has dropped capabilities via capset() does not regain them when accessing child namespace objects.

compute_effective_caps(process, object):
  1. proc_ns = process.user_namespace
  2. obj_ns = object.owning_user_namespace
  3. proc_caps = process.cap_effective  // NOT cap_bounding — use current effective set

  4. // Check if process's NS is an ancestor of object's NS (or same NS)
  5. if is_same_or_ancestor(proc_ns, obj_ns):
  6.     // Process is in a parent (or same) namespace — capabilities apply
  7.     // Return intersection of process's effective caps and caps valid for target
  8.     return intersection(proc_caps, capabilities_valid_for(obj_ns))

  9. // Check if process's NS is a descendant of object's NS
 10. if is_ancestor(obj_ns, proc_ns):
 11.     // Process is in a child namespace — no capabilities against parent objects
 12.     return EMPTY_CAP_SET

 13. // Unrelated namespaces (neither ancestor nor descendant)
 14. // This happens with sibling containers
 15. return EMPTY_CAP_SET

is_same_or_ancestor(potential_ancestor, potential_descendant):
  // Walk up the hierarchy from potential_descendant toward root.
  // Return true if potential_ancestor is encountered (including if they're the same).
  cursor = potential_descendant
  while cursor != None:
      if cursor == potential_ancestor:
          return true
      cursor = cursor.parent
  return false

capabilities_valid_for(namespace):
  // Returns the set of capabilities meaningful for objects in this namespace.
  // Most capabilities are always valid. Some (e.g., CAP_SYS_ADMIN for certain
  // operations) may be restricted based on the namespace hierarchy.
  // For typical operations, this returns ALL_CAPS.
  return ALL_CAPS

Key invariant: The intersection() at line 8 ensures that if a process drops CAP_NET_ADMIN via capset(), it cannot exercise CAP_NET_ADMIN against any object, including objects in child namespaces. This upholds the guarantee in §66.2: "a dropped privilege can never be regained."

File capability interpretation:

File capabilities (set via setcap) are interpreted relative to the file's owning user namespace: 1. When execve() loads a binary with file capabilities, the kernel checks if the process's user namespace is the same as or a descendant of the file's owning user namespace. 2. If the file is in an ancestor namespace, its capability bits are ignored (prevents privilege escalation via filesystem mounted from host into container). 3. If the file is in the same or descendant namespace, the file's capabilities are added to the process's permitted/effective sets, subject to the usual cap_bounding restrictions.

Setuid/setgid binary behavior in nested namespaces:

Binary Location Setuid Behavior Rationale
Initial namespace (host) UID changes in initial namespace Traditional Unix behavior
Child namespace UID changes within child namespace only Cannot escalate to parent namespace UIDs
Mounted from host into container Setuid bit ignored Prevents host→container privilege escalation

Privilege escalation prevention:

  1. A process in a child user namespace cannot modify the parent's UID mappings
  2. setuid() inside a user namespace only affects the inner UID, not the outer UID
  3. File capability bits (setcap) are interpreted relative to the file's owning user namespace
  4. Signals from a less-privileged namespace to a more-privileged namespace are blocked unless explicitly allowed

63.5 Security Policy Integration

Container isolation requires multiple defense layers beyond namespaces and capabilities. ISLE integrates with security policy mechanisms at specific points in the container lifecycle:

seccomp-bpf (Syscall Filtering): OCI-compliant container runtimes (Docker, containerd, CRI-O) require seccomp-bpf to restrict the syscall surface available to containerized processes. ISLE's seccomp implementation is part of the eBPF subsystem described in §20.4 (05-linux-compat.md), which covers eBPF program types including seccomp-bpf for per-process syscall filtering. The typical container creation sequence is:

1. clone(CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET | ...)
2. unshare(CLONE_NEWUSER) — if rootless
3. pivot_root() — change filesystem root
4. seccomp(SECCOMP_SET_MODE_FILTER, ...) — install syscall filter
5. drop_capabilities() — reduce capability set
6. execve() — exec container entrypoint

The seccomp filter must be installed before execve() so that the filter applies to the container's entrypoint and all its descendants. Docker's default seccomp profile blocks ~44 dangerous syscalls (e.g., kexec_load, reboot, mount). Kubernetes PodSecurityStandards mandate seccomp profiles for restricted workloads.

LSM Integration: ISLE supports pluggable Linux Security Modules (AppArmor, SELinux profiles). Container runtimes can specify an LSM profile via OCI annotations, which ISLE applies at execve() time. The integrity measurement framework (§24, 06-security.md) provides the foundation for policy enforcement. Note: The full LSM framework specification is a future milestone; currently, AppArmor and SELinux profile application follows Linux semantics.

See also: - §20.4 (05-linux-compat.md): eBPF subsystem including seccomp-bpf - §24 (06-security.md): Runtime Integrity Measurement (IMA) - §66: Credential model and capability dropping

64. Control Groups (Cgroups v2)

Linux cgroups v2 provide hierarchical resource allocation and limiting. ISLE implements the unified cgroup v2 interface, mapping controller semantics to ISLE's native scheduler, memory manager, and I/O subsystems.

64.1 Cgroup Filesystem and Hierarchy

Cgroups are exposed via a pseudo-filesystem mounted at /sys/fs/cgroup:

/sys/fs/cgroup/
├── cgroup.controllers      # Available controllers (cpu io memory pids cpuset freezer hugetlb)
├── cgroup.subtree_control  # Controllers enabled for children
├── cgroup.procs            # PIDs in this cgroup
├── system.slice/           # Systemd system services
├── user.slice/             # User sessions
└── docker/                 # Container cgroups
    └── <container-id>/
        ├── cpu.max
        ├── cpu.weight
        ├── cpu.guarantee   # ISLE extension (§15)
        ├── memory.max
        ├── memory.current
        ├── io.max
        ├── pids.max
        └── cpuset.cpus

Hierarchy delegation: A cgroup can delegate control to a subtree by enabling controllers in cgroup.subtree_control. Only controllers enabled in the parent's subtree_control are available in child cgroups. This matches Linux semantics for unprivileged container runtimes.

Cgroup namespace integration: CLONE_NEWCGROUP creates a new cgroup namespace where the process's current cgroup becomes the root of its view. Processes see /sys/fs/cgroup/ starting from their namespace's cgroup root, enabling rootless container runtimes to manage their own cgroup hierarchy.

64.2 CPU Controller Integration

The cgroup cpu controller maps to the ISLE scheduler:

  • cpu.weight: Mapped directly to the EEVDF task weight (range 1-10000, default 100). A higher weight grants a proportionally larger share of CPU time relative to sibling cgroups with lower weights.

  • cpu.max: Sets the bandwidth ceiling using CFS-style throttling. Format: "<quota> <period>" (both in microseconds). Example: "400000 1000000" limits the cgroup to 40% CPU (400ms per 1000ms period). This is a maximum limit, not a guarantee. When throttled, the cgroup's tasks are removed from the run queue until the next period begins. This matches standard Linux cgroup v2 semantics.

  • cpu.guarantee: (ISLE extension, see §15) Sets the bandwidth floor using Constant Bandwidth Server (CBS). Format: "<budget> <period>". Guarantees minimum CPU time regardless of other load. This is distinct from cpu.max: a cgroup can have both a guarantee (floor) and a limit (ceiling).

Relationship between cpu.max and cpu.guarantee: | Setting | Effect | Use Case | |---------|--------|----------| | cpu.max only | Limits maximum, no minimum | Prevent runaway containers | | cpu.guarantee only | Guarantees minimum, no maximum | RT workloads that need bounded latency | | Both | Guarantees minimum AND limits maximum | Mixed workloads with SLA |

When a cgroup is throttled (by either mechanism), the scheduler removes its tasks from the EEVDF tree until the next period or until budget is replenished.

64.3 Memory Controller Integration

The memory controller tracks physical page allocations per cgroup:

  • memory.current: The sum of all pages charged to this cgroup (in bytes).
  • memory.max: Hard limit (bytes). When exceeded, the per-cgroup OOM killer is invoked.
  • memory.high: Soft limit (bytes). When exceeded, the cgroup is throttled and its pages are prioritized for reclaim, but no OOM occurs.
  • memory.low: Memory protection (bytes). Pages below this threshold are protected from reclaim unless the system is under severe pressure.
  • memory.swap.max: Limits swap usage for this cgroup (Section 13.1).

Per-cgroup OOM killer: When memory.current exceeds memory.max, the OOM killer selects a victim within the cgroup subtree only — processes outside this cgroup are not affected. This is independent of global OOM (§12): per-cgroup OOM can trigger even when global memory is not exhausted. Victim selection criteria: 1. Select the task with the largest RSS within the cgroup subtree 2. Respect per-process oom_score_adj values: processes with OOM_SCORE_ADJ=-1000 are exempt from per-cgroup OOM (matching Linux semantics) 3. The final score is RSS + oom_score_adj_factor, with -1000 meaning "exempt"

This differs from the parenthetical shorthand in §12 — ISLE's per-cgroup OOM does respect oom_score_adj for compatibility with container workloads that use it to protect critical processes.

Memory accounting has low overhead (~1 atomic increment per page charge); the charge operation piggybacks on existing page table allocation routines in isle-core.

64.4 I/O Controller Integration

The io controller limits block I/O bandwidth and IOPS per cgroup:

  • io.max: Per-device limits. Format: "<major>:<minor> rbps=<bytes> wbps=<bytes> riops=<ops> wiops=<ops>". Example: "8:0 rbps=10485760 wbps=5242880" limits reads to 10 MB/s and writes to 5 MB/s on device 8:0.
  • io.weight: Proportional weight for best-effort I/O scheduling (1-10000, default 100).

The block I/O subsystem (§29) integrates with cgroup accounting: each bio (block I/O request) is tagged with its originating cgroup, and the I/O scheduler enforces per-cgroup limits.

64.5 PIDs Controller (Fork Bomb Prevention)

  • pids.max: Maximum number of tasks (threads + processes) in the cgroup subtree. Prevents fork bombs from exhausting system-wide PID space. fork()/clone() returns EAGAIN when the limit is reached.
  • pids.current: Current number of tasks in the cgroup.

This is critical for container isolation: a misbehaving container cannot exhaust the host's PID space.

64.6 Cpuset Controller (CPU and NUMA Pinning)

  • cpuset.cpus: CPUs allowed for tasks in this cgroup. Format: "0-3,8-11" (CPU list).
  • cpuset.mems: NUMA nodes allowed for memory allocation. Format: "0,2" (node list).
  • cpuset.cpus.partition: Partition mode (root, member, isolated). Isolated partitions have exclusive CPU access.

The scheduler respects cpuset constraints when selecting a CPU for a task. NUMA-aware allocation (§12) respects the cpuset.mems mask.

64.7 Freezer (Cgroup Pause/Resume)

  • cgroup.freeze: Write 1 to freeze all tasks in the cgroup subtree; write 0 to thaw.
  • cgroup.events: Contains frozen 0/1 indicating current frozen state.

Frozen tasks are removed from the run queue and cannot be scheduled. Used by docker pause and checkpoint/restore.

64.8 Additional Controllers

Controller Key Interface Description
hugetlb hugetlb.<size>.max Limits huge page allocations per cgroup
rdma rdma.max Limits RDMA/InfiniBand resources
misc misc.max Limits miscellaneous resources (e.g., SGX EPC)

65. POSIX Inter-Process Communication (IPC)

ISLE supports standard POSIX IPC mechanisms, optimized using ISLE's native zero-copy primitives where possible.

65.1 AF_UNIX Sockets

Local domain sockets (AF_UNIX) are heavily used in containerized environments (e.g., Docker, Kubernetes).

Zero-Copy Process-to-Process Rings: For SOCK_STREAM sockets, ISLE maps the connection to a pair of single-producer/single-consumer (SPSC) ring buffers shared directly between the two processes. These are distinct from the kernel-domain KABI ring buffers (Section 8), which are fixed-size command/completion rings for Tier 0/Tier 1 communication. The AF_UNIX ring buffer is:

/// Process-to-process SPSC ring for AF_UNIX SOCK_STREAM zero-copy.
/// Mapped into both processes' address spaces at connection time.
pub struct UserSpscRing {
    /// Ring buffer memory, shared between sender and receiver.
    /// Mapped read-write in sender, read-only in receiver.
    pub buffer: *mut u8,

    /// Total buffer size in bytes (power of 2 for efficient masking).
    pub capacity: usize,

    /// Write position (updated by sender, read by receiver).
    /// Stored in a separate cache line to avoid false sharing.
    /// Note: CacheAligned uses 128-byte alignment (max of x86/ARM/PPC cache lines)
    /// to prevent false sharing on all 6 target architectures.
    pub write_pos: CacheAligned<AtomicU64>,

    /// Read position (updated by receiver, read by sender).
    pub read_pos: CacheAligned<AtomicU64>,

    /// Futex word for blocking when ring is full (sender waits) or empty (receiver waits).
    pub futex_word: AtomicU32,
}
  • The isle-compat layer intercepts send() and recv() calls and translates them into ring buffer enqueues/dequeues.
  • Data is copied twice: once from sender's buffer into the shared ring, once from the ring into receiver's buffer. This eliminates the traditional kernel-buffer intermediate copy, reducing the path from 3 copies to 2.
  • The kernel is only invoked via futex when a ring is full/empty and the process must block.

SOCK_SEQPACKET message boundaries: SOCK_SEQPACKET requires preserving message boundaries — recv() must return exactly one message per call. The ring buffer protocol includes a 4-byte length header before each message:

/// Message format in SOCK_SEQPACKET ring:
/// | msg_len: u32 | data: [u8; msg_len] | msg_len: u32 | data: ... |
///
/// The receiver reads msg_len, then reads exactly that many bytes.
/// Short reads (buffer smaller than msg_len) discard the remainder of the message.

For SOCK_DGRAM AF_UNIX sockets, a similar framed protocol is used, but the ring is unidirectional (no connection, just a receive queue per socket).

65.2 Pipes and FIFOs

Standard pipes are implemented as bounded in-memory buffers managed by the VFS. - For high-throughput scenarios, applications can use vmsplice() to zero-copy data from a pipe into a memory-mapped region. - Internally, a pipe is a specialized VfsNode that maintains a wait queue for readers and writers.

Pipe data structure:

/// Maximum pipe capacity in pages. 1 MB / 4 KB = 256 pages.
/// Pipes larger than 1 MB require explicit fcntl(F_SETPIPE_SZ) and
/// use a separate allocation path (see LargePipeBuffer below).
pub const PIPE_MAX_FAST_PAGES: usize = 256;

/// A pipe buffer, shared between reader(s) and writer(s).
/// Allocated when pipe(2) or pipe2(2) is called.
///
/// The default buffer size is 65536 bytes (64 KB), matching Linux's default
/// since kernel 2.6.11. The size is configurable via fcntl(F_SETPIPE_SZ)
/// up to /proc/sys/fs/pipe-max-size (default 1 MB, max 1 GB).
///
/// **Zero-copy optimization**: When a pipe page is "gifted" via vmsplice()
/// with SPLICE_F_GIFT, the page is transferred to the pipe without copying.
/// The gifted page is unmapped from the sender's address space and becomes
/// owned by the pipe until read. This enables zero-copy data pipelines.
///
/// **Allocation model**: Small pipes (≤1 MB) use the inline page array below,
/// allocated from the kernel slab allocator in a single contiguous block.
/// Large pipes (>1 MB) store the page array separately and access it via
/// `pages_ptr`; the inline `pages_small` is unused. This hybrid approach
/// avoids heap allocation (Vec) while supporting the full Linux pipe size range.
pub struct PipeBuffer {
    /// Ring buffer of page-sized buffers.
    /// For pipes ≤1 MB (most cases): inline array, slab-allocated as part of PipeBuffer.
    /// For pipes >1 MB: unused, pages accessed via `pages_ptr`.
    pub pages_small: [PipePage; PIPE_MAX_FAST_PAGES],

    /// For large pipes (>1 MB): pointer to separately-allocated page array.
    /// None for small pipes. The allocation is from the general kernel heap
    /// (not slab) because large pipe buffers are rare and size varies.
    /// Accessed only while holding `ring_lock`.
    pub pages_large: Option<NonNull<PipePage>>,

    /// Number of pages currently allocated (determines which array to use).
    /// If ≤256, use `pages_small`. If >256, use `pages_large`.
    pub num_pages: AtomicU32,

    /// Index of the first page with data (read cursor).
    pub read_idx: AtomicU32,

    /// Index of the first empty page (write cursor).
    pub write_idx: AtomicU32,

    /// Byte offset within pages[read_idx] for partial reads.
    pub read_offset: AtomicU32,

    /// Byte offset within pages[write_idx] for partial writes.
    pub write_offset: AtomicU32,

    /// Total bytes currently in the pipe (atomic for lock-free size check).
    pub len: AtomicU32,

    /// Maximum capacity in bytes (set via fcntl(F_SETPIPE_SZ)).
    pub capacity: AtomicU32,

    /// Number of readers (for detecting write-side SIGPIPE).
    /// When this drops to 0, write() returns EPIPE.
    pub reader_count: AtomicU32,

    /// Number of writers (for detecting read-side EOF).
    /// When this drops to 0 and the pipe is empty, read() returns 0.
    pub writer_count: AtomicU32,

    /// Wait queue for blocked readers (pipe empty).
    pub read_wait: WaitQueueHead,

    /// Wait queue for blocked writers (pipe full).
    pub write_wait: WaitQueueHead,

    /// Lock for modifying the page ring (growing/shrinking) and multi-writer path.
    /// The lock-free single-writer path does not hold this lock.
    pub ring_lock: Mutex<()>,

    /// Seqlock for detecting concurrent fcntl(F_SETPIPE_SZ) during lock-free writes.
    /// Odd values indicate resize in progress; even values indicate stable.
    /// Writers read before and after; if changed, retry.
    pub resize_seq: AtomicU32,

    /// Count of active single-writer fast-path operations.
    /// fcntl(F_SETPIPE_SZ) waits for this to reach 0 before resizing.
    pub active_writer: AtomicU32,
}

/// A single page in the pipe buffer.
pub struct PipePage {
    /// Physical page containing the data.
    /// Allocated from the page allocator or gifted via vmsplice.
    pub page: PhysPage,

    /// Number of valid bytes in this page (0 = empty, PAGE_SIZE = full).
    /// For gifted pages, this is the full page; for standard writes,
    /// partial pages are possible.
    pub len: AtomicUsize,

    /// True if this page was gifted via vmsplice(SPLICE_F_GIFT).
    /// Gifted pages are unmapped from the sender and transferred to
    /// the reader; standard pages are copied.
    pub is_gifted: AtomicBool,
}

Pipe write algorithm (lock-free fast path):

Note: This algorithm assumes single-writer semantics for the lock-free fast path. POSIX pipes technically allow multiple concurrent writers, but such usage requires atomic writes smaller than PIPE_BUF (4096 bytes) to guarantee data integrity. For ISLE's high-performance path, the lock-free algorithm below requires exactly one concurrent writer — multi-writer scenarios fall back to a mutex-protected slow path (not shown).

Resize safety: The lock-free write path uses a resize_seq: AtomicU32 seqlock to detect concurrent fcntl(F_SETPIPE_SZ) operations. Before starting the write loop, the writer reads the seqlock; after completing each page, it re-checks. If the seqlock changed, the writer retries from the beginning with the new num_pages value. fcntl(F_SETPIPE_SZ) acquires ring_lock, waits for in-flight single-writers via an active_writer count, increments resize_seq, performs the resize, and increments resize_seq again. This ensures the lock-free path never observes an inconsistent buffer size.

write(pipe, data, len):
  0. seq_start = resize_seq.load(Acquire)  // Capture resize generation
  1. If reader_count.load(Acquire) == 0: return EPIPE (SIGPIPE to caller)
     // TOCTOU note: reader may close between this check and write. This is
     // acceptable per POSIX — data written to a pipe with no readers is simply
     // discarded, and the next write() will observe reader_count == 0 and
     // return EPIPE. The pipe remains consistent; no data corruption occurs.
  2. // Try to claim fast path via compare-and-swap
     if !active_writer.compare_exchange(0, 1, Acquire, Relaxed).is_ok():
         // Another writer active — take slow path with ring_lock
         return write_slow_path(pipe, data, len)
  3. remaining = len; written = 0
  4. current_num_pages = num_pages.load(Acquire)  // Capture once at start
  5. while remaining > 0:
       a. If len.load() >= capacity.load():
          // Pipe full — block on write_wait
          active_writer.store(0, Release)  // Release during wait
          wait_event_interruptible(write_wait, len.load() < capacity)
          if interrupted: return remaining_written
          // Re-acquire fast path and re-check
          if !active_writer.compare_exchange(0, 1, Acquire, Relaxed).is_ok():
              // Lost to another writer during wait — take slow path
              return written + write_slow_path(pipe, &data[written..], remaining)
          if reader_count.load(Acquire) == 0:
              active_writer.store(0, Release)
              return EPIPE
          // Check for resize during wait
          if resize_seq.load(Acquire) != seq_start:
              active_writer.store(0, Release)
              goto 0  // Retry with new parameters

       b. write_idx_val = write_idx.load(Relaxed)
       c. write_off = write_offset.load(Relaxed)
       d. available = min(PAGE_SIZE - write_off, remaining)
       e. copy data[written:written+available] to pages[write_idx_val][write_off:]
       e'. pages[write_idx_val].len.store(write_off + available, Release)  // Update per-page len
       f. write_offset.store(write_off + available, Release)
       g. len.fetch_add(available, Release)  // Publishing barrier for data in step e
       h. If write_offset == PAGE_SIZE:
          // Page full, advance to next — but first check for concurrent resize
          if resize_seq.load(Acquire) != seq_start:
              active_writer.store(0, Release)
              goto 0  // Retry from beginning with fresh parameters
          write_idx.store((write_idx_val + 1) % current_num_pages, Release)
          write_offset.store(0, Release)
       i. written += available; remaining -= available
  6. active_writer.store(0, Release)  // Release fast-path lock
  7. wake_up(read_wait)  // Notify any blocked readers
  8. return written

Memory ordering rationale for write path: The Release on len.fetch_add() (step g) is the publishing barrier that synchronizes with the reader's Acquire load of global len. This ensures all prior stores (the data memcpy in step e, the per-page len update in step e') are visible to the reader before it observes the new len value. The reader must use the global len Acquire→per-page len Acquire chain.

fcntl(F_SETPIPE_SZ) implementation:

fcntl_setpipe_sz(pipe, new_size):
  1. ring_lock.lock()
  2. // Wait for active single-writers to complete using futex
     while active_writer.load(Acquire) > 0:
         // Use futex wait instead of busy-spin to avoid priority inversion
         futex_wait(&active_writer, expected=1, timeout=1ms)
  3. resize_seq.fetch_add(1, Release)  // Start resize
  4. // Perform resize: allocate/free pages, update num_pages
  5. resize_seq.fetch_add(1, Release)  // End resize
  6. ring_lock.unlock()

Multi-writer support: When multiple threads write to the same pipe concurrently, the lock-free path cannot be used. The kernel detects multi-writer scenarios using a compare-and-swap pattern: a writer performs active_writer.compare_exchange(0, 1, Acquire, Relaxed). If successful (previous value was 0), it proceeds on the fast path. If it fails (another writer is active), it acquires ring_lock and takes the slow path. This ensures exactly one writer can be on the fast path at a time, preserving POSIX atomic write guarantees for writes ≤ PIPE_BUF.

Pipe read algorithm (lock-free, requires single reader or mutex for multi-reader):

read(pipe, buffer, len):
  0. seq_start = resize_seq.load(Acquire)  // Capture resize generation
  1. If len.load(Acquire) == 0:
       // Pipe empty — check for EOF or block
       if writer_count.load(Acquire) == 0:
           return 0  // EOF — all writers closed
       wait_event_interruptible(read_wait, len.load(Acquire) > 0 || writer_count.load(Acquire) == 0)
       if interrupted: return 0
       if len.load(Acquire) == 0 && writer_count.load(Acquire) == 0:
           return 0  // EOF after wakeup
       // Check for resize during wait
       if resize_seq.load(Acquire) != seq_start:
           goto 0  // Retry with new parameters

  2. bytes_read = 0
  3. current_num_pages = num_pages.load(Acquire)
  4. while bytes_read < len && len.load(Acquire) > 0:
       a. read_idx_val = read_idx.load(Acquire)  // Acquire to see writer's stores
       b. read_off = read_offset.load(Acquire)
       c. // Determine bytes available in current page
          page_len = pages[read_idx_val].len.load(Acquire)
          available = min(page_len - read_off, len - bytes_read)
       d. // Copy data from page to user buffer (Acquire ensures data is visible)
          copy pages[read_idx_val][read_off:read_off+available] to buffer[bytes_read:]
       e. read_offset.store(read_off + available, Release)
       f. len.fetch_sub(available, Release)
       g. If read_offset.load(Relaxed) >= page_len:
          // Page consumed, advance to next
          // Check for concurrent resize before using current_num_pages
          if resize_seq.load(Acquire) != seq_start:
              // Resize occurred — re-capture num_pages and validate read_idx
              seq_start = resize_seq.load(Acquire)
              current_num_pages = num_pages.load(Acquire)
              // read_idx may have been invalidated; re-load
              read_idx_val = read_idx.load(Acquire)
          read_idx.store((read_idx_val + 1) % current_num_pages, Release)
          read_offset.store(0, Release)
       h. bytes_read += available

  5. wake_up(write_wait)  // Notify any blocked writers
  6. return bytes_read

Memory ordering rationale: The reader uses Acquire loads on len, read_idx, read_offset, and pages[].len to synchronize with the writer's Release stores. This ensures the reader observes all data written before the writer updated these indices. On weakly-ordered architectures (AArch64, RISC-V, ARMv7, PPC), this ordering is critical to prevent the reader from seeing stale data.

FIFOs (named pipes): A FIFO is a VFS node (VfsNode) that, when opened, creates a reference to an existing PipeBuffer or creates a new one. Multiple readers and writers can open a FIFO; the reader_count and writer_count fields track opens/closes. Writers use the multi-writer slow path when concurrent writes are detected. When the last reader and last writer close, the buffer is freed.

65.3 Shared Memory (POSIX and SysV)

  • POSIX shm_open(): Implemented as a memory-mapped file (mmap) backed by a hidden tmpfs instance.
  • SysV shmget(): Maps to the same underlying physical memory allocation mechanism, but managed via the CLONE_NEWIPC namespace tables.

Both mechanisms result in direct page table entries (PTEs) mapping the same physical frames into multiple Capability Domains.

66. Credential Model and Capabilities

The Linux credential model (UIDs, GIDs, supplementary groups) is a legacy access control mechanism. ISLE maps this model to its native Capability System (Section 11).

66.1 Credential Translation

When a Linux binary executes, isle-compat constructs a Capability Set based on the file's ownership and the process's current credentials:

  1. Authentication: The process presents its UID/GID.
  2. Translation: isle-compat looks up the UID/GID in the NamespaceSet::id_map.
  3. Capability Grant: If the UID is 0 (root) in the initial namespace, the process is granted a wide set of administrative capabilities (e.g., CAP_SYS_ADMIN equivalent).
  4. File Access: When accessing a file, isle-compat checks the file's VFS mode bits against the process's UID/GID. If access is allowed, a native ISLE Capability<VfsNode> is generated and returned as a file descriptor.

66.2 Bounding Sets and Dropping Privileges

Linux capabilities (e.g., CAP_NET_BIND_SERVICE) map 1:1 to specific ISLE capability flags. When a process calls capset() to drop privileges, isle-compat permanently revokes the corresponding ISLE capabilities from the process's Capability Domain. Because ISLE capabilities are unforgeable, a dropped privilege can never be regained, ensuring strict containment.