Chapter 8: Process and Task Management¶
Task/Process structs, fork/exec/exit, signals, process groups, sessions, real-time guarantees
Tasks are the fundamental schedulable unit. Processes and threads are both Task objects
distinguished by shared/private address space. The lifecycle (fork, exec, exit) is
Linux-compatible at the syscall boundary. Signals, process groups, sessions, and resource
limits follow POSIX semantics with UmkaOS-internal improvements to state tracking.
8.1 Process and Task Management¶
This section defines how UmkaOS represents runnable entities, creates and destroys processes, loads programs, and manages the virtual address space operations that user space relies on. The scheduler that decides when tasks run is in Section 7.1; this section covers what tasks are and how they are born, transformed, and reaped.
8.1.1 Task Model¶
UmkaOS uses task as the schedulable unit. Tasks are grouped into processes that share
an address space, capability table, and file descriptor table. This mirrors the Linux
thread-group model: a "process" is a collection of tasks that share CLONE_VM |
CLONE_FILES | CLONE_SIGHAND, and a "thread" is simply another task within the same
process.
Task descriptor:
bitflags! {
/// Task scheduling state flags. Compound states are formed by ORing base flags.
/// The scheduler checks these to decide when sleeping tasks can be woken.
///
/// **Linux ABI compatibility**: The `TASK_*` values (RUNNING through TRACED,
/// WAKEKILL, FROZEN, NEW) match Linux's `__state` constants. ZOMBIE and DEAD
/// use UmkaOS-specific values (0x10/0x20) because UmkaOS merges Linux's
/// separate `__state` and `exit_state` fields into a single `TaskState`
/// bitflags — the natural lifecycle ordering (ZOMBIE precedes DEAD) differs
/// from Linux's `EXIT_ZOMBIE`/`EXIT_DEAD` values. The character mapping for
/// `/proc/[pid]/status` "State:" field is handled by `task_state_to_char()`
/// (see below), decoupling the internal bitflag values from the userspace ABI.
pub struct TaskState: u32 {
/// Task is on a run queue and eligible to be scheduled (or currently running).
/// Not a bit — the zero value. A task with no other flag set is RUNNING.
const RUNNING = 0x0000_0000;
/// Sleeping; woken by signals, explicit wake_up(), or timer expiry.
const INTERRUPTIBLE = 0x0000_0001;
/// Sleeping in uninterruptible wait; only woken by explicit wake_up().
/// Used for I/O waits that must not be interrupted by signals.
const UNINTERRUPTIBLE = 0x0000_0002;
/// Stopped by SIGSTOP or group stop. Resumed by SIGCONT.
const STOPPED = 0x0000_0004;
/// Being ptraced; stopped at a ptrace event.
const TRACED = 0x0000_0008;
/// Exit in progress; task_struct still exists for wait4() reaping.
/// UmkaOS ordering: ZOMBIE precedes DEAD in the merged TaskState field.
/// Linux uses reversed values because EXIT_ZOMBIE/EXIT_DEAD are in a separate
/// exit_state field. The procfs ABI uses characters ('Z'/'X'), not raw bits.
const ZOMBIE = 0x0000_0010;
/// All resources freed; task_struct about to be released.
/// UmkaOS ordering: DEAD follows ZOMBIE — natural lifecycle progression
/// in the merged TaskState field. Linux's EXIT_DEAD is in a separate
/// exit_state field with a different value; procfs exposes 'X', not bits.
/// Note: Linux also has TASK_DEAD = 0x00000080 in __state (a different
/// field). UmkaOS does not use TASK_DEAD separately — DEAD covers both.
const DEAD = 0x0000_0020;
/// Modifier: woken by fatal signals (SIGKILL) even while UNINTERRUPTIBLE.
/// Combine with UNINTERRUPTIBLE: KILLABLE = UNINTERRUPTIBLE | WAKEKILL.
const WAKEKILL = 0x0000_0100;
/// Convenient alias: UNINTERRUPTIBLE that can be killed.
const KILLABLE = Self::UNINTERRUPTIBLE.bits() | Self::WAKEKILL.bits();
/// Modifier: task should receive a wake-up IPI on the next tick
/// (used by WEA fibers to avoid dedicated wakeup IPIs on short waits).
const WAKEUP_DEFERRED = 0x0000_0200;
/// Task is being migrated between CPUs; temporarily off any run queue.
const MIGRATING = 0x0000_0400;
/// Frozen by cgroup freezer ([Section 17.2](17-containers.md#control-groups--freezer)).
/// Task is dequeued from run queue and cannot be scheduled until thawed.
/// Entry into TASK_FROZEN is an implicit RCU quiescent state.
/// Fatal signals (SIGKILL) can still terminate a frozen task.
/// Linux: `TASK_FROZEN = 0x00008000`.
const FROZEN = 0x0000_8000;
/// Task has been allocated and registered in the PID table but has not
/// yet been made runnable by `wake_up_new_task()`. Prevents signal
/// delivery to a task that hasn't been fully initialized (stack,
/// credentials, FD table, and scheduler state are still being set up).
/// Transitions to RUNNING on `wake_up_new_task()`.
/// Linux: `TASK_NEW = 0x00000800`.
const NEW = 0x0000_0800;
}
}
/// Map TaskState to the single-character code for `/proc/[pid]/status` "State:"
/// and `/proc/[pid]/stat` field 3. This function decouples the internal bitflag
/// values from the userspace-visible ABI: tools like `ps`, `top`, `htop` parse
/// the character, not the numeric value.
///
/// Linux equivalent: `task_state_to_char()` in `fs/proc/array.c`.
///
/// | TaskState | Char | Meaning |
/// |-----------|------|---------|
/// | RUNNING | 'R' | Running or runnable |
/// | INTERRUPTIBLE | 'S' | Sleeping in interruptible wait |
/// | UNINTERRUPTIBLE | 'D' | Sleeping in uninterruptible wait |
/// | STOPPED | 'T' | Stopped by signal |
/// | TRACED | 't' | Tracing stop |
/// | ZOMBIE | 'Z' | Zombie (exited, not yet reaped) |
/// | DEAD | 'X' | Dead (should not be seen in /proc) |
/// | FROZEN | 'D' | Reports as uninterruptible (Linux compat: TASK_FROZEN includes UNINTERRUPTIBLE bit) |
/// | KILLABLE | 'D' | Reported as uninterruptible (same as D) |
/// | NEW | — | Never visible in /proc (task not yet in PID table; see note) |
///
/// **Linux ABI compatibility (SF-094 fix)**: Linux `fs/proc/array.c`
/// `task_state_array` uses: R, S, D, T, t, X, Z, P, I. There is NO
/// 'F' or 'N' character in Linux `/proc/[pid]/status`. Tools like `ps`,
/// `htop`, `systemd-cgroup` have hardcoded state tables and display
/// anomalies on unknown characters. FROZEN maps to 'D' matching Linux
/// (where `TASK_FROZEN` includes the `UNINTERRUPTIBLE` bit). NEW tasks
/// are never enumerable by procfs (they are between PID registration
/// and `wake_up_new_task()`; the PID is not yet in the PID table).
fn task_state_to_char(state: TaskState) -> char {
// Priority order: check exit states first (ZOMBIE/DEAD), then stop
// states, then sleep states, then running. First match wins.
// All return values are Linux-compatible characters.
if state.contains(TaskState::ZOMBIE) { return 'Z'; }
if state.contains(TaskState::DEAD) { return 'X'; }
if state.contains(TaskState::STOPPED) { return 'T'; }
if state.contains(TaskState::TRACED) { return 't'; }
// FROZEN reports as 'D' (uninterruptible) — Linux compat.
// Linux TASK_FROZEN includes __TASK_UNINTERRUPTIBLE, so frozen tasks
// appear as 'D' in /proc. Using 'F' would break ps/htop/systemd.
if state.contains(TaskState::FROZEN) { return 'D'; }
if state.contains(TaskState::UNINTERRUPTIBLE) { return 'D'; }
if state.contains(TaskState::INTERRUPTIBLE) { return 'S'; }
// NEW tasks are between PID alloc and wake_up_new_task(). They cannot
// be enumerated by procfs because they are not yet in the PID table.
// Debug assertion: if we reach this point with NEW, it's a kernel bug
// (procfs should never see this state).
debug_assert!(!state.contains(TaskState::NEW),
"NEW task should not be visible in /proc");
'R' // RUNNING is the zero value — no flags set
}
**TaskState ↔ OnRqState valid combinations:**
The scheduler uses `OnRqState` as the authoritative decision for `pick_eevdf()`.
`TaskState` governs the lifecycle (which transitions are legal).
| TaskState | Valid OnRqState(s) | Notes |
|---|---|---|
| RUNNING | Eligible, Ineligible | Task is on-CPU or in a run queue tree |
| INTERRUPTIBLE | Off, Deferred, CbsThrottled | Sleeping; signal wakes to Eligible. CbsThrottled if cgroup budget exhausted |
| UNINTERRUPTIBLE | Off, Deferred, CbsThrottled | Sleeping; only explicit wake moves to Eligible. CbsThrottled if cgroup budget exhausted |
| KILLABLE | Off, Deferred, CbsThrottled | Like UNINTERRUPTIBLE but SIGKILL wakes. CbsThrottled if cgroup budget exhausted |
| STOPPED | Off | Not schedulable; SIGCONT transitions to RUNNING/Eligible |
| TRACED | Off | Ptrace-stopped; not schedulable until resumed |
| NEW | Off | Between PID registration and wake_up_new_task(); not schedulable |
| ZOMBIE | Off | Exit in progress; never scheduled again |
| DEAD | Off | Resources freed; about to be reaped |
| FROZEN | Off | Cgroup-frozen; thaw transitions to previous state |
| MIGRATING | Off | Temporarily between run queues during cross-CPU migration |
`OnRqState::Deferred` and `OnRqState::CbsThrottled` are distinct states:
- **`Deferred`**: EEVDF deferred-dequeue optimization. The task went to sleep with
negative lag (`sched_delayed == true`) and remains physically present on the EEVDF
tree (eligible or ineligible). `pick_eevdf()` skips it, but it contributes weight
to `avg_vruntime`. When the task wakes, `place_entity()` repositions it and it
transitions to `Eligible` without re-enqueue overhead. This is purely an EEVDF
fairness mechanism -- no timer or budget is involved.
- **`CbsThrottled`**: CBS bandwidth exhaustion. The task's cgroup CPU budget is depleted.
The task is NOT on any EEVDF tree -- it has been fully dequeued with `vruntime` and
`lag` preserved. It waits for the CBS period replenishment timer. SIGKILL immediately
re-enqueues (with bandwidth debt); other signals are queued as `TIF_SIGPENDING` and
processed on replenishment.
/// Maximum task comm name length including null terminator. Linux-compatible value.
pub const TASK_COMM_LEN: usize = 16;
pub struct Task {
/// Kernel-unique task identifier.
pub tid: TaskId,
/// Owning process (shared with sibling tasks).
pub process: Arc<Process>,
/// Task name (thread/process name). Set to the executable basename on exec()
/// (truncated to 15 bytes + null terminator). Updated by prctl(PR_SET_NAME)
/// and pthread_setname_np(). Exposed via /proc/[pid]/comm and
/// /proc/[pid]/status (Name: field).
///
/// **External ABI**: Must be [u8; TASK_COMM_LEN] for Linux /proc compatibility.
/// The null terminator at index 15 is always maintained.
pub comm: [u8; TASK_COMM_LEN],
/// Scheduling state machine. Stored as `AtomicU32` for concurrent
/// access from multiple CPUs (signal delivery, scheduler, do_exit).
/// Accessors: `state.load(Acquire)` returns `TaskState`;
/// `state.store(new.bits(), Release)` transitions state. The
/// `Acquire`/`Release` pair ensures state transitions are visible
/// to other CPUs along with preceding memory writes (e.g., setting
/// `exit_code` before transitioning to ZOMBIE).
pub state: AtomicU32, // Running, Interruptible, Uninterruptible, Stopped, Zombie
/// Per-task process flags (PF_EXITING, PF_WQ_WORKER, PF_KTHREAD, PF_SIGNALED,
/// PF_MEMALLOC, PF_FORKNOEXEC, PF_OOM_VICTIM, etc.). Orthogonal to `state`
/// (scheduling state). AtomicU32 for concurrent access from signal delivery,
/// scheduler, and do_exit. Set via `fetch_or(flag, Release)`, clear via
/// `fetch_and(!flag, Release)`.
///
/// Linux equivalent: `task_struct->flags` (`unsigned int`, accessed via
/// `PF_*` macros in `include/linux/sched.h`). UmkaOS matches the semantics
/// but uses atomic operations for thread safety (Linux relies on BKL legacy
/// + careful ordering, which is fragile in a Rust memory model).
///
/// See `PF_*` constants below the Task struct for bit definitions.
pub flags: AtomicU32,
/// Per-task thread flags for signal/scheduling notification. Contains
/// TIF_SIGPENDING (signal needs delivery), TIF_NEED_RESCHED (redundant
/// with CpuLocalBlock.need_resched but needed for cross-CPU signal wakeup),
/// TIF_NOTIFY_RESUME (notify callback on return to userspace).
///
/// Checked on every return-to-userspace path (syscall exit, interrupt exit,
/// exception exit). The kernel's `do_signal()` entry point is invoked when
/// `TIF_SIGPENDING` is set. `TIF_NEED_RESCHED` in thread_flags is the per-task
/// counterpart to `CpuLocalBlock.need_resched` (per-CPU); the per-task flag is
/// needed because `send_signal()` may target a task on a remote CPU and must
/// set a flag visible to that task's return-to-userspace path.
///
/// AtomicU32 for concurrent access: signal delivery on CPU A may set
/// `TIF_SIGPENDING` on a task currently running on CPU B. Uses
/// `fetch_or(flag, Release)` to set, `fetch_and(!flag, Acquire)` to clear
/// and read.
///
/// See `TIF_*` constants below the Task struct for bit definitions.
pub thread_flags: AtomicU32,
/// Completion event for vfork(). Stack-allocated in the parent's
/// kernel stack frame. The child stores a pointer here; the parent
/// blocks on it. Signaled when the child calls exec() or _exit().
/// Null when no vfork is pending (non-vfork children, or after signaling).
///
/// # Interior mutability rationale
/// Written in `do_fork()` (step 19: set to parent's stack-allocated event)
/// and cleared in `exec_mmap()` / `do_exit()` (after signaling completion).
/// Both write sites access the field through `&Task` (shared reference via
/// `Arc<Task>`). `AtomicPtr` provides interior mutability without requiring
/// `&mut Task`. The null pointer serves as the `None` sentinel.
///
/// # Safety
/// This is a raw pointer to the parent's kernel stack frame. It is valid
/// ONLY while `CompletionEvent.guard` reads `true` (Acquire). The child
/// MUST check `guard.load(Acquire)` before dereferencing `event` — the
/// parent may have been killed (SIGKILL) and set guard=false before
/// unwinding its stack frame. After signaling, the child stores null.
///
/// Read: `let ptr = task.vfork_done.load(Acquire);`
/// Write: `task.vfork_done.store(ptr, Release);`
/// Take-and-clear: `let ptr = task.vfork_done.swap(null_mut(), AcqRel);`
///
/// See the vfork() documentation and `CompletionEvent` struct for the
/// full protocol including the edge case where the parent is killed.
pub vfork_done: AtomicPtr<CompletionEvent>,
/// Job control flags for POSIX job control (JOBCTL_STOP_PENDING,
/// JOBCTL_STOP_CONSUME, JOBCTL_TRAP_STOP, JOBCTL_TRAP_NOTIFY, etc.).
/// Used by group stop, ptrace, and SIGCONT/SIGSTOP handling.
///
/// AtomicU32 for concurrent access from signal delivery paths.
/// Set via `fetch_or(flag, Release)`, cleared via `fetch_and(!flag, Release)`.
///
/// See `JOBCTL_*` constants below the Task struct for bit definitions.
pub jobctl: AtomicU32,
/// Syscall restart block. Holds arguments for restarting an interrupted
/// syscall after signal delivery (nanosleep, futex_wait, poll, etc.).
/// Set by the syscall implementation before returning -ERESTARTSYS.
/// The signal delivery path inspects this to determine how to restart
/// the syscall after the signal handler returns.
///
/// Initialized to `RestartBlock::None` at task creation. Only the current
/// task writes to its own restart_block (no concurrent access).
pub restart_block: RestartBlock,
/// OOM adjustment score (-1000 to 1000). Read by `oom_badness()`
/// ([Section 4.5](04-memory.md#oom-killer)). Set via `/proc/PID/oom_score_adj`. Defaults to 0.
/// -1000 = OOM-immune (never killed). +1000 = always killed first.
/// AtomicI16 for concurrent reads from the OOM killer while the task
/// or a privileged process writes via procfs. Range enforced by the
/// procfs write handler.
pub oom_score_adj: AtomicI16,
// Debug assertion: on every state transition, verify the new (TaskState, OnRqState)
// pair is in the valid combinations table above. In debug builds:
// debug_assert!(is_valid_state_pair(new_state, new_on_rq),
// "invalid (TaskState={:?}, OnRqState={:?})", new_state, new_on_rq);
// This catches scheduler bugs early without runtime cost in release builds.
/// Scheduler bookkeeping (vruntime, deadline params, etc.).
pub sched_entity: SchedEntity,
/// Which CPUs this task may run on.
pub cpu_affinity: CpuSet,
/// Blocked and pending signal masks.
pub signal_mask: SignalSet,
/// Architecture-specific saved CPU context for context switching.
/// Defined per-architecture in `umka-core/src/arch/*/context.rs`.
///
/// | Architecture | Callee-saved registers | Additional state | Approx. size |
/// |---|---|---|---|
/// | x86-64 | RBP, RSP, RBX, R12-R15, RIP (return address) | SSE/AVX state pointer, FPU save area | ~96 bytes |
/// | AArch64 | X19-X28, X29 (FP), X30 (LR), SP | TPIDR_EL0 (TLS) | ~88 bytes |
/// | ARMv7 | r4-r11, lr, sp | TPIDR_URW (TLS) | ~40 bytes |
/// | RISC-V 64 | s0-s11, ra, sp | tp (thread pointer) | ~104 bytes |
/// | PPC32 | r14-r31, lr, sp (r1) | CR (condition register, callee-saved fields cr2-cr4) | ~80 bytes |
/// | PPC64LE | r14-r31, lr, sp (r1), TOC (r2) | CR (cr2-cr4), VRSAVE | ~96 bytes |
/// | s390x | r6-r15 (GPRs), f8-f15 (FPRs, callee-saved) | PSW (Program Status Word), access registers a0-a15 | ~256 bytes |
/// | LoongArch64 | s0-s8 ($r23-$r31), fp ($r22), ra ($r1), sp ($r3) | CSR.PRMD (Previous Mode), CSR.ERA (Exception Return Address) | ~96 bytes |
///
/// The architecture module's `context_switch(prev: &mut Task, next: &Task)`
/// saves the current registers into `prev.context` and restores from `next.context`.
///
/// Context switch instruction sequences:
/// - x86-64: `push` callee-saved GPRs to kernel stack, swap RSP, `pop` from new stack.
/// - AArch64: `stp`/`ldp` callee-saved pairs to/from `SavedContext` struct, swap SP.
/// - ARMv7: `stmia`/`ldmia` register list to/from `SavedContext`, swap SP.
/// - RISC-V 64: `sd`/`ld` callee-saved GPRs to/from `SavedContext`, swap SP.
/// - PPC32: `stmw`/`lmw` r14-r31 to/from `SavedContext`, swap r1.
/// - PPC64LE: `std`/`ld` r14-r31, r2 (TOC), LR to/from `SavedContext`, swap r1.
/// - s390x: `stmg r6,r15,off(prev)` / `lmg r6,r15,off(next)` saves/restores GPRs.
/// FPRs f8-f15 saved via `std`/`ld`. PSW restored via `lpswe` for return-to-user.
/// - LoongArch64: `st.d`/`ld.d` callee-saved GPRs to/from `SavedContext`, swap SP.
/// Return to user via `ertn` (loads PC from CSR.ERA, mode from CSR.PRMD).
///
/// `ArchContext` is an opaque type alias: `pub type ArchContext = arch::current::context::SavedContext;`
pub context: ArchContext,
/// Per-task capability restriction handle. Acts as a per-thread restriction
/// mask: can only **narrow**, never widen, the process-wide CapSpace
/// ([Section 9.1](09-security.md#capability-based-foundation)). A thread that restricts its own
/// capabilities cannot re-grant them to itself or other threads in the same
/// process. Used by sandboxed thread models (e.g., renderer threads that drop
/// filesystem capabilities after startup).
pub capabilities: CapHandle,
/// Embedded futex waiter node (Section 19.2.1). A task can block on at
/// most one futex at a time, so a single embedded node is sufficient.
/// Linked into a FutexHashTable bucket when the task calls futex_wait;
/// unlinked on wake or timeout. Uses intrusive linking to avoid heap
/// allocation under the bucket spinlock.
///
/// **Task exit safety**: When a task exits (including SIGKILL) while
/// linked into a futex bucket, the exit path (`do_exit`) must unlink
/// this node before freeing the Task struct. The cleanup sequence is:
/// 1. Acquire the futex bucket spinlock for this waiter's bucket.
/// 2. If `futex_waiter.is_linked()`, remove the node from the bucket
/// list (O(n) scan for intrusive singly-linked list, see
/// [Section 19.4](19-sysapi.md#futex-and-userspace-synchronization) for the rationale).
/// 3. Release the bucket spinlock.
/// This runs before address space teardown (step 3 of `do_exit`) and
/// before the Task struct is freed. The bucket spinlock serializes
/// against concurrent `futex_wake()` operations — a wakeup that races
/// with task exit will either see the node (and wake it, which is
/// harmless for a dying task) or see it already removed.
pub futex_waiter: FutexWaiter,
/// Pointer to the VFS ring this task is currently blocking on (set in
/// `reserve_slot()`, cleared in `complete_slot()`). Used by crash recovery
/// to identify tasks stuck in VFS operations on a crashed driver's ring.
/// Null when no VFS ring operation is in progress.
///
/// **Lifecycle invariant**: Must be null at every task lifecycle boundary:
/// - **Fork** (step 6d): initialized to `AtomicPtr::new(null_mut())`.
/// - **Exec** (step 5, `setup_new_exec`): cleared via `store(null_mut(), Release)`.
/// - **Exit** (`do_exit` step 3b-post): cleared via `store(null_mut(), Release)`.
///
/// Without this invariant, slab-recycled Task structs may carry stale pointers.
/// `sigkill_stuck_producers()` iterates all tasks and checks this field via range
/// comparison against the crashed ring set. A stale pointer from a recycled slab
/// slot can match, causing false SIGKILL to innocent tasks — including container
/// PID 1, which kills the entire container.
///
/// Set via `store(ring_ptr, Release)` in `reserve_slot()` after acquiring
/// the slot. Cleared via `store(null, Release)` in `complete_slot()` after
/// the response is consumed. Crash recovery reads with `load(Relaxed)` --
/// the downstream Acquire on ring_set.state provides the ordering fence.
pub current_vfs_ring: AtomicPtr<VfsRingPair>,
/// Scheduler upcall re-entrancy guard (Section 8.1.7.2).
///
/// Set to `true` by the kernel immediately before transferring control to
/// the userspace scheduler upcall handler, and cleared when the handler
/// calls `SYS_scheduler_upcall_resume()` or `SYS_scheduler_upcall_block()`.
///
/// While `in_upcall` is `true`:
/// - A new blocking condition does **not** trigger a nested upcall. Instead,
/// the kernel falls back to standard 1:1 blocking for the duration of the
/// handler's own blocking operations.
/// - Blocking syscalls entered from within the upcall handler return normally
/// when the I/O or wait completes; no re-entrancy into the upcall stack.
/// - If a scheduling event occurs that would ordinarily issue an upcall,
/// `upcall_pending` is set instead, and the upcall is delivered as soon
/// as the current handler exits.
///
/// `in_upcall` uses `Acquire`/`Release` ordering. The kernel stores
/// with `Release` when entering/exiting upcall context (ensuring all
/// preceding UpcallFrame writes are visible before the flag becomes
/// true). Signal delivery loads with `Acquire` (ensuring it observes
/// all UpcallFrame state written before the flag was set). If
/// `in_upcall` is only ever read on the task's current CPU (guaranteed
/// by the signal delivery model — signals are delivered on the target
/// task's CPU during return-to-userspace), `Relaxed` would also suffice
/// — but `Acquire`/`Release` is correct by default and costs nothing
/// on x86-64 (all loads are Acquire, all stores are Release).
pub in_upcall: AtomicBool,
/// Deferred upcall flag (Section 8.1.7.2).
///
/// Set to `true` by the kernel when a scheduling event occurs while
/// `in_upcall` is already `true` (i.e., the upcall handler is currently
/// executing). The pending upcall cannot be delivered immediately because
/// nesting would overwrite the `UpcallFrame` at the top of the upcall stack,
/// corrupting the original fiber's saved register state.
///
/// When the in-flight upcall handler calls `SYS_scheduler_upcall_resume()`
/// or `SYS_scheduler_upcall_block()`:
/// 1. `in_upcall` is cleared.
/// 2. If `upcall_pending` is `true`:
/// a. Clear `upcall_pending`.
/// b. Re-examine the scheduling state and, if a blocking event is still
/// pending, deliver the deferred upcall immediately before returning
/// to user space. This is equivalent to a normal upcall delivery.
///
/// At most one pending upcall is ever deferred: if multiple blocking
/// conditions arise while the handler is executing, they merge into a single
/// pending flag. The handler will re-examine the scheduler state on the next
/// delivery and observe all accumulated events.
pub upcall_pending: AtomicBool,
/// Per-thread nonce for UpcallFrame integrity verification (Section 8.1.7.3).
///
/// Generated from the hardware RNG (`RDRAND` on x86-64, `RNDR` on AArch64,
/// `seed` CSR on RISC-V with Zkr, platform entropy source on PPC) at thread
/// creation time (in `do_fork` / `do_clone`). Stored exclusively in kernel
/// memory — never exposed to userspace through any ABI, register, or memory
/// mapping.
///
/// When the kernel builds an `UpcallFrame` on the upcall stack, it writes
/// `magic_cookie = UPCALL_FRAME_MAGIC ^ upcall_frame_nonce` into the frame.
/// On `SYS_scheduler_upcall_resume`, the kernel verifies that
/// `frame.magic_cookie ^ upcall_frame_nonce == UPCALL_FRAME_MAGIC`. A frame
/// not written by the kernel's own upcall dispatch code will fail this check,
/// preventing a malicious fiber scheduler from forging frames with arbitrary
/// register state.
pub upcall_frame_nonce: u64,
/// Base address of the upcall stack registered via
/// `SYS_register_scheduler_upcall()`. Zero if no upcall handler is registered.
/// Used by `SYS_scheduler_upcall_resume()` validation (step 1) to verify that
/// the `UpcallFrame` pointer lies within the upcall stack, not the kernel stack
/// or the main user thread stack.
pub upcall_stack_base: usize,
/// Size in bytes of the upcall stack. Zero if no upcall handler is registered.
/// Minimum 8 KiB (enforced by `SYS_register_scheduler_upcall()`).
pub upcall_stack_size: usize,
/// File descriptor table reference — canonical owner of the fd table.
///
/// Shared via `Arc` for threads created with `CLONE_FILES`; cloned on `fork()`.
///
/// Threads within the same process that were created with `CLONE_FILES`
/// share a single `FdTable` instance via this `Arc`. Threads created
/// without `CLONE_FILES` (e.g., after `fork()` without `CLONE_FILES`,
/// or via `unshare(CLONE_FILES)`) own their own independent clone of
/// the table.
///
/// All fd operations (open, close, dup, read, write) go through this
/// reference. The `FdTable` itself is lock-protected; the `Arc` wrapper
/// allows the reference to be cloned cheaply at `fork()` / `clone()`.
///
/// On `exec()`, the table is kept but all close-on-exec (`O_CLOEXEC`)
/// file descriptors are closed atomically before the new binary gains
/// control.
///
/// # Interior mutability rationale
/// `ArcSwap` provides atomic replacement through `&Task` (shared reference).
/// `do_exit()` step 5 drops the fd table via `task.files.swap(empty_arc)`;
/// `unshare(CLONE_FILES)` replaces the table atomically. Read path:
/// `task.files.load()` (~3-5 cycles, no refcount increment).
/// See [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--arcswap).
pub files: ArcSwap<FdTable>,
/// Per-task filesystem context: root directory, current working directory,
/// and umask. Shared across threads created with `CLONE_FS`; cloned on
/// `fork()` without `CLONE_FS` (same sharing model as `files`/`FdTable`).
///
/// Accessed by path resolution ([Section 14.1](14-vfs.md#virtual-filesystem-layer)) to determine
/// the starting point for relative paths (`fs.pwd`) and the chroot boundary
/// (`fs.root`). Modified by `chdir(2)`, `fchdir(2)`, `chroot(2)`, and
/// `umask(2)`.
///
/// **`AT_FDCWD` resolution**: When a `*at()` syscall receives `AT_FDCWD`
/// (-100) as `dirfd`, the VFS uses `task.fs.load().read().pwd` as the base
/// directory for relative path resolution.
///
/// **`CLONE_FS` semantics**: Threads created with `CLONE_FS` share the
/// same `Arc<RwLock<FsStruct>>`. A `chdir()` by one thread affects all
/// threads sharing the same `FsStruct`. Threads created without `CLONE_FS`
/// (including `fork()` children by default) receive an independent copy —
/// `Arc::new(RwLock::new(parent_fs.read().clone()))`.
///
/// # Interior mutability rationale
/// `ArcSwap` provides atomic replacement through `&Task` (shared reference).
/// `do_exit()` Step 6 releases dentry references by swapping in an empty
/// sentinel `FsStruct`. `unshare(CLONE_FS)` replaces the shared struct
/// with an independent copy. Without `ArcSwap`, mutation through `&Task`
/// is a compile error.
///
/// Read: `task.fs.load()` returns `ArcSwapGuard<RwLock<FsStruct>>`.
/// Write: `task.fs.store(new_fs)` under appropriate serialization.
/// Release: `task.fs.store(EMPTY_FS_SENTINEL)` in do_exit() Step 6.
/// See [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--arcswap).
pub fs: ArcSwap<RwLock<FsStruct>>,
/// Namespace proxy — references to all namespace instances for this task.
///
/// Per-task (not per-process) because Linux has `task_struct->nsproxy` and
/// `setns(2)` / `unshare(2)` operate on individual threads, not the whole
/// process. Threads created with `CLONE_THREAD` (but without any
/// `CLONE_NEW*` flags) share the parent's `NamespaceSet` via `Arc::clone`.
/// `unshare()` replaces the calling thread's nsproxy without affecting
/// sibling threads — matching Linux semantics exactly.
///
/// Uses `ArcSwap` for interior mutability: `setns(2)` and `unshare(2)`
/// must swap the nsproxy through `&self` (the `current_task()` reference
/// is always shared). Read path: `task.nsproxy.load()`. Write path
/// (under `task.task_lock()`): `task.nsproxy.store(new_nsproxy)`.
///
/// Accessed as `task.nsproxy.load().{pid_ns, net_ns, mount_ns, ...}`.
/// See [Section 17.1](17-containers.md#namespace-architecture) for `NamespaceSet` definition.
pub nsproxy: ArcSwap<NamespaceSet>,
/// Cgroup membership for this task. Updated atomically during cgroup migration
/// (Section 17.2.7). Read by CPU bandwidth enforcement, memory OOM killer, and
/// cgroup freezer. Uses `ArcSwap` for lock-free reads on the scheduler hot path.
pub cgroup: ArcSwap<Cgroup>,
// Resource limits are stored in `self.process.rlimits` (the `Process` struct).
// All threads in a thread group share one `Process` and therefore one `RlimitSet`
// automatically — no separate `Arc<RlimitSet>` layer needed. Enforcement code
// reads `task.process.rlimits` directly. Writes (`setrlimit`, `prlimit64`) acquire
// `task.process.rlimit_lock`. Fork copies the entire `Process::rlimits` into the
// new child's `Process` struct — no shared Arc means no COW complexity.
/// Intrusive sibling node linking this task's process into its parent's child list.
///
/// Each process (specifically, each thread-group leader) is linked into
/// its parent process's `children: XArray<Pid, Arc<Process>>` via this node.
/// The node is inserted at `fork()` (parent acquires `Process::lock`,
/// inserts child's PID → Arc<Process> mapping) and removed on `do_exit()`
/// after reparenting orphans to the init process.
///
/// Using an intrusive node embedded in `Task` avoids per-child heap
/// allocation in the `fork()` fast path. The node is keyed by `Pid`
/// (not `*mut Task` or `Arc<Process>`) to avoid raw-pointer aliasing
/// and reference cycles across the parent-child boundary; the XArray
/// provides O(1) lookup by PID, O(1) insertion, and O(1) removal.
///
/// Only the thread-group leader's `sibling_node` is linked; non-leader
/// threads within the same process do NOT appear in the parent's child
/// list. The node is in an unlinked state (`IntrusiveNode::is_linked()`
/// returns `false`) for all non-leader tasks.
pub sibling_node: IntrusiveNode<Pid>,
// ===== Fields contributed by other subsystems =====
// Canonical Task struct = union of the fields above + the fields below.
/// Mutexes currently held by this task, ordered by lock level. Used by
/// `compute_effective_priority()` to find the max waiter priority across
/// all held mutexes for PI boosting ([Section 8.4](#real-time-guarantees)).
///
/// SmallVec-style tiered structure: inline array of 8 slots (one cache line,
/// covers 99.99% of tasks) with overflow Vec for rare deep nesting.
/// Protected by `task.pi_lock` (all access is under this lock).
///
/// **Why not a hard limit**: UmkaOS has Tier 1 isolation rings, live evolution
/// quiescence locks, CBS bandwidth locks, capability validation, and ML policy
/// parameter locks. Over 50 years, lock depth will grow as subsystems are added.
/// A hard `MAX_PI_CHAIN_DEPTH=10` is a timebomb — works for years, breaks
/// silently when a new subsystem adds lock level #11. The overflow Vec trades
/// one rare heap allocation for unbounded correctness.
///
/// **FMA integration**: when overflow is first allocated for a task, emit
/// `observe_kernel!(HeldMutexOverflow, task.tid, inline_count, overflow_len)`
/// so the ML framework can track which cgroups/workloads trigger deep nesting.
pub held_mutexes: HeldMutexes,
/// Base static priority of the task (nice-mapped for CFS, rt_priority for RT).
/// Range: CFS [-20, 19] mapped to [100, 139]; RT [0, 99].
/// Set by `sched_setscheduler(2)`, `nice(2)`, `setpriority(2)`.
/// The effective dynamic priority may differ due to PI boosting.
pub base_priority: i32,
/// Effective dynamic priority, accounting for priority inheritance (PI)
/// boosting from RT mutexes. Equals `base_priority` when no PI boost is
/// active. When a high-priority waiter is blocked on a mutex held by this
/// task, `effective_priority` is temporarily raised to the waiter's
/// priority. See [Section 8.4](#real-time-guarantees) for the full PI protocol.
pub effective_priority: i32,
// The signal queue lock (`siglock`) lives in `SignalHandlers`, NOT on the
// Task struct. All threads sharing `CLONE_SIGHAND` share one lock instance.
// Access via `task.siglock()` convenience accessor which delegates to
// `task.process.sighand.queue_lock`. See `SignalHandlers::queue_lock`.
//
// The disposition table lock (`sighand_lock`) is `task.process.sighand.lock`.
// Access via `task.sighand_lock()` convenience accessor.
/// Pending signal set for this specific task (thread-directed signals).
/// Process-wide signals are in `Process::shared_pending`. Signal delivery
/// checks both: `self.pending | self.process.shared_pending`.
pub pending: PendingSignals,
/// Timestamp (nanoseconds, monotonic) of the most recent context switch
/// that placed this task on-CPU. Used by CPU time accounting to compute
/// wall-clock execution time: `elapsed = now - last_switched_in`.
pub last_switched_in: u64,
/// Per-task resource usage accumulator. Contains CPU time counters
/// (utime_ns, stime_ns), context switch counts (nvcsw, nivcsw), and
/// other getrusage() fields. Initialized to zero at fork (step 6d).
/// Updated by the scheduler on context switch (utime/stime) and by
/// the page fault handler (majflt/minflt — see also min_faults/maj_faults
/// AtomicU64 above which are the fast-path counters; rusage fields are
/// the definitive getrusage() source).
///
/// Access as `task.rusage.utime_ns` / `task.rusage.stime_ns`.
/// Exposed via `clock_gettime(CLOCK_THREAD_CPUTIME_ID)` and `/proc/[pid]/stat`
/// fields 14-15. See [Section 8.7](#resource-limits-and-accounting) for the full
/// RusageAccum definition and accounting rules.
pub rusage: RusageAccum,
/// Set-child-tid address for `CLONE_CHILD_SETTID`. The kernel writes the
/// child's TID to this userspace address after fork/clone. NULL if
/// `CLONE_CHILD_SETTID` was not set.
///
/// # Interior mutability rationale
/// Written in `do_fork()` (set from clone args) and cleared in `exec_mmap()`
/// (step 3: old address is in destroyed address space). Both write sites
/// access through `&Task`. `AtomicPtr` provides interior mutability.
/// Read: `task.set_child_tid.load(Acquire)`.
/// Write: `task.set_child_tid.store(ptr, Release)`.
/// Clear: `task.set_child_tid.store(null_mut(), Release)`.
pub set_child_tid: AtomicPtr<u32>,
/// Clear-child-tid address for `CLONE_CHILD_CLEARTID`. On task exit, the
/// kernel writes 0 to this userspace address and performs `futex_wake()`
/// on it (used by pthread_join). NULL if `CLONE_CHILD_CLEARTID` was not set.
///
/// # Interior mutability rationale
/// Same as `set_child_tid` — written in fork, cleared in exec and exit,
/// all through `&Task`. `AtomicPtr` provides interior mutability.
/// Read: `task.clear_child_tid.load(Acquire)`.
/// Write: `task.clear_child_tid.store(ptr, Release)`.
/// Clear: `task.clear_child_tid.store(null_mut(), Release)`.
pub clear_child_tid: AtomicPtr<u32>,
/// Per-task LSM security blob pointer. Allocated from the LSM blob slab cache
/// at fork/clone time. Contains per-LSM security state (e.g., SELinux security
/// context label, AppArmor profile). See [Section 9.8](09-security.md#linux-security-module-framework).
pub security: NonNull<LsmBlob>,
/// Audit context for this task. Non-null when audit is enabled.
/// Contains the audit session ID (loginuid), audit context state, and
/// accumulated audit records for the current syscall.
pub audit_context: Option<Box<AuditContext>>,
/// I/O priority class and level for the block I/O scheduler.
/// Encoding matches Linux: bits [15:13] = class (0=none, 1=RT, 2=BE, 3=IDLE),
/// bits [12:0] = level within class (0=highest, 7=lowest for BE/RT).
/// Set by `ioprio_set(2)`, inherited at fork.
pub ioprio: u16,
/// Thread-local storage pointer (TLS). Architecture-specific:
/// x86-64: FS base (set via `arch_prctl(ARCH_SET_FS)` or `WRFSBASE`).
/// AArch64: TPIDR_EL0. RISC-V: tp register. Saved/restored by context switch.
/// Stored here for access by `/proc/[pid]/status` and ptrace.
pub tls_base: u64,
/// Robust futex list head pointer. Userspace address of the robust futex
/// list registered via `set_robust_list(2)`. On task exit, the kernel walks
/// this list and wakes any waiters on held futexes, setting the
/// `FUTEX_OWNER_DIED` bit. NULL if no robust list registered.
/// See [Section 19.4](19-sysapi.md#futex-and-userspace-synchronization--robust-futexes).
pub robust_list: *mut RobustListHead,
/// Restartable sequences (rseq) per-task state. Set via the `rseq(2)` syscall.
/// The `rseq_area` pointer is a userspace address to a `struct rseq` registered
/// by the C library (glibc 2.35+, musl). The kernel uses it to:
/// 1. Write the current CPU ID to `rseq->cpu_id` on context switch.
/// 2. Detect if a restartable sequence critical section was interrupted and
/// restart it by setting IP to `rseq->rseq_cs->abort_ip`.
///
/// Reset to NULL on exec (the old address is in the destroyed address space)
/// and in do_exit() step 3b-post (before mm teardown).
///
/// # Interior mutability rationale
/// Written in `rseq(2)` syscall handler (set userspace pointer), cleared in
/// `exec_mmap()` and `do_exit()` step 3b-post. All access through `&Task`.
/// `AtomicPtr` provides interior mutability. The context switch hot path
/// reads with `load(Relaxed)` (null check before writing cpu_id).
/// See [Section 19.4](19-sysapi.md#futex-and-userspace-synchronization--restartable-sequences-rseq).
pub rseq_area: AtomicPtr<u8>, // userspace pointer to struct rseq
/// Length of the rseq area (bytes). Typically `sizeof(struct rseq)` = 32.
///
/// # Interior mutability rationale
/// Paired with `rseq_area` — always written together. `AtomicU32` for
/// mutation through `&Task`. The context switch path reads this with
/// `load(Relaxed)` to validate rseq area bounds.
pub rseq_len: AtomicU32,
/// Signature value for rseq abort handler validation. The 4 bytes preceding
/// the abort_ip must match this signature, preventing ROP to arbitrary rseq
/// abort handlers.
pub rseq_sig: u32,
/// Per-task perf event list. Intrusive list of `PerfEvent` objects attached
/// to this task (via `perf_event_open` with `pid` targeting this task).
/// Used to context-switch performance counters on schedule-in/schedule-out.
/// Cleaned up by `perf_event_exit_task()` in do_exit() Step 3d
/// ([Section 20.8](20-observability.md#performance-monitoring-unit--perf-event-exit-cleanup)).
pub perf_events: IntrusiveList<PerfEvent>,
/// io_uring task context. None if this task has never called io_uring_enter().
/// Lazily allocated on first io_uring_enter(). Tasks that never use io_uring
/// pay zero overhead (Option is a single pointer, None = null). Cleaned up by
/// `io_uring_files_cancel()` in do_exit() Step 3e
/// ([Section 19.3](19-sysapi.md#io-uring-subsystem--iouring-exit-cleanup)).
pub io_uring_tctx: Option<Box<IoUringTaskCtx>>,
/// Ptrace state. Non-null when this task is being ptraced. Contains the
/// tracer's PID, ptrace options (PTRACE_O_*), and the pending ptrace
/// stop/signal state.
///
/// Wrapped in `SpinLock` for interior mutability: `ptrace_detach_locked()`
/// and `ptrace_init_child()` set/clear this field through `&Task` or
/// `Arc<Task>` (shared references). The `process_tree_write_lock` provides
/// logical synchronization; the SpinLock satisfies Rust's borrow checker.
pub ptrace: SpinLock<Option<Box<PtraceState>>>,
/// Exit code. Written by `do_exit()`: bits [15:8] = exit status from
/// `exit(2)`, bits [7:0] = signal number if killed. Read by `wait4(2)`.
/// Zero while the task is alive.
pub exit_code: AtomicI32,
// ===== Thread linkage =====
/// Intrusive list link for the thread group list (all threads sharing the
/// same `Process`). Linked at `clone(CLONE_THREAD)` / `do_fork()` into
/// `Process::thread_group`, unlinked at `do_exit()` before the Task is
/// freed. The thread group leader is always the first element. Used by
/// `/proc/[tgid]/task/` enumeration and by `kill(-tgid)` signal delivery
/// (which must iterate all threads in the group).
pub group_node: IntrusiveListLink,
/// Intrusive list link for the process group (session/pgrp) list. Links
/// the thread-group leader into the process group's task list (only leaders
/// are linked; non-leader threads have this in unlinked state). Used by
/// `kill(-pgrp)` to deliver signals to all processes in a process group
/// and by `getpgid(2)` / `setpgid(2)` for group membership management.
pub pid_group_node: IntrusiveListLink,
// ===== Kernel stack =====
/// Base address of this task's kernel stack (highest address; stacks grow
/// downward on all supported architectures). Allocated by `do_fork()` from
/// the kernel stack slab cache. The stack pointer is initialized to
/// `stack_base` at fork time; the guard page is mapped at
/// `stack_base - stack_size` to detect overflow.
pub stack_base: usize,
/// Size of the kernel stack in bytes. Default: 16,384 (16 KiB, 4 pages on
/// 4 KiB page architectures). Kernel threads performing deep call chains
/// (e.g., filesystem recursion in overlayfs) may request larger stacks via
/// `kthread_create_with_stack_size()`. Exposed via
/// `/proc/[pid]/status` (`VmStk` field, kernel stack size).
pub stack_size: usize,
// ===== Fault accounting =====
/// Minor page fault count (page found in page cache, no disk I/O required).
/// Incremented atomically by the page fault handler on each minor fault.
/// Exposed via `/proc/[pid]/stat` field 10 (`minflt`).
pub min_faults: AtomicU64,
/// Major page fault count (page not in cache, disk I/O required to service).
/// Incremented atomically by the page fault handler on each major fault.
/// Exposed via `/proc/[pid]/stat` field 12 (`majflt`).
pub maj_faults: AtomicU64,
// ===== Exit signaling =====
/// Signal number to send to the parent process when this task exits.
/// Typically `SIGCHLD` (17) for `fork()` children. Set by the `exit_signal`
/// argument to `clone(2)` / `clone3(2)`. A value of 0 means no signal is
/// sent (used by threads created with `CLONE_THREAD`, which notify the
/// thread group leader internally rather than signaling the parent).
pub exit_signal: i32,
// ===== Scheduling hints (supplement cpu_affinity) =====
/// CPU index where the task last ran. Used by `select_task_rq()` for
/// cache-warmth heuristic: preferring the last-run CPU avoids TLB/L1/L2
/// cold-start penalties (~10-50 us on modern hardware). Updated by the
/// context switch path (`context_switch()`) when the task is scheduled
/// out. Initialized to the fork CPU in `wake_up_new_task()`.
/// See [Section 7.1](07-scheduling.md#scheduler--select_task_rq-cpu-selection-for-task-placement).
pub last_cpu: u32,
/// Set to `true` during CBS cgroup migration to prevent the replenishment
/// timer callback from enqueuing this task on the old CPU's runqueue while
/// the migration is in progress. Cleared once the task is fully installed
/// on the destination CPU's CBS server
/// ([Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees--cbs-cgroup-migration)).
pub migration_pending: AtomicBool,
/// Popcount of `cpu_affinity` — the number of CPUs this task is allowed to
/// run on. Cached here to avoid repeated popcount computation in the
/// scheduler's load balancer (called on every tick). Updated atomically
/// whenever `cpu_affinity` is modified by `sched_setaffinity(2)` or
/// `cpuset` cgroup migration.
pub nr_cpus_allowed: u32,
// ===== Resource tracking (for /proc/[pid]/stat) =====
/// Monotonic boot-relative time (nanoseconds) when this task was created.
/// Set once at `do_fork()` from `ktime_get_boot_ns()`. Never modified.
/// Exposed via `/proc/[pid]/stat` field 22 (`starttime`, in clock ticks)
/// and via `clock_gettime(CLOCK_BOOTTIME)` comparisons.
pub start_time_ns: u64,
// Context switch counters (nvcsw, nivcsw) are stored in `rusage: RusageAccum`
// (see [Section 8.7](#resource-limits-and-accounting)). No standalone duplicates here.
// Access as `task.rusage.nvcsw` / `task.rusage.nivcsw`.
// Exposed via `/proc/[pid]/status` (`voluntary_ctxt_switches`, `nonvoluntary_ctxt_switches`).
// ===== Credentials (RCU-updated, per-task) =====
/// Per-task effective credentials, updated via RCU for lock-free reads on
/// the syscall fast path. While `Process::cred` holds the process-wide
/// credential baseline, this per-task `RcuCell` allows individual threads
/// to temporarily assume different credentials (e.g., via `setresuid(2)`
/// with `CLONE_THREAD` — Linux allows per-thread UIDs, and userspace
/// libraries like glibc broadcast credential changes to all threads, but
/// the kernel must support the per-thread case for ABI compatibility).
///
/// **Read path** (hot, every capability check): `task.cred.read()` returns
/// an `RcuRef<Arc<TaskCredential>>` — no locks, no atomic increments. The RCU read
/// side is implicit in the preempt-disabled syscall prologue.
///
/// **Write path** (cold, credential changes): allocate a new `Arc<TaskCredential>`
/// with updated fields, then `task.cred.update(new_cred)`. The old `TaskCredential`
/// is freed after an RCU grace period. The `Process::cred_lock` serializes
/// credential updates across threads in the same process.
pub cred: RcuCell<Arc<TaskCredential>>,
/// Cgroup migration state for this task. Tracks whether the task is
/// currently being migrated between cgroups (written by the cgroup
/// migration path under `cgroup_threadgroup_rwsem`, read by the scheduler
/// and accounting paths to defer charge/uncharge until migration completes).
///
/// Values: 0 = None (steady state), 1 = Migrating (migration in progress,
/// charge accounting deferred), 2 = Complete (migration finished, pending
/// charge reconciliation). Transitions: 0→1 on `cgroup_migrate_prepare`,
/// 1→2 on `cgroup_migrate_finish`, 2→0 on charge reconciliation completion.
///
/// AtomicU8 with Relaxed loads on the scheduler hot path (the scheduler
/// only needs a hint — if migration is in progress, it skips cgroup
/// bandwidth enforcement for this tick). The migration path uses
/// Release stores to ensure accounting writes are visible before the
/// state transition.
pub cgroup_migration_state: AtomicU8,
// --- Fields contributed by signal handling (see [Section 8.5](#signal-handling)) ---
/// Base address of the alternate signal stack (set by sigaltstack(2)).
/// Zero if no alternate stack is registered. Cleared on exec (step 5f)
/// because the old address space is destroyed.
/// See [Section 8.5](#signal-handling--sigaltstack).
pub sas_ss_sp: usize,
/// Size in bytes of the alternate signal stack. Zero if not registered.
pub sas_ss_size: usize,
/// Flags for the alternate signal stack (SS_AUTODISARM, SS_DISABLE).
/// Only modified by the owning thread via sigaltstack(2).
pub sas_ss_flags: u32,
// --- Fields contributed by credential model (see [Section 9.9](09-security.md#credential-model-and-capabilities)) ---
/// Credential generation counter. Incremented by `commit_creds()` on
/// every credential change (setuid, capset, setgroups, etc.). Captured
/// by `CapValidationToken` at validation time; KABI dispatch compares
/// the cached generation against the current value to detect stale tokens.
/// See [Section 9.9](09-security.md#credential-model-and-capabilities--credential-generation-counter).
pub cred_generation: AtomicU64,
// --- Fields contributed by RT mutex / PI protocol (see [Section 8.4](#real-time-guarantees)) ---
/// The RtMutex waiter struct for this task, if the task is currently
/// blocked waiting to acquire an RtMutex. `None` when not waiting.
/// Used by `pi_propagate()` to walk the PI chain.
pub rt_waiter: Option<RtMutexWaiter>,
/// Pointer to the RtMutex this task is blocked on, if any.
/// Used by `pi_propagate()` to traverse from waiter → owner → next mutex.
pub blocked_on_rt_mutex: Option<NonNull<RtMutex>>,
// --- Fields contributed by resource limits (see [Section 8.7](#resource-limits-and-accounting)) ---
/// Last scheduler tick number at which SIGXCPU was sent for RLIMIT_CPU.
/// Prevents sending SIGXCPU more than once per second.
pub last_sigxcpu_tick: u64,
}
// The canonical Task struct merges all fields above. Subsystem-specific
// documentation provides full semantics for each field at its defining section.
// ===== Process flags (PF_*) — bits in Task.flags =====
//
// These are orthogonal to TaskState (scheduling state). A task can be
// RUNNING + PF_EXITING, INTERRUPTIBLE + PF_MEMALLOC, etc. Linux defines
// these in `include/linux/sched.h` as `#define PF_*`.
/// Task is in `do_exit()`. Signal delivery is inhibited; futex waiters
/// skip this task; scheduler accounting is finalized.
pub const PF_EXITING: u32 = 0x0000_0004;
/// Task is a workqueue worker thread.
pub const PF_WQ_WORKER: u32 = 0x0000_0020;
/// Task is a kernel thread (created via `kthread_create`).
pub const PF_KTHREAD: u32 = 0x0020_0000;
/// Task was killed by a signal (set in `do_exit()` when exit code has
/// signal bits set).
pub const PF_SIGNALED: u32 = 0x0000_0400;
/// Task is allowed to allocate from emergency reserves (set by the
/// page allocator for PF_MEMALLOC-context tasks like kswapd).
pub const PF_MEMALLOC: u32 = 0x0000_0800;
/// Task has called `fork()` but not yet `exec()`. Cleared on exec.
/// Used by process accounting and `/proc/[pid]/stat` flags field.
pub const PF_FORKNOEXEC: u32 = 0x0000_0040;
/// Task has been selected as OOM victim. Prevents double-kill and
/// grants unlimited memory allocation to allow the task to exit cleanly.
pub const PF_OOM_VICTIM: u32 = 0x0001_0000;
/// Task should not be selected for OOM kill (kernel threads, init).
pub const PF_KTHREAD_BOUND:u32 = 0x0040_0000;
// ===== Thread info flags (TIF_*) — bits in Task.thread_flags =====
//
// These flags are checked on every return-to-userspace transition.
// They are set by subsystems that need the task to do work before
// returning to user mode.
/// A signal is pending for this task. Checked on every return-to-userspace;
/// triggers `do_signal()` which dequeues and delivers the pending signal.
pub const TIF_SIGPENDING: u32 = 1 << 0;
/// A reschedule is needed for this task. Per-task counterpart to
/// `CpuLocalBlock.need_resched`. Set by cross-CPU wakeup IPIs targeting
/// this task. UmkaOS primarily uses the per-CPU `need_resched` flag, but
/// `TIF_NEED_RESCHED` is needed for the signal delivery slow path where
/// `send_signal()` must mark a remote task.
pub const TIF_NEED_RESCHED: u32 = 1 << 1;
/// A notification callback should be invoked on return to userspace.
/// Used by the audit subsystem, task_work infrastructure, and
/// `seccomp(SECCOMP_RET_USER_NOTIF)` to defer work until the syscall
/// exit slow path.
pub const TIF_NOTIFY_RESUME: u32 = 1 << 2;
/// Single-step trap pending (ptrace). Architecture-specific handling:
/// x86-64 sets TF in RFLAGS; AArch64 uses MDSCR_EL1.SS; RISC-V uses
/// dcsr.step. The flag is checked on return-to-userspace to re-arm
/// single-step after the ptrace stop.
pub const TIF_SINGLESTEP: u32 = 1 << 3;
/// Lazy rescheduling requested. Unlike TIF_NEED_RESCHED (which forces preemption
/// at the next kernel preemption point), TIF_NEED_RESCHED_LAZY only triggers
/// rescheduling at voluntary preemption points: return-to-user, cond_resched(),
/// and explicit schedule() calls. Used by ReschedUrgency::Lazy to reduce
/// unnecessary preemptions for latency-insensitive workloads.
/// See [Section 7.1](07-scheduling.md#scheduler--rescheduling-urgency-model).
pub const TIF_NEED_RESCHED_LAZY: u32 = 1 << 4;
/// OOM victim memory reserve access. Grants the dying task access to memory
/// reserves (below min watermark) to allow it to complete its exit path and
/// free memory. Set by the OOM killer; cleared when the victim's mm_struct
/// is torn down in exit_mm(). See [Section 4.2](04-memory.md#physical-memory-allocator--oom-killer).
pub const TIF_MEMDIE: u32 = 1 << 5;
// ===== Job control flags (JOBCTL_*) — bits in Task.jobctl =====
//
// Used by the POSIX job control mechanism (group stop, ptrace,
// SIGCONT/SIGSTOP). Matching Linux's `JOBCTL_*` bit definitions
// from `include/linux/sched/jobctl.h`.
/// Group stop has been initiated and this task should stop.
pub const JOBCTL_STOP_PENDING: u32 = 1 << 17;
/// Task should consume the group stop signal (one task per group
/// delivers the notification to the parent).
pub const JOBCTL_STOP_CONSUME: u32 = 1 << 18;
/// Ptrace trap-stop requested. Task will stop at the next ptrace
/// checkpoint.
pub const JOBCTL_TRAP_STOP: u32 = 1 << 19;
/// Ptrace trap-notify. Like TRAP_STOP but generates a PTRACE_EVENT
/// notification.
pub const JOBCTL_TRAP_NOTIFY: u32 = 1 << 20;
/// Task is in the process of trapping (between initiating the stop
/// and actually being dequeued). Used to prevent races between
/// concurrent SIGCONT and ptrace detach.
pub const JOBCTL_TRAPPING: u32 = 1 << 21;
/// Task is in a PTRACE_LISTEN state (listening for events without
/// actually stopping).
pub const JOBCTL_LISTENING: u32 = 1 << 22;
// ===== Syscall restart block =====
// RestartBlock: see [Section 8.5](#signal-handling--syscall-restart-mechanism) for the
// canonical definition (in signal-handling.md, alongside the sys_restart_syscall()
// consumer that pattern-matches on it).
// Variants: None, Nanosleep, FutexWait, PollTimeout, PosixTimer.
//
// Linux equivalent: `struct restart_block` in `include/linux/restart_block.h`.
// `MountDentry` — a (mount, dentry) pair identifying a location in the VFS mount tree.
// See [Section 14.1](14-vfs.md#virtual-filesystem-layer--mountdentry--vfs-location-pair) for the canonical
// definition (fields: `mnt: Arc<Mount>`, `dentry: Arc<Dentry>`).
// Used for `FsStruct.root` and `FsStruct.pwd`, as the result of path resolution,
// and as the base for `*at()` syscall directory arguments.
/// Per-task filesystem context: root directory, current working directory,
/// and file creation mask. Shared across threads in the same process that
/// were created with `CLONE_FS`. `fork()` without `CLONE_FS` creates an
/// independent deep copy.
///
/// **Initialization**: The init process (PID 1) initializes `root` and `pwd`
/// to the rootfs mount's root dentry. `umask` defaults to `0o022`.
///
/// **Concurrency**: Protected by the `RwLock` in `Task.fs`. Read lock is
/// taken by path resolution (very frequent — every `open`, `stat`, `access`,
/// `exec`, etc.). Write lock is taken by `chdir`, `fchdir`, `chroot`, and
/// `umask` (infrequent). The asymmetric read-heavy access pattern makes
/// `RwLock` the correct choice over `Mutex`.
///
/// **`exec()` behavior**: `exec()` does NOT replace `FsStruct`. The new
/// program inherits the calling thread's root, cwd, and umask unchanged.
/// This is POSIX-mandated behavior (exec preserves the process environment).
///
/// **Thread safety of `CLONE_FS`**: When threads share an `FsStruct` via
/// `CLONE_FS`, a `chdir()` by one thread is immediately visible to all
/// other threads in the group. `unshare(CLONE_FS)` breaks the sharing —
/// the calling thread gets an independent copy, and subsequent `chdir()`
/// calls affect only that thread.
pub struct FsStruct {
/// Chroot boundary. Path resolution never ascends above this point.
/// Set by `chroot(2)` (requires `CAP_SYS_CHROOT`). Defaults to the
/// real root of the mount namespace.
///
/// **Security invariant**: Once `chroot()` narrows the root, the process
/// cannot escape except via `chroot()` again with a path that resolves
/// within the current root (which requires `CAP_SYS_CHROOT`). The VFS
/// path resolution loop checks `current == root` to detect the boundary.
///
/// **Mount namespace invariant**: `root` must always refer to a mount
/// within the task's `nsproxy.mount_ns` (or a child/descendant mount
/// thereof). `chroot()` and `pivot_root()` enforce this — both operate
/// on paths resolved within the current mount namespace. A `setns(mnt_fd)`
/// that changes the mount namespace also resets `root` and `pwd` to the
/// new namespace's root mount.
pub root: MountDentry,
/// Current working directory. Used as the base for relative path
/// resolution. Set by `chdir(2)` and `fchdir(2)`.
///
/// **Deleted directory**: If the cwd directory is removed (`rmdir`),
/// the `MountDentry` reference keeps the dentry alive (positive dentry
/// with `i_nlink == 0`). Subsequent path resolution from the deleted
/// cwd returns `ENOENT` for any relative path — matching Linux behavior.
pub pwd: MountDentry,
/// File creation mask. Applied to the `mode` argument of `open(O_CREAT)`,
/// `mkdir`, `mknod`, and `mkfifo`: effective_mode = mode & ~umask.
/// Set by `umask(2)`, which returns the previous value.
///
/// Default value: `0o022` (owner gets full permissions; group and other
/// lose write permission). This is the traditional Unix default and
/// matches glibc/musl behavior when no explicit `umask()` is called.
///
/// Stored as `u32` (not `u16`) for alignment and to match the `umask(2)`
/// return type. Only the lower 9 bits (0o777) are meaningful.
pub umask: u32,
}
impl Clone for FsStruct {
fn clone(&self) -> Self {
FsStruct {
root: self.root.clone(),
pwd: self.pwd.clone(),
umask: self.umask,
}
}
}
/// SmallVec-style container for mutexes currently held by a task.
///
/// Inline array of 8 `NonNull<RtMutex>` pointers (64 bytes — one cache line)
/// covers the common case (99.99% of tasks hold ≤8 mutexes simultaneously).
/// Overflow Vec allocated on first >8 (once per task lifetime, extremely rare).
///
/// All access is under `task.pi_lock` — no additional synchronization needed.
/// Overflow moves with task on migration (owned by task struct).
///
/// **Complexity**:
/// - `push()`: O(1) amortized. One predicted-not-taken branch for overflow check.
/// - `remove()`: O(n) scan of inline + overflow. n ≤ 8 in common case.
/// - `max_waiter_priority()`: O(n) scan. Used by `compute_effective_priority()`.
///
/// **Memory**: 80 bytes (inline[8] = 64B at offset 0, inline_count = 1B at offset 64,
/// 7B implicit alignment padding, overflow = 8B at offset 72). Same size as
/// `ArrayVec<NonNull<RtMutex>, 10>` (80 bytes) but with no hard limit on nesting.
/// The inline array spans the first cache line; overflow metadata occupies 16 bytes
/// in a second cache line.
pub struct HeldMutexes {
/// Inline storage for the first 8 held mutexes.
/// Slots `[0..inline_count)` are valid; slots `[inline_count..8)` are None.
inline: [Option<NonNull<RtMutex>>; 8],
/// Number of valid entries in the inline array (0..=8).
inline_count: u8,
/// Heap-allocated overflow for tasks holding >8 mutexes simultaneously.
/// `None` when `inline_count <= 8` (the vast majority of tasks).
/// Allocated with `Vec::with_capacity(4)` on first overflow (32 bytes).
/// Once allocated, retained for the task's lifetime (not shrunk back).
/// Box wrapping is deliberate: keeps the HeldMutexes struct at 80 bytes
/// (inline storage dominates cache use). The Vec header (24 bytes) is
/// accessed only on the rare overflow path (>8 held mutexes).
overflow: Option<Box<Vec<NonNull<RtMutex>>>>,
}
Global PID → Task lookup: The kernel maintains a global PidTable:
Idr<Arc<Task>> (integer-keyed radix tree,
Section 3.13).
kill(pid, sig) resolves through: pid_table.lookup(pid) → Arc<Task> →
signal delivery (Section 8.5). The Idr is
RCU-protected for lockless read (signal delivery is a hot path). Write
operations (fork, exit) acquire the PID table's write lock.
Signal delivery resolves the target via the sender's PID namespace: PidNamespace.pid_map translates the namespace-local PID to a global TaskId, then PID_TABLE.get(task_id) retrieves the Arc<Task> (Section 17.1). Cross-namespace signals (e.g., host killing a container process) require CAP_KILL in the target's user namespace.
// - siglock: SpinLock (§8.3.1.1 signal delivery, this chapter)
// - base_priority: i32 (§8.2.3.1 priority inheritance)
// The canonical Task struct is the union of all fields defined here and
// in subsystem extensions. An implementation merges them into one struct.
/// All tasks (threads) sharing the same TGID, i.e., the same `Process`.
///
/// Embedded inside `Process`. A task is a member of exactly one `ThreadGroup`.
/// Per-task resource usage accumulator. Contains all fields needed by getrusage(2)
/// and /proc/[pid]/stat. Initialized to zero at fork (step 6d). Updated by the
/// scheduler (utime_ns, stime_ns on context switch), page fault handler
/// (minflt, majflt), and I/O paths (inblock, oublock).
///
/// **Canonical definition**: [Section 8.7](#resource-limits-and-accounting--rusageaccum--per-task-accounting).
/// All fields are `AtomicU64` for lock-free cross-CPU reads (e.g., `getrusage(2)`
/// from another thread while the scheduler updates `utime_ns` on context switch).
/// Peak RSS is in kilobytes (`peak_rss_kb`) to match Linux's `getrusage(2)` which
/// reports `ru_maxrss` in kilobytes.
/// The list is protected by the containing `Process::lock` (a `SpinLock` guarding
/// `thread_group` and `children`).
///
/// Design note: Linux embeds its per-group state in `signal_struct`. UmkaOS keeps it
/// in `ThreadGroup` to make the ownership boundary explicit: `Process` owns the
/// group metadata, not individual tasks.
pub struct ThreadGroup {
/// Number of live (non-zombie) threads in the group.
/// Decremented at thread exit (before zombie state), incremented at `clone()`.
pub count: AtomicU32,
/// Exit code set by the first `exit_group()` call or a fatal signal delivered
/// to any thread in the group. Encoded as a Linux-compatible wait status word
/// (WEXITSTATUS / WTERMSIG format) so that `waitid()` can return it directly.
/// `None` while the group is alive; `Some(code)` once exit has been initiated.
/// The first thread to call `exit_group()` or receive a fatal signal atomically
/// transitions this from `None` to `Some(code)` via `compare_exchange`.
/// Using `Option<i32>` avoids sentinel conflicts. Signed `i32` matches Linux's
/// `signal_struct.group_exit_code` (C `int`) and `do_exit(exit_code: i32)`.
pub exit_code: SpinLock<Option<i32>>,
/// Counter for the group stop protocol ([Section 8.5](#signal-handling--group-stop-protocol-posix-job-control)).
/// Set by `do_group_stop()` to the number of threads that need to stop.
/// Each thread decrements this in `do_signal_stop()`; when it reaches 0,
/// the last thread sends SIGCHLD(CLD_STOPPED) to the parent.
/// Using a counter avoids the TOCTOU race of iterating all threads to
/// check if they are all stopped (a concurrent SIGCONT could wake threads
/// between the iteration and the check).
pub group_stop_count: AtomicU32,
/// All `Task` structs in this group, linked via `Task::group_node`.
/// Intrusive list — each `Task` embeds a `group_node: IntrusiveListLink`
/// for O(1) insert/remove. The thread-group leader is always the first
/// element. Protected by `Process::lock` for mutations (thread
/// creation in `do_fork` step 17, thread removal in `release_task()`).
pub tasks: IntrusiveList<Task>,
}
pub struct Process {
/// Kernel-unique process identifier.
pub pid: ProcessId,
/// Memory descriptor — page tables, VMA tree, mmap_lock, RSS counters.
/// Shared across threads via `Arc`; COW-forked children get a new `MmStruct`.
/// Kernel threads have `mm` pointing to an empty sentinel `MmStruct`.
/// See [Section 4.8](04-memory.md#virtual-memory-manager) for `MmStruct`.
///
/// # Interior mutability rationale
/// `ArcSwap<MmStruct>` provides atomic replacement through `&Process` (shared
/// reference via `Arc<Process>`). Replaced atomically in `exec_mmap()` (step 4b:
/// swap in new mm, drop old mm) and read on hot paths (page fault handler,
/// /proc reads, OOM badness calculation).
///
/// **OOM race resolution**: Prior to ArcSwap, `oom_badness()` could read
/// `process.mm.as_ref()` while `exec_mmap()` concurrently replaced the mm,
/// yielding a use-after-free. `ArcSwap::load()` returns a guard that extends
/// the `Arc<MmStruct>` lifetime for the duration of the read, eliminating
/// the race.
///
/// Read: `process.mm.load()` returns `ArcSwapGuard<MmStruct>` (~3-5 cycles).
/// Write: `process.mm.store(new_mm)` in exec_mmap (single-thread at PNR).
/// Kernel threads: `ArcSwap::new(EMPTY_MM_SENTINEL)` — sentinel has no VMAs,
/// zero RSS, null PGD. Code that checks for kernel thread uses
/// `process.mm.load().is_kernel_mm()` instead of `mm.is_none()`.
/// See [Section 3.1](03-concurrency.md#rust-ownership-for-lock-free-paths--arcswap).
pub mm: ArcSwap<MmStruct>,
/// Process-wide capability table (Section 9.1.1).
pub cap_table: CapSpace,
// NOTE: The fd table is NOT owned here. The canonical owner is `Task.files: Arc<FdTable>`.
// CLONE_FILES allows threads to share or have independent fd tables, and fork() creates
// a new Arc<FdTable> clone. Storing it in Process would conflict with that model.
/// Parent process (0 = None, for init). Uses `AtomicU64` for interior
/// mutability: reparenting during `do_exit()` writes through `&Process`
/// (shared reference via `Arc<Process>`). 0 serves as the `None` sentinel
/// (PID 0 is never a valid process ID).
///
/// Read: `let pid = process.parent.load(Acquire); if pid != 0 { ... }`
/// Write: `process.parent.store(new_parent_pid, Release);`
pub parent: AtomicU64,
/// Children of this process. XArray keyed by child PID provides O(1)
/// insertion at fork(), O(1) removal at child exit, and O(1) lookup by PID.
/// Per collection policy: integer-keyed mapping → XArray (not IntrusiveList,
/// not BTreeMap). Values are `Arc<Process>` for the child's thread-group.
/// The reverse edge (`parent`) is a scalar `Option<ProcessId>`.
pub children: XArray<Pid, Arc<Process>>,
/// All tasks sharing this address space.
pub thread_group: ThreadGroup,
/// Process-wide baseline credentials (uid, gid, supplementary groups,
/// capability sets). Defined in [Section 9.9](09-security.md#credential-model-and-capabilities).
/// Per-task credential overrides are in `Task::cred` (RcuCell).
pub cred: TaskCredential,
/// Shared signal handler table for this thread group.
///
/// All threads created with `CLONE_SIGHAND` share the same `SignalHandlers`
/// instance. `sigaction(2)` modifies the shared table under `SignalHandlers::lock`;
/// signal delivery reads entries lock-free (array index by signal number is
/// naturally atomic for pointer-sized fields on all supported architectures).
///
/// On `exec()` the handler table is replaced: a fresh `SignalHandlers` is
/// allocated with all dispositions reset to `SigHandler::Default` (except
/// `SIG_IGN` entries, which are preserved per POSIX).
///
/// On `fork()` the child receives a *copy* of the parent's table (COW semantics:
/// a new `SignalHandlers` with the same entries). Threads created with
/// `CLONE_SIGHAND` share the parent's existing `Arc<SignalHandlers>` directly.
pub sighand: Arc<SignalHandlers>,
/// Process-wide pending signal set (shared across all threads).
/// Thread-directed signals go to `Task.pending`; process-directed signals
/// (e.g., `kill(pid, sig)`) go here. Signal delivery checks both:
/// `task.pending | task.process.shared_pending`.
pub shared_pending: PendingSignals,
/// Controlling terminal for this process's session.
///
/// `None` for processes that have no controlling terminal — typically daemons
/// that called `setsid()` and have not yet opened a terminal, or processes
/// started without a terminal (e.g., launched by a service manager).
///
/// Set by the TTY layer when a session leader opens a terminal without
/// `O_NOCTTY`, or explicitly via `TIOCSCTTY`. Cleared when the terminal
/// hangs up or when `TIOCNOTTY` is called by the session leader.
///
/// Protected by the session lock (`Session::lock`). The `Tty` type is fully
/// defined in Chapter 21 (User I/O); this field holds an `Arc` reference so
/// that the terminal device persists as long as any process retains it as its
/// controlling terminal, even after the last file-descriptor reference is closed.
///
/// This field caches `process.session().controlling_terminal` for fast access
/// from the TTY layer. It MUST be cleared for ALL processes in the session
/// when the terminal is disassociated (via `TIOCNOTTY`, `setsid()`, or
/// hangup). The clearing is performed under `Session::lock` by iterating
/// all process groups and their member processes.
pub tty: Option<Arc<Tty>>,
/// Registered exit cleanup actions. Executed by the kernel cleanup
/// thread during do_exit() Step 11b, after namespace release and before
/// zombie transition. Protected by SpinLock for concurrent registration
/// (multiple threads can register cleanups simultaneously).
/// Bounded by UMKA_MAX_EXIT_CLEANUPS (64).
/// See [Section 8.1](#process-and-task-management--task-exit-and-resource-cleanup).
pub exit_cleanups: SpinLock<ArrayVec<ExitCleanupEntry, 64>>,
/// Per-process SysV semaphore undo list. Accumulated adjustments from
/// `semop(SEM_UNDO)` calls. Reversed on process exit by `exit_sem()`
/// ([Section 8.2](#process-lifecycle-teardown)). Shared across threads via
/// `CLONE_SYSVSEM`. Each `SemUndoRef` is a reference to a `SemUndo`
/// entry that is ALSO linked in the corresponding `SemSet.undo_list`
/// (dual linkage for O(1) traversal from both the process side and
/// the semaphore-set side). Protected by SpinLock because `semop()`
/// may be called concurrently from multiple threads in the same process.
/// Bounded by SEMMNI (max semaphore sets per IPC namespace, default 32768).
/// A process can have at most one SemUndo entry per semaphore set it has
/// performed SEM_UNDO operations on. Pre-allocation strategy: Vec is
/// created with capacity 0 (no heap allocation for processes that never
/// use SysV semaphores); Vec::push in semop() pre-reserves outside the
/// lock critical section, then inserts under the lock.
pub sysvsem_undo: SpinLock<Vec<SemUndoRef>>,
/// WaitQueue for wait4()/waitid() callers. Woken when a child changes
/// state: exits (SIGCHLD/CLD_EXITED), is killed (CLD_KILLED/CLD_DUMPED),
/// stops (CLD_STOPPED), or continues (CLD_CONTINUED).
/// See [Section 8.2](#process-lifecycle-teardown--step-12c-send-sigchld-to-parent).
pub wait_chldexit: WaitQueue,
// --- Fields contributed by process groups/sessions ---
// See [Section 8.6](#process-groups-and-sessions).
/// Process group ID. Set on fork (inherited from parent), changed
/// by setpgid(2). Used by kill(-pgid) for group signal delivery.
pub pgid: Pid,
/// Session ID. Set on fork (inherited from parent), changed by
/// setsid(2). Used for controlling terminal assignment and job control.
pub sid: Pid,
/// Per-process resource limits. Shared by all threads in the thread group
/// (threads share the same `Process`, so they share `rlimits` automatically).
/// Protected by `rlimit_lock` for writes (`setrlimit`, `prlimit64`); read
/// lock-free by enforcement code on hot paths.
pub rlimits: RlimitSet,
/// Lock serializing writes to `rlimits`. Not needed for reads (enforcement
/// code reads individual limits atomically).
pub rlimit_lock: SpinLock<()>,
/// Amount of memory locked by this process (via `mlock(2)`, `mlockall(2)`).
/// Tracked in bytes. Checked against `RLIMIT_MEMLOCK` on each `mlock()` call.
/// Updated atomically by the VMM on lock/unlock operations.
pub locked_pages: AtomicU64,
/// Accumulated resource usage from all reaped children. Updated when
/// `wait4()` / `waitid()` reaps a child: the child's `rusage` fields
/// are added to this accumulator. Exposed via `getrusage(RUSAGE_CHILDREN)`.
/// Protected by `wait_chldexit` signaling ordering (only one waiter
/// reaps a given child).
pub children_rusage: RusageAccum,
/// Per-process lock protecting `thread_group.tasks` mutations (thread
/// creation/exit), `children` XArray mutations (fork/exit/reparent),
/// and multi-task atomic operations (ptrace reparent, process group
/// changes). Lock level: PROCESS_LOCK (pending assignment in
/// [Section 3.4](03-concurrency.md#cumulative-performance-budget--lock-ordering)).
///
/// Nested acquisition of two `Process::lock` instances is permitted ONLY
/// when acquired in ascending PID order (prevents ABBA deadlock during
/// reparenting). See [Section 8.2](#process-lifecycle-teardown) Step 12-pre for
/// the reparent lock protocol.
pub lock: SpinLock<()>,
}
/// Entry in the per-process exit cleanup list.
///
/// Each entry pairs a kernel-assigned ID (for cancellation via
/// `ExitCleanupHandle::cancel()`) with the action to execute.
pub struct ExitCleanupEntry {
/// Unique identifier for this cleanup registration. Monotonically
/// increasing per-process counter assigned at registration time.
pub id: CleanupId,
/// The action to execute when the process exits.
pub action: ExitCleanupAction,
}
/// Per-process signal handler table, shared across the thread group.
///
/// POSIX requires that `sigaction(2)` affects the whole process: all threads
/// observe the updated disposition. This is achieved by sharing a single
/// `SignalHandlers` instance (via `Arc`) among all threads created with
/// `CLONE_SIGHAND` — the same sharing flag used for the fd table in the
/// Linux/POSIX thread model.
///
/// # Locking discipline
///
/// - **Disposition type check** (`send_signal()` ignore filter): lock-free via
/// `handler_cache[signo - 1].load(Relaxed)`. One `AtomicU8` byte is naturally
/// atomic — no torn reads possible. This is the hot-path disposition check.
/// - **Full SigAction reads** (signal frame building in `handle_signal()`):
/// performed under `queue_lock` (SIGLOCK, level 40). The `SigAction` struct
/// is multi-field (`handler`, `sa_mask`, `sa_flags`); reading it without
/// synchronization could observe a torn value (e.g., new `handler` but old
/// `sa_mask`). This matches Linux's design trade-off: `sighand->siglock`
/// serializes both signal delivery reads and `sigaction()` writes. UmkaOS
/// splits the two roles (SIGHAND_LOCK for writes, SIGLOCK for delivery reads)
/// for better concurrency, but both provide the synchronization needed.
/// - **Writes** (`sigaction(2)`): acquire `lock` (SIGHAND_LOCK, level 30),
/// update the `SigAction` entry AND the corresponding `handler_cache` byte,
/// release. Writers are rare (handler installation at program startup);
/// the spinlock is never contended on the fast path.
pub struct SignalHandlers {
/// Signal dispositions, 0-indexed: signal N is at `action[N - 1]`.
///
/// Valid signals: 1 through `SIGMAX` (64 inclusive).
/// Access pattern: `unsafe { &mut *action.get() }[(signo - 1)]` with
/// SIGHAND_LOCK held for writes, or `unsafe { &*action.get() }[(signo - 1)]`
/// under `queue_lock` for reads during signal frame building.
///
/// Invariants enforced by the kernel:
/// - `action[SIGKILL - 1].handler` is always `SigHandler::Default`.
/// - `action[SIGSTOP - 1].handler` is always `SigHandler::Default`.
/// `sigaction(2)` rejects attempts to change either (returns `EINVAL`).
///
/// # Interior mutability
/// `UnsafeCell` wraps the action array because `SignalHandlers` is behind
/// `Arc` (shared reference via `Process.sighand`). Writes require
/// `lock` (SIGHAND_LOCK, level 30). Reads during signal delivery use
/// `handler_cache` for the disposition check (lock-free), and access
/// full `SigAction` entries under `queue_lock` when building the signal
/// frame. SAFETY invariant: SIGHAND_LOCK must be held for all writes;
/// queue_lock must be held for multi-field reads.
pub action: UnsafeCell<[SigAction; SIGMAX]>,
/// Spinlock protecting writes to `action` (SIGHAND_LOCK, level 30). Not
/// held during reads (signal delivery). Contention is extremely rare: only
/// `sigaction(2)` writes, which typically occurs only during process startup.
/// `sigaction()` updates `action[signo-1]` AND `handler_cache[signo-1]`
/// atomically under this lock.
pub lock: SpinLock<()>,
/// Per-sighand signal queue lock (SIGLOCK, level 40). Protects
/// `PendingSignals` mutations (enqueue, dequeue, flush), `signal_mask`,
/// and group stop state. Shared by all threads with `CLONE_SIGHAND`.
///
/// The `task.siglock()` convenience accessor delegates to
/// `task.process.sighand.queue_lock`.
///
/// IRQ-safe: acquired with interrupts disabled because signals may be
/// sent from interrupt context (e.g., SIGSEGV from page fault handler,
/// SIGALRM from timer IRQ).
///
/// Locking order: SIGLOCK(40) nests below SIGHAND_LOCK(30) and above
/// RQ_LOCK(50) in the global lock ordering table
/// ([Section 3.4](03-concurrency.md#cumulative-performance-budget--lock-ordering)).
pub queue_lock: SpinLock<()>,
/// Lock-free disposition cache for `send_signal()` ignore check.
///
/// Each byte caches the disposition type for the corresponding signal:
/// 0 = `SigHandler::Default`, 1 = `SigHandler::Ignore`, 2 = `SigHandler::User`.
/// Indexed by `signo - 1` (signal 1 = index 0).
///
/// Updated atomically by `sigaction()` AFTER updating the full `SigAction`
/// entry under `lock`. `send_signal()` reads with `Relaxed` ordering — one
/// byte is naturally atomic on all architectures, so no torn reads are possible.
/// This eliminates the need to acquire `lock` or `queue_lock` for the common
/// ignore-check in `send_signal()`.
///
/// Cost: 64 bytes. Updated on `sigaction()` (cold path). Read on
/// `send_signal()` (warm path). The alternative — acquiring `lock` per
/// signal send — is far more expensive.
pub handler_cache: [AtomicU8; SIGMAX],
}
/// Maximum signal number supported. RT signal range is 34–64; SIGMAX = 64.
/// Matches Linux SIGRTMAX for 64-bit architectures (glibc uses signals 32–33
/// internally for NPTL; application-visible SIGRTMAX = 64).
pub const SIGMAX: usize = 64;
/// Architecture-specific saved context. Aliased from `arch::current::context::SavedContext`.
/// See `umka-core/src/arch/*/context.rs` for per-architecture definitions.
///
/// Minimum fields required by all architectures:
/// - Callee-saved general-purpose registers (per ABI)
/// - Stack pointer
/// - Return address / program counter
/// - Thread-local pointer (if used for CpuLocal)
/// - `orig_syscall_nr: u64` — the original syscall number, saved at syscall entry
/// for restart-after-signal. Per-architecture source register:
///
/// | Architecture | Source register | Entry path |
/// |---|---|---|
/// | x86-64 | RAX | `SYSCALL` entry saves RAX before dispatch overwrites it |
/// | AArch64 | X8 | `SVC #0` entry saves X8 |
/// | ARMv7 | R7 | `SWI` entry saves R7 |
/// | RISC-V 64 | A7 | `ecall` entry saves A7 |
/// | PPC32 | R0 | `sc` entry saves R0 |
/// | PPC64LE | R0 | `sc`/`scv` entry saves R0 |
/// | s390x | R1 (SVC code) | `SVC` entry saves the SVC instruction code |
/// | LoongArch64 | A7 | `SYSCALL` entry saves A7 |
///
/// The generic `do_signal()` restart path reads this field via
/// `arch::current::get_orig_syscall_nr(&task.context)` — a per-arch accessor
/// that returns the `orig_syscall_nr` field from the architecture's
/// `SavedContext` struct.
pub type ArchContext = arch::current::context::SavedContext;
// Forward declaration: `Tty` is fully defined in Chapter 21 (User I/O).
//
// A `Tty` represents an open terminal device — either a real hardware serial
// terminal or one side of a pseudo-terminal (PTY) pair. The controlling
// terminal is attached to a session via `TIOCSCTTY` or by the session leader
// opening the first terminal device after `setsid()` without `O_NOCTTY`.
//
// The field `Process::tty` holds `Option<Arc<Tty>>` so that the terminal
// device's reference count stays elevated for as long as any process holds it
// as a controlling terminal, independently of how many file descriptors point
// to the same device.
pub struct Tty; // forward declaration — see Chapter 21
// ---------------------------------------------------------------------------
// Supporting type definitions referenced by Task and Process
// ---------------------------------------------------------------------------
/// Scheduler entity embedded in `Task`. Carries all EEVDF scheduling state
/// for a task: virtual runtime, virtual deadline, eligibility, lag, and
/// deferred-dequeue status.
///
/// Full definition is in [Section 7.1](07-scheduling.md#scheduler),
/// as `EevdfTask`. The type alias `SchedEntity = EevdfTask` is used here for
/// clarity: in the process/task context, "scheduling entity" is the familiar
/// term. The scheduler chapter uses "EevdfTask" to emphasize the algorithm.
///
/// Embedded directly in `Task` (not heap-allocated) so that the scheduler
/// hot path can access it without pointer indirection.
pub type SchedEntity = EevdfTask; // defined in Section 7.1
/// File descriptor table — maps non-negative integer file descriptor numbers
/// to open file descriptions.
///
/// One `FdTable` may be shared among multiple tasks (threads) that were
/// created with `CLONE_FILES`. Each task holds an `Arc<FdTable>` reference;
/// the table's internal `SpinLock` serialises concurrent fd operations.
///
/// # Locking discipline
/// - All reads and writes to the fd array require holding `inner.lock`.
/// - `max_fds` is an `AtomicU64` updated under the lock; it may be read
/// without the lock for a conservative upper bound on valid fd indices.
/// Callers that need an exact bound must hold the lock.
///
/// # Lifecycle
/// - Created empty at process start (or after `unshare(CLONE_FILES)`).
/// - On `fork()` without `CLONE_FILES`: copied (COW — the copy is a fresh
/// `FdTable` with the same fd → `Arc<OpenFile>` entries, each `Arc`
/// cloned to bump the open-file reference count).
/// - On `fork()` with `CLONE_FILES` (thread creation): the `Arc` is cloned
/// (zero copy), both tasks share the same `FdTable`.
/// - On `exec()`: `O_CLOEXEC` fds are closed atomically under the lock
/// before the new binary's entry point runs. Non-cloexec fds remain open.
pub struct FdTable {
/// Lock-protected inner state.
pub inner: SpinLock<FdTableInner>,
/// Current highest allocated fd index plus one. Read with `Relaxed`
/// for a fast upper bound. The authoritative count of open fds is
/// the number of `Some` entries in `inner.fds`.
pub max_fds: AtomicU64,
/// Total number of currently open file descriptors. Updated under
/// `inner.lock`. Used by RLIMIT_NOFILE enforcement (Section 8.5.4).
/// u64 to match RLIMIT_NOFILE (u64) and satisfy counter longevity policy.
pub count: AtomicU64,
}
/// Lock-protected contents of `FdTable`.
pub struct FdTableInner {
/// Sparse fd-to-file mapping. `fds.load(n) = Some(f)` means fd `n` is open
/// and refers to open file description `f`. Empty slots mean the descriptor
/// is closed (or not yet allocated).
///
/// XArray provides O(1) lookup by integer key (fd numbers are non-negative
/// integers), RCU-compatible reads for `fstat()`/`read()`/`write()` fast
/// paths, and sparse storage (no wasted memory for gaps between open fds).
/// Per collection policy: integer-keyed mapping -> XArray.
/// Bounded by `RLIMIT_NOFILE.hard` for the owning process.
pub fds: XArray<RawFd, Arc<OpenFile>>,
/// Bitmap of close-on-exec descriptors. Bit `n` is set if fd `n` should
/// be closed at `exec()`. Maintained in sync with `fds`: when a slot is
/// cleared (closed), the corresponding cloexec bit is also cleared.
///
/// Separate from the `OpenFile` to avoid a per-file atomic on the
/// exec-close fast path (a single range clear on the bitmap is faster
/// than iterating each OpenFile's flags).
pub cloexec: BitVec,
}
/// An open file description (not a file descriptor — one description may be
/// referenced by multiple fds and multiple processes after `dup(2)` or `fork()`).
///
/// Forward declaration — fully defined in [Section 14.1](14-vfs.md#virtual-filesystem-layer).
pub struct OpenFile; // forward declaration — see Chapter 14
Each task holds its own CapHandle that can further restrict the process-wide
CapSpace but never widen it. This allows individual threads to voluntarily drop
privileges -- for example, a worker thread that processes untrusted input can shed
network capabilities before entering its main loop.
8.1.2 Process Creation¶
Linux problem: fork() copies the entire process state -- page tables, file
descriptor table, signal handlers, credentials -- then the child almost always
immediately calls exec(), discarding everything that was just copied. The clone()
syscall provides fine-grained control via a combinatorial flag space (CLONE_VM,
CLONE_FILES, CLONE_FS, CLONE_SIGHAND, CLONE_NEWPID, ...) that is powerful but
difficult to use correctly.
UmkaOS native model: Capability-based spawn(). A new process is created with an
explicit set of capabilities, an address space, and an entry point. Nothing is inherited
implicitly -- the parent must grant each resource (memory regions, file descriptors,
capabilities) that the child should receive. This makes the child's authority set visible
and auditable at creation time.
/// Maximum capabilities that can be granted during a single `create_process()`.
/// 64 covers all realistic delegation scenarios; processes needing more can
/// delegate additional capabilities after creation.
const MAX_CAPS_IN_SPAWN: usize = 64;
/// Maximum fd remappings during a single `create_process()`.
/// 256 covers all realistic fd inheritance; matches Linux's SCM_MAX_FD.
const MAX_FDS_IN_SPAWN: usize = 256;
pub struct SpawnArgs {
/// ELF binary or entry point address.
pub entry: EntrySpec,
/// Capabilities to grant to the child (subset of caller's CapSpace).
/// Bounded by `MAX_CAPS_IN_SPAWN` to prevent unbounded kernel allocation
/// from userspace input.
pub granted_caps: ArrayVec<CapHandle, MAX_CAPS_IN_SPAWN>,
/// File descriptors to pass (remapped into child's fd table).
/// Bounded by `MAX_FDS_IN_SPAWN`.
pub fds: ArrayVec<(Fd, Fd), MAX_FDS_IN_SPAWN>,
/// Initial address space configuration.
pub address_space: AddressSpaceSpec,
/// CPU affinity for the initial task.
pub cpu_affinity: CpuSet,
/// Namespace set (inherit parent's or create new).
pub namespaces: NamespaceSpec,
}
Linux compatibility: fork() and clone() are implemented in the SysAPI layer
(Section 19.1) by translating to the underlying task/process primitives:
fork()=clone(SIGCHLD)= COW address space copy + fd table copy + signal handler copy. Page table entries are marked read-only and reference-counted; the actual page copy is deferred to the write-fault handler (Section 4.8).clone(CLONE_VM | CLONE_FILES | ...)= create a new task within the same process (i.e., a thread). No address space copy, no fd table copy.-
vfork()= parent blocks until the child callsexec()or_exit(). The child temporarily shares the parent's address space (no COW overhead). Implementation details (matching Linuxkernel/fork.ccopy_process()+wait_for_vfork_done()): -
Completion primitive:
CompletionEvent(a one-shot semaphore backed by aWaitQueue+AtomicBool; see Section 3.6). - Storage: stack-allocated in the parent's kernel stack frame within
do_fork(). No heap allocation — theCompletionEventlives on the stack and is valid for the duration of the parent'sdo_fork()call. The child stores a raw pointer (*const CompletionEvent) inchild.vfork_done; this pointer is only valid while the parent is blocked inwait_for_completion_state(). - Child task field:
pub vfork_done: AtomicPtr<CompletionEvent>inTask. Set bydo_fork()before the child is scheduled; cleared by the child onexec()or_exit()after signaling completion. Null = no vfork pending. - Wait state: the parent waits in
TaskState::Killable— onlySIGKILLcan interrupt the wait (notSIGINT, notSIGTERM). This matches Linux'sTASK_KILLABLEstate for vfork parents. - Completion trigger: when the child calls
exec()(indo_execve()) or_exit()(indo_exit()), it signals the completion. The guard MUST always be checked — see edge case 7 below:// AtomicPtr::swap atomically loads and clears (interior mutability // through &Task). Null = no vfork pending / already signaled. let done = task.vfork_done.swap(core::ptr::null_mut(), AcqRel); if !done.is_null() { // Check guard before dereferencing — parent may have been killed. if unsafe { (*done).guard.load(Acquire) } { unsafe { (*done).event.complete(); } } } - Edge case — child calls
fork(): if the child callsfork()beforeexec(), the grandchild gets a COW copy of the shared address space. The original parent remains blocked until the child (not the grandchild) callsexec()or_exit(). The grandchild'svfork_doneisNone(not inherited). -
Edge case — parent killed: if the parent receives
SIGKILLwhile waiting,wait_for_completion_state()returns early. The parent proceeds todo_exit(). The child'svfork_donepointer becomes dangling. Resolution uses anAtomicBoolguard allocated alongside theCompletionEventon the parent's kernel stack:// The CompletionEvent struct (see definition above) contains both the // guard: AtomicBool and the event: Completion. In the parent's do_fork() // stack frame (step 19), the CompletionEvent is allocated with guard=true. // // Before the parent unwinds (step 22 on SIGKILL, or in do_exit's mm_release): vfork_completion.guard.store(false, Release); // fence: Release ensures the guard write is visible before stack deallocation. // Child's completion trigger (same pattern as item 5 above): let done = task.vfork_done.swap(core::ptr::null_mut(), AcqRel); if !done.is_null() { if unsafe { (*done).guard.load(Acquire) } { unsafe { (*done).event.complete(); } } }The
PF_EXITINGcheck is racy (TOCTOU between check and dereference). TheAtomicBoolguard is safe because: (a) the guard is checked withAcquirewhich pairs with the parent'sReleasestore, (b) if the guard readstrue, the parent has not yet unwound its stack frame (theReleasestore + fence ensures the guard write completes before any stack teardown begins). -clone3()= the modern extensible version. Supported with the samestruct clone_argslayout as Linux 5.3+.
/// Clone flags bitmask. Matches Linux's `unsigned long` clone flags.
/// Individual flag constants (CLONE_VM, CLONE_FS, CLONE_FILES, etc.) are
/// defined as `pub const CLONE_VM: CloneFlags = 0x0000_0100;` etc.
pub type CloneFlags = u64;
/// CLONE_VFORK (0x0000_4000): The parent blocks until the child calls
/// exec() or _exit(). The child temporarily shares the parent's address
/// space with no COW overhead. Used by vfork() and posix_spawn().
pub const CLONE_VFORK: CloneFlags = 0x0000_4000;
/// CLONE_PIDFD (0x0000_1000): Create a pidfd for the child and write it
/// to clone_args.pidfd. Available since clone3(). Used by systemd
/// (PidfdWatch) and container runtimes. Mutually exclusive with
/// CLONE_PARENT_SETTID in legacy clone() (they share the same field),
/// but NOT conflicting in clone3() (separate fields).
pub const CLONE_PIDFD: CloneFlags = 0x0000_1000;
/// CLONE_PARENT (0x0000_8000): The child's parent is set to the caller's
/// parent (grandparent) rather than the caller. Used by init replacement
/// processes and some container runtimes.
pub const CLONE_PARENT: CloneFlags = 0x0000_8000;
CompletionEvent — vfork synchronization primitive:
/// Completion event with validity guard for vfork() parent-child
/// synchronization. Stack-allocated in the parent's kernel stack frame
/// within do_fork(). The child stores a NonNull pointer to this struct
/// in Task.vfork_done.
///
/// The `guard` AtomicBool handles the edge case where the parent is
/// killed (SIGKILL) while blocked on the vfork wait. If the parent is
/// killed, its do_exit() path stores `false` into `guard` (with Release
/// ordering) before unwinding the stack frame. The child checks `guard`
/// (with Acquire ordering) before dereferencing `event` — if `false`,
/// the pointer is dangling and the child silently drops the vfork_done
/// reference without signaling.
///
/// Memory layout: both fields live in the parent's kernel stack frame.
/// The struct is valid from allocation in do_fork() until the parent
/// returns from wait_for_completion_killable() or its do_exit() path
/// stores guard=false and unwinds the stack.
pub struct CompletionEvent {
/// Validity guard. `true` = the parent is still blocked and the
/// `event` field is safe to dereference. `false` = the parent was
/// killed and is unwinding — do NOT touch `event`.
///
/// Written by the parent (Release) before stack unwind.
/// Read by the child (Acquire) before signaling completion.
pub guard: AtomicBool,
/// One-shot completion semaphore. The parent blocks on this via
/// `wait_for_completion_killable()`. The child signals via
/// `event.complete()` to wake the parent.
/// See [Section 3.6](03-concurrency.md#lock-free-data-structures--completion-one-shot-or-multi-shot-signaling-primitive).
pub event: Completion,
}
8.1.2.1 fork()/clone() Pre-Checks and Cgroup Inheritance¶
Before a new task is created, do_fork()/do_clone() performs a series of limit
checks and resource reservations. The checks are ordered so that each step can be
independently rolled back if a later step fails. The canonical check ordering:
do_fork(parent, flags) → Result<Arc<Task>>:
Phase A — Pre-checks and resource reservation (all reversible):
1. Validate clone flags: reject incompatible combinations.
- CLONE_THREAD without CLONE_VM → EINVAL (threads must share address space).
- CLONE_THREAD without CLONE_SIGHAND → EINVAL (threads must share signal handlers).
- CLONE_SIGHAND without CLONE_VM → EINVAL (signal sharing requires shared mm).
- CLONE_NEWUSER combined with CLONE_THREAD → EINVAL.
- CLONE_FS combined with CLONE_NEWNS → EINVAL.
2. LSM security permission check: `security_task_create(parent, clone_flags)`.
Invoke the registered LSM hook (SELinux/AppArmor/SMACK) for permission only.
The LSM may deny fork based on the parent's security context, the requested
clone flags, or domain transition policy. If denied: return EACCES.
**This is a permission check only -- no blob allocation.** The per-task
security blob is allocated later in step 7b (`security_task_alloc`) once
the child Task struct exists and credentials are copied. Linux merged
both roles into a single `security_task_alloc()` hook (removing
`security_task_create()` in 2017), but UmkaOS separates them for
clarity: step 2 checks, step 7b allocates.
3. cgroup_can_fork() — cgroup controller pre-checks.
a. pids_can_fork(): atomically increment pids.current in the task's cgroup
(walking from the task's cgroup to the root, incrementing each ancestor).
If any ancestor's pids.current > pids.max after increment:
- Decrement all already-incremented ancestors.
- Increment pids.events_max for the failing cgroup.
- Return EAGAIN.
b. Other controllers' can_fork hooks (if any; currently only pids has one).
4. RLIMIT_NPROC check: count tasks owned by this UID (via `UserEntry.task_count`).
Atomically increment `user_entry.task_count.fetch_add(1, AcqRel)`.
If count >= rlim[RLIMIT_NPROC].rlim_cur AND the task does NOT have CAP_SYS_RESOURCE:
- Rollback step 3: call cgroup_cancel_fork() which decrements pids.current
in all ancestors that were incremented.
- Return EAGAIN.
5. PID allocation (`alloc_pid`): allocate a PID in each namespace from the
task's leaf PID namespace up to the init namespace (bottom-up walk).
The child is visible with a different PID number at each level.
Allocation proceeds bottom-up:
level 0 = the child's leaf PID namespace (deepest)
level N = init PID namespace (ancestor, root)
For each level, from deepest to shallowest:
pid_nr[level] = ns.pid_map.alloc()?;
ns.reverse_map.insert(child_task_id, pid_nr[level]);
// pid_map is an Idr<TaskId> — allocates the lowest free PID in [1, pid_max).
// reverse_map is an RcuIdr<u32> — enables O(1) lookup from TaskId to local pid_t.
**PID 1 in new namespaces (CLONE_NEWPID)**: When the child's leaf PID
namespace was freshly created (its IDR is empty), the first `pid_map.alloc()`
returns PID 1. The child becomes the namespace's init process (child reaper).
Store the task reference: `leaf_ns.child_reaper = Some(Arc::clone(&child_task));`
PID 1 has special signal handling: default-disposition signals are silently
dropped, and SIGKILL/SIGSTOP from within the namespace are blocked. See
[Section 8.5](#signal-handling) for the full PID 1 signal protection specification.
If allocation fails at any intermediate level (PID exhaustion at an ancestor):
- Free PIDs already allocated in levels 0..failing_level-1 (reverse order).
- Rollback step 4: decrement user_entry.task_count.
- Rollback step 3: call cgroup_cancel_fork().
- Return EAGAIN.
The allocated PIDs are stored in the per-namespace `pid_map` and `reverse_map`
structures, NOT in the Task struct. The function `task_pid_nr_ns(task, ns)` walks
namespace ancestry at query time via `ns.reverse_map.lookup(task_id)`.
CLONE_INTO_CGROUP early resolution: if `flags.contains(CLONE_INTO_CGROUP)`,
resolve the target cgroup from `clone_args.cgroup` fd NOW (before step 3)
and use it as the target cgroup for `pids_can_fork()` in step 3. This
ensures pids.current is incremented on the correct cgroup hierarchy.
See the extended clone3() flags section for the full specification.
Phase B — Resource allocation and state copying:
6. Allocate child Task struct from the task slab cache (GFP_KERNEL).
The slab returns a raw pointer; this is immediately wrapped in `Arc`:
```rust
let task_ptr = TASK_SLAB.alloc(GfpFlags::KERNEL)?;
// SAFETY: task_ptr is a valid, exclusive pointer from the slab.
// Arc::from_raw_parts is used because slab allocation returns
// uninit memory — the Arc refcount is set to 1 as part of the
// initialization below. The Arc metadata (strong count, weak count)
// is stored in a separate allocation header managed by the slab.
let child: Arc<Task> = Arc::new(unsafe { task_ptr.cast::<Task>().read() });
```
The `Arc<Task>` is the ownership type used throughout the kernel:
`do_fork()` returns `Result<Arc<Task>>`, the scheduler holds
`Arc<Task>` in runqueue entries, and `release_task()` drops the
final Arc reference to trigger slab deallocation.
If allocation fails (ENOMEM):
- Free the allocated PID(s).
- Rollback steps 3-4.
- Return ENOMEM.
6a. Allocate Process struct: if CLONE_THREAD is NOT set (new process, not thread),
allocate `Process` from slab (`slab_alloc::<Process>(GFP_KERNEL)`).
Initialize ALL Process fields:
- `pid`: from PID allocated in step 5.
- `mm`: set in step 6b (fork) or step 11 (CLONE_VM share).
- `cap_table`: initialized in step 9a (cap_space_fork).
- `parent`: set in step 17 (CLONE_PARENT aware).
- `children`: `XArray::new()` — new process has no children.
- `thread_group`: ThreadGroup { pid_ns from parent or new PID namespace,
tgid = allocated PID, leader = child Task pointer }.
- `cred`: `TaskCredential::clone(&parent.process.cred)` — process-wide
baseline credentials copied from parent. Distinct from Task.cred
(per-task effective credentials set in step 7).
- `sighand`: set in step 14.
- `shared_pending`: `PendingSignals::empty()` — POSIX: child starts
with no process-wide pending signals.
- `tty`: `parent.process.tty.as_ref().map(Arc::clone)` — inherit the
parent's controlling terminal (Linux behavior).
- `exit_cleanups`: `SpinLock::new(ArrayVec::new())` — empty.
- `sysvsem_undo`: `SpinLock::new(Vec::new())` — no SysV semaphore state.
- `wait_chldexit`: `WaitQueue::new()` — no waiters yet.
- `pgid`: `parent.process.pgid` — inherit parent's process group.
- `sid`: `parent.process.sid` — inherit parent's session.
- `rlimits`: `parent.process.rlimits.clone()` — copy parent's resource limits.
- `rlimit_lock`: `SpinLock::new(())`.
- `locked_pages`: `AtomicU64::new(0)` — no locked memory in child.
- `children_rusage`: `RusageAccum::zero()` — no reaped children yet.
- `lock`: `SpinLock::new(())`.
6b. Allocate MmStruct: if CLONE_VM is NOT set (fork, not thread),
allocate `MmStruct` from slab (`slab_alloc::<MmStruct>(GFP_KERNEL)`).
Initialize `mm.users = AtomicU32::new(1)`.
Page tables are populated later in step 11.
6c. (Moved to step 7b — LSM security blob allocation runs AFTER copy_creds
and sched_fork, matching Linux `copy_process()` ordering. See step 7b.)
6d. Initialize child Task scalar fields. Slab memory is NOT guaranteed to be
zeroed (the slab may reuse a previously freed Task), so every field must be
explicitly initialized. Fields not listed here are set by subsequent steps
(7-15) or are inherited from the parent during copy phases.
// Task name: copy from parent. On exec(), this is overwritten with the
// new executable's basename. For fork() without exec(), the child inherits
// the parent's comm (matching Linux behavior — /proc/[pid]/comm shows the
// parent's name until the child calls exec() or prctl(PR_SET_NAME)).
child.comm = parent.comm;
// Scheduling and state
child.state = AtomicU32::new(TaskState::NEW.bits()); // NOT RUNNING
child.flags = AtomicU32::new(PF_FORKNOEXEC);
child.thread_flags = AtomicU32::new(0);
child.jobctl = AtomicU32::new(0);
// Exit and signaling
child.exit_code = AtomicI32::new(0);
child.exit_signal = if flags.contains(CLONE_THREAD) { 0 }
else { clone_args.exit_signal as i32 };
// vfork completion (set to actual pointer later if this IS a vfork)
child.vfork_done = AtomicPtr::new(core::ptr::null_mut());
// Pending signals: empty (POSIX — child starts with no pending signals)
child.pending = PendingSignals::empty();
child.signal_mask = parent.signal_mask;
// CLONE_CHILD_SETTID / CLONE_PARENT_SETTID / CLONE_CHILD_CLEARTID
child.set_child_tid = AtomicPtr::new(if flags.contains(CLONE_CHILD_SETTID) {
clone_args.child_tid as *mut u32
} else {
core::ptr::null_mut()
});
child.clear_child_tid = AtomicPtr::new(if flags.contains(CLONE_CHILD_CLEARTID) {
clone_args.child_tid as *mut u32
} else {
core::ptr::null_mut()
});
// TLS base: from clone_args if CLONE_SETTLS, otherwise inherit parent
child.tls_base = if flags.contains(CLONE_SETTLS) {
clone_args.tls
} else {
parent.tls_base
};
// Resource accounting
child.start_time_ns = ktime_get_boot_ns();
child.min_faults = AtomicU64::new(0);
child.maj_faults = AtomicU64::new(0);
child.last_switched_in = 0;
// Upcall state
child.in_upcall = AtomicBool::new(false);
child.upcall_pending = AtomicBool::new(false);
child.upcall_stack_base = 0;
child.upcall_stack_size = 0;
child.upcall_frame_nonce = arch::current::rng::get_random_u64();
// Signal alternate stack (NOT inherited — old addresses in parent's
// address space. POSIX: child starts with no alt stack registered).
child.sas_ss_sp = 0;
child.sas_ss_size = 0;
child.sas_ss_flags = 0;
// Credential generation (starts fresh — no stale CapValidationTokens)
child.cred_generation = AtomicU64::new(0);
// SIGXCPU accounting
child.last_sigxcpu_tick = 0;
// Priority and PI state
child.base_priority = parent.base_priority;
child.effective_priority = parent.base_priority; // no PI boost at fork
child.held_mutexes = HeldMutexes::empty();
child.rt_waiter = None;
child.blocked_on_rt_mutex = None;
// CPU affinity
child.cpu_affinity = parent.cpu_affinity.clone();
// Capabilities (per-task handle, distinct from Process.cap_table)
child.capabilities = CapHandle::default(); // set properly in step 9a
// Futex waiter: initialized unlinked. A task can block on at most one
// futex at a time; the waiter node is linked into a FutexHashTable bucket
// only when the task calls futex_wait(). Children never inherit a parent's
// futex wait state (POSIX: child starts as if it called fork(), not as if
// it inherited the parent's blocked syscall).
child.futex_waiter = FutexWaiter::new();
// Perf events, ptrace, io_uring, audit — all empty/None at fork
child.perf_events = IntrusiveList::new();
child.io_uring_tctx = None;
child.ptrace = None;
child.audit_context = None;
// Intrusive list links — group_node is linked into Process::thread_group
// in step 17; pid_group_node and sibling_node are initialized unlinked.
child.group_node = IntrusiveListLink::new();
child.pid_group_node = IntrusiveListLink::new();
child.sibling_node = IntrusiveNode::new(child.tid);
// Resource usage accounting (rusage)
child.rusage = RusageAccum::zero();
// I/O and misc
child.ioprio = parent.ioprio; // inherit I/O priority
child.robust_list = core::ptr::null_mut(); // NOT inherited
child.rseq_area = AtomicPtr::new(core::ptr::null_mut()); // NOT inherited
child.rseq_len = AtomicU32::new(0);
child.rseq_sig = 0;
// VFS ring pointer: NOT inherited. A child task is never mid-VFS-operation
// at birth. Without this, slab recycling can leave a stale pointer that
// sigkill_stuck_producers() matches against a crashed ring set, sending
// false SIGKILL to the innocent child. See field doc comment for the full
// lifecycle invariant.
child.current_vfs_ring = AtomicPtr::new(core::ptr::null_mut());
child.nr_cpus_allowed = parent.nr_cpus_allowed;
child.migration_pending = AtomicBool::new(false);
child.cgroup_migration_state = AtomicU8::new(0);
child.oom_score_adj = AtomicI16::new(parent.oom_score_adj.load(Relaxed));
child.restart_block = RestartBlock::None;
// CLONE_PARENT_SETTID: write child TID to parent-visible address NOW.
// NOTE: This is an externally-visible side effect (put_user to parent
// address space), placed inside the "initialize scalar fields" step for
// proximity to the other clone_args-driven field initializations. The
// write must happen before the child is made runnable (step 21) so that
// the parent can observe the TID after fork returns.
if flags.contains(CLONE_PARENT_SETTID) {
put_user::<u32>(child.tid.as_u32(), clone_args.parent_tid as *mut u32)?;
}
7. Copy/share credentials:
let parent_cred = parent.cred.read();
child.cred = RcuCell::new(Arc::clone(&*parent_cred));
No new allocation — both parent and child reference the same immutable
credential set. Credentials are only replaced (not mutated) on exec or
explicit setuid/setgid.
7a. Initialize ALL scheduler parameters (fields inside `sched_entity: SchedEntity`).
Slab memory may contain garbage from a recycled Task; every EevdfTask field
must be explicitly set. Failure to initialize `weight` causes division-by-zero
in `calc_delta_fair()`; garbage `custom_slice` or `rel_deadline` causes wrong
placement in `place_entity()`.
// Policy and class — copied from parent.
child.sched_entity.sched_policy = parent.sched_entity.sched_policy;
child.sched_entity.nice = parent.sched_entity.nice;
child.sched_entity.sched_class = parent.sched_entity.sched_class;
// Weight — MUST be derived from nice value. Critical: if weight == 0,
// calc_delta_fair divides by zero → kernel panic.
child.sched_entity.weight = nice_to_weight(child.sched_entity.nice);
child.sched_entity.inv_weight = nice_to_inv_weight(child.sched_entity.nice);
// vruntime/deadline — zeroed; initialized by wake_up_new_task() ->
// place_entity() which positions the child relative to CFS_RQ's
// min_vruntime. See [Section 7.1](07-scheduling.md#scheduler--wakeupnewtask-forked-task-activation).
child.sched_entity.vruntime = 0;
child.sched_entity.deadline = 0;
child.sched_entity.min_vruntime = 0;
child.sched_entity.vlag = 0;
child.sched_entity.slice_ns = sysctl_sched_base_slice;
child.sched_entity.custom_slice = false;
child.sched_entity.rel_deadline = false;
// Deferred dequeue state — not deferred at fork.
child.sched_entity.sched_delayed = false;
// On-RQ state: not on any run queue yet (NEW state).
child.sched_entity.on_rq = OnRqState::Off;
// PELT load tracking — fresh zero state.
child.sched_entity.load_avg = 0;
child.sched_entity.runnable_avg = 0;
child.sched_entity.util_avg = 0;
child.sched_entity.load_sum = 0;
child.sched_entity.runnable_sum = 0;
child.sched_entity.util_sum = 0;
child.sched_entity.period_contrib = 0;
child.sched_entity.last_update_time = 0;
// Exec runtime tracking — zero at fork.
child.sched_entity.sum_exec_runtime = 0;
child.sched_entity.prev_sum_exec_runtime = 0;
// For RT tasks (SCHED_FIFO/SCHED_RR): copy rt_priority from the parent's
// RtTask sub-entity. Subject to RLIMIT_RTPRIO check.
if parent.sched_entity.sched_class == SchedClass::Rt {
child.sched_entity.rt.priority = parent.sched_entity.rt.priority;
}
// SCHED_DEADLINE tasks always fork as SCHED_NORMAL (Linux behavior —
// DL parameters are not inheritable).
if parent.sched_entity.sched_class == SchedClass::Deadline {
child.sched_entity.sched_policy = UserSchedPolicy::Normal;
child.sched_entity.sched_class = SchedClass::Fair;
child.sched_entity.weight = nice_to_weight(0); // nice 0 default
child.sched_entity.inv_weight = nice_to_inv_weight(0);
}
7b. LSM security blob allocation (SF-091 fix: moved from step 6c to AFTER
copy_creds and sched_fork, matching Linux `copy_process()` ordering where
`security_task_alloc()` runs after `copy_creds()` and `sched_fork()`).
Call `security_task_alloc(child, clone_flags)`.
Each registered LSM provider allocates its per-task security context blob
(e.g., SELinux task_security_struct, AppArmor task_ctx). If any provider
returns ENOMEM: free Process/MmStruct, rollback steps 3-7a, return ENOMEM.
On success, assign the allocated blob to the child:
```rust
child.security = security_task_alloc(&child, clone_flags)?;
```
This is the ONLY call to `security_task_alloc` during fork. Step 2's LSM
hook is a permission CHECK only (`security_task_create`); it does NOT
allocate a blob. The blob allocation happens here, after the child's
credentials are copied (step 7) and scheduler state is initialized (step 7a).
This ordering ensures that any LSM hook that inspects `child.cred` sees
valid credential data rather than uninitialized slab garbage.
If step 2 and step 7b share a single hook entry point
in the LSM hook table, the implementation must distinguish the two
call sites (check-only vs allocate) via a parameter or separate hooks.
8. Copy/share filesystem context (task.fs is Option<Arc<RwLock<FsStruct>>>):
- CLONE_FS set: child.fs = Some(Arc::clone(parent.fs.as_ref().unwrap())).
Parent and child share cwd, root, and umask. Changing cwd in one
affects the other (thread-like sharing).
- CLONE_FS not set: child.fs = Some(Arc::new(RwLock::new(
parent.fs.as_ref().unwrap().read().clone()))).
Child gets an independent copy of cwd, root, and umask.
9. Copy/share file descriptor table:
- CLONE_FILES set: child.files = Arc::clone(&parent.files).
Parent and child share the same FdTable. Opening/closing fds in one
affects the other (thread-like sharing).
- CLONE_FILES not set: child.files = Arc::new(parent.files.dup_all()).
Child gets an independent copy of all open file descriptors.
Each File entry's refcount is incremented. The cloexec bitmap is
preserved in the copy.
`FdTable::dup_all()` allocates a new `FdTable` and copies all entries:
```rust
impl FdTable {
/// Duplicate all open file descriptors into a new independent table.
///
/// Acquires `self.inner` lock for read access during the copy.
/// Each `Arc<OpenFile>` in the source table gets an `Arc::clone()`
/// (refcount increment). The cloexec bitmap is copied verbatim.
///
/// # Allocation
/// The new FdTable's XArray is pre-sized to hold all entries from
/// the source. This is a warm-path allocation (bounded by the
/// number of open fds, typically < 1024).
pub fn dup_all(&self) -> FdTable {
let guard = self.inner.lock();
let mut new_table = FdTable::new();
let new_inner = new_table.inner.get_mut();
// Copy all fd entries: iterate source XArray, clone each Arc<OpenFile>.
// SF-096 fix: field is `fds` (per FdTableInner struct), not `entries`.
for (fd, file_ref) in guard.fds.iter() {
new_inner.fds.store(fd, Arc::clone(file_ref));
}
// Copy cloexec bitmap.
new_inner.cloexec = guard.cloexec.clone();
new_table
}
}
```
9a. Copy capability table:
- CLONE_THREAD set (thread creation): child shares parent.process.cap_table
directly (threads operate on the same Process, same CapSpace).
- CLONE_THREAD not set (fork): child.process.cap_table = cap_space_fork(&parent.process.cap_table)?
Deep copy of the handle table (independent slots), shared CapEntry
objects (Arc refcount increment per live entry). Revoking a capability
in the parent also revokes it in the child (shared CapEntry with
atomic REVOKED_FLAG). The child can independently close handles or
receive new capabilities without affecting the parent's handle table.
See [Section 9.1](09-security.md#capability-based-foundation--fork-inheritance-semantics) for
the full specification.
If cap_space_fork fails (ENOMEM):
- Free the allocated fd table (step 9).
- Free the allocated Task struct (step 6).
- Rollback steps 3-5.
- Return ENOMEM.
10. Copy/share namespace set (per-task, like Linux `task_struct->nsproxy`):
Before processing explicit CLONE_NEW* flags, check for pending namespace
transitions (set by prior `setns()` or `unshare()` calls). The access path
requires loading from the ArcSwap nsproxy:
```
let parent_ns = parent.nsproxy.load(); // -> Guard<Arc<NamespaceSet>>
let pending_pid = parent_ns.pending_pid_ns.lock().clone();
let pending_time = parent_ns.pending_time_ns.lock().clone();
```
If non-None, apply these namespace transitions to the child. The
pending fields are NOT cleared on the parent — `.clone()` preserves
the pending value so ALL future children enter the same namespace
(matching Linux's `pid_ns_for_children` / `time_ns_for_children`
semantics). The pending value is only replaced by a subsequent
`setns(CLONE_NEWPID)` or `unshare(CLONE_NEWPID)` call.
This enables `nsenter` and `docker exec` workflows where `setns()`
+ `fork()` moves the child into the target namespace.
Process each CLONE_NEW* flag independently. CLONE_NEWUSER must be processed
first: it establishes the user namespace context (full capability set in new
namespace) required for creating other namespace types.
- CLONE_NEWUSER: create a new user namespace. Security preconditions:
- The calling task must NOT already be sharing filesystem-related
attributes with another task via CLONE_FS (checked in step 1).
- `CLONE_NEWUSER | CLONE_THREAD` is rejected (step 1) — threads
sharing an address space cannot have different user namespaces.
- `/proc/sys/user/max_user_namespaces` limit is enforced (per-user
namespace nesting depth capped at USERNS_SETGROUPS_MAX = 32).
- The parent's `user_ns.level` must be < 32 (nesting depth limit).
Credential transformation via `prepare_creds`/`commit_creds`
([Section 9.9](09-security.md#credential-model-and-capabilities--credential-lifecycle-copy-on-write-via-rcu)):
(1) `new_user_ns = UserNamespace::create(parent.cred.user_ns, parent.cred.euid)`
— the new namespace records its creator UID and parent namespace.
(2) `new_cred = prepare_creds(parent.cred)` — clone parent credentials
(RCU copy-on-write; see [Section 9.9](09-security.md#credential-model-and-capabilities)).
(3) Set `new_cred.user_ns = new_user_ns` (child operates in the new namespace).
(4) Set `new_cred.euid = 0`, `new_cred.egid = 0` — the child is UID 0
within its new user namespace (the creator is mapped to root).
(5) Grant full capability set in the new namespace:
`new_cred.cap_effective = CAP_FULL_SET`,
`new_cred.cap_permitted = CAP_FULL_SET`,
`new_cred.cap_inheritable = 0` (inheritable is empty — capabilities
do not leak across exec unless explicitly granted via file caps or
ambient caps within the new namespace).
`new_cred.cap_bset = CAP_FULL_SET` (bounding set is full in new ns).
`new_cred.cap_ambient = 0` (ambient caps cleared on namespace transition).
(6) `commit_creds(child, new_cred)` — atomically publish the new
credentials via RCU pointer swap. The `commit_creds` invariants
([Section 9.9](09-security.md#credential-model-and-capabilities--credential-lifecycle-copy-on-write-via-rcu))
are satisfied because all capability sets are self-consistent.
The child is root-equivalent within its user namespace but has **zero**
capabilities in the parent namespace. The parent's credentials are
unchanged — only the child receives the transformed credential.
**UID/GID mapping**: The new user namespace starts with empty UID/GID maps.
The child (or a process with `CAP_SETUID` in the parent namespace) must
write `/proc/[child_pid]/uid_map` and `gid_map` before the child can
interact with filesystem objects or create further namespaces. Until the
maps are written, operations requiring UID translation (file access, signal
delivery to processes in other namespaces) fail with `EOVERFLOW`.
See [Section 17.1](17-containers.md#namespace-architecture) for the full UID/GID mapping specification.
The remaining namespaces are created in the canonical order defined in
[Section 17.1](17-containers.md#namespace-architecture): MNT, UTS, IPC, PID, CGROUP, NET, TIME.
This order is required for correct rollback (reverse-order teardown).
- CLONE_NEWNS: copy the parent's mount table into a new mount namespace.
- CLONE_NEWUTS: copy hostname/domainname into a new UTS namespace.
- CLONE_NEWIPC: create a new IPC namespace (empty SysV IPC and POSIX mqueue).
- CLONE_NEWPID: create a new PID namespace (child becomes init of it).
- CLONE_NEWCGROUP: create a new cgroup namespace (child sees its cgroup as root).
- CLONE_NEWNET: create a new empty network namespace. **Dual-field update**:
both `nsproxy.net_stack` (Capability<NetStack>) and `nsproxy.net_ns`
(Arc<NetNamespace>) must be updated in lockstep to maintain the
NamespaceSet invariant (`net_ns == net_stack.cap_resolve()`). The new
NetNamespace is created first, then `net_stack` is set to a capability
wrapping it, and `net_ns` is set to `Arc::clone` of the same namespace.
See [Section 17.1](17-containers.md#namespace-architecture--namespace-implementation) for the invariant.
- CLONE_NEWTIME: create a new time namespace (CLOCK_MONOTONIC/BOOTTIME offsets).
- IMA namespace: IMA namespaces are **not** created via a dedicated `CLONE_NEWIMA`
flag. Instead, IMA namespace creation is tied to user namespace creation
(`CLONE_NEWUSER`): when a new user namespace is created, a new `ImaNamespace`
is automatically allocated alongside it with an empty local measurement log,
fresh virtual PCR bank (all zeros), and the global policy inherited read-only
from the init IMA namespace ([Section 9.5](09-security.md#runtime-integrity-measurement)). If
`CLONE_NEWUSER` is not set, the child inherits the parent's `ima_ns` via
`Arc::clone`. This follows the design of Berger's IMA namespace patch series
(IBM, v14, 2022 — not yet merged into mainline Linux). UmkaOS diverges from the
patches by using per-namespace virtual PCRs instead of vTPM instances, and by not
propagating measurements to parent namespaces (container privacy).
For all namespaces without a CLONE_NEW* flag: inherit from parent
(Arc::clone of the parent's namespace reference). All namespace references
are bundled in `NamespaceSet`; the common case (fork() with no CLONE_NEW*)
is a single Arc::clone of the parent task's `nsproxy`.
The child task's `nsproxy` field is set to this (possibly new) `NamespaceSet`
(wrapped in `ArcSwap` — see Task struct definition above).
11. Set up address space:
- CLONE_VM set AND CLONE_THREAD set (thread creation): child shares
parent.process (same Process object, same mm). Increment
`parent.process.mm.load().users.fetch_add(1, AcqRel)` — the
new thread is an additional user of this address space. No page table copying.
- CLONE_VM set WITHOUT CLONE_THREAD (vfork): child has its OWN Process
(allocated in step 6a) but shares the parent's mm. Set
`child.process.mm.store(Arc::clone(&*parent.process.mm.load()))`.
Increment `mm.users.fetch_add(1, AcqRel)`. No page table copying.
The child will replace this mm in exec_mmap().
- CLONE_VM not set (fork):
// Acquire mmap_lock for read on the parent's mm. Without this,
// concurrent mmap()/munmap() from other threads could race with
// the VMA tree walk in copy_page_tables(). Linux takes
// mmap_read_lock(oldmm) in dup_mm().
// ArcSwap::load() returns a guard extending the Arc<MmStruct> lifetime.
let parent_mm = parent.process.mm.load();
let _mmap_guard = parent_mm.mmap_lock.read();
child.process.mm.store(copy_page_tables(&*parent_mm)?);
The new MmStruct was allocated in step 6b with `mm.users = 1`.
See [Section 4.8](04-memory.md#virtual-memory-manager--cow-page-table-duplication-at-fork) for the
full page table walk specification. All writable pages are marked COW.
12. Allocate kernel stack for child.
Stack size is architecture-dependent (typically 16 KiB on x86-64, 16 KiB on
AArch64). Allocated from the kernel stack slab cache. The stack is zero-filled
to prevent information leaks from recycled slab memory.
13. Synthesize child ArchContext (ret_from_fork trampoline).
The child's ArchContext is constructed so that when the scheduler first
dispatches the child, it enters at `ret_from_fork` with:
- Return register = 0 (child sees fork() returning 0).
- Stack pointer = top of child's kernel stack.
- Instruction pointer = ret_from_fork trampoline address.
- All other general-purpose registers copied from the parent's saved context.
See ret_from_fork specification below for the per-architecture trampoline.
14. Copy/share signal handlers:
- CLONE_SIGHAND set: child.sighand = Arc::clone(&parent.sighand).
Parent and child share the signal disposition table.
- CLONE_SIGHAND not set: child.sighand = Arc::new(parent.sighand.clone()).
Child gets an independent copy of all signal dispositions.
Pending signals are NOT inherited by the child (empty pending set).
15. Resolve target cgroup for the child:
```
let target_cgroup = if flags.contains(CLONE_INTO_CGROUP) {
// CLONE_INTO_CGROUP early resolution was done before step 3.
// Use the pre-resolved target cgroup.
resolved_target_cgroup.clone()
} else {
Arc::clone(&*parent.cgroup.load())
};
child.cgroup = ArcSwap::new(target_cgroup.clone());
```
The child starts in the resolved cgroup. For most forks this is the
parent's cgroup; for `CLONE_INTO_CGROUP` it is the pre-resolved target.
The `cgroup_post_fork()` in step 18 receives `target_cgroup` as a
parameter and does NOT unconditionally overwrite `child.cgroup` — it
only attaches the child to cgroup controller task lists using the
already-set target.
Phase C — Linkage and activation:
16. Insert child PID into the global PID table (PID_TABLE: Idr).
The PID was reserved in step 5; this step makes it resolvable via
`find_task_by_pid()`.
17. Link child into parent's children list:
// CLONE_PARENT: child's parent is the caller's parent (grandparent),
// not the caller itself. Linux: `p->real_parent = current->real_parent`.
let effective_parent = if flags.contains(CLONE_PARENT) {
// The child becomes a sibling of the caller, not a child.
let pp = parent.process.parent.load(Acquire);
if pp != 0 { ProcessId::from_u64(pp) } else { parent.process.pid }
} else {
parent.process.pid
};
child.process.parent.store(effective_parent.as_u64(), Release);
// Acquire the effective parent's process lock and link the child.
let parent_proc = find_process_by_pid(effective_parent);
let _guard = parent_proc.lock.lock();
parent_proc.children.insert(child.process.pid.as_u64(), Arc::clone(&child.process));
// Insert child task into its own ThreadGroup.tasks list.
child.process.thread_group.tasks.push_back(&child.group_node);
18. cgroup_post_fork(child, target_cgroup): attach the child to its cgroup.
The `target_cgroup` parameter is the cgroup resolved in step 15 (parent's
cgroup by default, or the `CLONE_INTO_CGROUP` target). This function does
NOT overwrite `child.cgroup` — that was already set in step 15.
- For each cgroup subsystem state (CSS) in `target_cgroup`:
atomically add the child task to the cgroup's task list
(`css.tasks.store(child.tid, Arc::clone(&child))`).
- This must complete BEFORE the child becomes runnable (before
wake_up_new_task), ensuring the child is visible to cgroup accounting
from its first scheduled timeslice.
19. CLONE_VFORK setup (before making child runnable):
If flags.contains(CLONE_VFORK):
// Allocate CompletionEvent on the parent's kernel stack frame.
// This is a stack variable, NOT heap-allocated. It is valid for the
// duration of this do_fork() call — the parent will block below.
let vfork_completion = CompletionEvent {
guard: AtomicBool::new(true),
event: Completion::new(),
};
// Give the child a pointer to our stack-allocated completion event.
// AtomicPtr::store provides interior mutability through &Task.
child.vfork_done.store(&vfork_completion as *const _ as *mut _, Release);
20. CLONE_PIDFD handling:
If flags.contains(CLONE_PIDFD):
// Create a pidfd (anonymous inode backed by PidfdInode) for the child.
let pidfd = pidfd_create(child.tid, 0)?; // O_CLOEXEC by default
// Write the pidfd number to the parent's address space.
put_user::<i32>(pidfd, clone_args.pidfd as *mut i32)?;
21. **`wake_up_new_task(child)`** — make the child runnable.
Delegates to the scheduler's canonical `wake_up_new_task()` function.
See [Section 7.1](07-scheduling.md#scheduler--wakeupnewtask-forked-task-activation) for the
full definition including CPU selection, vruntime initialization,
`activate_task()` with `EnqueueFlags::ENQUEUE_INITIAL`, and
preemption check.
```rust
// Single call — do NOT inline-expand the scheduler internals here.
// The scheduler spec defines the full sequence: select_task_rq(),
// init_new_task_vruntime(), activate_task(), check_preempt_curr().
wake_up_new_task(child);
```
The scheduler may preempt the parent if the child has higher priority.
22. CLONE_VFORK parent blocking:
If flags.contains(CLONE_VFORK):
// Block the parent until the child calls exec() or _exit().
// The parent enters TASK_KILLABLE — only SIGKILL can interrupt.
match wait_for_completion_killable(&vfork_completion.event):
Ok(()) =>
// Child signaled completion (exec or _exit). Resume normally.
()
Err(EINTR) =>
// Parent received SIGKILL while waiting.
// Mark the guard as invalid so the child does not dereference
// the CompletionEvent after our stack frame is unwound.
vfork_completion.guard.store(false, Release);
// Proceed to do_exit() — the signal delivery path handles this.
// The child will see guard=false and silently skip the signal.
23. Return Ok(child).
The parent returns from do_fork() with the child's PID (via the return
register). The child's first execution begins at ret_from_fork (step 13).
Lock acquisition order: PID_TABLE lock → parent.Process::lock. The child's Process::lock is never contended during fork because no other thread has a reference to the child yet.
Error handling in Phase B: If any allocation in steps 6-14 fails, all
previously allocated resources are freed in reverse order. PID(s) are released,
cgroup counters are rolled back (via cgroup_cancel_fork()), and
user_entry.task_count is decremented. The partial child Task struct (if
allocated) is dropped, which cascades Arc::drop on any already-cloned
resources (credentials, fd table, etc.).
Rollback invariant: If the fork fails at step N, steps 1 through N-1 are rolled back in reverse order. No resource counters are left incremented for a task that was never created. This prevents counter drift over 50-year uptime.
Step-specific rollback table:
| Failure at step | Resources to undo (reverse order) |
|---|---|
| 3 (cgroup_can_fork) | Decrement pids.current in all already-incremented ancestors |
| 4 (RLIMIT_NPROC) | cgroup_cancel_fork() (step 3) |
| 5 (PID alloc) | Free PIDs at completed levels, decrement user_entry.task_count (step 4), cgroup_cancel_fork() (step 3) |
| 6 (Task alloc) | Free PID(s) (step 5), undo steps 3-4 |
| 7 (cred copy) | Free Process/MmStruct (step 6a/6b), free PID(s), undo steps 3-5. Note: step 7 is Arc::clone (infallible), so failure here is not possible in practice. |
| 7a (sched init) | Infallible — initializes fields on already-allocated slab memory. No rollback needed. |
| 7b (LSM blob alloc) | Free sched entity state (step 7a — trivial), drop cred Arc (step 7), free Process/MmStruct (step 6a/6b), free PID(s), undo steps 3-5 |
| 8 (fs copy) | If CLONE_FS not set and Arc::new(RwLock::new(...)) fails (ENOMEM): drop Task (step 6 — cascades cred Arc), free PID(s), undo steps 3-5 |
| 9a (cap_space_fork) | Free fd table (step 9), free fs (step 8 if not shared), drop Task (step 6 — cascades Arc::drop on cred, fs), free PID(s), undo steps 3-5 |
| 10 (namespace copy) | Drop cap_space (step 9a), free fd table, drop Task, free PID(s), undo steps 3-5 |
| 11 (mm copy/COW) | Destroy new MmStruct + page tables, undo steps 6-10, free PID(s), undo steps 3-5 |
| 12 (kernel stack) | Free MmStruct (step 11), undo steps 6-10, free PID(s), undo steps 3-5 |
| 13 (ArchContext) | Infallible — register state synthesis on the already-allocated kernel stack (step 12). No allocation, no failure mode. No rollback needed. |
| 14 (signal handlers) | Free kernel stack (step 12), undo steps 6-11, free PID(s), undo steps 3-5 |
| 15 (cgroup) | cgroup_cancel_fork() — decrement cgroup task counts, undo steps 6-14, free PID(s), undo steps 3-5 |
| 16 (PID_TABLE insert) | Remove child PID from PID_TABLE, undo step 15, undo steps 6-14, free PID(s), undo steps 3-5 |
| 17 (children link) | Remove child from parent's children XArray, undo step 16, undo steps 6-15, free PID(s), undo steps 3-5 |
| 20 (CLONE_PIDFD) | Close pidfd, undo steps 14-6, free PID(s), undo steps 3-5 |
Each undo function is idempotent — safe to call even if the corresponding
resource was not fully initialized. The Task struct's Drop impl cascades
Arc::drop on all Arc<>-wrapped fields (credentials, fs, files, sighand)
that were already cloned, so explicit teardown of those is not needed when
dropping the partially-initialized Task.
Cgroup inheritance semantics: The child inherits the parent's cgroup
membership in all controllers (cpu, memory, io, pids, cpuset, perf_event, etc.).
The child starts in the same cgroup as the parent; it can only be migrated to a
different cgroup after creation by writing to cgroup.procs. This matches Linux
cgroup v2 semantics exactly.
Extended clone3() Flags¶
The following clone3()-only flags are supported (not available via legacy clone()):
CLONE_CLEAR_SIGHAND (Linux 5.5+, clone3 flag bit 0x1_0000_0000):
Clears all non-SIG_DFL signal handlers in the child. After fork, the child's
SignalHandlers table has every entry reset to SIG_DFL. This avoids the overhead
of a separate sigaction(SIG_DFL) loop in the child for use cases like container
init processes that must not inherit the parent's handlers. Implemented in step 14
of do_fork(): if CLONE_CLEAR_SIGHAND is set and CLONE_SIGHAND is not set (they
are mutually exclusive — CLONE_SIGHAND | CLONE_CLEAR_SIGHAND returns EINVAL),
the child's independent SignalHandlers copy is reset:
for sig in 1..=SIGRTMAX:
if child.sighand.handlers[sig].sa_handler != SIG_IGN:
child.sighand.handlers[sig] = SigAction::default() // SIG_DFL
Note: SIG_IGN dispositions are preserved (POSIX requires inherited SIG_IGN to
persist across fork and exec).
CLONE_INTO_CGROUP (Linux 5.7+, flag bit 0x2_0000_0000 (bit 33), clone3 field cgroup):
Places the child directly into the specified cgroup instead of inheriting the
parent's cgroup membership. The clone_args.cgroup field is a file descriptor to
the target cgroup directory (opened via open("/sys/fs/cgroup/target", O_RDONLY)).
Early resolution (before step 3): The target cgroup MUST be resolved BEFORE
cgroup_can_fork() (step 3), because pids_can_fork() increments pids.current
walking up the cgroup hierarchy. If CLONE_INTO_CGROUP specifies a different cgroup
than the parent's, incrementing the wrong hierarchy gives incorrect pid counts. This
matches Linux, where cgroup_can_fork() receives the resolved target cgroup.
// Before Phase A step 3 — early CLONE_INTO_CGROUP resolution:
let target_cgrp = if flags.contains(CLONE_INTO_CGROUP) {
let cgrp = cgroup_from_fd(clone_args.cgroup)?; // EBADF if invalid fd
// Permission check: CGRP_WRITE on the target cgroup
cgroup_can_fork_into(cgrp)?; // EACCES if denied
Some(cgrp)
} else {
None
};
// Pass target_cgrp to pids_can_fork() in step 3 (use it instead of parent's cgroup).
// In step 15, use the resolved target:
if let Some(target) = target_cgrp {
// cgroup_from_fd() returns Arc<Cgroup>; use directly (no Arc::new wrapping).
child.cgroup = ArcSwap::from(target);
} else {
child.cgroup = ArcSwap::from(Arc::clone(&*parent.cgroup.load()));
}
Security: requires CGRP_WRITE permission on the target cgroup (checked via
the cgroup file descriptor's open-time access control). This avoids the
fork() + write(cgroup.procs) race where the child runs in the wrong cgroup
for a brief window. The early resolution also ensures pids.current accounting
is applied to the correct cgroup hierarchy from the start.
PR_SET_VMA — Anonymous VMA Naming¶
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, addr, len, name) sets a human-readable
name on anonymous VMAs for display in /proc/PID/maps and /proc/PID/smaps.
This is used by Android's libc and memory profiling tools (e.g., heapprofd) to
label anonymous memory regions.
/// Maximum length of an anonymous VMA name (excluding null terminator).
/// Matches Linux kernel's `ANON_VMA_NAME_MAX_LEN` (80 bytes).
pub const ANON_VMA_NAME_MAX_LEN: usize = 80;
/// prctl(PR_SET_VMA) subcommands.
pub const PR_SET_VMA_ANON_NAME: u64 = 0;
Implementation:
prctl_set_vma(PR_SET_VMA_ANON_NAME, addr, len, name_ptr):
1. Validate [addr, addr+len) is page-aligned and falls within the task's VMA tree.
2. Copy name string from userspace (at most ANON_VMA_NAME_MAX_LEN + 1 bytes).
If name_ptr is NULL: clear the name on affected VMAs.
3. Validate: name must contain only printable ASCII (0x20-0x7e), no spaces.
If invalid: return EINVAL.
4. For each VMA overlapping [addr, addr+len):
- If VMA is not anonymous (file-backed, shared, special): skip (not an error).
- Split VMA at addr and addr+len boundaries if needed (same as mprotect splitting).
- Set vma.anon_name = Some(Arc<str>) (shared across split VMAs).
5. The name appears in /proc/PID/maps as: "[anon:<name>]" in the pathname column.
8.1.2.2 ret_from_fork: Child Task Entry Trampoline¶
When the scheduler first dispatches a newly-forked child task, the child does not
return through a normal syscall return path (since it never made a syscall). Instead,
the child's ArchContext is synthesized by do_fork() (step 13 above) so that
context_switch() transfers control to the ret_from_fork trampoline.
ArchContext synthesis at fork time:
The parent's do_fork() constructs the child's ArchContext with the following
register state:
Child ArchContext:
instruction pointer = address of ret_from_fork (architecture-specific trampoline)
stack pointer = top of child's kernel stack (allocated in step 12)
return register = 0 (rax on x86-64, x0 on AArch64, r0 on ARMv7, a0 on RISC-V, r3 on PPC32/PPC64LE, r2 on s390x, $a0 ($r4) on LoongArch64)
kthread_fn pointer = NULL for userspace forks, function pointer for kthread_create
all other GPRs = copied from parent's saved ArchContext at syscall entry
FPU/SIMD state = copied from parent (lazy restore: marked dirty so first
FPU use triggers a restore from the saved state)
ret_from_fork execution sequence:
ret_from_fork():
1. Call schedule_tail():
- Complete scheduler bookkeeping for the new task (update prev task's state,
release the runqueue lock that context_switch() left held, enable preemption).
- This is the same function called by context_switch() for normal task switches;
for newly-forked tasks it is called explicitly because the child never went
through the normal switch_to() epilogue.
2. Check kthread_fn:
- If kthread_fn is non-NULL (kernel thread):
a. Call kthread_fn(kthread_arg).
b. On return: call do_exit(0). The kernel thread terminates.
c. (Never reaches step 3.)
- If kthread_fn is NULL (userspace fork): continue to step 3.
3. Restore userspace registers:
- Load the saved user-mode register state from the child's ArchContext
(which was copied from the parent's syscall-entry save area).
- The return register contains 0 (child's fork() return value).
4. Return to userspace:
- Execute the architecture-specific return-to-user instruction.
- User code resumes at the instruction after the fork()/clone() syscall,
with the return value 0 indicating "I am the child."
Per-architecture trampoline details:
| Architecture | Return-to-user instruction | Trampoline location | Notes |
|---|---|---|---|
| x86-64 | sysretq (fast path) or iretq (if signal pending, ptrace, or IOPLs) |
ret_from_fork in entry.asm |
sysretq requires RCX=RIP, R11=RFLAGS; iretq used as fallback |
| AArch64 | eret |
ret_from_fork in entry.S |
Restores SPSR_EL1 and ELR_EL1 from saved context |
| ARMv7 | movs pc, lr (or rfe sp! from exception frame) |
ret_from_fork in entry.S |
Restores CPSR from SPSR via movs |
| RISC-V 64 | sret |
ret_from_fork in entry.S |
Sets sstatus.SPP=0 (return to U-mode), sepc = user PC |
| PPC32 | rfi |
ret_from_fork in entry.S |
Restores MSR and SRR0/SRR1 |
| PPC64LE | rfid |
ret_from_fork in entry.S |
Restores MSR and SRR0/SRR1; uses rfid (hypervisor-aware) |
| s390x | lpswe (Load PSW Extended) |
ret_from_fork in entry.S |
Loads a new PSW from the lowcore area (address 0x1C0), which atomically sets the instruction address and the problem-state bit (PSW bit 15 = 1 for user mode). The saved PSW is constructed at fork time with the child's user-mode instruction pointer and condition code. After kernel_thread_fn returns (for kthreads), the trampoline loads the user PSW via lpswe to enter user mode. The LPSWE instruction is atomic with respect to interrupts — no window exists where the CPU is in kernel mode with a user-mode instruction pointer. |
| LoongArch64 | ertn (Exception Return) |
ret_from_fork in entry.S |
Restores CSR.ERA (Exception Return Address) with the child's user-mode PC, and CSR.PRMD (Previous Mode) with PLV=3 (user mode) and interrupt-enable bits from the saved context. The ertn instruction atomically transitions to user mode by loading PC ← CSR.ERA and CSR.CRMD ← CSR.PRMD. After kernel_thread_fn returns (for kthreads), the trampoline writes the user PC to CSR.ERA, sets CSR.PRMD.PLV=3, and executes ertn. |
s390x ret_from_fork detailed sequence:
s390x ret_from_fork:
1. Restore callee-saved registers from kernel stack (r6-r15, floating-point
control register FPCR, and vector registers v8-v15 if VX facility is active).
The stack frame layout follows the s390x ELF ABI: r6-r13 at fixed offsets
from the stack frame backchain pointer, r14 (return address) and r15 (stack
pointer) at the frame header.
2. If kernel thread: call kernel_thread_fn(arg). The function pointer is stored
in r14 (link register, set by do_fork) and the argument in r2 (first argument
register per s390x calling convention). r15 remains the stack pointer — it is
NOT overloaded for the argument. On return from kernel_thread_fn: call
do_exit(r2) where r2 contains the thread's return value. (Never reaches step 3.)
3. Clear return value: set r2 = 0 (child gets pid=0 from fork/clone).
4. Construct user-mode PSW on the kernel stack:
- PSW bits [0:7] = system mask (I/O, external, machine-check interrupts enabled)
- PSW bit 12 = 1 (Extended Addressing, 64-bit mode)
- PSW bit 15 = 1 (Problem State = user mode)
- PSW bits [33:63]/[97:127] = user-mode instruction address (from saved context)
5. LPSWE (Load PSW Extended): atomically loads the constructed PSW from the
stack save area. This simultaneously sets the instruction address, enables
problem state (user mode), and restores the condition code and program mask.
No instruction boundary exists between kernel mode and user mode — the
LPSWE is the last kernel-mode instruction executed.
6. On user return: execution resumes at the user-mode instruction address
stored in the PSW. The SVC (supervisor call) link area on the user stack
contains the return address from the original clone/fork syscall. General
registers r0-r5 are restored from the saved pt_regs on the kernel stack
before LPSWE; r2=0 is the child's fork return value.
LoongArch64 ret_from_fork detailed sequence:
LoongArch64 ret_from_fork:
1. Restore callee-saved registers from kernel stack: s0-s8 ($r23-$r31),
fp ($r22, frame pointer), ra ($r1, return address). Floating-point
callee-saved registers f24-f31 are restored if the FPU was active
(determined by CSR.EUEN.FPE bit in the saved context). The stack frame
layout follows the LoongArch ELF psABI: callee-saved GPRs at ascending
offsets from $sp, with ra stored at the base of the frame.
2. If kernel thread: call fn(arg) where fn is stored in s1 ($r24) and arg
in s0 ($r23), both set by do_fork during ArchContext synthesis. On return:
move the return value to a0 ($r4), call do_exit(a0). (Never reaches step 3.)
3. Clear return value: set a0 ($r4) = 0 (child gets pid=0 from fork/clone).
4. Restore CSR state for user return:
- Write the child's user-mode PC to CSR.ERA (Exception Return Address,
CSR index 0x6). This is the instruction after the clone/fork syscall
in userspace.
- Write the saved previous-mode state to CSR.PRMD (Previous Mode, CSR
index 0x1):
* PRMD.PLV = 3 (Privilege Level 3 = user mode)
* PRMD.PIE = 1 (Previous Interrupt Enable — restores global interrupt
enable on return)
* PRMD.PWE = saved value (Previous Watch Enable)
- Restore general registers a0-a7, t0-t8 from the pt_regs save area on
the kernel stack. a0 ($r4) = 0 overwrites the saved parent return value.
5. ERTN (Exception Return): atomically transitions to user mode by performing:
- PC <- CSR.ERA (branches to user-mode return address)
- CSR.CRMD <- CSR.PRMD (restores PLV=3, IE=PIE, WE=PWE)
No instruction window exists between kernel and user mode — ERTN is atomic
with respect to the privilege level transition. Interrupts are re-enabled
by the CRMD.IE <- PRMD.PIE restoration.
6. On user return: execution resumes at the user-mode instruction address
(CSR.ERA). PLV=3 (user mode) is active, IE is restored. a0=0 is the
child's fork return value visible to userspace.
CLONE_CHILD_SETTID on first schedule: After schedule_tail(), if
child.set_child_tid is non-null, the kernel writes the child's TID to that
userspace address: put_user::<u32>(child.tid.as_u32(), child.set_child_tid).
This runs in the child's execution context (the child's address space is active).
For fork() (separate address spaces), the write MUST happen in the child's
context because the target address is in the child's address space, not the
parent's. For CLONE_VM (threads), the address space is shared, so either
context would work. Linux performs the write in the child context for both
cases — UmkaOS does the same for consistency with Linux and to handle both
fork() and clone(CLONE_VM) uniformly.
Signal delivery on first return: After schedule_tail() and before returning
to userspace, ret_from_fork checks for pending signals (just like the normal
syscall exit path). If a signal is pending (e.g., the parent sent a signal to
the child between fork and first schedule), the signal delivery path is taken
instead of direct userspace return.
8.1.3 Program Execution (exec)¶
execve() replaces the current task's address space with a new program image. The
previous mappings and signal handler dispositions are discarded (handlers reset to
SIG_DFL; SIG_IGN dispositions are preserved). Pending signals are retained and
delivered to the new program image after exec completes (POSIX requirement: IEEE Std
1003.1-2017, "exec" family). The file descriptor table is preserved (minus CLOEXEC
descriptors). The detailed ELF
parsing, segment loading, auxiliary vector, stack layout, and ASLR algorithms are
specified in Section 8.3.
execveat() (syscall 322 on x86-64): Extends execve() with dirfd and flags.
When flags includes AT_EMPTY_PATH and pathname is empty, the kernel executes the
file referred to by dirfd directly (the fd must have been opened with O_PATH or
with execute permission). This is the preferred method for fexecve(3) on Linux 3.19+.
When AT_EMPTY_PATH is not set, pathname is resolved relative to dirfd (same as
openat()). AT_SYMLINK_NOFOLLOW in flags prevents following a final symlink
component.
exec() sequence (do_execveat_common):
1a. Allocate BinPrm and new MmStruct. Before any file operations,
allocate the BinPrm struct and its embedded bprm.mm = Some(Arc::new(MmStruct::new())).
If mm allocation fails, return -ENOMEM with no side effects. The new mm is needed
before the PNR so that segment validation can check addresses against the new mm's
TASK_SIZE. In Linux, bprm_mm_init() is called inside alloc_bprm().
1b. Open executable: Resolve path (honoring AT_FDCWD / dirfd / AT_EMPTY_PATH),
open file, check MAY_EXEC permission against the file's inode. For AT_EMPTY_PATH:
verify the fd refers to a regular file with execute permission.
1c. Prepare credentials. Clone the current task's credentials into bprm.cred.
Apply the capability transformation from execve_transform_caps() based on file
capabilities (security.capability xattr), setuid/setgid bits, bounding set, and
ambient capabilities (see Section 9.9).
If credential preparation fails, return error before the PNR.
- LSM hook chain:
lsm_call_task_security(Exec, bprm)-- iterates all registered LSM modules in priority order (Section 9.8). This is a single call that invokes all active LSMs: - IMA (priority 11):
security_bprm_checkmeasures the binary hash against the IMA policy and, if appraisal is enabled, verifies the signed hash. If the binary has been tampered with or is not in the IMA allow-list, exec fails with-EACCESbefore any process state is modified. - SELinux/AppArmor/SMACK (priority 21): Domain transition and policy checks on the binary.
-
Landlock (priority 30): Sandboxing restrictions. If any module returns
Err(SecurityDenial), exec fails with-EACCES. -
Load binary format: Try registered binfmt handlers in priority order:
- ELF: Parse ELF header, validate
e_machinematches running architecture, extractPT_LOADsegments andPT_INTERPpath. - Script (
#!): Extract interpreter path and arguments, recursively invokedo_execveat_commonon the interpreter with the script as argv[1]. - Miscellaneous (
binfmt_misc): Match magic bytes or file extension against registered handlers. If no handler accepts the binary, return-ENOEXEC. -
--- POINT OF NO RETURN (
flush_old_exec/begin_new_exec) --- After this point, the old process image is destroyed. Failure beyond this point means the task must be killed withSIGKILL— there is no old image to return to. Specific steps at the point of no return: a. De-thread (de_thread()): If the calling process is multi-threaded: MUST happen BEFORE exec_mmap. Sibling threads are still running and using the old address space. If exec_mmap tears down the old mm first, sibling threads fault on every memory access (page tables freed under them). In Linuxfs/exec.c,de_thread()runs beforeexec_mmap()inbegin_new_exec()for exactly this reason.- Send
SIGKILLto all other threads in the thread group (including the thread group leader if the exec-ing thread is not the leader). - Wait for all sibling threads to exit (they observe
SIGKILLat their next kernel exit point and enterdo_exit()). Each exiting sibling decrementsold_mm.usersviaexit_mm(). - PID swap (if the exec-ing thread is NOT the thread group leader):
The exec-ing thread must become the new thread group leader with the
leader's PID. After the original leader exits:
- Remove the original leader's PID from
PID_TABLE. - Remove the exec-ing thread's old PID from
PID_TABLE. - Set
task.tid = leader_pid(take the leader's PID). - Insert the exec-ing thread into
PID_TABLEunder the leader's PID. - Update
process.thread_group.leaderto point to the exec-ing thread. - Update
/procentries (the exec-ing thread is now visible as the TGID). - Any active
pidfdthat referenced the original leader now references the exec-ing thread (theArc<Process>remains the same — pidfds hold aProcessreference, and the Process outlives the PID swap).
- Remove the original leader's PID from
- The exec-ing thread is now the sole thread and thread group leader.
At this point, all sibling threads have completed
do_exit(), which decrementsthread_group.count(see Section 8.2) and removes them fromthread_group.tasks(viarelease_task()). The count is now 1.thread_group.taskscontains only the exec-ing thread (now the leader). b. Flush old address space (exec_mmap(bprm.mm)): This is the canonical definition ofexec_mmap(). Theelf-loader.mdversion is removed; see[Section 8.3](#elf-loader)for a cross-reference.
/// Canonical exec_mmap — called at the point of no return. /// See [Section 8.1](#process-and-task-management--program-execution-exec) step 4b. fn exec_mmap(task: &Task, bprm: &mut BinPrm) { // 1. Signal vfork completion BEFORE swapping the mm. // Matching Linux's exec_mm_release() called at the top of // exec_mmap(). The vfork_done pointer may reference the // parent's kernel stack — must complete before mm swap. // AtomicPtr::swap(null, AcqRel) atomically loads and clears. let done = task.vfork_done.swap(core::ptr::null_mut(), AcqRel); if !done.is_null() { // SAFETY: done was set in do_fork() step 19 to point to the // parent's stack-allocated CompletionEvent. The guard check // verifies the parent is still alive before dereferencing. if unsafe { (*done).guard.load(Acquire) } { unsafe { (*done).event.complete(); } } } // 2. Futex cleanup for old mm: walk robust futex list, write 0 // to clear_child_tid and wake futex waiter (the put_user + wake // is part of futex_exec_release, matching Linux's mm_release()). // This MUST happen BEFORE clearing the tid pointers (step 3), // because the futex wake on clear_child_tid unblocks pthread_join. futex_exec_release(task); // 3. Clear set_child_tid and clear_child_tid — these are userspace // addresses in the old mm. After mm swap, they become dangling. // Cleared AFTER futex_exec_release so the wake on clear_child_tid // in step 2 can deliver the futex notification. // AtomicPtr::store provides interior mutability through &Task. task.set_child_tid.store(core::ptr::null_mut(), Release); task.clear_child_tid.store(core::ptr::null_mut(), Release); // 4. Clear current_vfs_ring — old VFS ring associations are invalid // after mm swap (the new binary has no active VFS operations). task.current_vfs_ring.store(core::ptr::null_mut(), Release); // 5. Swap mm: take new mm from bprm, replace the current mm. let new_mm = bprm.mm.take().unwrap(); // Invariant: bprm.mm is always Some after step 1b (allocate BinPrm). // No path between step 1b and the PNR consumes it. After .take(), // BinPrm::Drop will not double-free the new mm. // ArcSwap::swap atomically replaces and returns the old Arc<MmStruct>. let old_mm = task.process.mm.swap(new_mm); // 6. Switch page tables to the new mm. // Load the new mm via ArcSwap for the switch_mm call. let new_mm_ref = task.process.mm.load(); arch::current::mm::switch_mm(&old_mm, &*new_mm_ref); // 7. Notify MmuNotifier subscribers on old mm (KVM, IOMMU). mmu_notifier_release(&old_mm); // 8. Release old mm. Decrements mm.users; if 0, exit_mmap() tears // down VMAs, frees page tables, and releases physical pages. // // mmap_lock SAFETY: We do NOT acquire mmap_lock on the old mm // before the swap. This is safe because de_thread (step 4a) has // waited for all sibling threads to exit. Combined with // PF_EXITING blocking new ptrace attach and /proc/[pid]/maps // returning ESRCH for exiting tasks, no external observer can // be walking the old mm's VMA tree at this point. Linux takes // the same approach: exec_mmap() in fs/exec.c does not acquire // mmap_lock on the old mm. mmput(old_mm); }After
This flag is checked byexec_mmapreturns, mark the point of no return in BinPrm:BinPrm::Drop— if set, Drop does NOT freebprm.mm(already consumed by exec_mmap). Without this, BinPrm::Drop usesmm.is_some()as the PNR indicator, which is correct butpoint_of_no_returnmakes the state explicit for debugging and error path validation (assertions can verify the two indicators agree).After this step, returning to userspace with the old image is impossible.
- Send
c. Unshare and clear signal handlers (unshare_sighand + handler reset):
The Process::sighand field (Arc<SignalHandlers>) may still be shared
with dying sibling threads (de_thread killed them but they may not have
dropped their Arc reference yet). Mutating the shared table in-place would
race with those dying threads.
```rust
// If sighand refcount > 1 (shared), allocate a fresh SignalHandlers,
// copy entries, replace the Arc. This matches Linux's unshare_sighand()
// in begin_new_exec().
if Arc::strong_count(&task.process.sighand) > 1 {
// SAFETY: We hold no lock on the old sighand. The action array is
// read-only here (copying). UnsafeCell access is safe because we're
// the only thread accessing this process post-de_thread().
let old_action = unsafe { &*task.process.sighand.action.get() };
let mut new_sighand = SignalHandlers {
action: UnsafeCell::new(old_action.clone()),
lock: SpinLock::new(()),
queue_lock: SpinLock::new(()),
handler_cache: core::array::from_fn(|i|
AtomicU8::new(task.process.sighand.handler_cache[i].load(Relaxed))),
};
// NOTE: At this point, task.process.sighand is behind Arc<Process>
// (shared reference). In implementation, Process.sighand should use
// ArcSwap<SignalHandlers> or the replacement should be done through
// the Process::lock. Since we are post-de_thread (sole thread), the
// Arc is exclusively owned and the replacement is safe.
task.process.sighand = Arc::new(new_sighand);
}
// NOTE: The Arc::strong_count check above is a benign race — no new
// references can be created because de_thread has killed all siblings,
// and clone/fork of this process is impossible (single-threaded post
// de_thread). An unnecessary copy is harmless (extra allocation).
// Now the task is the sole owner. Reset all non-SIG_IGN handlers.
let _guard = task.process.sighand.lock.lock();
// SAFETY: SIGHAND_LOCK held. UnsafeCell access serialized by lock.
let action = unsafe { &mut *task.process.sighand.action.get() };
// Loop 0..SIGMAX: the action array is 0-indexed (signal N stored at
// action[N-1]). SIGMAX is the number of signals (64), so this iterates
// all 64 entries. The handler_cache array uses the same 0-based indexing.
// See [Section 8.5](#signal-handling--signal-data-structures) for SignalHandlers layout.
for sig in 0..SIGMAX {
if action[sig].handler != SigHandler::Ignore {
action[sig] = SigAction::default(); // SIG_DFL
task.process.sighand.handler_cache[sig].store(0, Relaxed); // Default
}
}
```
d. Close CLOEXEC file descriptors: Under FdTable.inner (SpinLock),
iterate the cloexec bitmap, remove each flagged fd from the XArray, and
collect the Arc<OpenFile> references into a temporary list. Release the
lock. Then, outside the lock, call file_close() on each collected
OpenFile. The close may block (NFS, FUSE, CIFS flush/release operations);
holding the SpinLock across blocking close would deadlock. This two-phase
approach matches Linux's do_close_on_exec() which nulls fd slots under
files->file_lock but defers filp_close() to after lock release.
Invariant: Steps 4a–4d and step 5 execute within a single kernel function
call chain (begin_new_exec / do_execveat_common). No return-to-userspace
occurs between these steps. Signal delivery only happens on return-to-userspace,
so no signal can be delivered between handler reset (step 4c) and credential
commit (step 5a). An implementing agent MUST NOT insert a scheduling point
between steps 4 and 5.
Single-threaded fast-path: de_thread() (step 4a) should check
thread_group.count.load(Acquire) == 1 and return immediately when the
process is already single-threaded, avoiding unnecessary iteration over
an empty thread group and a wait on a waitqueue that nobody signals.
setup_new_exec: Apply the new execution context. This step executes AFTER the point of no return (step 4) and BEFORE segment mapping (step 6). At this point, ELF headers have been validated and the old mm has been replaced, but the new binary's segments have not yet been mapped into the new address space. The credential commit is irrevocable — a failure after this point cannot roll back to the old credentials.
a. Commit credentials: install bprm.cred as the task's active
credential set via commit_creds() (RCU pointer swap —
Section 9.9).
If the binary has file capabilities (security.capability xattr) or
setuid/setgid bits, the new credential set reflects the computed
capability transformation (see "Capability inheritance across exec"
below). If no credential change is needed, bprm.cred is a clone of
the pre-exec credentials and the swap is a no-op in terms of privilege.
b. Clear UmkaOS capability space (CapSpace): the process's CapSpace
is reset to the post-exec capability set computed by the exec-time
capability grant table. All previously held KABI capabilities are
revoked. Capabilities marked CAP_INHERITABLE in the grant table are
re-granted as fresh CapEntry objects. See
Section 9.1.
c. Update task.comm to the executable basename (truncated to 15 bytes + NUL).
d. Reset affected resource limits (RLIMIT_STACK to default if setuid).
e. Set the process dumpable flag (non-dumpable if credentials changed).
f. Clear sigaltstack: task.sas_ss_sp = 0; task.sas_ss_size = 0;
The old alternate signal stack region is in the old (now destroyed) address
space. Without clearing, signal delivery with SA_ONSTACK would use a
garbage address in the new binary's address space.
g. Clear scheduler upcall state (UmkaOS-specific, no Linux equivalent):
task.upcall_stack_base = 0;
task.upcall_stack_size = 0;
task.in_upcall.store(false, Release);
task.upcall_pending.store(false, Release);
// Re-randomize upcall_frame_nonce on exec for defense-in-depth.
// The nonce prevents forged UpcallFrames. After exec, the new binary's
// threat model differs from the old binary's. Re-randomizing is cheap
// (one rdrand/equivalent) and eliminates a theoretical side-channel.
task.upcall_frame_nonce = arch::current::rng::get_random_u64();
// AtomicPtr/AtomicU32: interior mutability through &Task.
task.rseq_area.store(core::ptr::null_mut(), Release);
task.rseq_len.store(0, Release);
task.rseq_sig = 0; // Plain u32 — only written by owning thread via rseq(2)/exec.
begin_new_exec() performs the same reset. The new binary will
re-register rseq via the rseq(2) syscall during C library initialization.
h2. Clear VFS ring pointer: The pre-exec task may have been mid-VFS-operation
when exec was invoked (the syscall itself is a VFS operation). After exec,
current_vfs_ring must be null: the old VFS ring context is invalidated by
the address space replacement (step 4b). Without clearing, crash recovery's
sigkill_stuck_producers() can match the stale pointer against a crashed
ring set, sending false SIGKILL to the post-exec task.
See Task.current_vfs_ring field doc for the full lifecycle invariant.
i. Ptrace exec event: If the task is being ptraced:
if task_is_ptraced(task) {
ptrace_event(PTRACE_EVENT_EXEC, old_leader_tid);
// If PTRACE_O_TRACEEXEC is set: the tracee stops here.
// If legacy PTRACE_ATTACH without PTRACE_O_TRACEEXEC: send SIGTRAP.
}
begin_new_exec() calls ptrace_event(PTRACE_EVENT_EXEC, ...) after
the credential commit. See Section 20.4.
j. Update mm->exe_file: Set the new mm's executable file reference so that
/proc/[pid]/exe resolves to the new binary:
Without this, /proc/[pid]/exe would be broken (the new mm has no exe_file).
k. Clear robust_list: task.robust_list = core::ptr::null_mut();
The old robust futex list pointer is in the old address space.
l. LSM domain transition hook: lsm_task_exec_transition(old_cred, new_cred).
This is a DISTINCT hook from commit_creds() — commit_creds() calls
lsm_call_cred_security(Commit, ...) (credential publishing), while this
step calls lsm_task_exec_transition() (domain transition, e.g., SELinux
type transition from httpd_t to httpd_script_t). The two hooks are
separate entries in the LSM hook table
(Section 9.8).
m. perf_event cleanup: perf_event_exec(task) — iterate task.perf_events
and for each event: if attr.enable_on_exec is set, enable the event
(reset counters, start counting in the new binary); otherwise, disable it
(the old address space's instruction addresses are no longer valid).
See Section 20.8 (the exec cleanup
reuses the same perf_event_exit_task() mechanism with exec-specific behavior:
enable events with enable_on_exec, disable others).
n. io_uring cleanup: If the task has an io_uring context (io_uring_tctx),
cancel all pending io_uring requests associated with the old file table.
This is a non-blocking cancellation: in-flight requests are marked cancelled
and their completions are discarded. Ring fds without CLOEXEC survive exec
(the ring remains mapped in the new address space); ring fds with CLOEXEC
are closed in step 4d.
See Section 19.3 (the exec cleanup reuses
the same io_uring_task_cancel() mechanism as do_exit).
o. seccomp filter handling: If seccomp-bpf is active, the filter state is
inherited across exec (Linux semantics — POSIX does not define seccomp).
If SECCOMP_FILTER_FLAG_TSYNC was used, the filter applies to the new
binary. This is NOT reset — the new binary runs under the same seccomp
jail. (The exec flow does not need to clear seccomp; noting this explicitly
to prevent implementer confusion.)
See Section 10.3.
p. mm.users lifecycle note: After de_thread completes (step 4a), old_mm.users == 1
(only the exec-ing thread remains). exec_mmap (step 4b) decrements from 1 to 0
and synchronously calls exit_mmap() to tear down the old address space. This is
cleaner than the alternative (where the last sibling thread's exit triggers teardown
asynchronously).
6. Load segments (see Section 8.3 for the complete segment mapping
algorithm, map_elf_segment(), auxiliary vector construction, and ASLR):
Map ELF PT_LOAD segments as VMAs with the specified permissions
(demand-paged via the page cache). Note: segment loading operates on
process.mm (the now-current mm installed by exec_mmap in step 4b),
NOT on bprm.mm (which is None after .take() in exec_mmap).
If PT_INTERP is present, map the dynamic linker
(ld-linux-x86-64.so.2, ld-musl-x86_64.so.1, etc.) as well.
Allocate a new user stack. Push auxv (auxiliary vector), envp,
and argv onto the stack in the standard layout expected by the C runtime.
Apply ASLR: randomize mmap base, stack base, and heap base.
7. Start execution (start_thread()): Initialize user-mode register state for the
new binary. This is architecture-specific and MUST be done before returning to
userspace to prevent information leaks from the old binary or kernel state.
start_thread(regs, entry_point, new_sp):
a. Zero the pt_regs frame on the kernel stack — the syscall entry path
saved user registers into a pt_regs frame on the kernel stack (or in the
per-arch ArchContext). start_thread() zeroes this entire frame, then
overwrites IP/SP/flags with the new binary's values. This prevents leaking
kernel register contents and the old binary's state (security-critical: old
binary may have contained cryptographic key material in registers).
The Task.context: ArchContext (callee-saved kernel registers) does NOT
need resetting — it contains only kernel-mode register state that is never
exposed to userspace.
b. Set IP to the ELF interpreter's entry point (or the binary's own entry
point if statically linked).
c. Set SP to the new stack top (computed by the stack construction step).
d. Set user-mode flags:
- x86-64: EFLAGS = X86_EFLAGS_IF (interrupts enabled, all other flags 0).
CS = __USER_CS (0x33), SS = __USER_DS (0x2b).
- AArch64: PSTATE = PSR_MODE_EL0t (user mode, interrupts enabled).
- ARMv7: CPSR = USR_MODE | PSR_ENDIAN | PSR_I_BIT cleared.
- RISC-V: sstatus.SPP = 0 (user mode).
- PPC32/PPC64LE: MSR set for user mode with EE enabled.
- s390x: PSW with problem state, interrupts enabled.
- LoongArch64: PRMD.PLV = 3 (user mode), PRMD.PIE = 1.
e. Reinitialize FPU/SIMD state to hardware init values. All FP/SIMD
registers are zeroed, FP control registers reset to defaults:
- x86-64: XRSTOR with init state (MXCSR = 0x1F80, all XMM/YMM/ZMM zero).
Reset PKRU to default (0x55555554 — deny all keys except key 0).
- AArch64: zero all NEON/SVE registers, FPCR/FPSR = 0.
- RISC-V: zero all FP registers, FCSR = 0.
- Other architectures: analogous FPU reset.
Without this, the old binary's FPU state (potentially containing keys) leaks.
f. Clear debug registers (unless preserved by ptrace).
g. Return to user space via the syscall return path.
Capability grants on exec replace the traditional setuid/setgid mechanism
(Section 9.2). Instead of running the new program as a different UID, the kernel
consults a per-binary capability grant table and adds the specified capabilities to the
task's CapHandle. The process never gains uid 0 -- it gains precisely the capabilities
the binary needs (e.g., CAP_NET_BIND_SERVICE for a web server on port 80).
Capability inheritance across exec: The Linux-compatible capability inheritance
model (Section 9.9) applies at
step 5 (setup_new_exec). The new effective/permitted/inheritable/ambient capability
sets are computed from the binary's file capability xattr (security.capability),
the task's current bounding set, and the ambient capability set, following the
formula: P'(effective) = F(effective) ? P'(permitted) : P'(ambient) where
P'(permitted) = (P(inheritable) & F(inheritable)) | (F(permitted) & P(bounding)) | P(ambient).
The bounding set limits all capability transitions. See
Section 9.9 for the complete
derivation and the interaction with user namespaces.
Security cleanup on exec:
- File descriptors with
CLOEXECare closed. - Signal dispositions are reset to
SIG_DFL(exceptSIG_IGN). - Pending signals are retained (POSIX requirement; delivered after exec completes).
- The process dumpable flag is re-evaluated (non-dumpable if capabilities were gained).
- Address space layout randomization (ASLR) re-randomizes all base addresses.
exec error handling (pre- and post-PNR):
If exec fails before the point of no return (steps 1-3):
- No process state has been modified. do_execveat_common returns the error
and the calling task continues executing the old binary.
- If bprm.mm was allocated (new mm for the exec'd binary):
BinPrm::Drop calls mmput(bprm.mm) to free page tables and VMA
allocations. This is automatic — no explicit cleanup needed on the error
path. See Section 8.3 for the Drop
implementation.
If exec fails after the point of no return (step 4+):
- The old address space has been destroyed. The task cannot return to the old
binary. The kernel sends SIGKILL to the task (matching Linux's
force_fatal_sig(SIGKILL) in search_binary_handler post-PNR failure path).
- Specific failure modes: segment mapping failure in step 6 (ENOMEM, ELF
validation error), stack setup failure, or interpreter loading failure.
- The task enters do_exit() with a fatal signal — normal exit/zombie path.
8.1.4 Task Exit and Resource Cleanup¶
A task exits via exit() (single task) or exit_group() (all tasks in the process).
The latter is what the C library's exit() actually calls.
Cleanup order for a single-task exit:
- Cancel pending asynchronous I/O (io_uring SQEs, AIO requests).
- If this is the last task in the process, proceed to process cleanup (below).
Otherwise, release per-task resources (stack,
ArchContext) and remove from the thread group.
Process cleanup (when the last task exits):
- Close all file descriptors in the fd table.
- Release all capabilities in the
CapSpace. - Tear down the address space: unmap all VMAs, release page table pages, decrement page reference counts.
- Deliver
SIGCHLDto the parent process. - Reparent children: any surviving child processes are reparented to the nearest
subreaper (a process that set
PR_SET_CHILD_SUBREAPER) or to init (pid 1). - Transition to the zombie state. The task remains in the task table with its exit
status until the parent calls
wait()/waitpid()/waitid().
Zombie reaping: The zombie consumes only a small task-table slot (no address space,
no fd table, no capabilities). The parent retrieves the exit status and resource usage
via wait4() or waitid(), which frees the slot. If the parent exits without reaping,
the reparented-to ancestor (init or subreaper) is responsible.
Session and process group lifecycle: When a session leader exits, SIGHUP is
delivered to the foreground process group of the controlling terminal. The terminal is
disassociated from the session. This matches POSIX semantics required by sshd,
tmux, and shell job control.
8.1.4.1 pidfd — Process File Descriptors¶
Process file descriptors (pidfd) provide a race-free handle to a process, avoiding
the PID-reuse races inherent in integer PIDs. UmkaOS implements the full Linux pidfd
API (Linux 5.3+):
pidfd_open(pid, flags) -> fd (syscall 434 on x86-64): Opens a pidfd for the
specified process. flags: PIDFD_NONBLOCK (0x800, = O_NONBLOCK) sets non-blocking
mode on the resulting fd (Linux 5.10+). PIDFD_THREAD (0x1, Linux 6.9+) opens a pidfd
for a specific thread rather than a process. Returns -ESRCH if the PID does not exist.
The returned fd is pollable (EPOLLIN when the process exits), usable with
waitid(P_PIDFD, pidfd, ...), and passes across fork() / exec().
pidfd_send_signal(pidfd, sig, info, flags) (syscall 424 on x86-64): Sends a
signal to the process identified by pidfd. Race-free: if the process has exited and
the PID has been recycled, the pidfd still refers to the original (now-zombie) process
and the signal is correctly dropped (not delivered to the recycled PID's occupant).
pidfd_getfd(pidfd, targetfd, flags) -> fd (syscall 438 on x86-64): Duplicates
file descriptor targetfd from the process identified by pidfd into the calling
process's fd table. Returns a new fd in the caller referring to the same OpenFile.
Requires PTRACE_MODE_ATTACH_REALCREDS permission on the target process (same
permission check as PTRACE_ATTACH). flags must be 0.
Error codes:
- -ESRCH: target process does not exist or has fully exited (zombie reaped).
- -EBADF: pidfd is not a valid pidfd, or targetfd is not a valid fd in the
target process.
- -EPERM: caller lacks PTRACE_MODE_ATTACH_REALCREDS permission.
- -EMFILE: caller's fd table is full.
Implementation note: The pidfd is backed by a reference to the target's
struct Process (via Arc<Process>). The reference keeps the process reachable
even after it becomes a zombie, preventing PID-reuse races. The fd is released when
the pidfd is closed.
8.1.4.1.1 UmkaOS Process Exit Cleanup Tokens¶
Problem with atexit() and signal handlers for cleanup: Resource cleanup on
process death depends on atexit() (runs handlers synchronously during exit,
can block or be skipped on SIGKILL), signal handlers for SIGTERM (async, the
process may crash before the handler runs), or close() callbacks on file
descriptors (limited expressiveness). None of these mechanisms fire on SIGKILL
or OOM kill, and a handler that blocks stalls the entire exit.
UmkaOS exit cleanup tokens: A kernel-managed cleanup mechanism tied to process
lifetime. When the process exits for any reason — normal exit(), SIGKILL,
unhandled fault, or OOM kill — the kernel executes registered cleanup actions
after the process's address space is torn down, using kernel-internal state
that is independent of the process's stack and heap.
/// A handle to a registered process exit cleanup action.
///
/// Dropping the handle (or calling `cancel()`) unregisters the action so
/// it will not run when the process exits.
///
/// Obtained from `umka_register_exit_cleanup()`.
pub struct ExitCleanupHandle {
/// Opaque identifier assigned at registration time.
id: CleanupId,
/// Weak reference so the handle does not extend process lifetime.
process: Weak<Process>,
}
impl ExitCleanupHandle {
/// Cancel this cleanup action.
///
/// After this call the action will not execute on process exit.
/// Safe to call from any thread that owns the handle.
pub fn cancel(self) {
// Consuming self triggers Drop, which removes the entry from the
// process's cleanup list under the process lock.
}
}
impl Drop for ExitCleanupHandle {
fn drop(&mut self) {
// Attempt to upgrade Weak → Arc. If upgrade fails, the process
// has already been fully dropped — the cleanup list was drained
// during do_exit(), so no action is needed and the handle is
// simply discarded. No resource leak occurs because:
// 1. The cleanup action was already executed (if the process
// exited normally and do_exit drained the list), or
// 2. The process was killed and do_exit executed all remaining
// cleanup actions before the last Arc<Process> was dropped.
if let Some(process) = self.process.upgrade() {
let mut cleanups = process.exit_cleanups.lock();
cleanups.remove(self.id);
}
// If upgrade fails: no-op (process already gone, cleanup already
// executed or irrelevant).
}
}
/// Generation-tagged process identifier for deferred operations.
/// Also known conceptually as a "generational PID" — distinct from raw
/// `pid_t` (i32) used in `Process.parent` and other fields that hold
/// live Arc references (where PID recycling is prevented by refcounting,
/// not generation tagging).
///
/// Combines a numeric PID with a monotonically increasing generation counter
/// to prevent PID recycling races. When a cleanup action (or any deferred
/// operation) needs to target a process that may exit before the action runs,
/// a `ProcessId` is captured at registration time. At execution time, the
/// kernel looks up the PID in the process table and compares the generation:
/// if they match, the target is the same process; if not, the PID was recycled
/// and the operation is safely dropped.
///
/// The generation counter is incremented each time a PID slot is allocated
/// from the PID allocator (see [Section 8.1](#process-and-task-management--task-model)).
/// With u64, at 10 billion PID allocations per second, the counter survives
/// ~58 years before wrapping — well within the 50-year uptime target.
pub struct ProcessId {
/// The numeric PID (matches Linux `pid_t` = `i32`).
pub pid: i32,
/// Generation counter for this PID slot. Monotonically increasing.
/// Compared at lookup time to detect recycling.
pub generation: u64,
}
/// The kernel action to execute when the owning process exits.
pub enum ExitCleanupAction {
/// Revoke a capability held by the kernel on behalf of this process.
/// Equivalent to calling `cap_revoke()` after the process has gone.
/// Used by resource managers to release kernel resources without
/// relying on heartbeat polling.
RevokeCap(Cap),
/// Unlink a filesystem path (e.g., a pidfile or UNIX socket file).
/// Executes as `unlinkat(dirfd, path, 0)` in the kernel cleanup thread.
/// `dirfd` is `None` for absolute paths (treated as AT_FDCWD relative
/// to the root mount at exit time, not to the process's cwd).
///
/// **Allocation**: PathBuf is heap-allocated at `umka_register_exit_cleanup()`
/// time (warm path — syscall context, not hot path). The path is copied from
/// userspace once and stored until process exit. Acceptable: registration is
/// infrequent and bounded by `MAX_EXIT_CLEANUP_ACTIONS` (64) per process.
UnlinkPath {
path: PathBuf,
dirfd: Option<DirFd>,
/// Mount namespace captured at `umka_register_exit_cleanup()` time.
/// The cleanup thread enters this namespace before executing `unlinkat`.
/// Without this, the path would be resolved in the init namespace,
/// which would fail for processes in container (non-init) mount namespaces.
ns: Arc<MountNamespace>,
},
/// Send a signal to another process.
/// Executes as `kill(target, signo)` from the kernel cleanup thread.
/// The cleanup thread uses the captured `cred` for the `kill()`
/// permission check, NOT its own (kernel) credentials — otherwise
/// any process could escalate to unrestricted signal delivery.
///
/// **PID recycling protection**: `target` is a `ProcessId` containing both
/// the numeric PID and a generation counter captured at registration time.
/// At cleanup execution, the kernel looks up the PID in the process table
/// and compares the generation counter. If the generation does not match
/// (the original process exited and the PID was recycled), the signal is
/// silently dropped — not delivered to the recycled PID's occupant. This
/// prevents the exact race condition that `pidfd` was designed to solve,
/// without requiring the registering process to hold a pidfd.
SendSignal {
target: ProcessId,
signo: Signal,
/// Captured at `umka_register_exit_cleanup()` time from the
/// registering process's credentials.
cred: Arc<TaskCredential>,
},
/// Increment an eventfd counter to notify watchers that this process
/// has exited. Executes as `write(fd, &value.to_ne_bytes(), 8)`.
NotifyEventFd {
fd: OwnedFd,
value: u64,
},
}
/// Maximum number of cleanup actions that may be registered per process.
/// `umka_register_exit_cleanup()` returns `EMFILE` when this limit is reached.
pub const UMKA_MAX_EXIT_CLEANUPS: usize = 64;
/// Register a cleanup action to run when the current process exits.
///
/// The action runs in a dedicated kernel cleanup thread after the process's
/// address space and file descriptors are closed, and before the zombie is
/// made visible to `waitpid()`. The handle must be kept alive for the action
/// to remain registered; dropping the handle cancels the action.
///
/// Returns `EMFILE` if `UMKA_MAX_EXIT_CLEANUPS` actions are already registered.
/// Returns `EINVAL` if the action references an invalid fd, cap, or path.
///
/// Native UmkaOS syscall: `umka_register_exit_cleanup(2)` — syscall number -0x0A00
/// (negative, dispatched via bidirectional table; see Section 19.1.4.1).
pub fn umka_register_exit_cleanup(
action: ExitCleanupAction,
) -> Result<ExitCleanupHandle>;
Execution ordering within do_exit() (summary; full 14-step specification in
Section 8.2):
- Process calls
exit()or is killed —do_exit()begins.PF_EXITINGis set. - Per-task cancellation: pending timers and async I/O are cancelled, signal delivery is inhibited (no new signals can be delivered to this task).
- If
exit_group: SIGKILL all sibling threads; last thread performs process teardown. - Address space torn down: TLB flushed, core dump written (if signal-triggered), all VMAs unmapped, page tables freed.
- File descriptors closed, filesystem references released, signal handlers freed.
- Cgroup detach, IPC resource release, audit/accounting records written.
- Namespaces released (if last task in namespace).
- Cleanup phase: the kernel cleanup thread (one per CPU socket, pre-started
at boot) dequeues and executes all
ExitCleanupActions registered for this process, in registration order. Each action runs with a 1-second timeout: if an action blocks beyond the timeout, the kernel logs a warning at WARN level identifying the action type and process, then skips to the next action. Cleanup actions must not block; this timeout is a safety net, not a design budget. - Process transitions to zombie; SIGCHLD delivered to parent; children reparented.
- Schedule away — task never runs again.
Why run cleanup after MM teardown?: Cleanup actions are intentionally
deferred until after the process's own resources are gone, for two reasons.
First, RevokeCap and UnlinkPath are safe to execute unconditionally because
the process can no longer race with them — it has no address space or fd table.
Second, SendSignal after MM teardown sends a notification to an observer
rather than interacting with a still-running process, which is a well-defined
operation with no ordering ambiguity.
Comparison with existing cleanup mechanisms:
| Property | atexit() / C++ destructors |
Signal handlers (SIGTERM) | UmkaOS exit cleanup tokens |
|---|---|---|---|
| Runs on SIGKILL | No | No | Yes |
| Runs on OOM kill | No | No | Yes |
| Runs on unhandled fault | No | No | Yes |
| Can block exit | Yes | Yes | No (1 s timeout, kernel-managed) |
| Can crash and skip cleanup | Yes | Yes | No (runs in kernel thread) |
| Requires process address space | Yes | Yes | No (runs after MM torn down) |
| Integrates with capabilities | No | No | Yes (RevokeCap action) |
| Max registered actions | Unlimited (heap) | 1 per signal | 64 per process |
Linux compatibility: atexit(), on_exit(), and C++ destructors are fully
supported via the userspace runtime — they are unchanged. Exit cleanup tokens
are an UmkaOS extension exposed through the umka_register_exit_cleanup(2)
syscall (number -0x0A00, in the UmkaOS native negative-number syscall range).
Existing Linux binaries do not use this mechanism and are not affected by it.
Use cases:
- Container runtimes: unlink socket files and pidfiles when a container process dies, without needing to poll for liveness.
- Resource managers: release kernel resources (close capabilities, free reserved bandwidth) when a client process exits, replacing the heartbeat polling pattern common in distributed systems.
- Language runtimes (Go, JVM, .NET): guarantee cleanup of native resources even when the runtime is killed before its own shutdown hooks can run.
- Service monitors: write to an eventfd to notify a watchdog when a worker
process exits, with lower latency than polling
/procorpidfd_poll.
8.1.5 Address Space Operations¶
User space manipulates its address space through these syscalls, all of which go through
capability checks on the calling process's CapSpace:
mmap()/munmap(): Create and destroy virtual memory regions. Anonymous mappings allocate from the physical allocator on demand (Section 4.2). File-backed mappings go through the page cache (Section 4.4).MAP_SHAREDmappings are backed by a shared page cache entry;MAP_PRIVATEmappings COW on write.mprotect(): Change page permissions on an existing mapping. On x86-64, this integrates with hardware domain isolation (Section 11.2) -- Tier 1 driver memory regions can be made accessible only when the driver's protection key is active.brk()/sbrk(): Legacy heap expansion interface. Supported for compatibility with applications that do not usemmap-based allocators. Implemented as a resizable anonymous VMA at the process's break address.
8.1.5.1 mprotect(addr, len, prot) — Change Memory Region Permissions¶
Changes the access permissions of the virtual address range [addr, addr+len).
The full specification is in Section 4.8. Key requirements:
addrmust be page-aligned.lenis rounded up to the next page boundary.- If the range spans multiple VMAs, each is updated independently; the range
must not include any unmapped gap (returns
-ENOMEM). - VMAs may be split at
addrandaddr+lenif the range does not cover an entire VMA (same splitting logic asmmappartial unmap). protisPROT_NONE | PROT_READ | PROT_WRITE | PROT_EXEC(combinable).- Write permission on a file-backed mapping requires the mapping was created
with
MAP_SHAREDand the file was opened for writing; otherwise-EACCES. - On architectures with W^X enforcement (all by default),
PROT_WRITE | PROT_EXECis rejected with-EACCESunless the process holdsCAP_SYS_RAWIOor the VMA was explicitly created with both permissions. - A TLB flush is issued for all affected pages after the PTE updates. On SMP, this is a cross-CPU TLB shootdown via IPI.
- Error codes:
-EINVAL(unaligned addr),-ENOMEM(unmapped gap or VMA allocation failure during split),-EACCES(permission denied).
Capability-mediated memory sharing provides a secure alternative to POSIX shared
memory. Instead of a global namespace (/dev/shm/name), memory regions are shared by
explicitly granting a capability to the target process:
/// Grant access to a memory region to another process.
/// Returns a transferable capability handle.
pub fn mem_grant(
target: ProcessId,
region: VmaId,
perms: Permissions,
) -> Result<CapHandle>;
/// Map a previously granted region into the current address space.
/// The grant capability is consumed (single-use) or retained
/// depending on the grant's delegation policy.
pub fn mem_map(
grant: CapHandle,
hint_addr: Option<usize>,
) -> Result<*mut u8>;
This model has several advantages over POSIX shm_open:
- No global namespace: Shared regions are not visible to unrelated processes.
- Fine-grained permissions: The granter specifies read, write, or execute -- not just "owner/group/other" file modes.
- Revocable: The granter can revoke the capability (Section 9.1 generation counter), and the next access by the target faults.
- Auditable: Every grant and map operation flows through capability checks and can be logged.
POSIX shared memory (shm_open / mmap MAP_SHARED) is implemented on top of this
mechanism via the SysAPI layer, with a tmpfs-backed /dev/shm namespace that
translates names to capability grants.
8.1.6 Namespaces¶
Each process belongs to a set of namespaces that isolate its view of system resources.
UmkaOS implements all 8 Linux namespace types plus the UmkaOS-specific ima namespace (see Section 17.1 in 17-containers.md for full details):
| Namespace | Isolates | Key syscall flags |
|---|---|---|
pid |
Process ID space | CLONE_NEWPID |
net |
Network stack, interfaces, routes | CLONE_NEWNET |
mnt |
Mount table | CLONE_NEWNS |
user |
UID/GID mappings | CLONE_NEWUSER |
uts |
Hostname, domainname | CLONE_NEWUTS |
ipc |
SysV IPC, POSIX message queues | CLONE_NEWIPC |
cgroup |
Cgroup root directory | CLONE_NEWCGROUP |
time |
CLOCK_MONOTONIC / CLOCK_BOOTTIME offsets |
CLONE_NEWTIME |
ima |
Integrity measurement log and policy (UmkaOS-specific) | (implicit with CLONE_NEWUSER) |
Capability gating: Creating a new namespace requires the appropriate capability --
the UmkaOS equivalent of CAP_SYS_ADMIN (for most namespaces) or an unprivileged
CLONE_NEWUSER followed by capabilities within the new user namespace. This matches
Linux semantics so that rootless containers (Podman, Docker rootless mode) work
unmodified.
Default inheritance: Without any CLONE_NEW* flags, the child task inherits the
parent task's nsproxy: ArcSwap<NamespaceSet> via ArcSwap::from_pointee() with the
same underlying Arc (shared reference, not copy). This is the common case for fork()
and clone() without namespace flags.
When unshare() or setns() is called, only the calling task's nsproxy is replaced
(via task.nsproxy.store(new_arc)) — sibling threads are unaffected, matching Linux's
per-task task_struct->nsproxy semantics.
Process creation integration: clone3() and the native spawn() both accept a
NamespaceSpec that specifies whether each namespace is inherited from the parent or
freshly created. Namespaces are reference-counted; they are destroyed when the last
process in the namespace exits and no external references (bind mounts, open
/proc/[pid]/ns/* file descriptors) remain.
See also: Section 17.1 (
17-containers.md) provides the full namespace implementation details and container runtime compatibility requirements.
8.1.7 User-Mode Scheduling (Fibers and M:N Threading)¶
UmkaOS provides an opt-in scheduler upcall mechanism that enables userspace libraries to implement M:N threading — multiplexing many lightweight fibers (cooperative coroutines) onto fewer OS threads, with correct behaviour when a fiber blocks in a syscall. This is the only kernel-level primitive needed for fibers; the fiber context switch itself (saving/restoring registers, swapping stacks) is purely a userspace library operation and requires no syscall.
This feature is native-UmkaOS-only. It does not exist on Linux. Applications that use it are not portable to Linux without a shim. Existing Linux-compatible applications that do not opt in are completely unaffected.
8.1.7.1 Motivation¶
A fiber (cooperative coroutine, user-mode thread) is a save/restore of the integer and FPU register state plus a stack pointer swap. No kernel involvement is needed for the switch itself. The hard problem is a fiber calling a blocking syscall: without kernel cooperation, the entire OS thread blocks, starving all other fibers running on it.
Three approaches exist:
1. Async-only I/O (restrict fibers to io_uring/epoll): Works, but
requires all callees to be async-aware. Incompatible with legacy synchronous
code.
2. One OS thread per fiber (1:1): Works, but eliminates the efficiency
advantage of fibers and limits parallelism to the thread count.
3. Scheduler upcalls (this design): The kernel calls a registered userspace
function before blocking, allowing the fiber scheduler to park the current
fiber and immediately run another. The OS thread never actually blocks while
runnable fibers exist.
This is the scheduler activations model (Anderson et al., SOSP 1992),
implemented in Solaris LWPs, early NetBSD, and macOS pthreads internals. It
is the correct kernel primitive for M:N scheduling.
8.1.7.2 Scheduler Upcall Registration¶
A thread registers an upcall handler via a new UmkaOS syscall:
/// Register a scheduler upcall handler for the calling thread.
///
/// When the calling thread is about to enter a blocking state (blocking
/// syscall, futex wait, page fault that requires I/O), the kernel saves
/// the thread's full register state into `upcall_stack_top - sizeof(UpcallFrame)`
/// and transfers control to `handler`.
///
/// The handler runs on `upcall_stack` (a separate dedicated stack of
/// `upcall_stack_size` bytes) to avoid corrupting the fiber's stack.
///
/// # Arguments
/// - `handler`: Upcall entry point (see UpcallFrame below).
/// - `upcall_stack`: Userspace-allocated stack for upcall execution.
/// - `upcall_stack_size`: Size of that stack in bytes (minimum 8 KiB).
///
/// # Returns
/// `Ok(())` on success. `EINVAL` if `upcall_stack` is not mapped writable
/// or `upcall_stack_size` is below the minimum.
SYS_register_scheduler_upcall(
handler: extern "C" fn(*mut UpcallFrame),
upcall_stack: *mut u8,
upcall_stack_size: usize,
) -> Result<()>;
/// Deregister the upcall handler. Thread reverts to standard 1:1 blocking.
SYS_deregister_scheduler_upcall() -> Result<()>;
Re-entrancy protection: The Task struct carries two atomic flags for upcall
re-entrancy: in_upcall (set while the handler is executing) and upcall_pending
(deferred trigger for events that arrive while in_upcall is true). These fields are
defined in the Task struct in Section 8.1.1.
The full protocol for upcall delivery is:
Step 1 — Before delivering an upcall to userspace:
Check task.in_upcall.load(Relaxed):
- false → proceed to step 2.
- true → the upcall handler is already executing on the upcall stack.
Nesting would overwrite UpcallFrame, destroying the original
fiber's register state. Instead:
a. Set task.upcall_pending = true.
b. For blocking-syscall events: block the OS thread directly
(standard 1:1 blocking), as if no upcall handler were registered.
The syscall completes normally; the handler will be re-entered
for the next event once upcall_pending is processed.
c. For page-fault I/O events: block synchronously until the page
is populated; do not invoke the handler.
Return from step 1 — do NOT proceed to step 2.
Step 2 — Set task.in_upcall = true (Relaxed store; the architecture guarantees
single-copy atomicity for aligned byte writes).
Step 3 — Build the UpcallFrame on the upcall stack and transfer control to the
registered handler.
Step 4 — When the handler calls SYS_scheduler_upcall_resume() or
SYS_scheduler_upcall_block():
a. Set task.in_upcall = false.
b. Check task.upcall_pending:
- false → return to user space normally (resume the selected fiber or
block the OS thread waiting for completions).
- true → clear upcall_pending, re-examine the current scheduling state,
and if a blocking event is still pending, deliver a fresh upcall
immediately (loop back to step 2). This coalesces all deferred
events into a single handler invocation.
Blocking inside the upcall handler:
The upcall handler must not make blocking syscalls that would stall the OS
thread indefinitely — doing so would starve all fibers on that thread. If the
handler invokes a blocking syscall (e.g., futex_wait, read on a blocking fd),
the kernel permits it (task.in_upcall remains true, so no nested upcall is
issued), and the OS thread blocks until the syscall completes. task.upcall_pending
is set if any scheduling event arrives during that block. The handler should
use non-blocking or io_uring-based I/O on its own internal data structures
(run queue, completion ring) to avoid this. EUCLWAIT is NOT returned; the
kernel does not prohibit blocking syscalls from within the handler — it simply
defers the next upcall via upcall_pending rather than nesting.
This is analogous to how POSIX signal handlers mask the same signal during delivery to prevent re-entrant corruption: the handler is protected from re-entry, while incoming events are deferred rather than dropped.
/// Saved register state of the fiber that triggered the upcall.
/// Passed by pointer to the upcall handler; the fiber is resumed by
/// restoring these registers (see SYS_scheduler_upcall_resume).
#[repr(C)]
pub struct UpcallFrame {
/// Saved general-purpose registers (architecture-specific layout).
pub regs: ArchRegs,
/// Why the fiber is blocking.
pub reason: BlockReason,
/// Opaque kernel handle — pass back to SYS_scheduler_upcall_resume
/// or SYS_scheduler_upcall_block.
pub fiber_token: u64,
/// Integrity cookie: `UPCALL_FRAME_MAGIC ^ task.upcall_frame_nonce`.
/// Written by the kernel when building the frame; verified by
/// `SYS_scheduler_upcall_resume` before restoring registers.
/// See Section 8.1.7.3 UpcallFrame Validation, step 5.
pub magic_cookie: u64,
}
#[repr(u32)]
pub enum BlockReason {
/// Entering a blocking syscall (e.g., read, write, futex_wait).
BlockingSyscall = 1,
/// Page fault requiring disk I/O (demand paging).
PageFaultIo = 2,
/// Waiting for a kernel lock (unlikely; most kernel waits are brief).
KernelLock = 3,
}
/// Per-architecture size assertions for UpcallFrame.
/// UpcallFrame contains ArchRegs (architecture-dependent size) + BlockReason (4B)
/// + fiber_token (8B) + magic_cookie (8B) + possible padding.
/// These assertions ensure the frame size is stable across compiler versions
/// and matches the userspace fiber library's expectations.
///
/// ArchRegs sizes (GP regs + FP/SIMD state — architecture-specific):
/// - x86-64: 168B GP + 512B FXSAVE = 680B → UpcallFrame = 680 + 4 + 4(pad) + 8 + 8 = 704
/// - AArch64: 264B GP + 512B FP/NEON = 776B → UpcallFrame = 776 + 4 + 4(pad) + 8 + 8 = 800
/// - ARMv7: 64B GP + 256B VFPv3 = 320B → UpcallFrame = 320 + 4 + 8 + 8 = 340 (4B pad after reason) = 344
/// - RISC-V 64: 256B GP + 256B FP = 512B → UpcallFrame = 512 + 4 + 4(pad) + 8 + 8 = 536
/// - PPC32: 136B GP + 256B FP = 392B → UpcallFrame = 392 + 4 + 8 + 8 = 412 (4B pad) = 416
/// - PPC64LE: 288B GP + 256B FP = 544B → UpcallFrame = 544 + 4 + 4(pad) + 8 + 8 = 568
/// - s390x: 128B GP + 128B FP = 256B → UpcallFrame = 256 + 4 + 4(pad) + 8 + 8 = 280
/// - LoongArch64: 256B GP + 256B FP = 512B → UpcallFrame = 512 + 4 + 4(pad) + 8 + 8 = 536
///
/// Note: exact ArchRegs sizes are defined in the per-architecture arch modules.
/// The const_asserts below must be updated whenever ArchRegs changes.
/// Implementation agents: verify these values against the actual ArchRegs size
/// for each target before compiling.
#[cfg(target_arch = "x86_64")]
const_assert!(size_of::<UpcallFrame>() == 704);
#[cfg(target_arch = "aarch64")]
const_assert!(size_of::<UpcallFrame>() == 800);
// ARMv7, RISC-V, PPC, s390x, LoongArch64: verify at implementation time
// when ArchRegs is finalized for each architecture.
8.1.7.3 Upcall Handler Flow¶
Fiber A calls read(fd, buf, len) → would block
↓
Kernel saves Fiber A registers into UpcallFrame on upcall stack
↓
Kernel transfers control to handler(frame) on upcall stack
↓
Handler (fiber scheduler):
- Parks Fiber A: stores frame->fiber_token, records Fiber A as "blocked on read"
- Submits read to io_uring for non-blocking completion
(Note: the wakeup path uses io_uring completion rings, not eventfd.
The io_uring CQE provides both the completion signal and the result
data in a single shared-memory read, avoiding the extra syscall
overhead of eventfd notification.)
- Picks Fiber B from the run queue
- Calls SYS_scheduler_upcall_resume(fiber_b_frame) to restore Fiber B
↓
Fiber B runs on the OS thread
↓
io_uring completion arrives → event loop wakes Fiber A
- Handler receives io_uring completion
- Reconstructs Fiber A's UpcallFrame with the result
- Calls SYS_scheduler_upcall_resume(fiber_a_frame) to restore Fiber A
↓
Fiber A resumes with read() returning the result
Two new syscalls control fiber resumption:
/// Restore a fiber that was parked by an upcall.
/// Restores the registers from `frame` and returns to the fiber's PC.
/// The `result` value is placed in the return register (rax / x0 / a0).
/// This call never returns to the caller — control goes to the fiber.
SYS_scheduler_upcall_resume(frame: *const UpcallFrame, result: i64) -> !;
/// Tell the kernel it is safe to block the OS thread now.
/// Used when all fibers are waiting and there is nothing to run.
/// The thread blocks until any previously registered io_uring completion,
/// futex wake, or signal arrives. On return, the handler is called again
/// with the newly unblocked fiber.
SYS_scheduler_upcall_block() -> !;
UpcallFrame Validation -- SYS_scheduler_upcall_resume performs the following
checks on the frame pointer before restoring any register state. All checks must
pass; failure returns an error to the caller (which is possible because the call has
not yet transferred control to the fiber).
-
Frame pointer bounds check. The
framepointer must lie within the calling thread's registered upcall stack bounds, checked againsttask.upcall_stack_baseandtask.upcall_stack_size(set duringSYS_register_scheduler_upcall()registration). The entireUpcallFrame(fromframetoframe + size_of::<UpcallFrame>()) must fit within the range[upcall_stack_base, upcall_stack_base + upcall_stack_size). If out of bounds: returnEFAULT.Note: these are the upcall stack bounds (user-space memory allocated by the fiber library and registered via
SYS_register_scheduler_upcall), NOT the kernel stack bounds (task.stack_base/task.stack_size) and NOT the main user-space thread stack (mm.start_stack/RLIMIT_STACK). The upcall stack is a third, separate stack dedicated to the fiber scheduler's upcall handler. -
Saved instruction pointer check. The saved program counter in the frame (
frame.regs.ripon x86-64,frame.regs.pcon AArch64/ARMv7,frame.regs.sepcon RISC-V,frame.regs.srr0on PPC) must point to user-space: its value must be less thanUSER_ADDR_LIMIT. A saved PC pointing into kernel address space would allow the fiber scheduler to redirect execution into the kernel. If the PC is at or aboveUSER_ADDR_LIMIT: returnEPERM. -
Saved stack pointer check. The saved stack pointer in the frame (
frame.regs.rspon x86-64,frame.regs.spon AArch64/ARMv7/RISC-V/PPC) must point to user-space: its value must be less thanUSER_ADDR_LIMIT. This is the fiber's original stack pointer (NOT the upcall stack from check 1). The fiber may use any valid user-space stack (main stack, thread stack, or a fiber-library-allocated stack), so only the user/kernel boundary is checked. If at or aboveUSER_ADDR_LIMIT: returnEFAULT. -
Segment/privilege-level register check (architecture-specific):
- x86-64:
frame.regs.csmust equalUSER_CS(0x33— GDT index 6, RPL=3 code segment) andframe.regs.ssmust equalUSER_DS(0x2B— GDT index 5, RPL=3 data segment). If either is wrong: returnEINVAL. (Linux:__USER_CS=0x33,__USER_DS=0x2Binarch/x86/include/asm/segment.h.) - AArch64:
frame.regs.pstate & PSTATE_EL_MASKmust indicate EL0. If the saved PSTATE specifies EL1 or higher: returnEINVAL. - ARMv7:
frame.regs.cpsr & MODE_MASKmust indicate USR mode (0x10). If it specifies any privileged mode: returnEINVAL. - RISC-V:
frame.regs.sstatus & SPP_MASKmust indicate U-mode (SPP=0). If SPP indicates S-mode: returnEINVAL. -
PPC32/PPC64LE:
frame.regs.msr & MSR_PRmust be set (problem state / user mode). If PR is clear (supervisor mode): returnEINVAL. -
Magic cookie integrity check. The
frame.magic_cookiefield must satisfyframe.magic_cookie ^ task.upcall_frame_nonce == UPCALL_FRAME_MAGIC, whereUPCALL_FRAME_MAGICis a compile-time constant (e.g.,0x55504341_4C4C464D-- "UPCALLFM" in ASCII) andtask.upcall_frame_nonceis the per-thread nonce stored in the kernel'sTaskstruct (see field documentation above). The nonce is generated from the hardware RNG at thread creation time and is never accessible to userspace. A frame not written by the kernel's own upcall dispatch code will fail this check because the attacker cannot know the nonce value. If the cookie does not match: returnEINVAL.
When the kernel builds an UpcallFrame (Step 3 of Section 8.1.7.2), it
writes magic_cookie = UPCALL_FRAME_MAGIC ^ task.upcall_frame_nonce into
the frame before transferring control to the upcall handler.
- FPU/SIMD state validation (architecture-specific):
The frame's saved FPU/SIMD state must pass the following per-architecture
checks. Without these checks, a malicious fiber scheduler could craft an
UpcallFramewith invalid FPU state that causes undefined behavior or information leakage when restored.
| Architecture | Validation |
|---|---|
| x86-64 | xstate_bv & ~XCR0 must be 0 (no feature bits set that the CPU does not support). mxcsr & MXCSR_RESERVED_MASK must be 0 (reserved bits in MXCSR must be clear; MXCSR_RESERVED_MASK is read from CPUID.(EAX=01h):EDX at boot). If XSAVE area is present in the frame, its size must match xstate_size from CPUID.(EAX=0Dh, ECX=0). |
| AArch64 | fpsr reserved bits (bits 31:28 and 26:8) must be 0. fpcr reserved bits must be 0 (only bits 26:24, 23:22, 21:19, 18:16, 15, 14:12, 11:8 are defined per ARMv8-A). |
| ARMv7 | fpscr reserved bits (bits 31:28 cleared except for condition flags; bits 7, 6, 5 reserved) must match architecture constraints. |
| RISC-V | fcsr & ~0xFF must be 0 (only bits 7:0 are defined: frm[7:5] and fflags[4:0]). |
| PPC32/PPC64LE | fpscr reserved bits must be 0. Specifically, bits 0 (FX summary) through bit 31 are defined; any bits designated as reserved by the Power ISA for the target PPC version must be clear. |
| s390x | fpc (floating-point control register): reserved bits (bits 31:16 except DXC, and bits 7:3) must be 0. |
| LoongArch64 | fcsr0 reserved bits (bits 31:30, 28:25) must be 0. Only RM (bits 9:8), Enables (bits 4:0), and Flags (bits 20:16) are defined. |
If any FPU/SIMD field fails validation: return EINVAL.
All six checks are performed in order; the first failure terminates validation and returns the corresponding error code. Only after all checks pass does the kernel restore the register state from the frame and transfer control to the fiber.
8.1.7.4 Interaction with io_uring¶
For BlockingSyscall upcalls, the handler typically converts the blocking
operation to a non-blocking io_uring submission (IORING_OP_READ,
IORING_OP_WRITE, IORING_OP_FUTEX_WAIT, etc.) and calls
SYS_scheduler_upcall_block() when the run queue is empty. The io_uring
completion ring provides the wakeup. This combination fully replaces the
blocking syscall with an async equivalent, transparent to the fiber.
For PageFaultIo upcalls, the handler typically has no alternative — a page
must be faulted in from disk. The handler parks the faulting fiber and runs
others, then calls SYS_scheduler_upcall_block() until a wakeup arrives.
Page fault completion wakeup: When the kernel completes the I/O for a demand
page fault, it posts a synthetic completion event to the thread's registered
io_uring completion queue (if present) or writes an 8-byte counter increment
to the thread's registered eventfd (if configured via
SYS_register_scheduler_upcall). The completion event carries the
fiber_token of the faulting fiber, allowing the handler to identify which
parked fiber is now runnable. If neither io_uring nor eventfd is registered, the
kernel wakes the thread from SYS_scheduler_upcall_block() directly and issues
a new upcall with reason = PageFaultIo and the original fiber_token, allowing
the handler to resume the fiber.
8.1.7.5 Fiber Library Design¶
UmkaOS ships a userspace fiber library (umka-fiber) in the standard library.
The library provides:
// umka-fiber (userspace library, not kernel code)
pub struct Fiber { /* stack, register save area, FLS slot table */ }
pub struct FiberScheduler { /* per-OS-thread run queue, upcall stack */ }
impl FiberScheduler {
/// Initialize the scheduler on the current OS thread.
/// Allocates an upcall stack and calls SYS_register_scheduler_upcall.
pub fn init() -> Self;
/// Create a fiber with the given entry point and stack size.
pub fn spawn(&self, f: impl FnOnce() + 'static, stack_size: usize) -> FiberId;
/// Cooperatively yield to the next runnable fiber.
/// If no other fiber is runnable, returns immediately.
pub fn yield_now(&self);
/// Run the scheduler loop. Returns when all fibers have completed.
pub fn run(&mut self);
}
Fiber Local Storage (FLS): Each Fiber has a private FLS table (array of
*mut () slots, analogous to Windows FlsAlloc/FlsGetValue). The library
swaps the FLS table pointer on every SwitchToFiber — no kernel involvement.
Thread-local storage (#[thread_local]) continues to work as normal and is
shared across all fibers on the same OS thread (matching Windows TLS semantics;
FLS is distinct).
8.1.7.6 WEA Integration¶
Windows Fiber support in WEA (Section 19.6) maps directly onto this mechanism:
ConvertThreadToFiber()→FiberScheduler::init()on the calling thread.CreateFiber(size, fn, param)→FiberScheduler::spawn(...).SwitchToFiber(fiber)→ cooperative yield to a specific fiber; pure userspace register swap, no syscall.FlsAlloc/FlsGetValue/FlsSetValue→ read/write into the current fiber's FLS slot table; implemented in ntdll by WINE, no WEA syscall needed.- Blocking syscall inside a fiber → scheduler upcall converts to
io_uring; the OS thread runs other fibers while waiting.
The TEB NtTib.FiberData field is updated by WINE on every SwitchToFiber
call (userspace write to the TEB in user address space). The kernel's role is
only to provide the fast NtCurrentTeb() path via the per-thread GS base
mapping (Section 19.6) and the scheduler upcall mechanism above.
8.1.8 Signal Delivery Wakeup Protocol¶
When a signal is sent to a task (via kill(), tkill(), tgkill(), or
kernel-internal signal delivery such as SIGCHLD on child exit or SIGSEGV on
fault), the signal must be both queued and acted upon. Queuing alone is
insufficient — a sleeping task will never check its pending signals unless it
is woken. This section specifies the wakeup protocol that bridges signal
queuing and signal processing.
/// Send a signal to a specific task and ensure it will be processed.
///
/// Called by `kill()`, `tkill()`, `tgkill()` (via the syscall layer), and by
/// kernel-internal signal sources (fault handlers, timer expiry, child exit).
///
/// # Protocol
///
/// 1. **Queue**: Add the signal to the task's pending signal set. For
/// real-time signals (SIGRTMIN..SIGRTMAX), a `SigInfo` entry is also
/// enqueued on the task's `sigqueue` (one entry per signal instance,
/// not deduplicated). For standard signals (1..31), the signal is
/// represented as a single bit — re-sending a standard signal that is
/// already pending is a no-op.
///
/// 2. **Set TIF_SIGPENDING**: A per-task thread-info flag checked on every
/// return-to-userspace path (syscall return, interrupt return, exception
/// return). When set, the kernel invokes `do_signal()` before resuming
/// userspace execution, which dequeues and delivers pending signals.
///
/// 3. **Wake the task**: Call `try_to_wake_up(task, wake_mask)` to transition
/// the task from a sleeping state to RUNNING if the wake condition matches:
///
/// | Task state | Signal | Wakeup action |
/// |------------|--------|---------------|
/// | `RUNNING` | any | No-op. Task is already on a runqueue; it will check `TIF_SIGPENDING` on next return to userspace. |
/// | `INTERRUPTIBLE` | any | Set state to `RUNNING`, enqueue on runqueue with `OnRqState::Queued`. |
/// | `UNINTERRUPTIBLE` | non-fatal | No-op. Task is in an uninterruptible wait and will check `TIF_SIGPENDING` when the wait completes. |
/// | `KILLABLE` | `SIGKILL` | Set state to `RUNNING`, enqueue on runqueue. `KILLABLE` is `UNINTERRUPTIBLE | WAKEKILL` — only fatal signals break through. |
/// | `KILLABLE` | non-fatal | No-op. Same as `UNINTERRUPTIBLE` for non-fatal signals. |
/// | `STOPPED` | `SIGCONT` | Resume the task: set state to `RUNNING`, enqueue. Also clears `STOPPED` for all tasks in the thread group (POSIX). |
/// | `STOPPED` | `SIGKILL` | Unstop and kill: set state to `RUNNING`, enqueue. |
/// | `STOPPED` | other | Remain stopped. Signal is queued and will be delivered when the task is resumed. |
///
/// # CBS Throttling Interaction
///
/// The EEVDF scheduler's Capacity-Based Scheduling (CBS) may throttle a task
/// that has exhausted its bandwidth allocation. Throttled tasks have
/// `OnRqState::CbsThrottled` and are not eligible for scheduling until budget
/// replenishment. Unlike `OnRqState::Deferred` (which is an EEVDF-internal
/// optimization where the task stays on the tree), `CbsThrottled` tasks are
/// fully dequeued from EEVDF trees. Signal delivery interacts with CBS as
/// follows:
///
/// - **SIGKILL**: Always wakes immediately, bypassing CBS throttle. The task
/// is enqueued with `OnRqState::Queued` regardless of remaining budget.
/// Rationale: SIGKILL must be unblockable and unkillable processes are a
/// system reliability hazard. The one-time budget violation is bounded
/// (the task executes only its exit path).
///
/// - **SIGSTOP**: Always wakes immediately, bypassing CBS throttle (same as
/// SIGKILL). POSIX requires SIGSTOP to be unblockable and uncatchable —
/// the task must transition to `STOPPED` state promptly regardless of
/// scheduler budget. The budget violation is bounded (the task transitions
/// to `STOPPED` and ceases execution immediately; no user code runs).
/// Without this exception, a CBS-throttled task would appear to ignore
/// SIGSTOP until its next budget replenishment — violating POSIX and
/// breaking `kill -STOP` / job control semantics.
///
/// - **All other signals**: Set `TIF_SIGPENDING` only. If the task is
/// CBS-throttled (`OnRqState::CbsThrottled`), it remains throttled until the
/// next budget replenishment tick. On the next `schedule_in()`, the task
/// checks `TIF_SIGPENDING` before returning to userspace and processes
/// the signal at that point. This preserves CBS bandwidth guarantees for
/// non-fatal signals — a flood of signals cannot starve other tasks.
///
/// # Implementation
///
/// ```
/// fn send_signal_to_task(task: &Task, sig: u8, info: SigInfo) {
/// // send_signal_to_task is the LOW-LEVEL primitive — it skips
/// // permission checks, ignore checks, and SIGCONT/stop cancellation.
/// // The ignore check is in the top-level send_signal() function
/// // ([Section 8.5](#signal-handling--sending-a-signal) step 4), which uses the
/// // lock-free handler_cache for the disposition check.
/// //
/// // Acquire the per-sighand queue lock (SIGLOCK, level 40) to
/// // serialize signal queuing against concurrent delivery and
/// // sigprocmask changes. The lock is on SignalHandlers, not Task.
/// let _guard = task.process.sighand.queue_lock.lock();
///
/// if sig.is_realtime() {
/// // RT signal: slab-allocate a SigQueueEntry, append to rt_tail FIFO.
/// task.pending.enqueue_rt(sig, info.clone());
/// } else {
/// // Standard signal: store in standard[signo] (at-most-one-pending).
/// task.pending.enqueue_standard(sig, info.clone());
/// }
///
/// // Step 2: Set TIF_SIGPENDING.
/// task.thread_flags.fetch_or(TIF_SIGPENDING, Release);
///
/// // Step 3: Wake the task.
/// if sig == SIGKILL {
/// // SIGKILL always wakes, even UNINTERRUPTIBLE and KILLABLE tasks.
/// // Bypass CBS throttle.
/// try_to_wake_up(task, TASK_KILLABLE | TASK_INTERRUPTIBLE | TASK_STOPPED,
/// WakeFlags::BYPASS_CBS);
/// } else if sig == SIGSTOP {
/// // SIGSTOP always wakes immediately, bypassing CBS throttle.
/// // POSIX requires SIGSTOP to be unblockable and uncatchable —
/// // the task must transition to STOPPED promptly regardless of
/// // scheduler budget. The budget violation is bounded (the task
/// // transitions to STOPPED and ceases execution immediately).
/// try_to_wake_up(task, TASK_INTERRUPTIBLE | TASK_STOPPED,
/// WakeFlags::BYPASS_CBS);
/// } else if sig == SIGCONT {
/// // SIGCONT resumes stopped tasks (POSIX requirement).
/// try_to_wake_up(task, TASK_INTERRUPTIBLE | TASK_STOPPED, WakeFlags::empty());
/// } else {
/// // Normal signal: wake only INTERRUPTIBLE sleepers.
/// try_to_wake_up(task, TASK_INTERRUPTIBLE, WakeFlags::empty());
/// }
/// }
/// ```
///
/// # Thread-Group Signal Delivery (kill(pid, sig))
///
/// When a signal targets a process (not a specific thread), `kill(pid, sig)`
/// selects one thread in the thread group to receive it, preferring:
/// 1. A thread with `sig` not blocked in its `signal_mask` and in
/// `INTERRUPTIBLE` state (can be woken immediately).
/// 2. Any thread with `sig` not blocked (even if `RUNNING` or
/// `UNINTERRUPTIBLE`).
/// 3. If all threads block `sig`, the signal is queued on the process-wide
/// shared pending set and will be delivered to the first thread that
/// unblocks it (via `sigprocmask`).
pub fn send_signal_to_task(task: &Task, sig: u8, info: SigInfo);
TIF_SIGPENDING check points: The kernel checks this flag at every transition
from kernel mode to user mode:
- Syscall return path: After every syscall, before
sysret/eret/sret. - Interrupt return path: After handling a hardware interrupt that interrupted userspace execution.
- Exception return path: After handling a page fault or other exception from userspace.
If TIF_SIGPENDING is set, the kernel calls do_signal() which dequeues the
highest-priority pending signal (SIGKILL first, then by signal number) and either
invokes the registered signal handler (by setting up a signal frame on the user
stack and redirecting execution to the handler) or executes the default action
(terminate, stop, ignore, core dump).
Signal frame layout by architecture:
The kernel constructs a signal frame on the user stack containing the saved
register state, siginfo_t, and ucontext_t. The signal handler executes in
userspace and returns to the kernel via rt_sigreturn(), which restores the
saved state from the frame. Frame sizes include the siginfo_t (128 bytes on
all architectures) and the architecture-specific ucontext_t with extended
register state. The "base" size includes GPRs + minimal FP state; the "max"
size includes all optional SIMD extensions. Sizes are derived from Linux
sizeof(struct rt_sigframe) on each architecture.
| Architecture | Signal frame contents | Approx. frame size | Stack alignment | Notes |
|---|---|---|---|---|
| x86-64 | siginfo_t (128 bytes), ucontext_t (GPRs rax-r15, rip, rflags, cs/ss, rsp, FPU/SSE state via fpstate), rt_sigreturn trampoline address |
~944 bytes (base without fpstate) + FXSAVE: 512 bytes + XSAVE: up to 2688 bytes with AVX-512 | 16-byte (ABI) | 128-byte red zone below RSP is avoided by the kernel (frame placed at RSP - 128 - frame_size). Linux struct rt_sigframe: info (128) + uc (sigset(8) + stack(24) + mcontext_t (gregs[23]*8=184 + fpstate_ptr(8)) + pad) = ~568 bytes + fpstate appended |
| AArch64 | siginfo_t (128 bytes), ucontext_t (GPRs x0-x30, sp, pc, pstate), FP/SIMD state (FPSIMD: 512 bytes, SVE: variable up to VL*34 bytes), rt_sigreturn trampoline |
~1200 bytes (base) + SVE state | 16-byte (ABI) | No red zone. Frame includes __reserved area for extensible auxiliary records (SVE, PAC, MTE tags) |
| ARMv7 | siginfo_t (128 bytes), sigcontext (GPRs r0-r15, CPSR, fault_address), VFP state (user_vfp struct: 32 double registers + FPSCR), rt_sigreturn trampoline |
~580 bytes | 8-byte (AAPCS) | No red zone. retcode[2] embedded in frame for sigreturn trampoline (mov r7, __NR_rt_sigreturn; svc 0) |
| RISC-V 64 | siginfo_t (128 bytes), ucontext_t (GPRs x0-x31, pc), FP state (f0-f31, fcsr: 264 bytes), rt_sigreturn trampoline |
~680 bytes (base) + vector state if RVV | 16-byte (ABI) | No red zone. Frame placed at SP - frame_size, aligned to 16 bytes |
| PPC32 | siginfo_t (128 bytes), sigcontext (GPRs r0-r31, nip, msr, link, ctr, xer, ccr), FP state (f0-f31, fpscr: 264 bytes), rt_sigreturn trampoline |
~740 bytes | 16-byte (ABI) | No red zone. Signal trampoline stored in a VDSO page; the kernel writes the trampoline address into the link register |
| PPC64LE | siginfo_t (128 bytes), ucontext_t (GPRs r0-r31, nip, msr, link, ctr, xer, ccr, softe), FP state (f0-f31, fpscr), VMX state (vr0-vr31, vscr, vrsave), VSX state |
~1520 bytes | 16-byte (ABI) | No red zone. Uses ELFv2 ABI. TOC pointer (r2) restored from the signal frame on rt_sigreturn |
| s390x | siginfo_t (128 bytes), ucontext_t containing sigcontext (GPRs r0-r15, access registers a0-a15, FPRs f0-f15, PSW), vector registers v0-v31 (if VX facility present: 512 bytes) |
~512 bytes (base) + 512 bytes VX state | 8-byte (z/Arch ABI) | No red zone. PSW saved with user-mode bit set. The ri_cb (Runtime Instrumentation Control Block) is optionally saved if RI is active. Frame placed below the current stack pointer |
| LoongArch64 | siginfo_t (128 bytes), ucontext_t containing sigcontext (GPRs r0-r31, PC, flags), FP state (FPRs f0-f31, FCSR, FCC condition codes: 264 bytes), LSX state (128-bit v0-v31: 512 bytes, if LSX present), LASX state (256-bit x0-x31: 1024 bytes, if LASX present) |
~680 bytes (base) + SIMD state | 16-byte (ABI) | No red zone. sc_flags field indicates which optional state blocks are present. Frame includes rt_sigreturn trampoline code for the VDSO |
Cross-references:
- Task state flags and wakeup semantics: Section 8.1 (this chapter)
- CBS and EEVDF scheduler: Section 7.1
- OOM killer signal delivery: Section 4.2 (OOM Killer subsection)
- Signal handler table (SignalHandlers): Section 8.1 (Process::sighand)
- Cleanup tokens (run after SIGKILL exit): Section 8.1
8.2 Process Lifecycle Teardown¶
This section specifies the complete do_exit() teardown sequence and the ELF core dump
generator that runs within it. Together these define the precise ordering of resource
release, accounting, notification, and state capture that occurs when a process or thread
terminates. The sequence is invoked by exit(2), exit_group(2), fatal signals, and
the OOM killer.
Cross-references: - Task and Process structs: Section 8.1 - Exit cleanup tokens: Section 8.1 - Signal disposition and default actions: Section 8.5 - Resource limits (RLIMIT_CORE): Section 8.7 - Virtual memory manager: Section 4.8 - Cgroup controllers: Section 17.2 - Namespace architecture: Section 17.1 - Process groups and orphan detection: Section 8.6
8.2.1 do_exit() — Full Teardown Sequence¶
Preconditions for do_exit(): - Runs in task context (the dying task's own kernel stack). - Never called from interrupt or softirq context. - The OOM killer does NOT call do_exit directly. It sends SIGKILL, which causes the task to call do_exit on its next return-to-userspace check. - For exit_group(): the first thread calls do_exit, which sends SIGKILL to all sibling threads. Each sibling calls do_exit independently. - The task MAY be in any scheduling state (RUNNING, INTERRUPTIBLE, etc.) when do_exit is entered. Step 1 sets PF_EXITING atomically.
do_exit(exit_code: i32) is the single exit path for every dying task. Whether a process
calls exit(2), a thread calls pthread_exit(3), the kernel delivers a fatal signal, or
the OOM killer selects a victim, all paths converge here. The exit_code parameter is a
pre-encoded Linux wait status word. Callers are responsible for encoding: exit(2) and
exit_group(2) encode as (status & 0xff) << 8; fatal signal paths encode as
signum | (core_dumped ? 0x80 : 0). do_exit() stores the value verbatim into
task.exit_code without further transformation. The steps below execute in strict
order; each step's preconditions depend on the previous steps having completed.
8.2.1.1.1 Step 1: Set PF_EXITING¶
Setting PF_EXITING has three effects:
- Signal delivery is naturally inhibited. The task will never return to
userspace (it proceeds through teardown and eventually schedules away), so it
never reaches the return-to-user signal delivery checkpoint.
PF_EXITINGis used bycomplete_signal()inwants_signal()to skip this thread when choosing which thread receives a process-directed signal — this is a thread selection skip, NOT a signal drop. The signal is still queued to the process-wide pending set and can be delivered to other non-exiting threads.send_signal()does NOT checkPF_EXITING— Linux does not have such a check and neither does UmkaOS. See Section 8.5 for the precisesend_signal()algorithm. - Futex waiters are marked as dying.
futex_wake()skips tasks withPF_EXITINGwhen selecting wake targets, preventing a use-after-free if the waker runs after the waiter's address space is gone. - Scheduler accounting is finalized. The scheduler recognizes
PF_EXITINGand stops charging CPU time to this task after the current tick.
8.2.1.1.2 Step 1b: Ptrace Exit¶
If this task is ptraced, notify the tracer via PTRACE_EVENT_EXIT and enter
traced-zombie state (the tracer can inspect the dying task one last time before
it fully exits). If this task is a tracer, detach from all tracees and restore
their original parents. See Section 20.4.
Scenario 1 — this task is a tracee: If PTRACE_O_TRACEEXIT is set in
PtraceState.options, ptrace_event(PTRACE_EVENT_EXIT, exit_code) is called. The
task enters TASK_TRACED and blocks until the tracer resumes it (via PTRACE_CONT
or PTRACE_DETACH). This gives the tracer one final opportunity to inspect registers,
memory, and the exit code before teardown proceeds. If PTRACE_O_TRACEEXIT is not
set, the ptrace_event() call is a no-op and execution continues immediately.
Scenario 2 — this task is a tracer: For every task in ptraced_children:
1. If PT_EXITKILL is set on the tracee, send SIGKILL.
2. Restore tracee.process.parent to tracee.ptrace.saved_parent.
3. If the tracee is in TASK_TRACED, wake it (it resumes execution or enters
group stop if one is active).
4. If the tracee is a zombie, reparent it to saved_parent and notify the
real parent via SIGCHLD so it can be reaped.
5. Free the tracee's PtraceState (set tracee.ptrace = None).
Ordering rationale: Ptrace exit must happen early — before timer cancellation,
mm teardown, or fd close — because PTRACE_EVENT_EXIT blocks the dying task in
TASK_TRACED while the tracer inspects it. The task's resources (address space,
file descriptors, registers) must still be intact during this inspection. Linux
calls ptrace_event(PTRACE_EVENT_EXIT) before exit_signals() and calls
exit_ptrace() (tracer cleanup) inside exit_notify() near the end, but UmkaOS
consolidates both into a single early step for clarity: the tracee notification
blocks (so nothing below runs until the tracer releases it), and the tracer
cleanup only modifies other tasks' parent pointers (no dependency on the dying
task's later teardown steps).
8.2.1.1.3 Step 2: Timer Cancellation Gate¶
Timer cancellation is NOT performed here. It is deferred to the last-thread gate
(between Step 3 and Step 4, see "Last-thread gate" below). Individual thread exits
do not cancel process-wide timers, matching Linux behavior (exit_itimers() is
called only when group_dead is true in Linux's do_exit()). A multi-threaded
program where thread A calls pthread_exit() while threads B-N continue running
must NOT have its timers cancelled by thread A's exit. Proceed to Step 3.
8.2.1.1.4 Step 3: Thread Group Exit (exit_group)¶
If the exit was initiated by exit_group(2) (the normal exit(3) path in glibc, which
calls exit_group rather than exit), or by a fatal signal whose default action applies
to the entire process:
// Determine whether this is a group exit (exit_group(2) or fatal signal).
// The group exit code is set by the first thread to call exit_group() or
// to receive a fatal signal.
//
// Linux reference: kernel/exit.c do_group_exit() —
// if (sig->flags & SIGNAL_GROUP_EXIT)
// exit_code = sig->group_exit_code; // join existing group exit
// else {
// sig->group_exit_code = exit_code; // first thread: SET code
// sig->flags = SIGNAL_GROUP_EXIT;
// zap_other_threads(current); // first thread: KILL siblings
// }
let mut exit_code_guard = task.process.thread_group.exit_code.lock();
let already_exiting = exit_code_guard.is_some();
if already_exiting {
// Another thread already initiated a group exit. Use its exit code.
// No need to zap siblings — the first thread already did that.
drop(exit_code_guard);
} else {
// FIRST thread to initiate group exit. Set the code and zap siblings.
*exit_code_guard = Some(exit_code);
drop(exit_code_guard);
for sibling in task.process.thread_group.iter() {
if sibling.tid != task.tid && !sibling.flags.contains(PF_EXITING) {
// Use a direct wake mechanism that sets SIGKILL in the pending
// mask and wakes the thread. This bypasses send_signal() entirely,
// avoiding the PF_EXITING check (which blocks signals via the
// normal delivery path). Matches Linux's zap_other_threads().
signal_wake_up(sibling, /* fatal */ true);
}
}
}
Each sibling thread is woken via signal_wake_up(sibling, true), which sets SIGKILL in
the pending mask and wakes the thread. This bypasses send_signal() entirely --
signal_wake_up() directly manipulates the pending set and wakes the target, matching
Linux's zap_other_threads() implementation.
The sibling threads each enter do_exit()
independently and execute the same teardown sequence. Every exiting thread performs
per-thread cleanup (Steps 1, 1b, 3, 3a–3f). The last thread to exit additionally performs
the process-level cleanup in Steps 4–13; all other threads skip Steps 4–13 and proceed
directly to Step 14 after decrementing the thread group's live count.
signal_wake_up(sibling, true) behavior: The fatal = true parameter wakes tasks
in ANY sleep state including TASK_UNINTERRUPTIBLE (not just TASK_INTERRUPTIBLE).
This matches Linux's signal_wake_up_state(t, TASK_WAKEKILL) which sets
TIF_SIGPENDING and wakes the task regardless of its sleep state. Without this,
a thread blocked in an uninterruptible I/O operation (e.g., disk write, NFS RPC)
would never be woken by exit_group() and the process would hang forever. A
synchronize_group_exit() step is not separately specified — the atomic
thread_group.count decrement (below) serializes all thread exits, and any
multi-thread core dump serialization is implicit: non-last threads set TASK_DEAD
and schedule away before reaching Step 4, so only the last thread can initiate
a core dump.
The "last thread" check is an atomic decrement:
Execution order: Steps 3a-3g (documented BELOW in this file for readability) execute BEFORE the thread-group gate below. In the actual code, the per-thread cleanup functions are called first, then the thread_group.count decrement determines whether process-level Steps 4-13 also execute. The document structure places the gate first because it determines the control flow branch, then documents each per-thread cleanup step in detail.
// INVARIANT: Steps 3a–3g (per-thread cleanup, documented below) have
// already executed by this point. Every exiting thread — including
// non-last threads — performs robust futex cleanup, clear_child_tid,
// vfork completion, perf events, io_uring, key retention, and RT mutex
// cleanup BEFORE reaching this gate. The thread-group count decrement
// below ONLY gates process-level Steps 4–13.
let remaining = task.process.thread_group.count.fetch_sub(1, AcqRel);
if remaining > 1 {
// Not the last thread. Per-thread cleanup (Steps 3a–3g) has already
// been performed above. Mark this thread as DEAD and schedule away.
// Non-last threads are always auto-reaped (they have no parent to
// notify — only the thread group leader participates in parent
// notification). Without the DEAD state transition,
// finish_task_switch() would not free the kernel stack, leaking
// both the stack and the Task struct on every thread exit.
// Matches Linux: non-leader threads take the autoreap path in
// exit_notify() where tsk->exit_state = EXIT_DEAD.
//
// Note: exit_code for non-last threads is never read by wait4()
// (only the thread group leader's exit_code is reported to the
// parent). The store is retained for debugging/diagnostic purposes.
task.exit_code.store(exit_code, Release);
task.state.store(TaskState::DEAD.bits(), Release);
release_task(&task);
// Skip process-level Steps 4–13, go to Step 14 (schedule away).
goto_schedule_away();
}
// Last thread — cancel process-wide timers (gated by group_dead).
// This is here (not in Step 2) because individual thread exits via
// pthread_exit() must NOT cancel process-wide timers. Matches Linux:
// exit_itimers() is called only when group_dead is true.
cancel_posix_timers(&task.process);
cancel_itimer_real(&task.process);
cancel_itimer_virtual(&task.process);
cancel_itimer_prof(&task.process);
// Continue with process-level teardown.
8.2.1.1.5 Step 3a: Robust Futex Cleanup¶
Walk the dying task's robust futex list to release held robust mutexes. This MUST happen before address space teardown (Step 4b) because the robust list head and entries reside in user memory. Steps 3a–3e are per-thread cleanup that EVERY exiting thread performs, regardless of whether it is the last thread in the group. The thread-group gate (Step 3) determines only whether Steps 4–13 (process-level teardown) execute.
Note on
task.futex_waiter(kernel-internal futex hash bucket node): The Task struct'sfutex_waiterfield (Section 8.1) documents an explicit unlink protocol for tasks linked into a futex hash bucket. However,do_exit()does NOT contain a separate step to unlinkfutex_waiterbecause a task blocked infutex_wait()is always woken by SIGKILL before reachingdo_exit(). The woken task callsfutex_unqueue()in its own execution context (within the__futex_wait()return path) before proceeding todo_exit(). By the timedo_exit()runs,futex_waiter.is_linked()is alwaysfalse. This matches Linux's behavior:exit_mm_release()callsfutex_exit_release()which handles robust futex list walk and PI state cleanup, but the kernel-internal wait queue node is self-unqueued by the woken task. Thefutex_waitercleanup protocol in the Task struct documentation describes the safety invariant (what must be true), not a separatedo_exit()step (the invariant is maintained byfutex_unqueue()on the signal receipt path).
/// Walk the task's robust_list (set via set_robust_list(2)) and for each
/// held futex word, atomically set FUTEX_OWNER_DIED (bit 30) and wake
/// one waiter. This prevents deadlock when a task dies while holding a
/// robust mutex — waiters observe FUTEX_OWNER_DIED and can recover.
fn exit_robust_list(task: &Task) {
let head_ptr = task.robust_list;
if head_ptr.is_null() {
return;
}
// Read the robust_list_head from user memory. May fault if the page
// is swapped out — use copy_from_user semantics (fault-in on demand).
let Ok(head) = copy_from_user::<RobustListHead>(head_ptr) else {
return; // Corrupted pointer — skip gracefully
};
let mut entry = head.list.next;
let mut count = 0u32;
const MAX_ROBUST_ENTRIES: u32 = 2048; // Prevent infinite loop on corrupted list
while !entry.is_null() && entry as usize != head_ptr as usize && count < MAX_ROBUST_ENTRIES {
let Ok(futex_entry) = copy_from_user::<RobustList>(entry) else {
break; // Corrupted entry — stop walking
};
// Compute the futex word address from the entry + futex_offset
let futex_addr = (entry as usize).wrapping_add(head.futex_offset as usize);
// Atomically set FUTEX_OWNER_DIED and clear the TID
handle_futex_death(futex_addr, task.tid);
entry = futex_entry.next;
count += 1;
}
// Also handle the list_op_pending entry if one was in progress
if !head.list_op_pending.is_null() {
let pending_addr = (head.list_op_pending as usize)
.wrapping_add(head.futex_offset as usize);
handle_futex_death(pending_addr, task.tid);
}
}
handle_futex_death() performs an atomic compare-and-swap on the futex word:
if the current owner TID matches the dying task, it sets FUTEX_OWNER_DIED (bit 30)
and calls futex_wake(addr, 1) to wake one waiter. If the CAS fails (another thread
already took ownership), no action is needed.
8.2.1.1.6 Step 3b: Clear Child TID¶
If task.clear_child_tid is set (non-null), write 0 to the userspace address and call
futex_wake(clear_child_tid, 1) to wake one waiter. This implements the
CLONE_CHILD_CLEARTID contract used by glibc's pthread_join().
/// Clear the child TID location and wake one waiter.
/// This is the kernel side of the CLONE_CHILD_CLEARTID contract:
/// glibc sets clear_child_tid via clone(2) so that pthread_join()
/// can futex_wait() on the TID address and be woken when the thread exits.
fn exit_clear_child_tid(task: &Task) {
let tid_ptr = task.clear_child_tid;
if tid_ptr.is_null() {
return;
}
// Write 0 to the userspace TID location. This signals that the
// thread has exited. The write uses put_user semantics (may fault
// if the page is swapped — fault-in on demand).
let _ = put_user::<i32>(0, tid_ptr);
// Wake one waiter blocked in futex_wait() on this address.
// This is the wakeup that unblocks pthread_join().
futex_wake(tid_ptr as *const AtomicU32, 1);
}
This MUST happen before address space teardown (Step 4) because clear_child_tid points
to user memory. The ordering relative to Step 3a (robust futex cleanup) does not matter —
both operate on independent user memory locations.
8.2.1.1.7 Step 3b-post: Clear rseq Area¶
// Clear rseq state BEFORE address space teardown. After Step 4 drops the mm,
// the context switch code would attempt to write rseq->cpu_id to a user-space
// address in a now-freed address space (use-after-free). Clearing the pointer
// prevents this. The rseq_area is a userspace pointer that becomes dangling
// after mm teardown.
// Interior mutability: rseq_area is AtomicPtr<u8>, rseq_len is AtomicU32 —
// allowing mutation through &Task without &mut.
task.rseq_area.store(core::ptr::null_mut(), Release);
task.rseq_len.store(0, Release);
This MUST run before mm teardown (Step 4) and before any preemption point where
the context switch code might write to rseq->cpu_id. If the task is preempted
during Steps 5–14 after Step 4 drops the mm, the context switch code would
dereference the stale rseq_area pointer — a use-after-free. Clearing it here
makes the write a safe no-op (null check in the context switch path).
8.2.1.1.8 Step 3b-post-2: Clear VFS Ring Pointer¶
// Clear current_vfs_ring BEFORE the Task struct is recycled by the slab.
// A task killed mid-VFS-operation (e.g., SIGKILL during a blocking
// read/write on a Tier 1 VFS ring) will have current_vfs_ring pointing
// to a valid ring. complete_slot() never runs because the task is dying.
// Without clearing, slab recycling reuses this Task struct for a new
// fork, and the stale pointer persists. If an unrelated VFS driver crash
// triggers sigkill_stuck_producers(), the stale pointer may fall within
// the crashed ring's [ring_base, ring_end) range, causing false SIGKILL
// to the innocent new task.
//
// Release ordering pairs with the Relaxed load in sigkill_stuck_producers()
// — the downstream Acquire on ring_set.state provides the full fence.
// See Task.current_vfs_ring field doc for the lifecycle invariant.
task.current_vfs_ring.store(core::ptr::null_mut(), Release);
This MUST run before the task reaches the zombie state (Step 14). Ordering relative to rseq clearing (Step 3b-post) and vfork completion (Step 3c) does not matter — they operate on independent fields. Placed here for proximity to the rseq clearing because the rationale is identical: both clear per-task pointers that become stale after the task's operational context is invalidated.
8.2.1.1.9 Step 3c: Signal vfork Completion¶
If this task is a vfork() child that is exiting without having called exec(),
the blocked parent must be woken. This MUST run before mm teardown (Step 4) because
the CompletionEvent lives on the parent's kernel stack, and after the parent is
woken it may unwind its stack frame.
/// Signal vfork completion to the blocked parent (if this is a vfork child).
/// Matches Linux's `mm_release()` called from `exit_mm()`.
fn complete_vfork_done(task: &Task) {
// AtomicPtr::swap(null, AcqRel) atomically loads and clears the pointer.
// Interior mutability: task.vfork_done is AtomicPtr<CompletionEvent>,
// allowing mutation through &Task without &mut.
let done = task.vfork_done.swap(core::ptr::null_mut(), AcqRel);
if !done.is_null() {
// SAFETY: The CompletionEvent lives on the parent's kernel stack.
// Check the AtomicBool guard before dereferencing — the parent may
// have been killed (SIGKILL) while blocked. If the parent was killed,
// its do_exit() path stores guard=false (Release) before unwinding
// the stack frame. The Acquire load here pairs with that Release,
// ensuring we observe the guard write before any stack teardown.
//
// The guard check is ALWAYS required — even in the do_exit path.
// The assertion that "the parent is still blocked" is not provable
// without checking the guard: the parent may have been SIGKILLed
// concurrently. The guard is the only safe synchronization mechanism.
if unsafe { (*done).guard.load(Acquire) } {
unsafe { (*done).event.complete(); }
}
// If guard is false: parent was killed, CompletionEvent is dangling.
// Silently drop — the parent is already exiting.
}
}
Ordering rationale: Linux calls mm_release() → complete_vfork_done() inside
exit_mm(), before the mm is dropped. The vfork parent is woken and can resume
its do_fork() return path. If we signaled after mm teardown, the parent might
resume and access the (now freed) shared address space.
8.2.1.1.10 Step 3d: Perf Event Cleanup¶
Stop all perf events owned by this task (thread), release hardware PMU counters, and
mark orphaned events as dead. Events whose fds are held by external processes (via
pidfd_getfd or SCM_RIGHTS) transition to PerfEventState::Dead and remain
readable (returning the final accumulated count) but no longer collect samples.
See Section 20.8.
This is per-thread cleanup — every exiting thread runs it, because perf_events is
a per-Task field (each thread has its own perf event list). The perf event fd itself
resides in the shared fd table and is closed later in Step 5 (last-thread only), which
triggers perf_event_release() and frees the event if no external references remain.
Ordering constraint: Must run BEFORE close_files() (Step 5) because perf events
may hold references to the task's mm (for userspace sampling buffers). Must also run
BEFORE mm teardown (Step 4b) so that the sampling buffer orphaning can safely mark the
ring buffer without racing against mm teardown. In Linux, perf_event_exit_task() runs
before exit_mm() in do_exit().
8.2.1.1.11 Step 3e: io_uring Cancellation¶
Cancel all in-flight io_uring operations submitted by this thread and release fixed
buffer/file registrations. After this step, all io_uring instances this thread has
interacted with have no in-flight operations from this thread. The io_uring fd itself
is closed later in Step 5 (last-thread only), which triggers final ring teardown via
IoRingCtx::drop(). See Section 19.3.
This is per-thread cleanup — every exiting thread runs it, because each thread tracks
its own io_uring submissions independently (via Task.io_uring_tctx). In a
multi-threaded process, different threads may submit to the same io_uring ring. Each
thread cancels only the operations it submitted.
Ordering constraint: Must run BEFORE close_files() (Step 5) to prevent deadlock.
io_uring operations may hold Arc<File> references to files in the shared fd table. If
close_files() runs first, the close blocks waiting for io_uring to release its
reference — deadlock. In Linux, io_uring_task_cancel() runs before exit_files() in
do_exit().
Linux ordering divergence (intentional): In Linux, io_uring_files_cancel() runs
BEFORE exit_signals() (which sets PF_EXITING). In UmkaOS, Step 1 sets PF_EXITING
first, then Step 3e runs io_uring cancellation. This is a deliberate improvement:
PF_EXITING prevents new io_uring submissions from racing with the cancellation pass.
Linux achieves the same effect via a separate PF_IO_WORKER check. The UmkaOS ordering
is strictly safer — no new userspace-initiated io_uring operations can be submitted
between Step 1 and Step 3e because PF_EXITING causes io_uring_enter() to return
-ESRCH immediately.
8.2.1.1.12 Step 3f: Key Retention Cleanup¶
Release the task's key references: thread keyring, session keyring link, and process keyring link. Per-thread keyrings are freed immediately. Session and process keyrings have their reference counts decremented; they are freed when the last reference drops (may outlive this task if shared with other threads or inherited by children). See Section 10.2.
This is per-thread cleanup — every exiting thread runs it, because each thread
has its own thread keyring. Session and process keyrings are shared (via
CLONE_THREAD) and persist until the last thread drops its reference.
Ordering constraint: Must run BEFORE close_files() (Step 5) because
key operations may reference open file descriptors (e.g., keys backed by
file-based storage). Must also run BEFORE namespace release (Step 11)
because keyrings may reference the task's user namespace for permission
checks during cleanup. In Linux, exit_keys() runs in do_exit() before
exit_files().
8.2.1.1.13 Step 3g: Kernel RtMutex/PI Cleanup¶
Walk task.held_mutexes and unlock each held kernel RtMutex. For each mutex:
- Acquire
mutex.lock. - Store null into
mutex.owner(AtomicPtr::store(core::ptr::null_mut(), Release)). - If
mutex.waitersis non-empty: a. Dequeue the highest-priority waiter. b. Setmutex.ownerto the waiter's task pointer. c. Wake the waiter viatry_to_wake_up(). d. Propagate PI chain: the new owner may need its priority adjusted. - Release
mutex.lock. - Clear
task.blocked_on_rt_mutex = None. - Clear
task.rt_waiter = None.
This must run BEFORE MM teardown (Step 4) because PI chain traversal
may reference other tasks' memory. It must run AFTER Step 1 (PF_EXITING)
so that new lock attempts by the dying task are prevented.
Without this step, if a task is killed (SIGKILL) while holding an RtMutex,
RtMutex.owner becomes a dangling AtomicPtr<Task> after zombie reaping.
Any thread trying to lock the same mutex would dereference the dangling
pointer in pi_propagate() — a use-after-free bug.
See Section 8.4 for the PI protocol.
Scheduler upcall state: The per-task upcall fields (in_upcall, upcall_pending,
upcall_stack_base, upcall_stack_size, upcall_frame_nonce) are NOT explicitly
cleared during do_exit(). This is safe because PF_EXITING (set in Step 1) prevents
new upcall delivery — the upcall dispatch path checks PF_EXITING and skips delivery.
The fields are reclaimed when the Task struct is freed by release_task(). Note
that exec() DOES clear upcall state (step 5g in
Section 8.1) because the old binary's
upcall handler is no longer mapped. In do_exit(), the address space is torn down
(Step 4) before any code could observe the stale upcall fields, so explicit clearing
is unnecessary.
Per-task capability and LSM cleanup note:
The per-task CapHandle (capability restriction mask) is a u64 index into the
process-wide CapSpace. It requires NO explicit cleanup during do_exit() — the
CapHandle value is a lightweight integer index, not an owned resource. The
process-wide CapSpace is cleaned up when the last thread exits (Step 11,
process-level cleanup). Individual CapEntry objects in the CapSpace have
their own refcounts and are freed via normal CapEntry::drop() when all references
(from this process and any inherited children) are released.
See Section 9.1.
The per-task LSM security blob (allocated in do_fork() step 6c via
security_task_alloc()) is freed implicitly when the Task struct is
deallocated by release_task() → slab free. The LSM blob is embedded within
the Task slab allocation (contiguous memory, single slab slot) — there is no
separate heap allocation for the blob. Each LSM module's task_free hook is
called by release_task() before the slab slot is returned to the cache,
allowing the LSM to perform any module-specific cleanup (e.g., SELinux
credential cache invalidation, AppArmor context unref). The hook call is:
8.2.1.1.14 Step 4: Sync MM — Address Space Teardown¶
Precondition: mm.users.fetch_sub(1, AcqRel) == 1 — this is the last user of the
mm_struct. The mm.users counter (equivalent to Linux mm_users) tracks all entities
using this address space: threads in the thread group, vfork() parents sharing the mm,
and kernel threads using kthread_use_mm(). This is distinct from
thread_group.count (Step 3), which gates process-level cleanup. In the common case
(single-threaded process, no vfork, no use_mm), both conditions are true simultaneously,
but they are logically distinct: a vforked child shares the parent's mm without being in
the parent's thread group, and kthread_use_mm() increments mm_users without
incrementing thread_group.count.
The Arc<MmStruct> refcount serves as the equivalent of Linux's mm_count — it gates
deallocation of the mm_struct itself, which may outlive the last user (e.g., referenced
by /proc/[pid]/mem or by lazy TLB CPUs).
8.2.1.1.14.1 Step 4a: Core Dump (if signal-triggered)¶
If the exit was caused by a signal whose default action is CORE (see Section 8.2 below), the core dump is written before any address space teardown. The address space is still fully intact (VMAs, page tables, and physical pages are all present), which is required because the dump must read the process's memory contents.
/// Returns true if the exit code encodes a signal whose default action is CORE.
/// Extracts the signal number from bits [6:0] and checks against the core-dump set.
fn exit_is_core_signal(exit_code: i32) -> bool {
let sig = (exit_code & 0x7f) as u8;
matches!(sig, SIGQUIT | SIGILL | SIGABRT | SIGFPE | SIGSEGV
| SIGBUS | SIGTRAP | SIGSYS | SIGXCPU | SIGXFSZ)
}
if exit_is_core_signal(exit_code) {
do_coredump(&task, &mm, exit_code);
}
The core dump runs with the mm still live but with PF_EXITING set, so no user code
can execute concurrently. Other threads have already been killed (Step 3) and are either
dead or in their own do_exit() past Step 1.
Ordering rationale: Linux do_exit() calls synchronize_group_exit() (which
coordinates coredump) before exit_mm() (which tears down the address space). The
coredump must complete before any TLB teardown or VMA unmapping because the dump reads
the process's entire virtual memory contents. Reversing this order causes the dump to
read freed memory (use-after-free).
8.2.1.1.14.2 Step 4b: Batched TLB Teardown and VMA Unmap¶
// Acquire mmap_lock for write. Without this, /proc/[pid]/maps readers
// (which acquire mmap_lock.read()) could race against VMA tree teardown.
// In Linux, exit_mmap() acquires mmap_write_lock(mm) before unmap_vmas().
let _mmap_guard = mm.mmap_lock.write();
// MMF_UNSTABLE check: if the OOM reaper has set MMF_UNSTABLE on this mm
// ([Section 4.2](04-memory.md#physical-memory-allocator--oom-reaper)), the reaper is already
// walking the page tables via CAS-based PTE unmapping. Skip unmap_vmas()
// to avoid redundant work (page table walks, TLB shootdown IPIs, RSS
// counter updates on already-freed pages). The reaper's CAS makes double-
// unmap safe at the PTE level, but skipping saves significant CPU time.
if !mm.flags.load(Acquire).contains(MmFlags::MMF_UNSTABLE) {
let mut tlb = TlbGatherMmu::new_fullmm(mm);
unmap_vmas(&mut tlb, mm.vma_tree.iter(), 0, usize::MAX);
free_pgtables(&mut tlb, mm.vma_tree.iter(), 0, usize::MAX);
tlb.finish();
}
// If MMF_UNSTABLE is set, the reaper handles page table teardown.
// We still need to proceed with mm_struct reference release (Step 4c)
// and TIF_MEMDIE cleanup below.
Linux-style batched TLB teardown replaces a naive flush_tlb_all() (which flushes
ALL TLB entries on ALL CPUs). The TlbGatherMmu with fullmm = true gathers pages
freed during VMA unmapping and page-table release, then performs a single efficient
TLB invalidation at the end. The fullmm flag tells architecture-specific code that
the entire address space is being torn down, enabling optimizations:
- x86-64: Skips per-page INVLPG; a single CR3 switch on each CPU where this
mm_struct is loaded implicitly flushes all TLB entries for the ASID.
- AArch64: Issues TLBI ASIDE1IS (invalidate all entries for ASID) instead of
per-page TLBI VAE1IS.
- RISC-V: Issues SFENCE.VMA with rs1=0 (flush all for ASID) instead of
per-page flushes.
- Other architectures: similar ASID-global flush.
On multi-CPU systems, tlb.finish() sends TLB shootdown IPIs only to CPUs that
have this mm_struct loaded (tracked by mm.cpu_bitmap), not to all CPUs.
The flush must complete before any pages are freed to prevent use-after-free
through stale TLB entries.
unmap_vmas() unmaps all VMAs: anonymous pages are freed back to the physical
allocator (Section 4.2), file-backed pages have their reference
counts decremented (and are reclaimed by the page cache if the count reaches zero),
and shared mappings are detached from their backing objects. free_pgtables()
releases all page table pages. No separate munmap_all() call is needed — the
unmap_vmas() + free_pgtables() combination performs the complete teardown.
8.2.1.1.14.3 Step 4c: Release mm_struct Reference¶
// Clear TIF_MEMDIE if set (OOM victim completing exit — the mm is now
// torn down, so the reserve bypass is no longer needed). In Linux,
// exit_mm() calls tsk_set_memdie(false) at this point.
task.thread_flags.fetch_and(!TIF_MEMDIE, Release);
// Free the top-level page directory (PGD) physical page.
// free_pgtables() in Step 4b freed intermediate pages (PUD/PMD/PTE) but
// not the PGD itself. MmStruct stores the PGD as `pgd: PhysAddr`.
// SAFETY: Lazy TLB holders (other CPUs where this mm was previously loaded)
// hold Arc<MmStruct> (via mm_count / Arc refcount) but never load mm.pgd
// into a page table base register after mm.users reaches 0. They only use
// the Arc<MmStruct> to check ASID validity and issue TLB invalidation for
// the stale ASID. The freed PGD physical page is never dereferenced by
// lazy TLB holders.
free_page(mm.pgd);
drop(mm); // drops Arc<MmStruct>, freeing the struct if refcount reaches 0
The PGD page is freed separately from free_pgtables() (which frees PUD/PMD/PTE
levels). Then the MmStruct itself is dropped via Arc::drop, releasing all
remaining metadata (VMA tree, address space lock, RSS counters). The Arc may not
reach zero if lazy TLB holders or /proc/[pid]/mem still reference it — in that
case, the struct persists until the last reference is dropped. After this point,
the task has no address space — any attempt to access user memory would fault.
8.2.1.1.15 Step 4c: SysV Semaphore Undo Operations¶
// The per-process SysV semaphore undo list is stored on Process (not Task).
// Each SemUndo entry references a semaphore set and contains accumulated
// adjustments. On exit, each adjustment is reversed to prevent semaphore
// value leakage.
//
// Access path: task.process.sysvsem_undo (a SpinLock<Vec<SemUndoRef>>).
// Each SemUndoRef points to a SemUndo entry that is ALSO linked in the
// corresponding SemSet.undo_list (dual linkage for O(1) traversal from
// both the process side and the semaphore-set side).
//
// **ORDERING CRITICAL (SF-089 fix)**: exit_sem MUST run BEFORE exit_files
// (Step 5). If exit_files runs first, closing an NFS/FUSE file descriptor
// may trigger a filesystem flush callback that interacts with IPC subsystems.
// If exit_sem has not yet run, the undo list is still populated and the
// callback could race with the not-yet-executed undo operations, causing
// deadlock. Linux ordering (kernel/exit.c do_exit()): exit_sem() runs
// BEFORE exit_files().
exit_sem(&task.process);
exit_sem() walks the per-process undo list (task.process.sysvsem_undo):
fn exit_sem(process: &Process) {
let mut undo_list = process.sysvsem_undo.lock();
for undo_ref in undo_list.drain(..) {
let sem_set = match ipc_ns.sem_ids.lookup(undo_ref.sem_id) {
Some(set) => set,
None => continue, // set was destroyed, nothing to undo
};
let _set_lock = sem_set.lock.lock();
for &(sem_idx, adj) in &undo_ref.adjustments {
// Apply reverse adjustment: semval += adj (adj is already the
// NEGATIVE of the original operation, stored at semop time).
let val = sem_set.sems[sem_idx as usize].load(Relaxed);
let new_val = (val as i32 + adj as i32).max(0) as u16;
sem_set.sems[sem_idx as usize].store(new_val, Relaxed);
}
// Remove from the SemSet's undo_list (dual unlink).
sem_set.undo_list.lock().retain(|u| u.process_id != process.pid);
// Wake any processes blocked on semop() for this set.
sem_set.waiters.wake_all();
}
}
Note on SemUndo structure: Each SemUndo in SemSet.undo_list
(Section 17.1) MUST include a process_id: ProcessId field to
identify which process owns the entry. Without this, exit_sem() cannot
identify entries belonging to the dying process when scanning from the
semaphore-set side. The per-process sysvsem_undo list provides the reverse
index: O(1) traversal of all undo entries for this process without scanning
all semaphore sets in the IPC namespace.
8.2.1.1.16 Step 4d: POSIX Message Queue Notification Cleanup¶
If the process registered for mq_notify() on any POSIX message queue, the notification
registration is removed. The per-UID byte counter (UserEntry.mq_bytes) is not affected
here — it is decremented only when the queue is destroyed or a message is consumed.
Ordering: mq_notify_cleanup runs immediately after exit_sem, both before
exit_files (Step 5), matching Linux's ordering in do_exit().
8.2.1.1.17 Step 5: Close All File Descriptors¶
// task.files is ArcSwap<FdTable>. When CLONE_FILES is used, multiple tasks
// share the same Arc<FdTable>. We MUST NOT call close_all() unconditionally
// — that would close files for all sharing tasks, even those still running.
//
// Swap in an empty sentinel FdTable and drop the old Arc reference. If this
// is the last reference (no other task shares via CLONE_FILES), Arc::drop
// triggers FdTable::drop which closes all fds. Otherwise, only the Arc
// refcount is decremented.
// This matches Linux's exit_files(): decrement files->count, only call
// close_files() + put_files_struct() when the count drops to zero.
// Interior mutability: ArcSwap::swap takes &self, no &mut needed.
let old_files = task.files.swap(Arc::new(FdTable::empty()));
drop(old_files);
When this task holds the last reference to the FdTable (i.e., no other task shares it
via CLONE_FILES), FdTable::drop() iterates from fd 0 to the highest allocated fd.
Each open file is closed: the file's reference count is decremented, and if it reaches
zero the file's release() method is called (which may flush buffers, release locks, or
tear down network connections depending on the file type). When other tasks still share
the FdTable, only the Arc refcount is decremented — no files are closed.
Ordering constraint: fd close must happen AFTER mm operations (Step 4) because
some filesystems flush dirty pages during close(), which requires the inode and
page cache to be intact. Step 4b unmaps VMAs and drops the mm's page cache
references, but the inode and its page cache remain alive as long as the
Arc<File> in the fd table holds a reference. By running fd close after mm
teardown, we ensure that: (1) no user-accessible mappings exist when close()
triggers writeback, and (2) the inode reference held by the file keeps page cache
entries alive until close() completes.
8.2.1.1.18 Step 6: Release Filesystem References¶
// task.fs is ArcSwap<RwLock<FsStruct>> — the field is on Task, NOT Process.
// Swap in an empty sentinel to release the FsStruct reference NOW, during
// do_exit(). Without this, a zombie task would pin dentries (root + cwd) via
// the Arc, preventing filesystem unmount until the parent reaps the zombie.
// In Linux, exit_fs() explicitly sets task->fs = NULL.
// Interior mutability: ArcSwap::store takes &self, no &mut needed.
task.fs.store(Arc::new(RwLock::new(FsStruct::empty())));
The fs_struct holds the task's root directory, current working directory, and umask.
Each directory reference is a counted reference to a dentry
(Section 14.1); dropping them decrements the dentry refcount. If the
fs_struct is shared with other tasks (via clone(CLONE_FS)), only the reference count
is decremented — the struct is not freed until the last user drops it.
8.2.1.1.19 Step 7: Signal Handlers (NOTE — no explicit action needed)¶
Signal handlers (Process::sighand: Arc<SignalHandlers>) are shared by all
threads via CLONE_SIGHAND. The Arc refcount manages lifetime automatically:
when the last thread in the process exits and the Process struct is dropped,
the Arc<SignalHandlers> is dropped, which frees the handler table if no
other Process shares it (CLONE_SIGHAND without CLONE_THREAD).
No explicit .take() or strong_count check is needed — Arc::drop handles it.
The handler table persists as long as any Process holds a reference.
The sighand_struct contains the per-signal SigAction array shared by all threads
in the process. For the last thread, this frees the handler table. For threads that
share signal handlers via clone(CLONE_SIGHAND) without CLONE_THREAD, the table
persists until the last sharer exits.
8.2.1.1.20 Step 7b: CBS Bandwidth Cleanup¶
If the task belongs to a cgroup with CBS bandwidth guarantees (Section 7.6), the scheduler cleans up its per-CPU CBS state:
- If the task was CBS-throttled (
OnRqState::CbsThrottled), decrement the cgroup's per-CPUCbsCpuServer.nr_throttled_taskscounter. - Return any remaining per-CPU local slice budget to the cgroup's global
CpuBandwidthThrottle.runtime_remainingpool viafetch_addof the local remainder. This prevents budget leaks on task exit. - If
nr_throttled_tasksreaches 0 on this CPU, cancel the per-CPU CBS replenishment timer (hrtimer_cancel(&server.replenish_timer)). Without this, the timer would fire spuriously for a dead task. - Remove the task from the CBS server's per-CPU task list.
Ordering constraint: CBS cleanup must happen BEFORE cgroup detach (Step 8)
because CBS state references the cgroup's CpuBandwidthThrottle struct. After
cgroup detach, the cgroup reference may be the last one, freeing the throttle
struct while CBS cleanup still accesses it.
8.2.1.1.21 Step 8: Detach from Cgroups¶
8.2.1.1.21.1 Step 8a: Charge Exit to Controllers¶
// task.cgroup is ArcSwap<Cgroup> (see Task struct definition).
// The cgroup_exit() function ([Section 17.2](17-containers.md#control-groups)) performs the 5-step cleanup.
cgroup_exit(&task);
cgroup_exit() (Section 17.2) notifies each cgroup controller that this task
is exiting: the CPU controller adjusts its weight sum, the memory controller accounts
for any remaining charged pages, the I/O controller flushes pending I/O accounting,
and the PID controller decrements its task count.
8.2.1.1.21.2 Step 8b: Cgroup Reference Swap (handled by cgroup_exit)¶
The cgroup reference swap (task.cgroup.store(root_cgroup)) is already performed
by cgroup_exit() step 5 in Step 8a. No additional action is needed here.
The old Arc<Cgroup> is dropped by cgroup_exit(), decrementing the cgroup's
task refcount. If this is the last task in a cgroup and the cgroup has been
deleted by userspace (rmdir), the cgroup's resources are freed asynchronously
via RCU callback.
Ordering constraint: cgroup detach must happen AFTER fd close (Step 5) because I/O accounting depends on cgroup membership — closing a file may trigger a final writeback whose I/O bytes must be charged to the correct cgroup. IPC resource cleanup (exit_sem, mq_notify_cleanup) runs at Steps 4c-4d, BEFORE fd close, matching Linux's ordering.
8.2.1.1.22 Step 9: Release IPC Resources¶
IPC resource cleanup (SysV semaphore undo and POSIX mq notification) was moved to
Steps 4c–4d to run BEFORE exit_files() (Step 5), matching Linux's ordering in
kernel/exit.c. See Steps 4c–4d above for the full specification.
8.2.1.1.23 Step 10: Audit and Accounting¶
8.2.1.1.23.1 Step 10a: Audit Record¶
If the audit subsystem is active, a final AUDIT_EXIT record is written containing the
task's PID, UID, exit code, and accumulated CPU/memory statistics. This record is
delivered to the audit daemon via the netlink audit socket.
8.2.1.1.23.2 Step 10b: Process Accounting¶
If BSD process accounting is enabled (acct(2)), a fixed-format accounting record is
appended to the accounting file. The record includes the command name, CPU times, memory
usage, and exit status in the struct acct_v3 format (64 bytes, Linux-compatible).
8.2.1.1.23.3 Step 10c: Taskstats Notification¶
A taskstats netlink message (Section 17.2) is sent to any listening userspace
process (typically a statistics daemon or container runtime). The message includes
per-task and per-thread-group delay accounting data: CPU delay, block I/O delay, swapin
delay, and memory reclaim delay. The TASKSTATS_CMD_GET family is used with
CGROUPSTATS_CMD_NEW for per-cgroup aggregation.
8.2.1.1.24 Step 11: Release Namespaces¶
// Task.nsproxy is ArcSwap<NamespaceSet>; swap in a tombstone (empty NamespaceSet)
// to release the Arc, which decrements namespace refcounts. The task is exiting
// and will never access nsproxy again.
// Use the pre-allocated static sentinel to avoid heap allocation on every task exit.
// EMPTY_NSPROXY is initialized once at boot: Arc::new(NamespaceSet::empty()).
let old_nsproxy = task.nsproxy.swap(Arc::clone(&EMPTY_NSPROXY));
drop(old_nsproxy);
The task's namespace proxy (Section 17.1) holds references to the PID namespace, mount namespace, network namespace, UTS namespace, IPC namespace, cgroup namespace, user namespace, and time namespace. Each reference is decremented. If this is the last task in a namespace, the namespace's cleanup function runs:
- PID namespace: When PID 1 of a namespace exits,
zap_pid_ns_processes()sendsSIGKILLto all other processes in the namespace and waits for them to exit. PID 1 exits FIRST, then all others are killed. (The previous description was backwards — it is the init process that triggers cleanup, not the init process that receives SIGKILL.) After all processes have exited, remaining zombie PIDs are reaped and the PID namespace is marked for destruction when its last reference drops. See Section 17.1. - Network namespace: all network devices, sockets, and routing tables within the namespace are destroyed.
- Mount namespace: all mounts are unmounted (lazy unmount for busy mounts).
- IPC namespace: all SysV IPC objects (shared memory, semaphores, message queues) are destroyed.
8.2.1.1.25 Step 11b: Execute Exit Cleanup Actions¶
The kernel cleanup thread (one per CPU socket, pre-started at boot) dequeues and
executes all ExitCleanupActions registered for this process via
umka_register_exit_cleanup() (Section 8.1).
fn execute_exit_cleanups(task: &Task) {
let process = &task.process;
// Take the entire cleanup list under the lock, then release the lock
// before executing actions (actions may block — e.g., UnlinkPath does I/O).
let cleanups: ArrayVec<ExitCleanupEntry, 64> = {
let mut list = process.exit_cleanups.lock();
core::mem::take(&mut *list)
};
for entry in cleanups.iter() {
// Each action has a 1-second timeout. If an action blocks beyond
// the timeout, log a warning and skip to the next action.
let result = with_timeout(Duration::from_secs(1), || {
match &entry.action {
ExitCleanupAction::RevokeCap(cap) => {
let _ = cap_revoke(cap);
}
ExitCleanupAction::UnlinkPath { path, dirfd, ns } => {
// Enter the captured mount namespace for path resolution.
with_mount_namespace(ns, || {
let _ = vfs_unlinkat(dirfd, path, 0);
});
}
ExitCleanupAction::SendSignal { target, signo, cred } => {
// Check generation to prevent PID recycling race.
if let Some(target_task) = pid_lookup_with_generation(target) {
let _ = send_signal_as(cred, &target_task, *signo);
}
}
ExitCleanupAction::NotifyEventFd { fd, value } => {
let _ = eventfd_write(fd, *value);
}
}
});
if result.is_err() {
log_warn!("exit_cleanup timeout: process {}, action {:?}",
process.pid, entry.action);
}
}
}
Ordering constraint: exit cleanup runs AFTER namespace release (Step 11) because
cleanup actions may reference mount namespaces captured at registration time — the
task's own namespace proxy must be released first to avoid double-reference confusion.
Cleanup runs BEFORE zombie transition (Step 12) so that cleanup side-effects (e.g.,
RevokeCap, UnlinkPath) are visible to the parent before waitpid() returns.
8.2.1.1.26 Step 12: Reparent Children, Then Notify Parent¶
Ordering: Reparenting children MUST happen BEFORE the zombie transition
(Step 12b). In Linux, exit_notify() calls forget_original_parent() (reparent)
before setting tsk->exit_state = EXIT_ZOMBIE. If the dying task becomes a zombie
first and the parent reaps it (fast wait4()) before reparenting completes, the
children temporarily have no valid parent — violating the invariant that every live
process has a valid parent. Reparenting first ensures all children point to the new
parent before the dying task becomes eligible for reaping.
8.2.1.1.26.1 Step 12-pre: Reparent Children¶
See the reparent_children() specification below (after Step 12d).
8.2.1.1.26.2 Step 12a: Set Exit Code¶
The exit code encodes the termination reason in Linux-compatible format:
- Normal exit: (status & 0xff) << 8 (bits [15:8] = exit status, bits [7:0] = 0)
- Signal kill: signum & 0x7f (bits [6:0] = signal, bit 7 = core dumped flag)
8.2.1.1.26.3 Step 12b: Transition to Zombie¶
The task enters ZOMBIE state, indicating it has completed execution but has not yet been
reaped by its parent. The ZOMBIE task is NOT immediately removed from the run queue by
this store — the actual dequeue happens in schedule() (Step 14), which observes the
ZOMBIE/DEAD state and removes the task from the run queue as part of the context switch.
The task will never be scheduled to run again after Step 14. The Task struct persists
in memory so that the parent can read the exit code, resource usage, and other status
information via wait4(2). Note that auto-reaped tasks (Step 12d) skip zombie state
entirely and transition directly to DEAD.
8.2.1.1.26.4 Step 12c: Send SIGCHLD to Parent¶
// Navigate to parent: Process.parent is AtomicU64 (parent PID).
// Look up the parent Process via PID_TABLE, then its leader Task.
let parent_pid = task.process.parent.load(Acquire);
let parent_process = PID_TABLE.get(parent_pid)
.expect("parent process must exist while child notifies");
let parent = parent_process.thread_group.leader();
let exit_signal = task.exit_signal; // Usually SIGCHLD; may differ for clone(2)
// Three-way si_code determination matching Linux's do_notify_parent():
// code & 0x80 -> CLD_DUMPED (3) -- signal death with core dump
// code & 0x7f -> CLD_KILLED (2) -- signal death without core dump
// otherwise -> CLD_EXITED (1) -- normal exit
let code = task.exit_code.load(Acquire);
let (exit_si_code, exit_si_status) = if code & 0x80 != 0 {
(CLD_DUMPED, code & 0x7f) // signal number
} else if code & 0x7f != 0 {
(CLD_KILLED, code & 0x7f) // signal number
} else {
(CLD_EXITED, (code >> 8) & 0xff) // exit status
};
let info = SigInfo {
si_signo: exit_signal as i32,
si_code: exit_si_code,
si_errno: 0,
_union: SigInfoUnion { sigchld: SigInfoSigchld::new(
task.process.pid,
task.cred.read().uid,
exit_si_status as i32,
cputime_to_clock_t(task.rusage.utime_ns.load(Relaxed)),
cputime_to_clock_t(task.rusage.stime_ns.load(Relaxed)),
)},
};
send_signal_to_task(parent, exit_signal, &info);
wake_up_interruptible(&parent.wait_chldexit);
exit_signal is set at clone(2) time: SIGCHLD for fork(), potentially a different
signal for clone() with an explicit exit_signal argument, or 0 for threads (threads
do not individually notify the parent; only the thread group leader does).
The parent's wait_chldexit wait queue is woken, unblocking any wait4()/waitid()
call.
8.2.1.1.26.5 Step 12d: Auto-Reap (if applicable)¶
if parent_ignores_sigchld(parent) || parent_has_sa_nocldwait(parent) {
// Skip zombie state — release task immediately.
task.state.store(TaskState::DEAD.bits(), Release);
release_task(&task);
}
If the parent has set SIGCHLD to SIG_IGN or has installed a handler with
SA_NOCLDWAIT, the child is auto-reaped: it skips the zombie state entirely and its
Task struct is freed immediately. This matches the POSIX specification for
SA_NOCLDWAIT and the Linux behavior for SIG_IGN on SIGCHLD.
8.2.1.1.27 Step 12-pre (continued): Reparent Children¶
This executes BEFORE the zombie transition in Step 12b. All children of the dying task must be given a new parent. The new parent is selected as follows:
- Subreaper: walk up the ancestor chain looking for a task with
PR_SET_CHILD_SUBREAPERset (viaprctl(2)). If found, that task becomes the new parent. - Namespace init: if no subreaper exists, the init process (PID 1) of the task's PID namespace becomes the new parent.
Lock ordering for reparent: When modifying the parent-child relationship between two processes, locks are acquired in ascending PID order to prevent ABBA deadlock. Specifically:
- The dying process holds its own
process.lock(already held from earlier steps in do_exit). - For each child being reparented, the new parent's
process.lockmust be acquired. If the new parent's PID < dying process's PID, the dying process's lock must be temporarily released, both acquired in PID order, then the operation proceeds. In practice, the new parent is either: - A subreaper (typically a long-lived daemon with a low PID), or
-
PID 1 (init, always PID 1) Both cases have PID < dying process in the vast majority of scenarios.
-
Simplification: Since the dying process is in
PF_EXITINGstate and no new children can be added to it (fork checksPF_EXITINGon the parent), it is safe to iteratechildrenwithout holding the lock — the collection is frozen. The dying process releases its own lock, acquires the new parent's lock, performs all reparenting under the new parent's lock, then re-acquires its own lock for subsequent steps.
Process.lock level: All Process.lock instances share the same lock level
(pending assignment in Section 3.4). Nested
acquisition of two Process.lock instances is permitted ONLY when acquired in
ascending PID order. The compile-time Lock<T, LEVEL> mechanism enforces that two
locks at the same level cannot be held simultaneously; the PID-ordered protocol
avoids this by releasing the dying process's lock before acquiring the new parent's.
For each reparented child:
// task.process.children is XArray<Pid, Arc<Process>>.
// Each child is an Arc<Process>; per-task fields (state, exit_code,
// cred, rusage) live on the group leader Task, not on Process.
//
// Lock protocol: release the dying process's lock (PF_EXITING protects
// children from mutation), acquire new_parent_process.lock, then perform
// all reparenting under the new parent's lock.
// Iterate the XArray directly instead of collecting into a fixed-size buffer.
// A process may have more than 256 children (fork-bomb, Apache prefork MPM,
// test harnesses), so ArrayVec<_, 256> would panic. XArray supports lock-free
// iteration. PF_EXITING guarantees no new children are added during iteration.
// Lock protocol: release dying process lock (PF_EXITING protects children
// from mutation), acquire process_tree_write_lock for atomic batch reparent,
// then acquire new parent's Process::lock.
//
// process_tree_write_lock is a global RwLock for cold paths that need atomic
// multi-task mutation (ptrace reparent, process group operations, session
// leadership changes). Hot paths use RCU + per-PID atomic fields.
// Renamed from tasklist_lock to prevent agents from using it on hot paths.
drop(task_process_guard);
let _tree_guard = process_tree_write_lock.write();
let _new_parent_guard = new_parent_process.lock.lock();
for (_pid, child_process) in task.process.children.iter() {
child_process.parent.store(new_parent_pid.as_u64(), Release);
new_parent_process.children.insert(child_process.pid.as_u64(), Arc::clone(child_process));
// Remove from dying process's children XArray. Without this, the dying
// process's children XArray retains Arc references — memory leak.
task.process.children.remove(child_process.pid.as_u64());
// Access per-task fields via the group leader.
let leader = child_process.thread_group.leader(); // Arc<Task>
if leader.state.load(Acquire) == TaskState::ZOMBIE {
// Child is already a zombie — notify the new parent so it can reap.
// Construct a proper SigInfo matching the 3-arg pattern used in
// Step 12c (exit notification to the original parent).
// Three-way si_code/si_status matching Step 12c pattern:
let ec = leader.exit_code.load(Acquire);
let (sc, ss) = if ec & 0x80 != 0 {
(CLD_DUMPED, ec & 0x7f)
} else if ec & 0x7f != 0 {
(CLD_KILLED, ec & 0x7f)
} else {
(CLD_EXITED, (ec >> 8) & 0xff)
};
let cred = leader.cred.read(); // RcuCell -> RcuRef<Arc<TaskCredential>>
let info = SigInfo {
si_signo: SIGCHLD as i32,
si_code: sc,
si_errno: 0,
_union: SigInfoUnion { sigchld: SigInfoSigchld::new(
child_process.pid.as_raw(), // Pid -> i32
cred.uid,
ss as i32,
cputime_to_clock_t(leader.rusage.utime_ns.load(Relaxed)),
cputime_to_clock_t(leader.rusage.stime_ns.load(Relaxed)),
)},
};
// Use the child's exit_signal (set at clone(2) time), NOT hardcoded
// SIGCHLD. clone(2) allows specifying a custom exit_signal; only fork()
// defaults to SIGCHLD. Matches Linux: do_notify_parent(tsk, tsk->exit_signal).
let child_exit_signal = leader.exit_signal;
send_signal_to_task(new_parent_leader, child_exit_signal, &info);
wake_up_interruptible(&new_parent_process.wait_chldexit);
}
}
drop(_new_parent_guard);
drop(_tree_guard);
// No re-acquisition of the dying process lock — Step 14 (schedule()) does
// not require any process lock. The dying task is PF_EXITING with no
// further mutations needed.
Orphaned process group check (Section 8.6): after reparenting,
the kernel checks whether any process group has become orphaned (all members' parents are
either in the same group or outside the session). If an orphaned group contains stopped
processes, SIGHUP followed by SIGCONT is sent to all members.
8.2.1.1.28 Step 13: Orphaned Process Group Check¶
After reparenting (Step 12-pre) and zombie transition (Step 12b), check whether
any process group has become orphaned. An orphaned process group is one where
all members' parents are either in the same group or outside the session. If an
orphaned group contains stopped processes, SIGHUP followed by SIGCONT is
sent to all members (Section 8.6).
8.2.1.1.29 Step 14: Schedule Away¶
The scheduler removes the task from the run queue (it was already marked ZOMBIE or
DEAD in Step 12) and selects the next runnable task. The dying task's kernel stack
and task struct are freed by finish_task_switch()
(Section 7.3)
running on the NEXT task that is scheduled on this CPU. This deferred cleanup is
necessary because the dying task cannot free its own stack while still executing on it.
The finish_task_switch() procedure:
1. Releases the runqueue lock (held across the context switch).
2. Re-enables preemption.
3. Checks if prev (the dying task) is TASK_DEAD. If so, frees the kernel stack
(returned to the kernel stack slab cache) and drops the final Arc<Task> reference
(freeing the Task struct if no other references remain).
4. Fires scheduler-in notifiers (KVM guest re-entry, cgroup tracking update).
Note: Pending signals are NOT flushed during do_exit(). The signal queue
remains intact through the zombie state so that waitid(WNOWAIT) can inspect
the final signal state if needed. Signal queue cleanup occurs in release_task()
when the parent reaps the zombie via waitpid(). This matches Linux behavior:
release_task() calls flush_signals() which frees all SigQueueEntry nodes
back to SIGQUEUE_SLAB, clears the standard signal array, and zeroes the
pending mask.
Zombie → DEAD → free lifecycle: A zombie task retains its kernel stack and Task
struct until the parent calls wait4(2) → release_task().
Arcrelease_task() and finish_task_switch()):
The Arc<Task> reference count tracks all live references to the Task struct.
When release_task() completes (called from wait4() or auto-reap), it removes
the Task from PID tables, thread group lists, and parent child lists — eliminating
all structural references. The only remaining Arc<Task> reference is held by the
scheduler's RunQueue.prev field, set during schedule() (Step 14) when the dying
task context-switches away. finish_task_switch(), running on the NEXT scheduled
task, drops this final Arc<Task> via Arc::drop(prev), which triggers
Task::drop() → slab deallocation. The kernel stack is freed separately (also by
finish_task_switch()) because the stack cannot be freed while executing on it.
Summary of the ownership chain:
1. do_exit() Step 14: schedule() stores prev = Arc::clone(&dying_task) in runqueue.
2. Next task runs finish_task_switch(prev).
3. finish_task_switch checks prev.state == DEAD, frees kernel stack, calls
Arc::drop(prev) — the final Arc reference triggers Task::drop().
4. Task::drop() returns the slab slot to the task slab cache.
8.2.1.2 release_task() — Full Pseudocode¶
Called from two sites: (1) wait4() / waitid() after reaping a zombie, and
(2) the auto-reap path for non-last threads and SA_NOCLDWAIT / SIG_IGN children.
Preconditions:
- Task is in ZOMBIE state (reap path) or DEAD state (auto-reap path).
- The caller holds no locks on the target task (release_task acquires what it needs).
/// Release all remaining resources held by a dead/zombie task.
///
/// After this function returns, the caller's `Arc<Task>` reference is the
/// last external reference. When it is dropped, the Task struct is freed
/// by the slab allocator. The kernel stack is freed by `finish_task_switch()`
/// when the next task is scheduled on the same CPU.
fn release_task(task: &Task) {
// 1. State transition: ZOMBIE → DEAD (for zombie reap path).
// For auto-reap (non-last threads), the caller already set DEAD.
// This is idempotent — storing DEAD when already DEAD is harmless.
// The Release ordering pairs with the Acquire in finish_task_switch().
task.state.store(TaskState::DEAD.bits(), Release);
// 2. Return PID to the PID allocator.
// Walk from the task's leaf PID namespace up to the init namespace,
// freeing the PID number at each level. After this, the PID is
// available for reuse by future fork() calls.
pid_allocator_free(task.tid);
// 3. Decrement UID task count (RLIMIT_NPROC accounting).
// The UserEntry.task_count was incremented in do_fork() step 4.
// Decrementing here (not in do_exit) ensures the count remains
// accurate through the zombie phase — a zombie still occupies a
// PID slot and counts toward RLIMIT_NPROC.
let user_entry = user_entry_lookup(task.cred.read().uid);
user_entry.task_count.fetch_sub(1, Release);
// 4. Flush pending signal queue under queue_lock.
// Free all SigQueueEntry nodes back to SIGQUEUE_SLAB. Clear the
// standard signal array and pending mask. The signal queue was
// preserved through the zombie state for waitid(WNOWAIT) inspection.
//
// shared_pending belongs to the Process (thread group), not individual
// threads. Only flush it when reaping the thread group leader — flushing
// it for a non-leader thread would destroy process-wide pending signals
// that surviving sibling threads need to receive. Matches Linux
// kernel/exit.c release_task(): flush_sigqueue(&p->signal->shared_pending)
// is guarded by thread_group_leader(p).
{
let _guard = task.process.sighand.queue_lock.lock();
flush_signals(&task.pending); // always flush per-task pending
if task.tid == task.process.pid {
// Only flush process-wide pending when reaping the thread group leader.
// Non-leader threads share shared_pending with surviving siblings.
flush_signals(&task.process.shared_pending);
}
}
// 5. Unlink from thread group list.
// Remove task.group_node from Process::thread_group.tasks.
// Protected by Process::lock to serialize with concurrent
// /proc/[tgid]/task/ enumeration and kill(-tgid) signal delivery.
{
let _guard = task.process.lock.lock();
task.process.thread_group.tasks.remove(&task.group_node);
}
// 6. Remove from parent's children XArray (for thread group leaders).
// Non-leader threads are not in the parent's children list.
if task.tid == task.process.pid {
let parent_pid = task.process.parent.load(Acquire);
if parent_pid != 0 {
if let Some(parent_proc) = find_process_by_pid(parent_pid) {
let _guard = parent_proc.lock.lock();
parent_proc.children.remove(task.process.pid.as_u64());
}
}
}
// 7. PID namespace cleanup: if this was the last task in a PID namespace,
// the namespace can be freed. The PidNamespace's task_count was
// incremented at alloc_pid() and is decremented here.
pid_ns_release(task);
// The caller's Arc<Task> reference is the last one. When it goes out
// of scope, Arc::drop frees the Task struct to the slab cache.
// The kernel stack is freed by finish_task_switch() on the CPU where
// the dead task last ran — that function observes TASK_DEAD and calls
// free_kernel_stack(prev.stack_base).
}
UID task count decrement: UserEntry.task_count (Section 8.7)
is decremented in release_task(), not in do_exit(). This ensures the count remains
accurate through the zombie phase (the zombie is still a "task" that occupies a PID slot
and contributes to RLIMIT_NPROC accounting).
8.2.2 Core Dump Generation¶
When a process exits due to a signal whose default action is CORE, the kernel generates an ELF core dump before tearing down the address space. This section specifies the complete core dump procedure, ELF format, and per-architecture register sets.
For the high-level core dump filter bitmask and compression options, see Section 20.4.
8.2.2.1 Trigger Signals¶
The following signals trigger a core dump when their default disposition is in effect (i.e., no user-installed handler and the signal is not ignored):
| Signal | Number | Cause |
|---|---|---|
SIGQUIT |
3 | Keyboard quit (Ctrl+\) |
SIGILL |
4 | Illegal CPU instruction |
SIGTRAP |
5 | Trace/breakpoint trap |
SIGABRT |
6 | abort(3) call |
SIGBUS |
7 | Bus error |
SIGFPE |
8 | Arithmetic exception |
SIGSEGV |
11 | Invalid memory reference |
SIGSYS |
31 | Invalid syscall (seccomp) |
SIGXCPU |
24 | CPU time limit exceeded |
SIGXFSZ |
25 | File size limit exceeded |
8.2.2.2 RLIMIT_CORE Check¶
fn should_dump(task: &Task) -> bool {
let core_limit = task.process.rlimits.limits[RLIMIT_CORE].soft;
core_limit > 0
}
If RLIMIT_CORE (Section 8.7) is 0, the dump is skipped
entirely — no file is created, no pipe is opened. If the limit is less than the
estimated dump size, the dump is truncated to RLIMIT_CORE bytes (the ELF header and
NOTE segments are written first to maximize usefulness of truncated dumps).
8.2.2.3 Path Resolution — core_pattern¶
The dump destination is determined by /proc/sys/kernel/core_pattern (implemented as
an umkafs tunable). The pattern string supports the following format specifiers, evaluated
at dump time:
| Specifier | Expansion | Example |
|---|---|---|
%p |
PID of the dumped process (in the initial PID namespace) | 12345 |
%u |
Real UID of the dumped process | 1000 |
%g |
Real GID of the dumped process | 1000 |
%s |
Signal number that caused the dump | 11 |
%t |
UNIX timestamp (seconds since epoch) | 1710345600 |
%e |
Executable filename (truncated to 15 chars, no path) | my_program |
%E |
Executable path with / replaced by ! |
!usr!bin!my_program |
%h |
Hostname | node01 |
%% |
Literal % |
% |
Pipe mode: if core_pattern begins with |, the remainder is parsed as a command
line. The kernel spawns a userspace helper process (e.g., |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h)
and pipes the ELF core data to its stdin. The helper runs with full root capabilities in
the initial namespaces. This is the mechanism used by systemd-coredump, abrt, and
similar crash handlers.
File mode: if core_pattern does not begin with |, it is a filesystem path
(possibly with % specifiers). The kernel creates the file with mode 0600, owned by
the dumped process's real UID/GID. The file is written in the process's current working
directory unless core_pattern contains an absolute path.
Default: core (produces a file named core in the process's cwd).
8.2.2.4 ELF Core File Format¶
The core dump is written as a standard ELF file compatible with GDB, lldb, crash,
and all standard ELF analysis tools.
8.2.2.4.1 ELF Header¶
/// ELF64 header for core dump files on LP64 architectures (x86-64, AArch64,
/// RISC-V 64, PPC64LE, s390x, LoongArch64).
///
/// FIX-027: On ILP32 architectures (ARMv7, PPC32), the kernel writes ELF32
/// core dumps using `Elf32CoreHeader` (below). The ELF class byte
/// (`e_ident[EI_CLASS]`) is `ELFCLASS64` (2) for 64-bit or `ELFCLASS32` (1)
/// for 32-bit. GDB uses this byte to select the correct header parser.
#[repr(C)]
#[cfg(target_pointer_width = "64")]
pub struct ElfCoreHeader {
pub e_ident: [u8; 16], // ELF magic + class + endianness
pub e_type: u16, // ET_CORE = 4
pub e_machine: u16, // Per-architecture (see table below)
pub e_version: u32, // EV_CURRENT = 1
pub e_entry: u64, // 0 (no entry point for core files)
pub e_phoff: u64, // Program header table offset
pub e_shoff: u64, // 0 (no section headers in core files)
pub e_flags: u32, // Architecture-specific flags
pub e_ehsize: u16, // ELF header size (64 bytes for ELF64)
pub e_phentsize: u16, // Program header entry size (56 for ELF64)
pub e_phnum: u16, // Number of program headers
pub e_shentsize: u16, // 0
pub e_shnum: u16, // 0
pub e_shstrndx: u16, // 0
}
/// ELF32 header for core dump files on ILP32 architectures (ARMv7, PPC32).
///
/// Address and offset fields are 4 bytes (u32) instead of 8 bytes (u64).
/// The header is 52 bytes (vs 64 for ELF64). Program header entries are
/// 32 bytes (vs 56 for ELF64).
#[repr(C)]
#[cfg(target_pointer_width = "32")]
pub struct ElfCoreHeader {
pub e_ident: [u8; 16], // ELF magic + class(ELFCLASS32) + endianness
pub e_type: u16, // ET_CORE = 4
pub e_machine: u16, // Per-architecture (see table below)
pub e_version: u32, // EV_CURRENT = 1
pub e_entry: u32, // 0 (no entry point for core files)
pub e_phoff: u32, // Program header table offset
pub e_shoff: u32, // 0 (no section headers in core files)
pub e_flags: u32, // Architecture-specific flags
pub e_ehsize: u16, // ELF header size (52 bytes for ELF32)
pub e_phentsize: u16, // Program header entry size (32 for ELF32)
pub e_phnum: u16, // Number of program headers
pub e_shentsize: u16, // 0
pub e_shnum: u16, // 0
pub e_shstrndx: u16, // 0
}
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<ElfCoreHeader>() == 64);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<ElfCoreHeader>() == 52);
e_machine values by architecture:
| Architecture | e_machine |
Value |
|---|---|---|
| x86-64 | EM_X86_64 |
62 |
| AArch64 | EM_AARCH64 |
183 |
| ARMv7 | EM_ARM |
40 |
| RISC-V 64 | EM_RISCV |
243 |
| PPC32 | EM_PPC |
20 |
| PPC64LE | EM_PPC64 |
21 |
| s390x | EM_S390 |
22 |
| LoongArch64 | EM_LOONGARCH |
258 |
8.2.2.4.2 PT_NOTE Segment¶
The PT_NOTE segment contains structured metadata about the process and all of its
threads. Each note has the standard ELF note layout: namesz, descsz, type,
name (padded to 4-byte alignment), and descriptor data (padded to 4-byte alignment).
The note name is "CORE\0" (5 bytes) for standard notes and "LINUX\0" (6 bytes)
for Linux-specific extensions.
Note types (in emission order):
| Note Type | Name | Contents |
|---|---|---|
NT_PRSTATUS (1) |
CORE |
Per-thread: register state, signal info, PID, cumulative CPU times. One per thread. |
NT_PRPSINFO (3) |
CORE |
Process-level: executable name (16 chars), state char, PPID, PGRP, SID, UID, GID. |
NT_SIGINFO (0x53494749) |
CORE |
siginfo_t for the fatal signal (si_signo, si_code, si_errno, si_addr). |
NT_AUXV (6) |
CORE |
Auxiliary vector (AT_* entries) copied from the process's stack at exec time. |
NT_FILE (0x46494c45) |
CORE |
Memory-mapped file list: count, page_size, then per-entry {start, end, file_offset, filename}. |
NT_FPREGSET (2) |
CORE |
Per-thread floating-point register state. Format is architecture-dependent. |
| Arch-specific notes | LINUX |
Per-thread extended state (see per-architecture table below). |
8.2.2.4.2.1 NT_PRSTATUS Layout¶
One NT_PRSTATUS note is emitted per thread. The first NT_PRSTATUS corresponds to
the thread that received the fatal signal:
/// Per-thread status note. Layout matches Linux `struct elf_prstatus`.
///
/// FIX-026: `pr_sigpend` and `pr_sighold` are C `unsigned long` in Linux's
/// `struct elf_prstatus`, not `u64`. On ILP32 they are 4 bytes each.
/// Uses `KernelULong` ([Section 19.1](19-sysapi.md#syscall-interface--kernellong--kernelulong)).
///
/// `Timeval` fields also use `KernelLong` internally (see
/// [Section 8.7](#resource-limits-and-accounting--getrusage-wire-format)).
// kernel-internal, not KABI
#[repr(C)]
pub struct ElfPrstatus {
/// Signal information.
pub si_signo: i32,
pub si_code: i32,
pub si_errno: i32,
/// Current signal (short).
pub pr_cursig: u16,
pub _pad0: u16,
/// Set of pending signals (bitmask). C type: `unsigned long`.
pub pr_sigpend: KernelULong,
/// Set of held (blocked) signals (bitmask). C type: `unsigned long`.
pub pr_sighold: KernelULong,
/// Process ID.
pub pr_pid: i32,
/// Parent process ID.
pub pr_ppid: i32,
/// Process group ID.
pub pr_pgrp: i32,
/// Session ID.
pub pr_sid: i32,
/// User CPU time.
pub pr_utime: Timeval,
/// System CPU time.
pub pr_stime: Timeval,
/// Cumulative user time of reaped children.
pub pr_cutime: Timeval,
/// Cumulative system time of reaped children.
pub pr_cstime: Timeval,
/// General-purpose register set (architecture-dependent size).
pub pr_reg: ArchGpRegs,
/// Boolean flag: non-zero if the FP register set (NT_FPREGSET) is valid
/// for this thread. 0 if the thread never used floating-point. Matches
/// Linux's `pr_fpvalid` field in `struct elf_prstatus`.
pub pr_fpvalid: i32,
}
/// Size excluding `pr_reg` (architecture-dependent) and trailing `pr_fpvalid`:
/// 64-bit: 3*i32(12) + u16+pad(4) + 2*KernelULong(16) + 4*i32(16) + 4*Timeval(64) = 112
/// + pr_reg (varies) + pr_fpvalid(4) + possible trailing pad.
/// 32-bit: 3*i32(12) + u16+pad(4) + 2*KernelULong(8) + 4*i32(16) + 4*Timeval(32) = 72
/// + pr_reg (varies) + pr_fpvalid(4).
/// Full size depends on ArchGpRegs — verified per-architecture at build time.
8.2.2.4.3 PT_LOAD Segments¶
One PT_LOAD segment is emitted per VMA that passes the coredump filter
(Section 20.4). Each segment records the
virtual address range and file offset within the core file where the memory contents
are stored.
Inclusion rules:
- The VMA must have read permission (
PROT_READ). Execute-only VMAs without read are not dumped (their contents cannot be read by the core dump code). - The VMA must pass the
coredump_filterbitmask test: - Bit 0 (anonymous private):
MAP_ANONYMOUS | MAP_PRIVATE— always included by default. - Bit 1 (anonymous shared):
MAP_ANONYMOUS | MAP_SHARED— included by default. - Bit 2 (file-backed private): mapped from a file with
MAP_PRIVATE— excluded by default. - Bit 3 (file-backed shared): mapped from a file with
MAP_SHARED— excluded by default. - Bit 4 (ELF headers): the first page of ELF-backed mappings — included by default.
- Bit 5 (private huge pages):
MAP_HUGETLB | MAP_PRIVATE— included by default. - Bit 6 (shared huge pages):
MAP_HUGETLB | MAP_SHARED— excluded by default. - Bit 7 (private DAX pages):
MAP_PRIVATEon DAX-backed file — excluded by default. - Bit 8 (shared DAX pages):
MAP_SHAREDon DAX-backed file — excluded by default. - VMAs marked with
MADV_DONTDUMP(madvise(2)) are unconditionally excluded regardless of the filter bitmask. - File-backed clean pages are skipped for file-backed VMAs included by the filter: only dirty (modified) pages are written. Clean pages can be re-read from the backing file, so omitting them reduces dump size without losing information.
8.2.2.5 Per-Architecture Register Sets¶
Each architecture defines its general-purpose register set (written in NT_PRSTATUS),
floating-point register set (written in NT_FPREGSET), and optional extended state
notes. The register count and layout must match what GDB expects for each architecture
so that core files produced by UmkaOS are fully debuggable.
| Architecture | GP Registers | FP Registers | Extra Notes |
|---|---|---|---|
| x86-64 | 27 x u64: r15, r14, r13, r12, rbp, rbx, r11, r10, r9, r8, rax, rcx, rdx, rsi, rdi, orig_rax, rip, cs, eflags, rsp, ss, fs_base, gs_base, ds, es, fs, gs | XSAVE area (up to 8 KB): x87 FPU state, SSE (XMM0-15), AVX (YMM0-15), AVX-512 (ZMM0-31, opmask k0-k7) | NT_X86_XSTATE (0x202): full XSAVE area. NT_386_TLS (0x200): TLS descriptors. |
| AArch64 | 33 x u64: x0–x30, sp, pc, pstate | 32 x u128 (v0–v31) + u32 fpsr + u32 fpcr | NT_ARM_VFP (0x400): FPSIMD state. NT_ARM_SVE (0x405): SVE registers (if SVE enabled, variable length). NT_ARM_PAC_MASK (0x406): PAC masks. NT_ARM_TAGGED_ADDR_CTRL (0x409): MTE tag control. |
| ARMv7 | 18 x u32: r0–r15, cpsr, orig_r0 | 33 x u64: d0–d31 + fpscr (VFPv3) | NT_ARM_VFP (0x400): VFP register state. NT_ARM_HW_BREAK (0x402): hardware breakpoints. NT_ARM_HW_WATCH (0x403): hardware watchpoints. |
| RISC-V 64 | 33 x u64: x0 (zero)–x31, pc | 33 x u64: f0–f31, fcsr | NT_RISCV_CSR (0x900): selected CSR values (sstatus, scause, stval). |
| PPC32 | 48 x u32: r0–r31, nip, msr, orig_r3, ctr, link, xer, ccr, mq, trap, dar, dsisr, result | 33 x u64: fpr0–fpr31, fpscr | NT_PPC_SPE (0x101): SPE accumulator and SPEFSCR (for e500 cores with SPE). |
| PPC64LE | 48 x u64: r0–r31, nip, msr, orig_r3, ctr, link, xer, ccr, softe, trap, dar, dsisr, result | 33 x u64: fpr0–fpr31, fpscr | NT_PPC_VMX (0x100): Altivec/VMX v0–v31, vscr, vrsave. NT_PPC_VSX (0x102): VSX high doublewords (vs0h–vs31h). NT_PPC_TM_SPR (0x108): Transactional Memory SPRs (if TM active). |
| s390x | 16 x u64: r0–r15 + psw_mask + psw_addr | 16 x u64: fpr0–fpr15 + fpc | NT_S390_CTRS (0x304): control registers cr0–cr15. NT_S390_PREFIX (0x305): prefix register. NT_S390_VXRS_LOW (0x309): vector register low halves v0–v15. NT_S390_VXRS_HIGH (0x30a): full vector registers v16–v31. |
| LoongArch64 | 33 x u64: r0–r31, pc | 34 x u64: f0–f31, fcc (8 condition codes packed), fcsr | NT_LOONGARCH_LSX (0xa01): LSX registers (128-bit v0–v31). NT_LOONGARCH_LASX (0xa02): LASX registers (256-bit x0–x31, if LASX present). |
8.2.2.6 Core Dump Procedure¶
The complete procedure, called from do_exit() Step 4a:
do_coredump(task, mm, exit_code):
1. Check RLIMIT_CORE > 0. If not, return immediately.
2. Check that the process is not setuid/setgid (suid_dumpable check):
- suid_dumpable == 0 ("default"): skip dump for SUID/SGID binaries.
- suid_dumpable == 1 ("debug"): dump is written readable only by root.
- suid_dumpable == 2 ("suidsafe"): dump is piped to the core_pattern
handler (pipe mode only); file mode dumps are skipped.
3. Parse core_pattern to determine output destination (file or pipe).
4. Open the output (create file or spawn pipe helper).
5. Write ELF header (e_type = ET_CORE).
6. Write PT_NOTE segment:
a. For each thread: emit NT_PRSTATUS with register snapshot.
b. Emit NT_PRPSINFO (process info).
c. Emit NT_SIGINFO (fatal signal details).
d. Emit NT_AUXV (auxiliary vector).
e. Emit NT_FILE (memory-mapped file list).
f. For each thread: emit NT_FPREGSET and arch-specific notes.
7. For each VMA passing the filter:
a. Write PT_LOAD program header.
b. Copy VMA memory contents to output, page by page.
Skip zero pages (pages that have never been faulted in).
Truncate if RLIMIT_CORE would be exceeded.
8. If umka.coredump_compress is enabled: wrap output in a zstd frame
(compression happens inline as pages are written, using streaming
compression with level 3).
9. Close output file or wait for pipe helper to exit.
10. Set the core-dumped flag in exit_code: exit_code |= 0x80.
(This is the bit tested by WCOREDUMP() in wait status.)
8.2.2.7 /proc/PID/coredump_filter¶
A per-process bitmask readable and writable via /proc/PID/coredump_filter. The default
value is 0x33 (bits 0, 1, 4, 5 set: anonymous private, anonymous shared, ELF headers,
private huge pages). Stored in Process.coredump_filter: AtomicU32.
pub struct Process {
// ... existing fields ...
/// Coredump filter bitmask. Controls which VMA types are included in core
/// dumps. Default: 0x33 (anonymous private + anonymous shared + ELF headers
/// + private huge pages). Writable via /proc/PID/coredump_filter.
/// Bits defined in the PT_LOAD inclusion rules above.
pub coredump_filter: AtomicU32,
}
Writing to /proc/PID/coredump_filter requires either same-process access or
CAP_SYS_PTRACE. The value is inherited across fork() and preserved across exec().
8.2.2.8 Linux Compatibility Notes¶
| Topic | Detail |
|---|---|
| ELF core format | Binary-compatible with Linux; loadable by GDB, lldb, crash utility |
core_pattern specifiers |
All Linux specifiers supported (%p, %u, %g, %s, %t, %e, %E, %h) |
Pipe mode (\| prefix) |
Fully supported; helper receives core on stdin |
coredump_filter default |
0x33 (matches Linux default) |
suid_dumpable |
Three modes (0, 1, 2) match Linux /proc/sys/fs/suid_dumpable |
WCOREDUMP() macro |
Bit 7 of wait status is set when core was dumped |
| NT_PRSTATUS layout | Binary-identical to Linux struct elf_prstatus per architecture |
| NT_FILE format | Matches Linux format (count, page_size, entries, filenames) |
| Arch-specific note types | Note type numbers match Linux for all 8 architectures |
| Truncation on RLIMIT_CORE | Notes are written first (maximizes usefulness of truncated dumps) |
8.3 ELF Binary Loader¶
This section specifies the kernel-side ELF binary loader — the component that
parses ELF executables, maps segments into the new address space, sets up the
dynamic linker, constructs the initial stack, and transfers control to userspace.
The dynamic linker itself (ld-linux.so / ld-musl) is a userspace component
and is not specified here; the kernel's only responsibility is to map it and
provide the information it needs via the auxiliary vector.
The high-level execve() flow is described in
Section 8.1. This section details
steps 3 (binary format dispatch) and 6 (segment loading / stack setup) of that
sequence.
8.3.1 Binary Parameters Struct (BinPrm)¶
The BinPrm struct holds the execution context for a binary being loaded. It is
allocated at the start of do_execveat_common() and destroyed (via Drop) when
the exec attempt finishes — either by consuming the fields on success or by
freeing resources on failure.
/// Execution context for a binary being loaded.
///
/// Lifetime: allocated at exec start, consumed at the point of no return.
/// If exec fails before the PNR, `Drop` frees the pre-allocated new mm
/// and any other resources held by this struct.
pub struct BinPrm {
/// The binary file being executed.
pub file: Arc<OpenFile>,
/// The interpreter binary (e.g., ld-linux.so), if PT_INTERP was found.
pub interpreter: Option<Arc<OpenFile>>,
/// New mm allocated before the point of no return. Consumed by
/// `exec_mmap()` at the PNR. If exec fails before PNR, `Drop` calls
/// `mmput()` to free this mm.
pub mm: Option<Arc<MmStruct>>,
/// New credentials prepared for the exec (capability grants, setuid).
/// Committed at `setup_new_exec()` after the PNR.
pub cred: Arc<TaskCredential>,
/// Binary filename as seen by userspace (for /proc/[pid]/exe and auxv).
pub filename: KernelString,
/// Actual executed binary name (may differ if script interpreter).
pub interp: KernelString,
/// Argument count.
pub argc: u32,
/// Environment variable count.
pub envc: u32,
/// Current top-of-stack pointer (grows downward during arg/env push).
pub p: usize,
/// AT_SECURE flag: non-zero if credentials changed during exec.
pub secureexec: u8, // 0 = false, 1 = true
/// Set to 1 after flush_old_exec (PNR). Error after this is fatal.
pub point_of_no_return: u8, // 0 = false, 1 = true
/// First 256 bytes of the binary for magic number detection.
pub buf: [u8; 256],
/// Saved stack rlimit (RLIMIT_STACK may be adjusted during exec).
pub rlim_stack: RlimitPair,
}
impl Drop for BinPrm {
fn drop(&mut self) {
// If the new mm was allocated but not consumed (exec failed before PNR),
// free it now. After PNR, mm is None (consumed by exec_mmap).
if let Some(mm) = self.mm.take() {
mmput(mm);
}
}
}
/// Replace the current task's address space with the new one prepared
/// for the exec'd binary. Called at the point of no return (step 4b in
/// [Section 8.1](#process-and-task-management--program-execution-exec)).
///
/// **Canonical definition**: The full `exec_mmap()` pseudocode with the
/// correct 7-step ordering (vfork completion → futex cleanup → tid pointer
/// clearing → mm swap → page table switch → mmu_notifier → mmput) is
/// specified in [Section 8.1](#process-and-task-management--program-execution-exec)
/// step 4b. This file does NOT duplicate it — see the canonical definition
/// for the complete implementation including the tid pointer clearing that
/// is critical for `CLONE_CHILD_CLEARTID` correctness.
///
/// After this function returns, the old address space is destroyed and
/// returning to the old binary is impossible.
// exec_mmap(task: &Task, bprm: &mut BinPrm) — see canonical definition.
8.3.2 ELF Header Validation¶
When the binfmt dispatch selects the ELF handler, the kernel reads the first
64 bytes (ELF64) or 52 bytes (ELF32) of the file and validates the header
fields. Any validation failure returns -ENOEXEC (before the point of no
return), allowing the binfmt chain to try subsequent handlers.
/// Validation steps applied to the ELF header (Ehdr).
/// All checks happen before any address space modification.
// 1. Magic number: first 4 bytes must be 0x7f 'E' 'L' 'F'
assert!(ehdr.e_ident[EI_MAG0..EI_MAG3+1] == [0x7f, b'E', b'L', b'F']);
// 2. Class: must match the running kernel's native word size.
// ELF64 (ELFCLASS64) on 64-bit kernels; ELF32 (ELFCLASS32) on 32-bit.
// Compat mode (32-bit binary on 64-bit kernel) is handled by the
// compat ELF loader, not this path.
assert!(ehdr.e_ident[EI_CLASS] == expected_class());
// 3. Data encoding: must match the architecture's byte order.
// ELFDATA2LSB (little-endian) for x86-64, AArch64, RISC-V, PPC64LE, LoongArch64.
// ELFDATA2MSB (big-endian) for s390x, PPC32 (big-endian mode).
assert!(ehdr.e_ident[EI_DATA] == expected_endian());
// 4. ELF version: must be EV_CURRENT (1).
assert!(ehdr.e_ident[EI_VERSION] == EV_CURRENT);
// 5. OS/ABI: accept ELFOSABI_NONE (0) or ELFOSABI_LINUX (3).
// Other values → ENOEXEC (allows binfmt_misc to catch them).
assert!(matches!(ehdr.e_ident[EI_OSABI], ELFOSABI_NONE | ELFOSABI_LINUX));
// 6. Type: must be ET_EXEC (static executable) or ET_DYN (PIE / shared object).
// ET_REL and ET_CORE are not executable.
assert!(matches!(ehdr.e_type, ET_EXEC | ET_DYN));
// 7. Machine: must match the running architecture.
assert!(ehdr.e_machine == arch::current::elf::EM_NATIVE);
// 8. Program header sanity:
// - e_phentsize must equal sizeof(Elf64_Phdr) (56) or sizeof(Elf32_Phdr) (32).
// - e_phnum must be > 0 and <= ELF_MAX_PHNUM (256, matching Linux).
// - e_phoff must be within the file size.
assert!(ehdr.e_phentsize == core::mem::size_of::<ElfPhdr>());
assert!(ehdr.e_phnum > 0 && ehdr.e_phnum <= 256);
Per-architecture e_machine values:
| Architecture | EM_NATIVE |
Value |
|---|---|---|
| x86-64 | EM_X86_64 |
62 |
| AArch64 | EM_AARCH64 |
183 |
| ARMv7 | EM_ARM |
40 |
| RISC-V 64 | EM_RISCV |
243 |
| PPC32 | EM_PPC |
20 |
| PPC64LE | EM_PPC64 |
21 |
| s390x | EM_S390 |
22 |
| LoongArch64 | EM_LOONGARCH |
258 |
Compat mode (32-bit binaries on 64-bit kernels): Each 64-bit architecture
that supports 32-bit compat defines a secondary EM_COMPAT value. The compat
ELF loader accepts ELFCLASS32 + EM_COMPAT and uses 32-bit program headers,
32-bit auxiliary vector entries, and a 32-bit stack layout. Supported pairs:
| 64-bit Kernel | Compat e_machine |
Compat class | Compat endianness |
|---|---|---|---|
| x86-64 | EM_386 (3) |
ELFCLASS32 | ELFDATA2LSB |
| AArch64 | EM_ARM (40) |
ELFCLASS32 | ELFDATA2LSB |
| PPC64LE | EM_PPC (20) |
ELFCLASS32 | ELFDATA2LSB |
| s390x | EM_S390 (22) |
ELFCLASS32 | ELFDATA2MSB |
PPC64LE compat mode accepts only little-endian 32-bit PPC binaries
(ELFDATA2LSB + ELFCLASS32 + EM_PPC). Big-endian PPC32 binaries
(produced by the powerpc-unknown-linux-gnu target) are not compatible
with PPC64LE. The compat ELF loader validates endianness to match the
host architecture's byte order.
Note: This compat path serves third-party little-endian PPC32 binaries (e.g., from Debian ppc64el multilib or
powerpcle-unknown-linux-gnutoolchains). UmkaOS's own PPC32 target (powerpc-unknown-linux-gnu) is big-endian and incompatible with PPC64LE compat mode. No UmkaOS-produced PPC32 binaries will use this path.
RISC-V 64 and LoongArch64 do not support 32-bit compat binaries (no RV32 or LA32 compat layer).
8.3.3 Program Header Processing¶
After header validation, the kernel reads all program headers from the file
(e_phnum entries starting at e_phoff). Each program header of type
PT_LOAD describes a segment to map. Other header types provide metadata:
/// Program header types processed by the ELF loader.
/// Headers with unrecognized p_type are silently ignored.
pub const PT_LOAD: u32 = 1; // Loadable segment → map into address space
pub const PT_INTERP: u32 = 3; // Path to dynamic linker (interpreter)
pub const PT_NOTE: u32 = 4; // Note sections (ignored by loader, used by coredumps)
pub const PT_PHDR: u32 = 6; // Program header table location in memory
pub const PT_TLS: u32 = 7; // Thread-local storage template
pub const PT_GNU_EH_FRAME: u32 = 0x6474e550; // Exception handling frame (ignored by kernel)
pub const PT_GNU_STACK: u32 = 0x6474e551; // Stack permission flags
pub const PT_GNU_RELRO: u32 = 0x6474e552; // Read-only after relocation (handled by ld.so)
pub const PT_GNU_PROPERTY: u32 = 0x6474e553; // GNU property notes (CET, BTI)
Processing order:
-
Scan for
PT_INTERP: If present, read the interpreter path from the file (a NUL-terminated string atp_offset, max 4096 bytes). This is the dynamic linker path (e.g.,/lib64/ld-linux-x86-64.so.2). Validate that the path is absolute and does not exceedPATH_MAX. If the path cannot be opened or the interpreter itself fails ELF validation, return-ELIBBAD. -
Scan for
PT_GNU_STACK: Determines whether the stack should be executable. Ifp_flags & PF_Xis set, the stack is mapped withPROT_EXEC. IfPT_GNU_STACKis absent, the architecture default applies (non-executable on all UmkaOS-supported architectures). Modern toolchains always emitPT_GNU_STACKwithPF_R | PF_W(no exec). -
Scan for
PT_GNU_PROPERTY: On architectures that support it, parse GNU property notes for: GNU_PROPERTY_AARCH64_FEATURE_1_BTI— enable Branch Target Identification (AArch64 only, requiresFEAT_BTI).GNU_PROPERTY_X86_FEATURE_1_IBT— enable Indirect Branch Tracking (x86-64 CET, requiresCR4.CET).-
GNU_PROPERTY_X86_FEATURE_1_SHSTK— enable Shadow Stack (x86-64 CET). -
Collect
PT_LOADsegments: Gather allPT_LOADheaders for segment mapping (next section). -
Record
PT_TLS: If present, record the TLS template address, size, and alignment. The kernel does not process TLS itself — the values are passed to the dynamic linker via the program headers (accessible viaAT_PHDR).
8.3.4 Segment Loading Algorithm¶
Each PT_LOAD segment is mapped into the new address space as one or more
VMAs. The algorithm handles page alignment, BSS regions, and permission
mapping.
/// Map a single PT_LOAD segment into the address space.
///
/// The segment occupies file bytes [p_offset, p_offset + p_filesz) and
/// virtual addresses [p_vaddr, p_vaddr + p_memsz). The region
/// [p_vaddr + p_filesz, p_vaddr + p_memsz) is the BSS (zero-filled).
///
/// For ET_DYN (PIE) binaries, a load_bias is added to all p_vaddr values.
/// For ET_EXEC binaries, load_bias = 0 (segments are at fixed addresses).
fn map_elf_segment(
mm: &mut MmStruct,
file: &OpenFile,
phdr: &ElfPhdr,
load_bias: usize,
) -> Result<(), ElfLoadError> {
let vaddr = phdr.p_vaddr as usize + load_bias;
let filesz = phdr.p_filesz as usize;
let memsz = phdr.p_memsz as usize;
// Page-align the mapping start downward and end upward.
let page_offset = vaddr & (PAGE_SIZE - 1);
let map_start = vaddr - page_offset;
let map_end = page_align_up(vaddr + memsz);
let file_offset = phdr.p_offset as usize - page_offset;
// Convert ELF permission flags to VMA protection bits.
let prot = elf_flags_to_prot(phdr.p_flags);
// PF_R → PROT_READ, PF_W → PROT_WRITE, PF_X → PROT_EXEC
// Step 1: File-backed mapping (covers [map_start, file_end)).
let file_end = page_align_up(vaddr + filesz);
if file_end > map_start {
do_mmap(mm, map_start, file_end - map_start,
prot, MAP_PRIVATE | MAP_FIXED, file, file_offset)?;
}
// Step 2: If memsz > filesz, the remainder is BSS (anonymous, zero-filled).
if map_end > file_end {
do_mmap(mm, file_end, map_end - file_end,
prot, MAP_PRIVATE | MAP_FIXED | MAP_ANONYMOUS, None, 0)?;
}
// Step 3: Zero the partial page between filesz and page boundary.
// The file-backed mmap may have mapped a full page that extends past
// p_filesz. The bytes between p_filesz and the page boundary must be
// zeroed to avoid leaking file content into BSS.
let partial_end = vaddr + filesz;
let partial_page_end = page_align_up(partial_end);
if partial_end < partial_page_end && partial_page_end <= map_end {
// Fault in the page, then zero [partial_end, partial_page_end).
// This is done via a temporary writable mapping or a direct
// memset after faulting the page.
zero_partial_page(mm, partial_end, partial_page_end - partial_end)?;
}
Ok(())
}
Segment validation rules (checked before mapping, any failure → -EINVAL):
- No overlap:
PT_LOADsegments must not overlap in virtual address space after page alignment. Overlapping segments indicate a malformed binary. - Monotonic addresses: Segments must be sorted by
p_vaddr(ascending). The kernel does not sort them — a non-monotonic sequence is rejected. memsz >= filesz: The in-memory size must be at least as large as the file size (BSS can only extend, not shrink the segment).- Address range:
p_vaddr + p_memsz(plusload_bias) must not exceedTASK_SIZE(the architecture-specific user address space limit). - Alignment:
p_alignmust be a power of two.p_vaddrandp_offsetmust be congruent modulop_align(i.e.,p_vaddr ≡ p_offset (mod p_align)).
Load bias for PIE binaries (ET_DYN):
For position-independent executables, the kernel chooses a random base address
(load_bias) from the ASLR mmap region. The first PT_LOAD segment's
p_vaddr is typically 0 (or a small offset), so the actual load address is
load_bias + p_vaddr. The load_bias is computed once and applied uniformly
to all segments, preserving their relative layout.
8.3.5 Dynamic Linker Setup¶
If PT_INTERP is present, the kernel loads the dynamic linker as a second ELF
binary into the same address space:
- Open the interpreter file at the
PT_INTERPpath. - Validate its ELF header (same checks as the main binary: magic, class,
endianness,
e_machine). The interpreter must beET_DYN(position- independent);ET_EXECinterpreters are rejected. - Choose a separate load bias for the interpreter (randomized, from the mmap region).
- Map the interpreter's
PT_LOADsegments using the same algorithm as above. - Record the interpreter's entry point (
e_entry + interp_load_bias) — this becomes the initial instruction pointer, not the main binary's entry point.
The dynamic linker receives the main binary's program headers via AT_PHDR and
uses them to locate segments, perform relocations, resolve symbols, and
eventually jump to the main binary's entry point (AT_ENTRY).
If PT_INTERP is absent (statically linked binary), the kernel sets the
initial instruction pointer to the main binary's e_entry (plus load_bias
for PIE) directly.
8.3.6 Script Handler (#!)¶
If the first two bytes of the file are #!, the script handler processes the
file instead of the ELF handler:
/// Parse a #! (shebang) line. Maximum line length: 256 bytes (matching Linux).
/// Format: #!<optional-space><interpreter-path><optional-space><optional-single-arg>
///
/// Examples:
/// #!/bin/sh
/// #!/usr/bin/env python3
/// #! /bin/bash -e
fn parse_shebang(first_line: &[u8]) -> Result<(PathBuf, Option<&[u8]>), ElfLoadError> {
// 1. Skip "#!" prefix and optional leading whitespace.
// 2. Extract interpreter path (first non-whitespace token).
// 3. Extract optional single argument (all remaining text after
// the first space following the interpreter path, trimmed).
// Linux only supports ONE argument — "#!/usr/bin/env -S python3 -u"
// passes "-S python3 -u" as a single string to /usr/bin/env.
// 4. NUL bytes in the shebang line → ENOEXEC.
}
Recursive invocation: The kernel calls exec_binprm() which dispatches to
binfmt handlers. If a handler returns "try another format" (script #!
interpreter rewrite), exec_binprm() loops with the rewritten binary. A
local depth counter in exec_binprm() tracks the recursion level:
fn exec_binprm(bprm: &mut BinPrm) -> Result<(), ExecError> {
let mut depth: u32 = 0;
loop {
if depth > 5 {
return Err(ExecError::Loop); // -ELOOP
}
match try_binfmt_handlers(bprm)? {
BinfmtResult::Loaded => return Ok(()),
BinfmtResult::Rewrite => {
// Script handler rewrote bprm.file to the interpreter.
depth += 1;
continue;
}
}
}
}
The depth limit is depth > 5 (matching Linux fs/exec.c exec_binprm()),
which allows depths 0 through 5 — 6 total levels (the original binary + 5
binfmt rewrites). The Linux source comment says "This allows 4 levels of
binfmt rewrites" counting from depth 1 through 5 (depth 0 is the initial
format dispatch). The depth counter is a local variable, NOT a field in
BinPrm — it exists only for the duration of one exec_binprm() call.
8.3.7 Initial Stack Layout¶
After segment loading, the kernel constructs the initial user stack. The stack
grows downward from a randomized base address. The layout is the standard
ELF ABI initial process stack, consumed by the C runtime's _start →
__libc_start_main sequence:
High addresses (stack base, randomized)
┌──────────────────────────────────────────┐
│ Random canary bytes (16 bytes, AT_RANDOM)│
│ Platform string (NUL-terminated) │ ← "x86_64", "aarch64", etc.
│ Executable pathname (NUL-terminated) │ ← AT_EXECFN points here
│ Environment strings (NUL-terminated) │
│ Argument strings (NUL-terminated) │
│ Padding (0-15 bytes for alignment) │
├──────────────────────────────────────────┤
│ AT_NULL (0, 0) │ ← auxiliary vector terminator
│ ... auxiliary vector entries ... │ ← (type, value) pairs
├──────────────────────────────────────────┤
│ NULL (0) │ ← envp terminator
│ envp[n-1] (pointer to env string n-1) │
│ ... │
│ envp[0] (pointer to env string 0) │
├──────────────────────────────────────────┤
│ NULL (0) │ ← argv terminator
│ argv[argc-1] (pointer to arg string) │
│ ... │
│ argv[0] (pointer to program name) │
├──────────────────────────────────────────┤
│ argc (integer) │ ← SP points here on entry
└──────────────────────────────────────────┘
Low addresses (stack grows downward)
Data sizes: On 64-bit architectures, argc is 8 bytes (padded to word
size), each pointer is 8 bytes, and each auxiliary vector entry is a pair of
8-byte values (16 bytes total). On 32-bit architectures (ARMv7, PPC32),
everything is 4 bytes / 8 bytes per auxv entry.
Stack alignment: The stack pointer after pushing all data must be aligned to the architecture's required alignment before transferring control to userspace:
| Architecture | Stack alignment on entry to _start |
|---|---|
| x86-64 | 16 bytes |
| AArch64 | 16 bytes |
| ARMv7 | 8 bytes |
| RISC-V 64 | 16 bytes |
| PPC32 | 16 bytes |
| PPC64LE | 16 bytes (quadword) |
| s390x | 8 bytes |
| LoongArch64 | 16 bytes |
The kernel inserts 0-15 bytes of padding between the string area and the auxiliary vector to achieve the required alignment.
Stack size limits: The initial stack mapping is RLIMIT_STACK (default
8 MiB, matching Linux). The mapping is created with a guard page below it.
The total size of argv + envp strings + auxiliary data is bounded by
ARG_MAX (2 MiB on UmkaOS, matching Linux MAX_ARG_STRLEN × MAX_ARG_STRINGS).
If the combined argument + environment size exceeds this limit, execve()
returns -E2BIG before the point of no return.
8.3.8 Auxiliary Vector¶
The auxiliary vector (auxv) is an array of (type, value) pairs pushed onto
the initial stack immediately above the environment pointer array. It provides
the dynamic linker and C runtime with kernel-determined information that would
otherwise require syscalls to obtain.
/// Auxiliary vector entry (64-bit). Matches the Linux ELF ABI.
#[repr(C)]
pub struct Elf64Auxv {
/// Entry type (AT_* constant). AT_NULL (0) terminates the vector.
pub a_type: u64,
/// Entry value: integer, pointer, or flag depending on a_type.
pub a_val: u64,
}
const_assert!(size_of::<Elf64Auxv>() == 16);
/// 32-bit variant for compat mode.
#[repr(C)]
pub struct Elf32Auxv {
pub a_type: u32,
pub a_val: u32,
}
const_assert!(size_of::<Elf32Auxv>() == 8);
Entries populated by the kernel (in the order they appear on the stack):
| Constant | Value | Description |
|---|---|---|
AT_SYSINFO_EHDR (33) |
vDSO ELF base address | Pointer to the vDSO ELF header mapped into the process (Section 2.22) |
AT_HWCAP (16) |
CPU feature bitmask | Architecture-dependent hardware capability bits. Used by the dynamic linker for ifunc resolution and by libc for optimized routines |
AT_HWCAP2 (26) |
Extended CPU features | Extension of AT_HWCAP (AArch64, x86-64, PPC, s390x) |
AT_PAGESZ (6) |
Page size in bytes | System page size (4096 on all currently supported configurations) |
AT_CLKTCK (17) |
Clock ticks per second | CONFIG_HZ value, typically 100 or 250. Used by times(2) |
AT_PHDR (3) |
Program header address | Virtual address of the main binary's program header table in memory |
AT_PHENT (4) |
Program header entry size | sizeof(Elf64_Phdr) = 56, or sizeof(Elf32_Phdr) = 32 |
AT_PHNUM (5) |
Program header count | Number of program headers in the main binary |
AT_BASE (7) |
Interpreter base address | Load address of the dynamic linker (0 if statically linked) |
AT_FLAGS (8) |
0 | Reserved, always 0 |
AT_ENTRY (9) |
Main binary entry point | e_entry + load_bias — the address the dynamic linker jumps to after relocation |
AT_UID (11) |
Real user ID | task.cred.uid |
AT_EUID (12) |
Effective user ID | task.cred.euid |
AT_GID (13) |
Real group ID | task.cred.gid |
AT_EGID (14) |
Effective group ID | task.cred.egid |
AT_SECURE (23) |
Secure mode flag | Non-zero if the binary has elevated privileges (capability grants, or real/effective UID/GID differ). Causes the dynamic linker to ignore LD_PRELOAD, LD_LIBRARY_PATH, etc. |
AT_RANDOM (25) |
Random bytes address | Pointer to 16 bytes of cryptographically random data on the stack. Used by glibc for stack canary initialization |
AT_EXECFN (31) |
Executable pathname | Pointer to the NUL-terminated pathname of the executed file (on the stack) |
AT_PLATFORM (15) |
Platform string | Pointer to a NUL-terminated architecture identification string (e.g., "x86_64", "aarch64", "v7l", "riscv64") |
AT_NULL (0) |
0 | Terminator — marks the end of the auxiliary vector |
Architecture-specific entries (only present on relevant platforms):
| Constant | Architectures | Description |
|---|---|---|
AT_SYSINFO (32) |
x86-32 compat only | Entry point to the vsyscall page (not used on native 64-bit x86) |
AT_BASE_PLATFORM (24) |
PPC, MIPS | Real platform identifier (may differ from AT_PLATFORM) |
AT_DCACHEBSIZE (19) |
PPC | Data cache block size in bytes |
AT_ICACHEBSIZE (20) |
PPC | Instruction cache block size in bytes |
AT_UCACHEBSIZE (21) |
PPC | Unified cache block size in bytes |
AT_L1I_CACHESIZE (40) |
PPC | L1 instruction cache size |
AT_L1I_CACHEGEOMETRY (41) |
PPC | L1 instruction cache geometry (line size |
AT_L1D_CACHESIZE (42) |
PPC | L1 data cache size |
AT_L1D_CACHEGEOMETRY (43) |
PPC | L1 data cache geometry |
AT_L2_CACHESIZE (44) |
PPC | L2 cache size |
AT_L2_CACHEGEOMETRY (45) |
PPC | L2 cache geometry |
AT_L3_CACHESIZE (46) |
PPC | L3 cache size |
AT_L3_CACHEGEOMETRY (47) |
PPC | L3 cache geometry |
AT_HWCAP3 (29) |
AArch64, x86-64 | Third HWCAP word (additional feature bits) |
AT_HWCAP4 (30) |
AArch64, x86-64 | Fourth HWCAP word |
AT_RSEQ_FEATURE_SIZE (27) |
All | Size of supported rseq features (restartable sequences) |
AT_RSEQ_ALIGN (28) |
All | Required alignment for rseq area |
AT_MINSIGSTKSZ (51) |
All | Minimum signal stack size in bytes. Architecture-dependent: accounts for SVE/SME vector length on AArch64, XSAVE area size on x86-64, FP/SIMD register save area on all others. Required by glibc (since 2.34) for correct sigaltstack() allocation. Without this entry, programs using sigaltstack() with AArch64 SVE/SME may allocate too-small alternate signal stacks, causing stack overflow during signal delivery. Linux 5.14+. Cross-reference: Section 8.5 for the signal frame size that determines this value. |
UmkaOS-specific entries (negative a_type values — ignored by unmodified Linux libc):
| Constant | Value | Description |
|---|---|---|
AT_UMKA_CAPSUMMARY (-0x0100) |
CapSummaryPage address | Per-process capability summary page for fast negative permission checks (Section 2.22) |
AT_UMKA_CGAUGE (-0x0101) |
CgroupGaugePage address | Per-cgroup resource gauge for the process's initial cgroup (Section 2.22). Zero if root cgroup with no gauge |
AT_UMKA_SCHEDHINT (-0x0102) |
Feature flag (non-zero = supported) | Scheduler hint page availability. The page itself is per-task and mapped on demand via umka_map_sched_hint() (Section 2.22) |
AT_UMKA_PROCID (-0x0103) |
ProcIdentityPage address | Per-process identity page for fast getpid/getppid/get*id (Section 2.22) |
AT_SECURE computation: The kernel sets AT_SECURE = 1 if any of the
following conditions hold after the execve() credential transformation:
- euid != uid or egid != gid (setuid/setgid effect via capability grants)
- Capability grants were applied during exec
- An LSM policy marks the transition as "secure" (security_bprm_secureexec())
8.3.9 Address Space Layout and ASLR¶
The kernel randomizes the address space layout for every execve() call. Each
region receives independent randomization from the kernel's CSPRNG
(get_random_bytes()).
Address space regions (64-bit, top-down mmap layout):
0xFFFF_FFFF_FFFF_FFFF ┌────────────────────────────┐
│ Kernel space │
TASK_SIZE ├────────────────────────────┤
│ (guard gap) │
│ Stack (grows ↓) │ ← randomized base
│ (gap) │
│ mmap region (grows ↓) │ ← randomized base
│ - shared libraries │
│ - dynamic linker │
│ - vDSO + VVAR pages │
│ - anonymous mappings │
│ (gap) │
│ Heap / brk (grows ↑) │ ← after last PT_LOAD
│ (gap) │
│ ELF segments (PT_LOAD) │ ← randomized for PIE
│ (unmapped) │
0x0000_0000_0000_0000 └────────────────────────────┘
TASK_SIZE per architecture:
| Architecture | TASK_SIZE (user VA bits) |
|---|---|
| x86-64 | 0x0000_7FFF_FFFF_F000 (47-bit, or 56-bit with 5-level paging) |
| AArch64 | 0x0000_FFFF_FFFF_F000 (48-bit, or 52-bit with LPA) |
| RISC-V 64 (Sv48) | 0x0000_7FFF_FFFF_F000 (47-bit) |
| PPC64LE | 0x0000_1000_0000_0000 (46-bit) |
| s390x | 0x0002_0000_0000_0000 (49-bit) |
| LoongArch64 | 0x0000_7FFF_FFFF_F000 (47-bit) |
| ARMv7 (32-bit) | 0xBF00_0000 (~3 GiB) |
| PPC32 (32-bit) | 0xC000_0000 (3 GiB) |
ASLR entropy per region:
UmkaOS uses the maximum entropy supported by each architecture, matching the
linux-hardened configuration rather than Linux mainline defaults. The
randomization is controlled by the mmap_rnd_bits sysctl (writable by
CAP_SYS_ADMIN):
| Region | x86-64 | AArch64 | RISC-V 64 | PPC64LE | s390x | 32-bit (all) |
|---|---|---|---|---|---|---|
| mmap base | 32 bits | 33 bits | 32 bits | 32 bits | 32 bits | 16 bits |
| Stack mapping | 32 bits | 33 bits | 32 bits | 32 bits | 32 bits | 16 bits |
| Stack offset (within mapping) | 5 bits | 5 bits | 5 bits | 5 bits | 5 bits | 4 bits |
| PIE executable base | 32 bits | 33 bits | 32 bits | 32 bits | 32 bits | 16 bits |
| brk heap offset | 18 bits | 18 bits | 18 bits | 18 bits | 18 bits | 13 bits |
| vDSO | mmap region | mmap region | mmap region | mmap region | mmap region | mmap region |
Design choice: UmkaOS places the vDSO in the mmap region on all
architectures (matching linux-hardened and grsecurity), not near the stack
(as mainline Linux does on x86-64). This provides full mmap-level entropy for
the vDSO rather than the weak ~20 bits in mainline Linux.
ASLR entropy source: All randomization uses get_random_bytes() from the
kernel CSPRNG (ChaCha20-based), not prandom or any weaker generator.
Non-PIE executables (ET_EXEC): The executable is loaded at its fixed
p_vaddr addresses (no randomization). The stack, mmap region, and brk heap
are still randomized.
ASLR disable: The personality(ADDR_NO_RANDOMIZE) flag (Linux-compatible)
disables ASLR for the calling process and its children. Also controllable via
/proc/sys/kernel/randomize_va_space (0 = off, 1 = stack only, 2 = full).
8.3.10 vDSO and VVAR Mapping¶
As the final step of address space setup (before stack construction), the kernel maps the vDSO and VVAR pages into the process:
- VVAR page (1 page,
PROT_READ): Contains theVvarPagestruct (Section 2.22) with timekeeping data (includingclock_tai_offset_secforCLOCK_TAI) updated by the kernel on every timer tick. - VcpuPage (1 page,
PROT_READ | PROT_WRITE, per-CPU): Contains per-CPU CSPRNG state forgetrandom(), CPU identity forgetcpu(), and CPU performance hints. All processes on the same CPU share the same physical page. Write permission is required for the vDSO to advance the CSPRNG consumption cursor. See Section 2.22. - vDSO ELF (1-4 pages,
PROT_READ | PROT_EXEC): The per-architecture vDSO shared library with fast-path implementations ofclock_gettime,gettimeofday,time,clock_getres,getcpu, andgetrandom.
All three are mapped in the mmap region at a randomized address. The vDSO base
address is recorded in AT_SYSINFO_EHDR. The VVAR page is placed at a fixed
negative offset from the vDSO (the vDSO code computes the VVAR address
relative to its own load address using a linker-time constant). The VcpuPage
address is internal to the vDSO (no auxv entry); the vDSO discovers it via a
fixed offset from the vDSO base or the vgetrandom_alloc() mechanism.
8.3.11 UmkaOS-Specific Page Mappings¶
After vDSO/VVAR mapping and before stack construction, the kernel maps UmkaOS-specific pages into the new address space. Since exec destroys the old mm (all user-space mappings gone) and creates a new mm, these pages must be explicitly re-mapped for every exec:
-
CapSummaryPage (1 page,
PROT_READ): Per-process capability summary for fast negative permission checks (Section 2.22). The same physical page is re-mapped into the new mm. The new VMA address is recorded inAT_UMKA_CAPSUMMARY. -
CgroupGaugePage (1 page,
PROT_READ): Per-cgroup resource gauge for the process's current cgroup (Section 2.22). The physical page is determined by the task's current cgroup membership. Zero (no mapping) if the task is in the root cgroup with no gauge configured. Recorded inAT_UMKA_CGAUGE. -
ProcIdentityPage (1 page,
PROT_READ): Per-process identity page for fastgetpid/getppid/get*idwithout syscalls (Section 2.22). Same physical page, new VMA. Recorded inAT_UMKA_PROCID. -
SchedHintPage (optional): Mapped on demand via
umka_map_sched_hint(). Not automatically mapped at exec;AT_UMKA_SCHEDHINTindicates availability.
These mappings are placed in the mmap region alongside the vDSO. The auxv entries
point to the new mapping addresses in the new mm. Failure to map any of these pages
after the point of no return is fatal (SIGKILL), matching the behavior for vDSO
mapping failure.
8.3.12 Error Handling¶
All errors before the point of no return (flush_old_exec) return the error
code to the caller — the original process continues executing unchanged.
Errors after the point of no return are fatal:
| Error | Phase | Recovery |
|---|---|---|
| Segment mapping failure (OOM) | After flush_old_exec | SIGKILL — the old address space is destroyed, no image to return to |
| Stack construction failure (OOM) | After flush_old_exec | SIGKILL |
Interpreter not found (-ELIBBAD) |
Before flush_old_exec | Return error, process continues |
Bad ELF header (-ENOEXEC) |
Before flush_old_exec | Return error, try next binfmt handler |
Permission denied (-EACCES) |
Before flush_old_exec | Return error |
| File too short / truncated | Before flush_old_exec | -ENOEXEC |
| Overlapping PT_LOAD segments | Before flush_old_exec | -EINVAL |
p_vaddr + p_memsz > TASK_SIZE |
Before flush_old_exec | -EINVAL |
arg + env size exceeds ARG_MAX |
Before flush_old_exec | -E2BIG |
Post-point-of-no-return invariant: Once flush_old_exec is called, the
execve() path must either succeed (return to userspace at the new entry
point) or kill the task with SIGKILL. There is no third option — the old
process image no longer exists.
8.4 Real-Time Guarantees¶
8.4.1 Beyond CBS¶
Section 7.6 provides CPU bandwidth guarantees via CBS (Constant Bandwidth Server). This ensures average bandwidth. Real-time workloads need worst-case latency bounds: interrupt-to-response always under a specific ceiling.
8.4.2 Design: Bounded Latency Paths¶
// umka-core/src/rt/mod.rs
/// Real-time configuration (system-wide, set at boot or runtime).
pub struct RtConfig {
/// Maximum interrupt latency guarantee (nanoseconds).
/// The kernel guarantees that ISR entry occurs within this bound
/// after the interrupt fires.
/// Default: 50_000 (50 μs). Achievable on x86 with careful design.
pub max_irq_latency_ns: u64,
/// Maximum scheduling latency for SCHED_DEADLINE tasks (nanoseconds).
/// The kernel guarantees that a runnable DEADLINE task is scheduled
/// within this bound.
/// Default: 100_000 (100 μs).
pub max_sched_latency_ns: u64,
/// Preemption model.
pub preemption: PreemptionModel,
}
#[repr(u32)]
pub enum PreemptionModel {
/// Voluntary preemption (default). Preemption at explicit preempt points.
/// Lowest overhead, highest latency variance.
Voluntary = 0,
/// Full preemption. Preemptible everywhere except hard critical sections.
/// Moderate overhead, good latency bounds.
/// Equivalent to Linux PREEMPT (non-RT).
Full = 1,
/// RT preemption. All `spinlock_t` and `rwlock_t` instances become sleeping locks
/// (mapped to `rt_mutex`). `raw_spinlock_t` remains a true spinning lock with
/// interrupts disabled, used for scheduler internals, interrupt handling, and
/// hardware access paths that must not sleep.
/// Interrupts are threaded. Maximum preemptibility.
/// Equivalent to Linux PREEMPT_RT.
/// Highest overhead (~2-5% throughput), tightest latency bounds.
Realtime = 2,
}
RT Wakeup Latency Budget (x86-64, target ≤ 100 μs):
Component Worst case Basis
──────────────────────────────────────────────────────────────────────
Hardware interrupt delivery ~1 μs LAPIC delivery latency
IRQ handler + EOI ~3 μs Minimal ISR: ACK + flag set
Scheduler wakeup (try_to_wake_up) ~2 μs Runqueue lock + enqueue
Context switch overhead ~3 μs Register save/restore, FPU
TLB flush (if ASID switch) ~2 μs CR3 write + pipeline flush
WRPKRU domain switch ~0.1 μs 23 cycles @ 3 GHz
Cache warm (MADV_CRITICAL pages) ~10 μs 8 L3 misses × ~10ns/miss
Cross-socket IPI (if needed) ~5 μs LAPIC IPI round-trip
──────────────────────────────────────────────────────────────────────
Total, same-socket ~21 μs within 100 μs budget
Total, cross-socket ~26 μs within 100 μs budget
Caveats and configuration requirements:
- This budget assumes the RT task pins its working set with
madvise(MADV_CRITICAL)(Section 4.1) andmlock()to prevent page faults on the RT path. - SMI (System Management Interrupt) from firmware can add 50-500 μs stalls and
is outside UmkaOS's control. For hard RT: configure
isolcpus=N nohz_full=Nand ensure BIOS/UEFI does not issue SMIs on isolated CPUs. - Memory compaction (Section 4.1) is disabled on
nohz_fullCPUs. - The 100 μs target is for SCHED_RT tasks. SCHED_DEADLINE tasks additionally have their CBS deadline enforcement; actual latency depends on declared parameters.
8.4.3 Key Design Decisions for RT¶
1. Threaded interrupts (when PreemptionModel::Realtime):
All hardware interrupts are handled by kernel threads.
Threads are schedulable — RT tasks can preempt interrupt handlers.
Cost: ~1 μs additional interrupt latency (thread switch).
Linux PREEMPT_RT does the same.
2. Priority inheritance for RtMutex locks:
When a low-priority task holds an RtMutex needed by a high-priority task,
the low-priority task inherits the high-priority task's priority.
Prevents priority inversion (classic RT problem).
Cost: ~5-10 cycles per lock acquire (check/update priority).
Linux PREEMPT_RT does the same.
3. No unbounded loops in kernel paths:
Every loop has a bounded iteration count.
Memory allocation in RT context: from pre-allocated pools (no reclaim).
Page fault in RT context: fails immediately (no I/O wait).
Enforced by coding guidelines + Verus verification ([Section 24.4](24-roadmap.md#formal-verification-readiness)).
4. Deadline admission control (Section 7.1.4):
SCHED_DEADLINE tasks declare (runtime, period, deadline).
Kernel admits the task ONLY if it can guarantee the deadline.
If admission would violate existing guarantees: returns -EBUSY.
Same semantics as Linux SCHED_DEADLINE.
8.4.3.1 Priority Inheritance Protocol¶
8.4.3.1.1 Priority Number Convention¶
All base_priority and effective_priority fields in this section use the
Linux internal convention (also called "kernel priority"):
| Internal value | Meaning | User-facing sched_priority |
|---|---|---|
| 0 | Highest RT priority | 99 (sched_priority = 99 - internal) |
| 99 | Lowest RT priority | 0 |
| 100-139 | CFS/EEVDF (nice -20 to +19) | N/A (non-RT) |
The conversion is: internal_priority = 99 - sched_priority for RT tasks.
PI comparisons use >= to mean "lower or equal scheduling priority" — a
numerically higher internal value is a lower scheduling priority. For example,
owner.effective_priority >= max_waiter_prio means "the owner has equal or
lower priority than the top waiter."
IRQ threads use the internal convention: an IRQ thread at internal priority 49
(= sched_priority 50) is a mid-priority RT task.
Priority inheritance applies exclusively to RtMutex (the real-time mutex with PI
support). Standard SpinLock and Mutex do not use PI: SpinLock is
non-preemptible (no scheduling occurs while spinning), and sleeping Mutex uses its
own priority boosting. RtMutex is the kernel primitive that replaces SpinLock and
Mutex throughout the kernel when PreemptionModel::Realtime is active.
/// Real-time mutex with priority inheritance support.
///
/// # Data structure choice: intrusive linked list, not BinaryHeap
///
/// The waiter list is stored as a **priority-sorted intrusive doubly-linked
/// list** with nodes embedded directly in `Task` (`Task::rt_waiter`), not as
/// a heap-allocated `BinaryHeap<RtMutexWaiter>`.
///
/// Rationale:
/// - `RtMutex::lock` is a `RawSpinLock` (held with preemption disabled).
/// Heap allocation inside a raw spinlock is prohibited: the allocator may
/// attempt to acquire a lock that is already held, causing deadlock.
/// - An intrusive list node (`RtMutexWaiter` embedded in `Task`) requires
/// zero heap allocation — the node storage comes from the blocked task's
/// own stack frame.
/// - Tasks cannot be freed while they are waiting (the task pins itself by
/// not returning from `rt_mutex_lock`), so intrusive node lifetimes are safe.
/// - A sorted doubly-linked list gives O(n) insert (n = waiter count, typically
/// 1-3 in production RT workloads) and O(1) remove-top (next owner on unlock),
/// which is optimal for the actual workload distribution.
///
/// # Invariants
/// - `owner` is null if and only if the mutex is unlocked.
/// - `waiters` is sorted by descending `effective_priority` (highest priority
/// waiter is at the list head — it will be handed ownership first).
/// - All waiter priorities satisfy: priority ≤ effective priority of `owner`
/// (maintained by `pi_propagate` on lock, `rt_mutex_unlock` on unlock).
/// - Every `RtMutexWaiter` node in `waiters` is embedded in a `Task` that is
/// currently blocked on this specific mutex.
pub struct RtMutex {
/// Current owner task (`null` if unlocked).
/// Written only while `lock` is held; readable with Acquire load.
pub owner: AtomicPtr<Task>,
/// Priority-sorted intrusive doubly-linked list of blocked waiters.
/// List head = highest-priority waiter (next owner on unlock).
/// Protected by `lock`.
pub waiters: IntrusiveList<RtMutexWaiter>,
/// Internal spinlock protecting `waiters` and `owner` transitions.
/// This is a raw spin (never yields, never a sleeping lock), held for
/// at most ~10–30 instructions during ownership handoff.
pub lock: RawSpinLock,
}
/// Priority inheritance waiter node. One node per `Task`; embedded directly
/// in `Task` as `Task::rt_waiter: Option<RtMutexWaiter>`.
///
/// Using an intrusive node embedded in `Task` instead of a heap-allocated
/// entry avoids all allocation inside `RawSpinLock` critical sections.
/// The node is valid for exactly the duration that the task is blocked on
/// an `RtMutex`; it is initialized before the lock attempt and cleared on
/// acquisition or timeout/signal.
pub struct RtMutexWaiter {
/// Intrusive list links (prev/next pointers into the mutex's waiter list).
pub links: IntrusiveListLinks,
/// The task that owns this waiter node (back-pointer for PI chain walk).
pub task: NonNull<Task>,
/// The task's effective priority when enqueued.
/// Updated in-place by `pi_propagate` if the task's priority changes while waiting.
pub effective_priority: i32,
}
/// Fields added to `Task` to support PI chain traversal and intrusive waiter nodes.
pub struct Task {
// ... (existing fields from Section 8.1.1) ...
/// Embedded waiter node for the RtMutex this task is currently waiting on.
/// `Some(node)` while blocked; `None` otherwise.
/// The node is initialized before `rt_mutex_lock()` sleeps and cleared on
/// acquisition, timeout, or signal delivery. Initialized inside the RtMutex's
/// `lock` critical section, so no additional synchronization is needed.
pub rt_waiter: Option<RtMutexWaiter>,
/// The RtMutex this task is currently blocked waiting to acquire, or
/// `None` if the task is not blocked on any RtMutex.
/// Written under the RtMutex's internal spinlock before the task sleeps.
/// Used by `pi_propagate` to follow the ownership chain.
pub blocked_on_rt_mutex: Option<NonNull<RtMutex>>,
/// The task's base (scheduler-assigned) priority.
///
/// Set by the scheduler when the task is created or when `sched_setparam`
/// changes its scheduling policy. This is the priority the task has before
/// any priority inheritance boosting. PI algorithms use this as the lower
/// bound when unwinding the PI chain.
pub base_priority: i32,
/// The task's current effective priority.
///
/// Normally equals `base_priority` (the scheduler-assigned priority).
/// Raised by PI when this task holds an `RtMutex` that a higher-priority
/// task is waiting on. Lowered back to `base_priority` (or the maximum of
/// all remaining held-mutex waiter priorities) when the blocking task
/// acquires the mutex and this task releases it.
pub effective_priority: i32,
}
PI propagation algorithm (chain walk):
Constants:
MAX_PI_CHAIN_DEPTH = 10 // maximum ownership chain length before
// declaring deadlock (matches Linux rt_mutex)
rt_mutex_lock(mutex, current_task):
mutex.lock.lock_spin()
if mutex.owner.load(Acquire).is_null():
// Uncontended: claim ownership directly.
mutex.owner.store(current_task, Release)
mutex.lock.unlock_spin()
return Ok(())
// Contended: initialize the intrusive waiter node embedded in current_task
// and insert it into the mutex's priority-sorted waiter list. Zero allocation.
current_task.rt_waiter = Some(RtMutexWaiter {
links: IntrusiveListLinks::new(),
task: NonNull::from(current_task),
effective_priority: current_task.effective_priority,
})
// Insert in priority order (O(n) but n is typically 1-3 in RT workloads).
mutex.waiters.insert_sorted(&mut current_task.rt_waiter.as_mut().unwrap().links)
pi_propagate(mutex, depth=0) // boost owner; may chain-propagate
mutex.lock.unlock_spin()
// Sleep until woken by rt_mutex_unlock.
current_task.blocked_on_rt_mutex = Some(mutex)
current_task.sleep(WaitReason::RtMutex)
current_task.blocked_on_rt_mutex = None
current_task.rt_waiter = None // clear the intrusive node after wake
// On wake: we are the new owner (rt_mutex_unlock hands ownership directly).
return Ok(())
pi_propagate(mutex, depth):
if depth > MAX_PI_CHAIN_DEPTH:
// Potential deadlock or excessively deep PI chain (depth heuristic).
// Must roll back before returning the error:
// 1. Remove the current task's waiter from mutex.waiters.
// 2. Walk the PI chain backwards, restoring each boosted task's
// effective_priority to compute_effective_priority() (which
// considers all *other* held mutexes, excluding the deadlocked one).
// 3. Reposition restored tasks in the scheduler via update_priority().
// 4. Wake the blocked task with Err(LockError::Deadlock).
// Without this rollback, the waiter remains in mutex.waiters (leaked
// intrusive node → UAF when the task's stack is freed) and priorities
// remain artificially boosted (priority inversion of a different kind).
pi_deadlock_rollback(mutex, depth)
return Err(LockError::Deadlock)
owner_ptr = mutex.owner.load(Acquire)
if owner_ptr.is_null():
return Ok(()) // mutex became unlocked concurrently; nothing to boost
owner = &mut *owner_ptr
// The list head is the highest-priority waiter (sorted descending on insert).
max_waiter_prio = mutex.waiters.front().map_or(0, |w| w.effective_priority)
if owner.effective_priority >= max_waiter_prio:
return Ok(()) // owner is already at or above the required priority; done
// Boost the owner.
owner.effective_priority = max_waiter_prio
scheduler::update_priority(owner) // reposition in runqueue if running/runnable
// Chain propagation: if the owner is itself blocked on another RtMutex,
// propagate the boosted priority to that mutex's owner as well.
if let Some(upstream_mutex) = owner.blocked_on_rt_mutex:
upstream_mutex.lock.lock_spin()
// Update our waiter entry in the upstream mutex with the new priority.
upstream_mutex.waiters.update_priority(owner, max_waiter_prio)
pi_propagate(upstream_mutex, depth + 1)
upstream_mutex.lock.unlock_spin()
return Ok(())
// Rollback helper: undo PI boosting on deadlock detection.
pi_deadlock_rollback(deadlock_mutex, chain_depth):
// Step 1: Remove the current task's waiter from the mutex where
// the deadlock was detected.
deadlock_mutex.waiters.remove(¤t_task.rt_waiter)
// Step 2: Walk the PI chain from deadlock_mutex back toward the
// original mutex (depth levels). At each level, recalculate the
// owner's effective_priority from its remaining held mutexes.
walk_mutex = deadlock_mutex
for _ in 0..chain_depth:
owner = walk_mutex.owner.load(Acquire)
if owner.is_null():
break
// Recompute from base_priority + max waiter across all OTHER held mutexes.
owner.effective_priority = owner.compute_effective_priority()
scheduler::update_priority(owner)
// Move up the chain.
match owner.blocked_on_rt_mutex:
Some(upstream) => walk_mutex = upstream,
None => break,
// Step 3: Wake the task that attempted to lock the deadlocked mutex.
// The task's rt_mutex_lock() call returns Err(LockError::Deadlock).
current_task.blocked_on_rt_mutex = None
current_task.rt_waiter = None
current_task.wake(WakeReason::Deadlock)
Priority restoration on unlock:
rt_mutex_unlock(mutex, current_task):
mutex.lock.lock_spin()
// Remove current_task from the owner role.
mutex.owner.store(null, Release)
// Restore current_task's effective priority to the maximum of:
// (a) its base_priority (scheduler-assigned), and
// (b) the maximum waiter priority across all *other* RtMutexes it holds.
// This correctly handles the case where the task holds multiple mutexes.
current_task.effective_priority = current_task.compute_effective_priority()
scheduler::update_priority(current_task)
// Hand ownership to the highest-priority waiter (if any).
if let Some(next_waiter) = mutex.waiters.pop_max():
next_task = &mut *next_waiter.task
mutex.owner.store(next_task, Release)
// Wake the new owner. It will find itself the owner on return from sleep.
scheduler::wake_task(next_task)
mutex.lock.unlock_spin()
compute_effective_priority() iterates the task's HeldMutexes container
(Section 8.1) — a SmallVec-style structure with
8 inline NonNull<RtMutex> slots (one cache line) and a rare-case overflow Vec.
Updated on each rt_mutex_lock/rt_mutex_unlock, protected by the task's pi_lock.
Returns max(base_priority, max over held mutexes of their max waiter priority).
In the common case (≤8 held mutexes), this is a linear scan of the inline array — 8 pointer dereferences, all in one cache line. For the rare >8 case, the overflow Vec is also scanned. No hard bound is imposed: the scan is O(n) where n is the number of currently held mutexes, but n is virtually always ≤8 in practice.
Deadlock detection: PI chain propagation in pi_propagate tracks depth. At
depth == MAX_PI_CHAIN_DEPTH, if the ownership chain has not terminated, a potential
deadlock or excessively deep PI chain is detected (depth-based heuristic, matching
Linux rt_mutex behavior). rt_mutex_lock returns Err(LockError::Deadlock). The
kernel logs the chain (owner PIDs and mutex addresses) at WARN level before returning
the error to the caller. Note: a chain of length MAX_PI_CHAIN_DEPTH is not
necessarily a cycle — it could be legitimately deep nesting. True cycle detection
would require checking if the same mutex is visited twice during the walk, which is
more expensive. The depth heuristic is a pragmatic trade-off matching Linux's approach.
Scope: Standard SpinLock and Mutex in UmkaOS do not use PI. They are not sleeping
locks in the PREEMPT_RT sense: SpinLock disables preemption, so no scheduling event
can cause priority inversion while it is held. RtMutex is used wherever a kernel lock
may be held across a context switch, which in PreemptionModel::Realtime is most kernel
mutexes (converted from SpinLock by the RT build).
8.4.3.2 Threaded IRQ Thread Mapping¶
When PreemptionModel::Realtime is active, all device interrupt handlers run in
dedicated kernel threads rather than in hard-IRQ context. This allows RT tasks to
preempt interrupt handlers and gives the scheduler full visibility over IRQ handler
execution time. The mapping from IRQ number to handler thread is defined by IrqThread.
/// Kernel thread that services a single threaded hardware interrupt.
///
/// One `IrqThread` exists per `IrqAction` registered with `IRQF_THREAD`.
/// The thread sleeps on `wait` between interrupt deliveries. The hard-IRQ
/// top half sets `pending` and wakes the thread; the thread calls the
/// device's thread function and re-enables the IRQ line.
pub struct IrqThread {
/// The kernel task backing this IRQ thread.
/// Named `"irq/{irq}/{name}"` (e.g., `"irq/42/eth0"`).
pub task: Arc<Task>,
/// IRQ number this thread services.
pub irq: u32,
/// Descriptive name of the IRQ handler (from `request_irq`).
pub name: Arc<str>,
/// The registered IRQ action (contains the thread function pointer).
pub action: Arc<IrqAction>,
/// Wait queue: the thread sleeps here until the hard-IRQ top half wakes it.
pub wait: WaitQueue,
/// Set by the hard-IRQ top half; cleared by the thread bottom half.
/// `AtomicBool` allows lock-free set from IRQ context and load from
/// thread context.
pub pending: AtomicBool,
/// Scheduling priority of this IRQ thread.
/// Default: `SCHED_FIFO` priority 50. Adjustable via `chrt(1)` or by
/// passing `IRQF_THREAD_PRIORITY(n)` in `request_irq` flags.
/// Priority 50 places IRQ threads above normal tasks (priority 0–49 range
/// for `SCHED_OTHER`) but below high-priority RT tasks (priority 51–99).
pub priority: u32,
/// CPU affinity mask: which CPUs may run this IRQ thread.
/// Initialised from `/proc/irq/{irq}/smp_affinity` at registration time;
/// adjustable at runtime via the same sysfs path.
pub affinity: CpuSet,
}
Thread creation — invoked by request_irq(irq, handler, flags, name, dev_id) when
flags includes IRQF_THREAD:
1. Allocate a new IrqThread with:
- task name: "irq/{irq}/{name}" (truncated to TASK_COMM_LEN = 15 chars)
- scheduling class: SCHED_FIFO, priority 50 (or IRQF_THREAD_PRIORITY(n))
- CPU affinity: all CPUs (matches current smp_affinity; adjustable later)
- pending: false
- wait: empty WaitQueue
2. Start the kernel thread. The thread body immediately executes:
loop {
wait_event(&irq_thread.wait, irq_thread.pending.load(Acquire))
irq_thread.pending.store(false, Release)
action.thread_fn(irq, action.dev_id) // device bottom-half handler
irq_chip.irq_unmask(irq) // re-enable the IRQ line
}
3. Register the IrqThread in the global irq_thread_table[irq].
If IRQF_SHARED is set and another IrqAction already exists for this IRQ,
each action gets its own IrqThread (one thread per registered handler).
Hard-IRQ → thread handoff (the two-phase split):
Phase 1: Hard-IRQ top half (runs in interrupt context, preemption disabled):
1. Acknowledge the interrupt at the interrupt controller (mask the line,
send EOI, or equivalent — platform-specific).
2. Run the "primary handler" (the fast part of the ISR: read a status
register, record the event, clear a flag). Return IRQ_WAKE_THREAD.
3. Set IrqThread::pending = true (Relaxed store; the wake_up below
provides the Release barrier via the WaitQueue spinlock).
4. wake_up(&irq_thread.wait) (wakes the sleeping IrqThread task).
5. Hard-IRQ exits; preemption re-enabled.
Phase 2: IRQ thread bottom half (runs as a schedulable kernel task):
1. Woken by WaitQueue::wake_up.
2. Loads IrqThread::pending (Acquire); confirms it is true.
3. Clears IrqThread::pending (Release store).
4. Calls action.thread_fn(irq, dev_id) — the device's actual handler
(DMA buffer processing, packet reception, block I/O completion, etc.).
This may block (sleep on mutexes, allocate memory) — it is a normal
kernel thread context.
5. Calls irq_chip.irq_unmask(irq) to re-enable the hardware IRQ line.
6. Returns to the wait_event loop at the top.
Thread lifecycle:
| Event | Action |
|---|---|
request_irq(IRQF_THREAD) |
Create and start IrqThread; register in irq_thread_table |
free_irq() |
Call kthread_stop(irq_thread.task): set stop flag, wake the thread; thread exits its wait loop on the next iteration |
IRQF_SHARED (multiple handlers) |
Each IrqAction gets an independent IrqThread; all threads are woken by the hard-IRQ top half and each tests its own pending flag |
| CPU hotplug (CPU offline) | IRQ affinity is updated to exclude the offline CPU; if the IRQ thread is running on that CPU, the scheduler migrates it before the CPU is taken offline |
PreemptionModel not Realtime |
IRQ threads are not created; handlers run in hard-IRQ context as usual |
Scheduling: IRQ threads run at SCHED_FIFO priority 50 by default. This priority
was chosen so that:
- IRQ threads run before all SCHED_OTHER tasks (priority 0): device events are
processed promptly.
- IRQ threads yield to high-priority RT tasks (SCHED_FIFO priority 51–99): an RT
control loop at priority 80 can preempt any IRQ thread.
- The priority can be raised (e.g., to 70 for a network card in a soft-RT path) via
chrt -f 70 $(pgrep irq/42/eth0) or the IRQF_THREAD_PRIORITY(n) flag in
request_irq. Raising above 99 is rejected (EINVAL).
/proc/irq/{n}/smp_affinity: The affinity mask of IrqThread::task is updated
when the sysfs file is written, using sched_setaffinity() on the kernel thread. This
is the standard Linux interface; irqbalance(8) and tuna(8) work without
modification.
8.4.4 RT + Domain Isolation Interaction¶
The raw WRPKRU instruction takes ~23 cycles on modern Intel microarchitectures (~6ns at 4 GHz). On KABI call boundaries, the domain switch is unconditional (the caller always needs to switch to the callee's domain), so the switch cost is the raw WRPKRU cost: ~23 cycles. The performance budget (Section 1.3) uses this figure: 4 switches × ~23 cycles = ~92 cycles per I/O round-trip.
RT jitter analysis: In a tight RT control loop making KABI calls at 10kHz, each call requires a round-trip domain switch (out and back = ~46 cycles = ~12ns), accumulating to ~120μs/sec of jitter (10,000 × 12ns). See I/O path analysis below.
RT latency policy:
Tier 0 drivers: run in Core isolation domain. Zero transition cost.
RT-critical paths (interrupt handlers, timer callbacks) use Tier 0.
Tier 1 drivers: ring buffer dispatch (Transport T1). Individual call
latency ~50-120ns (ring submit + consumer wake + domain switch +
vtable dispatch + completion). With batching at N≥12, amortized
to ~23-80 cycles per op. Acceptable for soft-RT (audio, video).
RT tasks requiring <10μs single-call determinism should use Tier 0.
Tier 2 drivers: user-space. Context switch cost (~1μs). Not for RT.
Why domain isolation does not cause priority inversion: Unlike mutex-based isolation, domain switching is a single unprivileged instruction (WRPKRU) that executes in constant time with no blocking, no lock acquisition, and no kernel involvement. A high-priority RT task switching domains cannot be blocked by a lower-priority task holding the domain. This is fundamentally different from process-based isolation (Tier 2) where IPC involves a context switch that can be delayed by scheduling.
Shared ring buffers in RT paths: When an RT task communicates with a Tier 1 driver via a shared ring buffer, the ring buffer memory is tagged with the shared PKEY (readable/writable by both core and driver domains). Accessing the ring buffer does not require a domain switch — the shared PKEY is always accessible. Only direct access to driver-private memory requires WRPKRU. Therefore, the typical RT I/O path is:
RT task → write command to shared ring buffer (no WRPKRU)
→ doorbell write to MMIO (requires WRPKRU → driver domain → WRPKRU back: ~12ns)
→ poll completion from shared ring buffer (no WRPKRU)
Total domain switch overhead per I/O op: ~12ns (one domain round-trip: two
WRPKRU instructions at ~23 cycles each = ~46 cycles = ~12ns at 4 GHz).
At 10kHz: 120μs/sec. At 1kHz: 12μs/sec. Negligible.
Preemption during domain switch: WRPKRU is a single instruction that cannot be preempted mid-execution. If a timer interrupt arrives between two WRPKRU instructions (e.g., switch to driver domain, then switch back), the interrupt handler saves and restores PKRU as part of the register context. The RT task resumes with its PKRU intact. No special handling is needed — this is the same as any register save/restore on interrupt.
8.4.5 CPU Isolation for Hard RT¶
Standard Linux RT practice — fully supported:
isolcpus=2-3 Reserve CPUs 2-3: no normal tasks, no load balancing.
nohz_full=2-3 Tickless on CPUs 2-3: no timer interrupts when idle
or running a single RT task.
rcu_nocbs=2-3 RCU callbacks offloaded from CPUs 2-3: no RCU
processing on isolated CPUs.
Isolated CPUs have: no timer ticks, no RCU callbacks, no workqueues, no
kernel threads (except pinned ones). This is required for hard-RT workloads
(LinuxCNC, IEEE 1588 PTP, audio with <1ms latency). Cross-reference:
Section 22.8 lists isolcpus and nohz_full as supported.
8.4.6 Driver Crash During RT-Critical Path¶
If a Tier 1 driver crashes while an RT task depends on it:
Policy: immediate error notification, NOT wait for recovery.
1. Domain fault detected → crash recovery starts (Section 11.7).
2. RT task blocked on the driver gets IMMEDIATE unblock with error:
- Pending I/O returns -EIO.
- Pending KABI calls return CapError::DriverCrashed.
- Signal SIGBUS delivered if task is in a blocking syscall.
3. RT task handles the error (application-specific failsafe mode).
4. Driver recovery (~100ms) happens in background.
5. RT task can resume normal operation after driver reloads.
Rationale: RT guarantees are more important than waiting for recovery.
An RT task must ALWAYS get a response within its deadline, even if that
response is an error. Blocking an RT task for 100ms violates the RT contract.
8.4.7 Linux Compatibility¶
Real-time interfaces are standard Linux:
SCHED_FIFO, SCHED_RR: sched_setscheduler() — supported
SCHED_DEADLINE: sched_setattr() — supported
/proc/sys/kernel/sched_rt_*: RT scheduler tunables — supported
/sys/kernel/realtime: "1" when PREEMPT_RT is active — supported
clock_nanosleep(TIMER_ABSTIME): deterministic wakeup — supported
mlockall(MCL_CURRENT|MCL_FUTURE): prevent page faults — supported
Existing RT applications (JACK audio, ROS2, LinuxCNC, PTP/IEEE 1588) work without modification.
NUMA-Aware RT Memory:
Hard real-time tasks must avoid remote NUMA access (unpredictable latency). Standard Linux practice applies:
The kernel enforces: when a process has SCHED_DEADLINE or SCHED_FIFO priority AND
is bound to a NUMA node via set_mempolicy(MPOL_BIND), the memory allocator does NOT
fall back to remote nodes on allocation failure — it returns ENOMEM instead. This
prevents unpredictable remote-access latency spikes.
Additionally, NUMA balancing (automatic page migration based on access patterns) is disabled for RT-priority tasks. Automatic page migration adds unpredictable latency (~50-200μs per migrated page). RT tasks pin their memory explicitly.
8.4.8 Performance Impact¶
When PreemptionModel::Voluntary (default): zero overhead vs Linux. Same model.
When PreemptionModel::Full: ~1% throughput reduction. Same as Linux PREEMPT.
When PreemptionModel::Realtime: ~2-5% throughput reduction. Same as Linux
PREEMPT_RT. This is the unavoidable cost of deterministic scheduling — the
same cost any RT OS pays.
The preemption model is configurable at boot. Debian servers use Voluntary
(default). Embedded/RT deployments use Realtime.
8.4.9 Hardware Resource Determinism¶
Software scheduling (EEVDF, CBS, threaded IRQs, priority inheritance) guarantees CPU execution time. However, on modern multi-core SoCs, shared hardware resources — L3 caches, memory controllers, interconnect bandwidth — introduce unpredictable latency spikes that violate hard real-time deadlines regardless of CPU priority.
A high-priority RT task running on an isolated CPU can still miss its deadline if a background batch job on another core evicts the RT task's data from the shared L3 cache, or saturates the memory controller with streaming writes. This is the "noisy neighbor" problem at the hardware level.
UmkaOS addresses this by extending its Capability Domain model to physically partition shared hardware resources, using platform QoS extensions where available.
8.4.9.1 Cache Partitioning (Intel RDT / ARM MPAM)¶
Modern server CPUs expose hardware Quality of Service (QoS) mechanisms that allow the OS to assign cache and memory bandwidth quotas per workload:
- Intel Resource Director Technology (RDT): Available on Xeon Skylake-SP and later.
Provides Cache Allocation Technology (CAT) for L3 partitioning and Memory Bandwidth
Allocation (MBA) for memory controller throttling. Controlled via MSRs (
IA32_PQR_ASSOC,IA32_L3_MASK_n). Up to 16 Classes of Service (CLOS). - ARM Memory Partitioning and Monitoring (MPAM): Optional extension introduced
in ARMv8.4-A (FEAT_MPAM). Provides Cache Portion Partitioning (CPP) and Memory
Bandwidth Partitioning (MBP). Thread-to-partition assignment is via system registers
(
MPAM0_EL1,MPAM1_EL1) which set the PARTID for the executing thread. Actual resource limits (cache way bitmasks, bandwidth caps) are configured via MMIO registers in each Memory System Component (MSC) — e.g.,MPAMCFG_CPBMfor cache portions andMPAMCFG_MBW_MAXfor bandwidth limits. Up to 256 Partition IDs (PARTIDs).
UmkaOS integrates these into the Capability Domain model (Section 9.1). Each Capability
Domain can optionally carry a ResourcePartition constraint:
// umka-core/src/rt/resource_partition.rs
/// Hardware resource partition assigned to a Capability Domain.
/// Only meaningful when the platform provides QoS extensions (RDT, MPAM).
/// On platforms without QoS support, this struct is ignored.
pub struct ResourcePartition {
/// L3 Cache Allocation bitmask.
/// Each set bit grants the domain access to one cache "way."
/// On Intel RDT, this maps to IA32_L3_MASK_n for the assigned CLOS.
/// On ARM MPAM, this maps to the cache portion bitmap for the assigned PARTID.
/// Example: 0x000F = ways 0-3 (exclusive to this domain).
pub l3_cache_mask: u32,
/// Memory Bandwidth Allocation percentage (1-100).
/// Throttles memory controller traffic generated by this domain.
/// On Intel MBA, this maps to the delay value for the assigned CLOS.
/// On ARM MPAM, this maps to the MBW_MAX control for the assigned PARTID.
/// 100 = no throttling. 50 = limit to ~50% of peak bandwidth.
pub mem_bandwidth_pct: u8,
/// Whether this partition is exclusive (no overlap with other domains).
/// When true, the kernel verifies that no other domain's l3_cache_mask
/// overlaps with this one. Allocation fails with EBUSY if overlap detected.
pub exclusive: bool,
}
Determinism strategy:
- RT Domains: Granted exclusive L3 cache ways (e.g., ways 0-3 on a 16-way cache). Their hot data is never evicted by other workloads. Memory bandwidth set to 100%.
- Best-Effort Domains: Restricted to the remaining L3 cache ways (e.g., ways 4-15) and throttled via MBA during contention (e.g., limited to 50% bandwidth).
- Discovery at boot: The kernel queries CPUID (Intel) or MPAM system registers (ARM)
to discover the number of available cache ways and CLOS/PARTID slots. If the hardware
does not support RDT/MPAM, the
ResourcePartitionconstraint is silently ignored and a warning is logged.
Cache monitoring integration: RDT also provides Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM), which report per-CLOS cache occupancy and bandwidth usage. UmkaOS exposes these counters via the observability framework (Section 20.2) as stable tracepoints, enabling operators to verify that RT workloads remain within their allocated cache partition.
8.4.9.2 Strict Memory Pinning for RT Domains¶
Hard real-time tasks cannot tolerate page faults. A single page fault in a 10 kHz control loop adds 3-50 μs of jitter (TLB miss + page table walk + potential disk I/O), which can exceed the entire deadline budget.
UmkaOS provides strict memory pinning semantics for RT Capability Domains:
- Eager allocation: When an RT task calls
mmap()or loads an executable, all physical frames are allocated and page table entries populated immediately (equivalent toMAP_POPULATE). No demand paging. - Pre-faulted stacks: The kernel pre-faults the full stack allocation for RT
threads at
clone()time. Stack guard pages are still present but the usable stack region is fully backed by physical memory. - Exempt from reclaim: Pages owned by an RT domain are never targeted by
kswapdpage reclaim (Section 4.1), never compressed by ZRAM (Section 4.12), and never swapped. The OOM killer will target non-RT domains first; it will only kill an RT task as a last resort after all non-RT tasks have been considered. - NUMA-local enforcement: When an RT task is bound to a NUMA node via
set_mempolicy(MPOL_BIND), the allocator returnsENOMEMrather than falling back to remote NUMA nodes. Remote NUMA access adds 50-200 ns of unpredictable latency per cache miss — unacceptable for hard RT. NUMA auto-balancing (automatic page migration) is disabled for RT-priority tasks.
These properties are activated automatically when a task has SCHED_FIFO,
SCHED_RR, or SCHED_DEADLINE policy AND is assigned to a Capability Domain
with an RtConfig (Section 8.4). They can also be requested explicitly via
mlockall(MCL_CURRENT | MCL_FUTURE), which is the standard Linux RT practice.
8.4.9.3 Time-Sensitive Networking (TSN)¶
For distributed real-time systems (industrial control, automotive Ethernet, robotics),
determinism must extend beyond the CPU to the network. UmkaOS integrates with hardware
Time-Sensitive Networking (IEEE 802.1) features via umka-net (Section 16.1):
-
IEEE 802.1Qbv (Time-Aware Shaper): NICs with hardware TSN support expose gate control lists (GCLs) that schedule packet transmission at precise microsecond intervals. UmkaOS bypasses the software Qdisc layer for TSN-tagged traffic classes, programming the NIC's hardware scheduler directly via KABI. RT packets are never queued in software — they are placed directly in a hardware TX ring whose transmission gate opens at the scheduled time.
-
IEEE 802.1AS (Generalized Precision Time Protocol): Hardware PTP timestamps from the NIC's clock are fed directly to the timekeeping subsystem (Section 7.8). The
CLOCK_TAIsystem clock is synchronized to the PTP grandmaster with sub-microsecond accuracy. The CBS scheduler (Section 7.6) uses this PTP-synchronized timebase to align RT task wakeups with hardware transmission windows — the task wakes up, computes, and its output packet hits the NIC exactly when the 802.1Qbv gate is open. -
IEEE 802.1Qci (Per-Stream Filtering and Policing): Ingress traffic is filtered in hardware by stream ID. Non-RT traffic arriving on an RT-reserved stream is dropped at the NIC before it reaches the CPU, preventing interference with RT packet processing.
Architecture note: TSN support requires Tier 1 NIC drivers that implement the
TSN KABI extensions (gate control list programming, PTP clock read, stream filter
configuration). Standard NICs without TSN hardware operate normally but cannot provide
network-level determinism. The umka-net stack detects TSN capability at driver
registration via the device registry (Section 11.4).
8.5 Signal Handling¶
Signals are the primary asynchronous notification mechanism inherited from POSIX.
UmkaOS implements the full Linux-compatible signal model: 31 standard signals
(numbers 1–31), kernel-reserved real-time slots 32–33, and 31 user-visible real-time
signals SIGRTMIN–SIGRTMAX (numbers 34–64 when glibc NPTL reserves two slots).
The Task struct carries per-task signal state in its signal_mask field
(Section 8.1); process-wide signal disposition is stored in Process::sighand.
8.5.1 Signal Table¶
Every signal has a default action and optionally a user-installed handler.
| Num | Name | Default | Description |
|---|---|---|---|
| 1 | SIGHUP | TERM | Hangup on controlling terminal or parent process death |
| 2 | SIGINT | TERM | Keyboard interrupt (Ctrl+C) |
| 3 | SIGQUIT | CORE | Keyboard quit (Ctrl+\) |
| 4 | SIGILL | CORE | Illegal CPU instruction |
| 5 | SIGTRAP | CORE | Trace/breakpoint trap |
| 6 | SIGABRT | CORE | abort(3) call |
| 7 | SIGBUS | CORE | Bus error (misaligned or unmapped access) |
| 8 | SIGFPE | CORE | Floating-point / arithmetic exception |
| 9 | SIGKILL | TERM | Unconditional termination (unblockable, uncatchable) |
| 10 | SIGUSR1 | TERM | User-defined signal 1 |
| 11 | SIGSEGV | CORE | Invalid virtual-memory reference |
| 12 | SIGUSR2 | TERM | User-defined signal 2 |
| 13 | SIGPIPE | TERM | Write to pipe with no reader |
| 14 | SIGALRM | TERM | alarm(2) real-time timer expiry |
| 15 | SIGTERM | TERM | Graceful termination request |
| 16 | SIGSTKFLT | TERM | Coprocessor stack fault (legacy x86; rarely generated) |
| 17 | SIGCHLD | IGN | Child stopped, continued, or terminated |
| 18 | SIGCONT | CONT | Resume stopped process |
| 19 | SIGSTOP | STOP | Unconditional stop (unblockable, uncatchable) |
| 20 | SIGTSTP | STOP | Keyboard stop (Ctrl+Z); catchable |
| 21 | SIGTTIN | STOP | Background process read from controlling terminal |
| 22 | SIGTTOU | STOP | Background process write to controlling terminal (if TOSTOP) |
| 23 | SIGURG | IGN | Out-of-band data on socket |
| 24 | SIGXCPU | CORE | CPU time limit exceeded (setrlimit(RLIMIT_CPU)) |
| 25 | SIGXFSZ | CORE | File size limit exceeded (setrlimit(RLIMIT_FSIZE)) |
| 26 | SIGVTALRM | TERM | Virtual timer (user-time only) expiry |
| 27 | SIGPROF | TERM | Profiling timer expiry (user + system time) |
| 28 | SIGWINCH | IGN | Terminal window-size change |
| 29 | SIGIO / SIGPOLL | TERM | I/O now possible on fd (same number) |
| 30 | SIGPWR | TERM | Power failure / UPS notification |
| 31 | SIGSYS | CORE | Invalid system call argument (seccomp violation) |
| 32–33 | (reserved) | — | NPTL-internal RT signals (pthread_cancel, SIGSETXID); use RT queuing; not application-usable |
| 34 | SIGRTMIN | TERM | First user-visible real-time signal |
| 35–63 | SIGRTMIN+1 … SIGRTMAX-1 | TERM | Real-time signals; no predefined meaning |
| 64 | SIGRTMAX | TERM | Last user-visible real-time signal |
Signals 32 and 33 are within the RT signal range (32–64) and therefore use RT queuing semantics (per Section 8.5: append
SigInfoto per-signal queue). However, they are allocated internally to NPTL: signal 32 isSIGRTMINused forpthread_canceldelivery, signal 33 isSIGRTMIN+1used forSIGSETXID(thread credential synchronization). They are not available for application use viaSIGRTMIN + Ncalculations —sigrtmin()returns 34 as the first application-usable RT signal.
Default action codes:
- TERM — terminate the process via do_exit().
- CORE — terminate and write a core dump (Section 8.2;
format matches Linux ELF core; respects RLIMIT_CORE and core_pattern).
- STOP — place the process in TaskState::Stopped; notify parent with SIGCHLD.
- CONT — if currently stopped, resume execution; otherwise ignore.
- IGN — discard silently.
SIGKILL and SIGSTOP cannot be caught, blocked, or ignored. All other signals may
have their disposition changed via sigaction().
Real-time signals (34–64) differ from standard signals in three ways:
1. Queued delivery: multiple instances of the same RT signal are individually
queued; standard signals collapse to at most one pending instance.
2. Ordered delivery: when multiple distinct RT signals are pending, the
lowest-numbered signal is delivered first (SIGRTMIN has highest priority).
3. Value attachment: RT signals sent via sigqueue() carry a SigVal payload
(integer or pointer), delivered to SA_SIGINFO handlers in SigInfo::si_value.
8.5.2 Signal Data Structures¶
8.5.2.1.1 SigAction¶
SigAction describes the disposition of a single signal, corresponding to the POSIX
struct sigaction. It is stored in the per-process signal handler table
(Process::sighand), indexed by signal number minus 1.
/// Per-signal disposition record.
///
/// # Invariants
/// - `handler` == `SigHandler::Default` or `SigHandler::Ignore` when
/// `sa_flags` contains `SA_RESETHAND` and the signal is being delivered
/// (handler is reset to Default before invocation).
/// - SIGKILL and SIGSTOP always have `handler == SigHandler::Default`; the
/// kernel enforces this and rejects `sigaction()` calls that attempt to
/// change them.
pub struct SigAction {
/// Signal handler or default/ignore disposition.
pub handler: SigHandler,
/// Signals to add to the thread's signal mask during handler execution.
/// The delivered signal itself is also masked unless SA_NODEFER is set.
pub sa_mask: SignalSet,
/// Modifier flags.
pub sa_flags: SaFlags,
/// Optional user-space trampoline (calls `sigreturn`). If None, the
/// kernel uses the vsyscall/vDSO trampoline.
pub sa_restorer: Option<unsafe extern "C" fn()>,
}
/// Signal handler variant.
pub enum SigHandler {
/// Perform the signal's default action (see [Section 8.5](#signal-handling--signal-table)).
Default,
/// Discard the signal.
Ignore,
/// Classic signal handler: receives signal number only.
Handler(unsafe extern "C" fn(sig: i32)),
/// Extended handler: receives signal number, siginfo pointer, and
/// ucontext pointer. Enabled by SA_SIGINFO.
SigAction(unsafe extern "C" fn(sig: i32, info: *mut SigInfo, ctx: *mut UContext)),
}
bitflags! {
/// Flags for `SigAction::sa_flags`.
pub struct SaFlags: u32 {
/// Do not send SIGCHLD when a child stops (SIGTSTP/SIGTTIN/SIGTTOU),
/// only when it terminates.
const SA_NOCLDSTOP = 0x0000_0001;
/// Do not create zombies: reap children automatically; `wait()` returns
/// ECHILD immediately.
const SA_NOCLDWAIT = 0x0000_0002;
/// Deliver extended siginfo to a `SigAction` handler.
const SA_SIGINFO = 0x0000_0004;
/// Invoke the handler on the alternate signal stack (see `sigaltstack`).
const SA_ONSTACK = 0x0800_0000;
/// Restart slow syscalls interrupted by this signal instead of
/// returning EINTR. See [Section 8.5](#signal-handling--sarestart-and-eintr) for the list of restartable
/// syscalls.
const SA_RESTART = 0x1000_0000;
/// Do not automatically mask the signal during its own handler.
const SA_NODEFER = 0x4000_0000;
/// Reset handler to SIG_DFL after the signal is delivered (POSIX
/// one-shot semantics).
const SA_RESETHAND = 0x8000_0000;
}
}
8.5.2.1.2 SignalSet¶
A 64-bit bitmask representing a set of signal numbers. Bit n-1 (zero-indexed)
represents signal n. This layout is identical to the Linux sigset_t for 64-bit
architectures, ensuring ABI compatibility for sigprocmask(), sigaction(), and
sigwaitinfo().
/// Bitmask of signals (bit i-1 = signal i). Matches Linux 64-bit sigset_t.
#[repr(transparent)]
#[derive(Clone, Copy, Default, PartialEq, Eq)]
pub struct SignalSet(pub u64);
impl SignalSet {
/// Return a set containing exactly signal `sig` (1-indexed).
///
/// # Panics
/// Panics if `sig` is 0 or > 64.
pub fn mask(sig: u8) -> Self {
assert!(sig >= 1 && sig <= 64, "signal number out of range");
Self(1u64 << (sig - 1))
}
pub fn empty() -> Self { Self(0) }
pub fn full() -> Self { Self(u64::MAX) }
pub fn contains(self, sig: u8) -> bool {
self.0 & (1u64 << (sig - 1)) != 0
}
pub fn insert(&mut self, sig: u8) { self.0 |= 1u64 << (sig - 1); }
pub fn remove(&mut self, sig: u8) { self.0 &= !(1u64 << (sig - 1)); }
pub fn union(self, other: Self) -> Self { Self(self.0 | other.0) }
pub fn intersect(self, other: Self) -> Self { Self(self.0 & other.0) }
pub fn complement(self) -> Self { Self(!self.0) }
/// Return the raw bitmask value.
pub fn bits(self) -> u64 { self.0 }
/// True if any signal in this set is pending and unblocked.
pub fn has_pending_unblocked(self, mask: SignalSet) -> bool {
self.intersect(mask.complement()).0 != 0
}
/// Return the lowest signal number in the set, or None if empty.
pub fn lowest(self) -> Option<u8> {
if self.0 == 0 { None } else { Some(self.0.trailing_zeros() as u8 + 1) }
}
}
8.5.2.1.3 SigInfo¶
SigInfo matches the Linux siginfo_t ABI exactly. It is passed to SA_SIGINFO
handlers and enqueued for RT signal delivery.
/// Signal origin information (matches Linux siginfo_t ABI).
///
/// The active union variant is determined by the signal number and `si_code`:
/// - SIGCHLD → `sigchld` field
/// - SIGSEGV, SIGBUS, SIGILL, SIGFPE → `sigfault` field
/// - SIGPOLL/SIGIO → `sigpoll` field
/// - RT signals sent via `sigqueue()` → `rt` field
/// - Signals sent via `kill()` or `tgkill()` → `kill` field
/// - POSIX timers → `timer` field
#[repr(C)]
pub struct SigInfo {
pub si_signo: i32,
pub si_errno: i32,
pub si_code: i32,
/// Arch-dependent implicit alignment padding before the union:
/// - 64-bit: si_signo(4) + si_errno(4) + si_code(4) + align_pad(4) + union(112) = 128.
/// (SigInfoSigfault contains usize → union has 8-byte alignment → 4 bytes padding)
/// - 32-bit: si_signo(4) + si_errno(4) + si_code(4) + union(116) = 128.
/// (usize is 4 bytes → union has 4-byte alignment → no padding)
pub _union: SigInfoUnion,
}
// SigInfo must be exactly 128 bytes on both 32-bit and 64-bit targets.
// This is the Linux ABI guarantee for siginfo_t.
const_assert!(core::mem::size_of::<SigInfo>() == 128);
/// `si_code` constants (source of the signal).
/// Complete list per Linux 6.1 `uapi/asm-generic/siginfo.h`.
pub mod si_code {
pub const SI_USER: i32 = 0; // kill() or raise()
pub const SI_KERNEL: i32 = 0x80; // sent by kernel
pub const SI_QUEUE: i32 = -1; // sigqueue()
pub const SI_TIMER: i32 = -2; // POSIX timer
pub const SI_ASYNCIO: i32 = -4; // AIO completion
pub const SI_MESGQ: i32 = -3; // POSIX message queue
// SIGSEGV codes
pub const SEGV_MAPERR: i32 = 1; // address not mapped
pub const SEGV_ACCERR: i32 = 2; // invalid permissions for mapped object
pub const SEGV_BNDERR: i32 = 3; // failed address bound checks
pub const SEGV_PKUERR: i32 = 4; // failed protection key checks
/// Asynchronous ARM MTE tag check failure (Linux 5.10+, siginfo.h).
pub const SEGV_MTEAERR: i32 = 8;
/// Synchronous ARM MTE tag check exception (Linux 5.18+, siginfo.h).
pub const SEGV_MTESERR: i32 = 9;
pub const SEGV_CPERR: i32 = 10; // control protection fault
// SIGBUS codes
pub const BUS_ADRALN: i32 = 1; // invalid address alignment
pub const BUS_ADRERR: i32 = 2; // non-existent physical address
pub const BUS_OBJERR: i32 = 3; // object-specific hardware error
// SIGILL codes
pub const ILL_ILLOPC: i32 = 1; // illegal opcode
pub const ILL_ILLOPN: i32 = 2; // illegal operand
pub const ILL_ILLADR: i32 = 3; // illegal addressing mode
pub const ILL_ILLTRP: i32 = 4; // illegal trap
pub const ILL_PRVOPC: i32 = 5; // privileged opcode
pub const ILL_PRVREG: i32 = 6; // privileged register
pub const ILL_COPROC: i32 = 7; // coprocessor error
pub const ILL_BADSTK: i32 = 8; // internal stack error
// SIGFPE codes
pub const FPE_INTDIV: i32 = 1; // integer divide by zero
pub const FPE_INTOVF: i32 = 2; // integer overflow
pub const FPE_FLTDIV: i32 = 3; // floating-point divide by zero
pub const FPE_FLTOVF: i32 = 4; // floating-point overflow
pub const FPE_FLTUND: i32 = 5; // floating-point underflow
pub const FPE_FLTRES: i32 = 6; // floating-point inexact result
pub const FPE_FLTINV: i32 = 7; // floating-point invalid operation
// SIGCHLD codes
pub const CLD_EXITED: i32 = 1;
pub const CLD_KILLED: i32 = 2;
pub const CLD_DUMPED: i32 = 3;
pub const CLD_TRAPPED: i32 = 4;
pub const CLD_STOPPED: i32 = 5;
pub const CLD_CONTINUED: i32 = 6;
// SIGPOLL codes
pub const POLL_IN: i32 = 1;
pub const POLL_OUT: i32 = 2;
pub const POLL_MSG: i32 = 3;
pub const POLL_ERR: i32 = 4;
pub const POLL_PRI: i32 = 5;
pub const POLL_HUP: i32 = 6;
}
/// Union payload of `SigInfo`. The union's alignment is determined by its
/// largest-aligned variant: on 64-bit, `SigInfoSigfault` contains `usize`
/// (8-byte aligned), so the union is 8-byte aligned. On 32-bit, `usize` is
/// 4-byte aligned, so the union is 4-byte aligned. This alignment difference
/// creates the implicit 4-byte padding between `si_code` and the union on
/// 64-bit (si_code at offset 8, size 4 → pad to offset 16 for 8-byte alignment)
/// vs no padding on 32-bit (union starts at offset 12, 4-byte aligned).
#[repr(C)]
pub union SigInfoUnion {
/// For kill()/tgkill(): sender PID and UID.
pub kill: SigInfoKill,
/// For POSIX timers.
pub timer: SigInfoTimer,
/// For sigqueue() / RT signals.
pub rt: SigInfoRt,
/// For SIGCHLD.
pub sigchld: SigInfoSigchld,
/// For SIGSEGV, SIGBUS, SIGILL, SIGFPE.
pub sigfault: SigInfoSigfault,
/// For SIGPOLL/SIGIO.
pub sigpoll: SigInfoSigpoll,
/// Raw padding to match Linux siginfo_t size (128 bytes total).
/// 64-bit: header(12) + align_pad(4) + union(112) = 128.
/// 32-bit: header(12) + union(116) = 128.
#[cfg(target_pointer_width = "64")]
pub _pad: [u8; 112],
#[cfg(target_pointer_width = "32")]
pub _pad: [u8; 116],
}
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigInfoUnion>() == 112);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigInfoUnion>() == 116);
#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoKill { pub si_pid: i32, pub si_uid: u32 }
const_assert!(size_of::<SigInfoKill>() == 8);
#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoTimer {
pub si_timerid: i32, pub si_overrun: i32, pub si_value: SigVal,
}
// 64-bit: 4 + 4 + 8 = 16. 32-bit: 4 + 4 + 4 = 12.
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigInfoTimer>() == 16);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigInfoTimer>() == 12);
#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoRt { pub si_pid: i32, pub si_uid: u32, pub si_value: SigVal }
// 64-bit: 4 + 4 + 8 = 16. 32-bit: 4 + 4 + 4 = 12.
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigInfoRt>() == 16);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigInfoRt>() == 12);
/// FIX-024: si_utime/si_stime are C `long` (clock_t), not `i64`.
/// Uses `KernelLong` ([Section 19.1](19-sysapi.md#syscall-interface--kernellong--kernelulong)).
///
/// Layout on 64-bit (LP64):
/// offset 0: si_pid (i32, 4B)
/// offset 4: si_uid (u32, 4B)
/// offset 8: si_status (i32, 4B)
/// offset 12: _pad0 (4B, aligns si_utime to 8-byte boundary)
/// offset 16: si_utime (KernelLong = i64, 8B)
/// offset 24: si_stime (KernelLong = i64, 8B)
/// Total: 32 bytes
///
/// Layout on 32-bit (ILP32):
/// offset 0: si_pid (i32, 4B)
/// offset 4: si_uid (u32, 4B)
/// offset 8: si_status (i32, 4B)
/// offset 12: si_utime (KernelLong = i32, 4B)
/// offset 16: si_stime (KernelLong = i32, 4B)
/// Total: 20 bytes (no padding needed — all fields 4-byte aligned)
#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoSigchld {
pub si_pid: i32, pub si_uid: u32, pub si_status: i32,
/// Explicit padding to align `si_utime` to 8-byte boundary on LP64.
/// On ILP32, `KernelLong` is 4 bytes and no padding is needed.
/// Without this field, `#[repr(C)]` inserts 4 bytes of implicit padding
/// on 64-bit — an information disclosure vector across the kernel/user
/// boundary (the padding bytes may contain uninitialized kernel stack data).
#[cfg(target_pointer_width = "64")]
pub _pad0: [u8; 4],
pub si_utime: KernelLong, pub si_stime: KernelLong,
}
/// 64-bit: 4 + 4 + 4 + 4(pad) + 8 + 8 = 32
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigInfoSigchld>() == 32);
/// 32-bit: 4 + 4 + 4 + 4 + 4 = 20
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigInfoSigchld>() == 20);
impl SigInfoSigchld {
/// Construct a `SigInfoSigchld` with correct conditional padding.
/// Callers MUST use this constructor instead of struct literal syntax
/// to avoid omitting the `#[cfg(target_pointer_width = "64")] _pad0`
/// field — which would compile on 32-bit but silently produce a
/// padding gap on 64-bit (information disclosure across kernel/user boundary).
pub fn new(pid: i32, uid: u32, status: i32, utime: KernelLong, stime: KernelLong) -> Self {
Self {
si_pid: pid,
si_uid: uid,
si_status: status,
#[cfg(target_pointer_width = "64")]
_pad0: [0u8; 4],
si_utime: utime,
si_stime: stime,
}
}
}
/// FIX-023: `_pad_lsb` size depends on pointer width. The padding aligns
/// `_u` to `usize` (= C `unsigned long`) boundary. On LP64 `si_addr` is 8
/// bytes so `si_addr_lsb` sits at offset 10 and needs 6 bytes to reach 16.
/// On ILP32 `si_addr` is 4 bytes so `si_addr_lsb` sits at offset 6 and needs
/// 2 bytes to reach 8. Using `size_of::<KernelLong>() - 2` captures this
/// exactly: `8 - 2 = 6` on LP64, `4 - 2 = 2` on ILP32.
/// See [Section 19.1](19-sysapi.md#syscall-interface--kernellong--kernelulong).
#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoSigfault {
/// Fault address.
pub si_addr: usize,
/// LSB of the faulting address (for SIGBUS BUS_MCEERR_AR).
pub si_addr_lsb: i16,
/// Explicit padding to usize alignment for the union field.
/// `size_of::<KernelLong>() - 2` = 6 on LP64, 2 on ILP32.
#[cfg(target_pointer_width = "64")]
pub _pad_lsb: [u8; 6],
#[cfg(target_pointer_width = "32")]
pub _pad_lsb: [u8; 2],
/// Discriminated by si_code: SEGV_BNDERR uses addr_bnd, SEGV_PKUERR uses pkey.
pub _u: SigFaultUnion,
}
#[repr(C)]
#[derive(Clone, Copy)]
pub union SigFaultUnion {
/// For SEGV_BNDERR: lower and upper bounds.
pub addr_bnd: SigFaultBounds,
/// For SEGV_PKUERR: protection key that triggered the fault.
pub pkey: u32,
}
// kernel-internal, not KABI
#[repr(C)]
#[derive(Clone, Copy)]
pub struct SigFaultBounds {
pub si_lower: usize,
pub si_upper: usize,
}
/// 64-bit: si_addr(8) + si_addr_lsb(2) + _pad_lsb(6) + _u(16) = 32
/// 32-bit: si_addr(4) + si_addr_lsb(2) + _pad_lsb(2) + _u(8) = 16
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigInfoSigfault>() == 32);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigInfoSigfault>() == 16);
/// FIX-025: `si_band` is C `long` (not `i64`).
/// Uses `KernelLong` ([Section 19.1](19-sysapi.md#syscall-interface--kernellong--kernelulong)).
///
/// LP64 layout (16 bytes):
/// offset 0: si_band (KernelLong = i64, 8 bytes)
/// offset 8: si_fd (i32, 4 bytes)
/// offset 12: _pad0 ([u8; 4], explicit tail padding to 8-byte alignment)
/// Total: 16 bytes.
///
/// ILP32 layout (8 bytes):
/// offset 0: si_band (KernelLong = i32, 4 bytes)
/// offset 4: si_fd (i32, 4 bytes)
/// Total: 8 bytes. No tail padding (struct alignment = 4).
#[repr(C)] #[derive(Clone, Copy)]
pub struct SigInfoSigpoll {
pub si_band: KernelLong,
pub si_fd: i32,
/// Explicit tail padding to 8-byte alignment on LP64.
/// On ILP32, KernelLong is 4 bytes (same alignment as i32), no tail
/// padding needed. Without this field, the 4 bytes at offset 12-15
/// may contain uninitialized kernel stack data — an information
/// disclosure vector when SigInfo is copied to userspace during
/// signal delivery.
#[cfg(target_pointer_width = "64")]
pub _pad0: [u8; 4],
}
impl SigInfoSigpoll {
/// Construct a SigInfoSigpoll with padding zeroed.
/// Matches the SigInfoSigchld::new() pattern — all padding fields
/// are explicitly zeroed to prevent kernel data leakage to userspace.
pub fn new(si_band: KernelLong, si_fd: i32) -> Self {
Self {
si_band,
si_fd,
#[cfg(target_pointer_width = "64")]
_pad0: [0u8; 4],
}
}
}
/// 64-bit: si_band(8) + si_fd(4) + _pad0(4) = 16.
/// 32-bit: si_band(4) + si_fd(4) = 8.
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigInfoSigpoll>() == 16);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigInfoSigpoll>() == 8);
/// Value carried by RT signals and POSIX timers.
/// Size equals pointer width: 8 bytes on LP64, 4 bytes on ILP32.
/// Matches Linux `union sigval` layout (largest member determines size).
#[repr(C)] #[derive(Clone, Copy)]
pub union SigVal {
pub sival_int: i32,
pub sival_ptr: usize,
}
/// 64-bit: max(4, 8) = 8 bytes (sival_ptr is usize = 8)
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigVal>() == 8);
/// 32-bit: max(4, 4) = 4 bytes (sival_ptr is usize = 4)
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigVal>() == 4);
impl SigInfo {
/// Construct a kernel-origin `SigInfo` for the given signal.
///
/// Used by `force_sig()`, `send_signal_to_task()`, and other
/// kernel-internal callers that generate signals without a userspace
/// sender. Sets `si_code = SI_KERNEL` and zeroes the union payload.
///
/// For standard signals that collapse (at-most-one-pending), the dequeue
/// path may find `standard[signo] == None` (bit set in pending_mask but
/// no SigInfo stored — happens when a second instance arrives while the
/// first is already pending). In that case, `dequeue_signal()` synthesizes
/// a default SigInfo via this constructor.
pub fn kernel(sig: u8) -> Self {
Self {
si_signo: sig as i32,
si_errno: 0,
si_code: si_code::SI_KERNEL,
_union: SigInfoUnion { _pad: [0u8; {
#[cfg(target_pointer_width = "64")] { 112 }
#[cfg(target_pointer_width = "32")] { 116 }
}]},
}
}
}
8.5.3 Signal Delivery Algorithm¶
8.5.3.1.1 Pending Signal State¶
Each Task maintains a PendingSignals struct for thread-directed signals, and the
Process holds a separate PendingSignals for process-wide (shared) signals. Signal
delivery checks both: task.pending | task.process.shared_pending.
Standard signals (1-31) follow at-most-one-pending semantics: if already pending, the
duplicate is dropped. RT signals (32-64) are individually queued — each instance is
enqueued unless the per-signal-type cap or RLIMIT_SIGPENDING is reached (EAGAIN is
returned to the sender).
8.5.3.1.2 Two-Tier Pending Signal Storage¶
Problem with a naive per-signal pre-allocated queue: If each of the 65 signal
numbers gets a static ring buffer of 32 SigInfo entries (128 bytes each), the
PendingSignals struct consumes ~268 KB per task — dominated by RT signal ring
buffers that are almost never used. With thousands of threads, this wastes hundreds
of megabytes of kernel memory.
Problem with a single global signal queue (Linux's approach): RLIMIT_SIGPENDING
limits the total number of queued signals across all signal types for a UID. If process
A floods process B with SIGCHLD (rapid child spawning), the global queue fills and B
can no longer receive SIGUSR1, SIGTERM, or any other signal — even one-time signals
that should always get through.
UmkaOS two-tier design: Standard signals and RT signals have fundamentally different queuing semantics, so they use different storage strategies:
-
Standard signals (1–31): at-most-one-pending per POSIX. A flat array of
Option<SigInfo>— one slot per signal number. No ring buffer, no linked list. Cost: 32 * 136 bytes = ~4.3 KB on LP64, 32 * 132 = ~4.1 KB on ILP32 (index 0 unused, 1-indexed for off-by-one safety). -
RT signals (32–64): queued delivery, multiple instances per signal number. Entries are slab-allocated on send and freed on dequeue. A per-task intrusive FIFO list holds all pending RT
SigInforecords. Per-signal-number counters (rt_count[signo - 32]) enforce the per-type cap, preventing one RT signal from starving others — preserving the anti-flooding property without pre-allocating 33 * 32 * 128 bytes.
Memory comparison:
| Design | Per-task cost (idle) | Per-task cost (1 pending SIGUSR1) | Per-task cost (32 RT signals queued) |
|---|---|---|---|
| Old (static ring buffers) | ~268 KB | ~268 KB | ~268 KB |
| New (two-tier) | ~4.3 KB | ~4.3 KB | ~4.3 KB + 32 * 144 B slab = ~8.8 KB |
/// Per-task pending signal state.
///
/// Two-tier storage: standard signals use a fixed inline array (one
/// `Option<SigInfo>` per signal, since POSIX mandates at-most-one-pending).
/// RT signals use slab-allocated entries in a per-task intrusive FIFO list,
/// allocated only when an RT signal is actually sent.
///
/// Size: ~4.3 KB per task (vs. ~268 KB with the naive per-signal ring buffer
/// approach). RT signal entries are allocated from `SIGQUEUE_SLAB` on demand.
pub struct PendingSignals {
/// Standard signal info storage (signals 1–31).
/// **1-indexed**: `standard[N]` holds the `SigInfo` for signal number N.
/// Index 0 is unused (132 bytes wasted per task). This deliberate waste
/// eliminates off-by-one errors: the signal number maps directly to the
/// array index without any `N - 1` arithmetic. The alternative (`[Option<SigInfo>; 31]`
/// with 0-based indexing) saves 132 bytes but introduces a pervasive
/// `signo - 1` adjustment that is a proven source of kernel bugs.
/// Signals 32–64 (RT signals) are NOT stored here — they use the slab-allocated
/// RT queue below. No code path indexes `standard[32]` or above.
/// At most one entry per signal (POSIX at-most-one-pending).
///
/// `UnsafeCell` for interior mutability: `PendingSignals` is embedded in
/// `Task.pending` and `Process.shared_pending`, accessed through shared
/// references (`&Task` via `Arc<Task>`). All mutations are serialized by
/// `siglock` (SIGLOCK, level 40). SAFETY invariant: siglock MUST be held
/// for all access via `unsafe { &mut *self.standard.get() }`.
standard: UnsafeCell<[Option<SigInfo>; 32]>,
/// Head of the RT signal FIFO queue (slab-allocated `SigQueueEntry` nodes).
/// Entries are ordered by insertion time (oldest first = delivered first).
/// `None` when no RT signals are pending.
/// `UnsafeCell`: SAFETY invariant — siglock held for all access.
rt_head: UnsafeCell<Option<NonNull<SigQueueEntry>>>,
/// Tail of the RT signal FIFO queue, for O(1) append.
/// `None` when no RT signals are pending.
/// `UnsafeCell`: SAFETY invariant — siglock held for all access.
rt_tail: UnsafeCell<Option<NonNull<SigQueueEntry>>>,
/// Per-RT-signal-number pending count. Index 0 = signal 32 (SIGRTMIN),
/// index 32 = signal 64 (SIGRTMAX). Used for O(1) per-type overflow
/// checks: if `rt_count[signo - 32] >= PER_SIGNAL_RT_LIMIT`, the send
/// returns `EAGAIN`. This preserves the anti-flooding property: a flood
/// of one RT signal type cannot prevent delivery of other RT signals.
/// `UnsafeCell`: SAFETY invariant — siglock held for all access.
rt_count: UnsafeCell<[u8; 33]>,
/// Total pending count across standard + RT queues.
/// Used for O(1) RLIMIT_SIGPENDING enforcement without summing all queues.
/// `UnsafeCell`: SAFETY invariant — siglock held for all access.
total_count: UnsafeCell<u32>,
/// Bitmask of pending signals. Signal N occupies bit (N - 1):
/// signal 1 = bit 0, signal 64 = bit 63. Bit 0 thus represents
/// signal 1 (SIGHUP), not an unused slot.
/// Enables O(1) "find first pending signal" via trailing-zero-count.
/// Standard signals (1–31): bit set iff `standard[N].is_some()`.
/// RT signals (32–64): bit set iff `rt_count[N - 32] > 0`.
/// AtomicU64 rationale: (1) prevents torn reads on 32-bit architectures
/// where a plain u64 load could see half-old, half-new; (2) enables
/// lock-free snapshot reads in `sigpending(2)` and `next_pending_signal()`.
/// All mutations to pending_mask are serialized by the per-task siglock;
/// the AtomicU64 is NOT for mutation serialization — it is for read-side
/// torn-read prevention and lock-free snapshots.
pending_mask: AtomicU64,
}
// SAFETY: PendingSignals uses UnsafeCell fields that are always accessed under
// siglock. The AtomicU64 pending_mask is safe for lock-free reads. The struct
// is Send+Sync because all concurrent access is serialized by the siglock.
unsafe impl Send for PendingSignals {}
unsafe impl Sync for PendingSignals {}
/// Maximum queued RT signals per signal number per task.
/// Linux default: min(RLIMIT_SIGPENDING, 1000). UmkaOS: 32 per signal.
/// Stored in `rt_count` (u8), so must be <= 255.
pub const RT_SIGQUEUE_MAX: usize = 32;
/// Per-signal-number limit for real-time signal queues.
/// Standard signals (1–31) are capped at 1 pending instance (coalesced).
/// RT signals (32–64) are queued up to this limit per signal number.
/// The per-queue cap prevents a single RT signal type from consuming the
/// entire UID budget, while `PendingSignals::total_count` enforces the
/// aggregate RLIMIT_SIGPENDING across all queues.
pub const PER_SIGNAL_RT_LIMIT: usize = RT_SIGQUEUE_MAX;
/// Slab-allocated RT signal queue entry.
///
/// Allocated from `SIGQUEUE_SLAB` under `siglock` when an RT signal is sent.
/// Freed back to the slab when the signal is dequeued for delivery or when
/// `flush_signals()` drains all pending signals (called on `exit()` and
/// de-thread `SIGKILL`; NOT called on `exec()` — pending signals survive exec).
///
/// Slab allocation under spinlock is safe: slab `alloc()` is O(1) from the
/// per-CPU magazine layer and never sleeps. The slab is pre-warmed at boot
/// with enough objects for typical workloads.
///
/// **Allocation failure handling**: If `SIGQUEUE_SLAB.alloc()` returns `None`
/// (slab exhausted — extreme memory pressure), the RT signal is silently
/// dropped and `send_signal()` returns `EAGAIN` to the caller. This matches
/// Linux behavior: `__sigqueue_alloc()` returns NULL on failure, and
/// `send_signal_locked()` returns `-EAGAIN`. The signal is NOT coalesced
/// into a single pending instance (unlike standard signals) because RT
/// signal semantics require independent delivery of each instance.
/// Callers (userspace `sigqueue()`) receive EAGAIN and can retry.
pub struct SigQueueEntry {
/// The signal number (32–64) this entry belongs to.
signo: u8,
/// The signal info payload delivered to the handler.
info: SigInfo,
/// Intrusive next pointer for the per-task FIFO list.
/// `None` if this is the tail entry.
next: Option<NonNull<SigQueueEntry>>,
}
/// Global slab cache for `SigQueueEntry` allocations.
/// Object size: 144 bytes on LP64 (1 + 7pad + 128 + 8), 136 bytes on ILP32
/// (1 + 3pad + 128 + 4). Alignment: 8 (LP64), 4 (ILP32).
/// Allocated under `siglock` (IRQ-disabled); slab fast path is lock-free
/// (per-CPU magazine) so no lock ordering issues.
static SIGQUEUE_SLAB: OnceLock<SlabCache<SigQueueEntry>> = OnceLock::new();
Locking discipline: All mutations to PendingSignals (enqueue, dequeue, flush)
are serialized by a per-task IRQ-safe spinlock (siglock):
The siglock is defined as SignalHandlers::queue_lock (SIGLOCK, level 40).
It is shared by all threads with CLONE_SIGHAND — stored on SignalHandlers,
NOT on Task. The task.siglock() convenience accessor delegates to
task.process.sighand.queue_lock.
IRQ-safe: acquired with interrupts disabled because signals may be sent from interrupt context (e.g., SIGSEGV from page fault handler, SIGALRM from timer IRQ).
Locking order: SIGLOCK(40) nests below SIGHAND_LOCK(30) and above RQ_LOCK(50) in
the global lock ordering table (Section 3.4).
Signal delivery wakes the target task (try_to_wake_up), which acquires
RQ_LOCK(50) — so SIGLOCK MUST have a level between SIGHAND_LOCK(30) and
RQ_LOCK(50).
Operations requiring siglock:
- send_signal() — for standard signals: set standard[signo]; for RT signals:
slab-allocate a SigQueueEntry, append to rt_tail, increment
rt_count[signo - 32]. In both cases: increment total_count, set bit in
pending_mask, wake target if signal is unblocked.
- dequeue_signal() — for standard signals: take from standard[signo]; for RT
signals: walk rt_head to find the first entry matching signo, unlink it,
free to slab, decrement rt_count[signo - 32]. In both cases: decrement
total_count, clear bit in pending_mask when no entries remain for that signal.
- flush_signals() — clear standard[], walk and free all SigQueueEntry nodes
back to SIGQUEUE_SLAB, zero rt_count[] (called on exit() and de-thread
SIGKILL; NOT called on exec() — pending signals survive exec per POSIX).
- sigprocmask() — modify signal_mask and recheck pending (may trigger immediate
delivery of newly-unblocked signals).
The pending_mask field is AtomicU64 for lock-free read in sigpending(2) and
next_pending_signal(), but writes are always under siglock to maintain consistency
with standard[], rt_head/rt_count[], and total_count.
Finding the next pending signal is O(1):
/// Return the lowest-numbered unblocked pending signal, or None.
/// Uses a single hardware CTZ instruction — no iteration over queues.
fn next_pending_signal(pending: &PendingSignals, mask: &SignalSet) -> Option<u32> {
// Mask out blocked signals from the pending bitmask.
// Acquire is conservative; this function runs on the owning CPU during
// do_signal() return-to-user. Cross-CPU visibility is provided by the
// wake IPI path, not by Acquire/Release pairing on pending_mask.
let eligible = pending.pending_mask.load(Ordering::Acquire) & !mask.bits();
if eligible == 0 {
return None;
}
// Lowest-numbered eligible signal: bit 0 = signal 1 (SIGHUP), so add 1
// to bit position to get signal number. Matches SignalSet::lowest() (line 192).
let signo = eligible.trailing_zeros() + 1;
Some(signo)
}
RLIMIT_SIGPENDING enforcement: total_count is incremented on every
send_signal() call that enqueues a signal and checked against the UID's
sigpending_count limit (identical to Linux). The two-tier design does not
change the aggregate cap: a process with RLIMIT_SIGPENDING=100 can have at
most 100 pending signals total across all 64 signal numbers. What changes is
that a SIGCHLD flood consumes budget from total_count but cannot prevent
delivery of RT signals (which are independently tracked via rt_count[]).
EAGAIN is returned to the sender when total_count reaches the UID limit,
or when an RT signal's per-type counter reaches PER_SIGNAL_RT_LIMIT.
Signal delivery priority within the eligible set:
- Standard signals (1–31) have higher delivery priority than RT signals (32–64), matching Linux POSIX behavior.
- Within standard signals: lowest signal number delivered first (SIGHUP=1 before SIGINT=2, etc.).
- Within RT signals: lowest signal number delivered first (SIGRTMIN before SIGRTMIN+1, etc.), which preserves the Linux ordering for equal-priority RT signals.
dequeue_signal() pops from standard[signo] for standard signals or walks
the rt_head FIFO for RT signals, clears bit signo in pending_mask when
no entries remain for that signal, and decrements total_count.
Performance note: Dequeuing a specific RT signal from the shared pending
queue is O(N) where N is the number of queued RT signals. This matches Linux
(collect_signal() in kernel/signal.c walks the list). For the common case
(few pending RT signals), this is acceptable. Systems with heavy RT signal
traffic (>1000 queued) should use io_uring or eventfd instead of POSIX signals.
sigpending(2) return value: reports pending_mask cast to a sigset_t
(O(1) — no iteration required).
Linux compatibility: signal delivery order and coalescing semantics are
identical to Linux. The two-tier storage is an internal implementation detail
invisible to userspace. The only observable difference: a SIGCHLD flood no longer
prevents other signal types from being queued — an improvement that Linux cannot
make without ABI-breaking changes to its task_struct.pending flat-list design.
8.5.3.1.3 Sending a Signal¶
Signal delivery originates from send_signal(target, sig, info, scope) where
scope determines the signal routing:
/// Signal delivery scope. Determines whether the signal targets a specific
/// thread or a process (and thus participates in thread selection).
/// Maps to Linux's PIDTYPE_PID / PIDTYPE_TGID / PIDTYPE_PGID / PIDTYPE_SID.
pub enum SignalScope {
/// Thread-directed: enqueue into `target.pending` (per-task queue).
/// Used by tgkill(), tkill(), fault-generated signals (SIGSEGV, SIGFPE).
Thread,
/// Process-directed: enqueue into `target.process.shared_pending`.
/// Thread selection algorithm (see below) picks the recipient.
/// Used by kill(), SIGCHLD from child, SIGPIPE from pipe write.
Process,
/// Process-group-directed: iterate all processes in the group.
/// Used by kill(-pgid, sig), terminal-generated SIGINT/SIGTSTP.
ProcessGroup,
/// Session-directed: iterate all process groups in the session.
/// Used by SIGHUP on controlling terminal disconnect.
Session,
}
/// Result of `handle_signal()` — tells the `do_signal()` dispatch loop
/// what happened so it can decide whether to break or continue.
///
/// Each variant corresponds to one branch of the default-action or
/// user-handler delivery path. The `do_signal()` match on this enum
/// determines loop control flow (see [Section 8.5](#signal-handling--checking-and-delivering-pending-signals)).
pub enum SignalAction {
/// A user-space signal frame was built on the user stack and the
/// return registers were redirected to the handler entry point.
/// The loop MUST break — delivering a second handler frame would
/// overwrite the first, corrupting saved register state.
UserHandler,
/// Default TERM action: `do_exit()` was called. Unreachable
/// (do_exit never returns).
DefaultTerm,
/// Default CORE action: core dump + `do_exit()`. Unreachable.
DefaultCore,
/// Default STOP action: the task entered `TASK_STOPPED` and
/// `schedule()` yielded the CPU. Execution resumed here after
/// SIGCONT. The loop should continue to re-check pending signals.
DefaultStop,
/// Default IGN action: signal was silently discarded.
DefaultIgn,
/// Default CONT action: stopped threads were resumed.
DefaultCont,
}
The scope parameter controls two key decisions in the algorithm:
- Step 5 (Enqueue): Thread scope enqueues into target.pending;
Process/ProcessGroup/Session scope enqueues into
target.process.shared_pending.
- Step 2b (SIGCONT cancellation): only iterates the thread group for
Process scope or broader (thread-directed SIGCONT does not cancel
pending stop signals on other threads).
The full algorithm:
- No PF_EXITING check:
send_signal()does NOT checkPF_EXITING. Linux's__send_signal_locked()does not have such a check either. An exiting task naturally does not deliver signals because it never reaches the return-to-user signal delivery checkpoint.PF_EXITINGis checked only inwants_signal()(for thread selection incomplete_signal()) — not for signal drop. This distinction is critical: a process-directed SIGKILL (e.g., from the OOM killer) must still be queued to the process-wide pending set even if one thread hasPF_EXITING, so that other non-exiting threads can receive it. - Validate: reject signal 0 (used only for kill() permission check). Reject numbers > 64.
- Permission check: the sending task must have
CAP_KILLin its capability domain, oreuid/uidof sender must matchuid/suidof target process. SIGCONT may always be sent within the same session. 2b. SIGCONT/stop mutual cancellation (POSIX requirement, verified against Linuxkernel/signal.cprepare_signal()). This step runs under the sighand lock (Process::sighandlock, shared by all threads withCLONE_SIGHAND). In UmkaOS, the per-tasksiglockis a per-sighand lock (like Linux): all threads sharing the sameSignalHandlersviaCLONE_SIGHANDshare the samesiglock. This is why the SIGCONT cancellation loop can safely access other threads'pendingsets — they are all protected by the same lock.
Locking clarification: UmkaOS uses two distinct signal-related locks, matching the canonical lock ordering table (Section 3.4):
SIGHAND_LOCK(30)—SignalHandlers::lock— protects the signal disposition table (sigaction()reads and writes). Held briefly bysigaction(),exec()signal reset, and thesend_signal()path when reading the action for ignore-check and SIGCONT/stop cancellation. NOT IRQ-safe (never taken from interrupt context).SIGLOCK(40)— the per-sighand signal queue lock — protectsPendingSignalsmutations (enqueue, dequeue, flush),signal_mask, and group stop state. IRQ-safe (acquired with interrupts disabled because signals may be sent from interrupt context). Acquired underSIGHAND_LOCK(30)during signal delivery (nesting: 30 < 40). Chains toRQ_LOCK(50)viatry_to_wake_up().
Both locks are stored in SignalHandlers (not in Task): all threads
sharing CLONE_SIGHAND share both lock instances. The per-task references
task.siglock and task.sighand_lock are convenience accessors that
delegate to task.process.sighand.{lock, queue_lock} respectively.
In Linux, sighand->siglock serves both roles (protecting both the
action table and the pending queue). UmkaOS splits these into two locks
to reduce contention: sigaction() does not need to serialize with
send_signal() queue mutations, only with other sigaction() calls.
- If sig == SIGCONT: acquire SIGHAND_LOCK(30) then SIGLOCK(40), then flush
all pending SIGSTOP (19), SIGTSTP (20), SIGTTIN (21), and SIGTTOU (22) from
both task.pending and process.shared_pending for every thread in the
thread group. For each flushed standard signal: clear standard[signo], clear
the corresponding bit in pending_mask, decrement total_count. Also clear
JOBCTL_STOP_PENDING on every thread.
// Acquire SIGHAND_LOCK(30) — needed for disposition consistency.
let _sighand_guard = target.process.sighand.lock.lock();
// Acquire SIGLOCK(40) — protects queue mutations and signal_mask reads.
let _queue_guard = target.process.sighand.queue_lock.lock();
let stop_mask: u64 = (1 << 18) | (1 << 19) | (1 << 20) | (1 << 21);
// SIGSTOP=19 SIGTSTP=20 SIGTTIN=21 SIGTTOU=22
// (bit N-1 for signal N, since bit 0 = signal 1)
// Interior mutability: flush_sigqueue_mask takes &PendingSignals (not &mut)
// and uses UnsafeCell internally. SAFETY: SIGLOCK(40) held.
for t in target.process.thread_group.iter():
flush_sigqueue_mask(&t.pending, stop_mask)
t.jobctl.fetch_and(!JOBCTL_STOP_PENDING, Release)
flush_sigqueue_mask(&target.process.shared_pending, stop_mask)
// Wake all threads in TASK_STOPPED so they transition to RUNNING.
// This is the step that actually resumes stopped threads.
// Matches Linux `prepare_signal()` which calls `signal_wake_up(t, true)`
// for each stopped thread on SIGCONT delivery.
//
// SF-098 fix: KEEP locks held during wake-up iteration. The previous
// code dropped SIGHAND_LOCK(30) and SIGLOCK(40) before the wake loop
// based on incorrect lock ordering analysis claiming SIGLOCK(40) →
// RQ_LOCK(50) was invalid. The UmkaOS lock ordering is:
// SIGHAND_LOCK(30) < SIGLOCK(40) < RQ_LOCK(50)
// Holding SIGLOCK(40) while calling try_to_wake_up (which acquires
// RQ_LOCK(50)) is VALID — 40 < 50 satisfies the ordering. Linux
// confirms: prepare_signal() calls wake_up_state() (acquires rq->lock)
// while holding siglock. Dropping locks unnecessarily creates:
// 1. Thread group UAF: concurrent pthread_exit() unlinks a thread
// from thread_group.tasks (protected by Process::lock).
// 2. Signal state race: concurrent send_signal() re-enqueues a stop
// signal between flush and wake, creating a lost-wakeup.
for t in target.process.thread_group.iter():
if t.state.load(Acquire) == TaskState::STOPPED.bits():
try_to_wake_up(t, TASK_STOPPED, WakeFlags::empty())
// Locks (_queue_guard, _sighand_guard) are dropped at end of step 2b
// scope by RAII.
sig is SIGSTOP (19), SIGTSTP (20), SIGTTIN (21), or SIGTTOU (22): flush
any pending SIGCONT (18) from both task.pending and process.shared_pending
for every thread in the thread group.
// Acquire SIGHAND_LOCK(30) then SIGLOCK(40) for queue mutation.
let _sighand_guard = target.process.sighand.lock.lock();
let _queue_guard = target.process.sighand.queue_lock.lock();
let cont_mask: u64 = 1 << 17; // SIGCONT=18, bit 17
// Interior mutability: flush_sigqueue_mask takes &PendingSignals (not &mut)
// and uses UnsafeCell internally. SAFETY: SIGLOCK(40) held.
for t in target.process.thread_group.iter():
flush_sigqueue_mask(&t.pending, cont_mask)
flush_sigqueue_mask(&target.process.shared_pending, cont_mask)
// After flushing, recalculate TIF_SIGPENDING for each affected thread.
// Without this, a thread whose only pending signal was the flushed
// SIGCONT would retain a stale TIF_SIGPENDING, causing a spurious
// do_signal() call on every return-to-user until naturally cleared.
for t in target.process.thread_group.iter():
recalc_sigpending_for(t)
task.pending and process.shared_pending. This
requires holding SIGLOCK(40) (the per-sighand signal queue lock), which is
already held during send_signal(). SIGHAND_LOCK(30) is also held (outer lock)
because send_signal() acquires the sighand lock for disposition checks before
acquiring the signal queue lock. The cancellation loop iterates the standard
signal array (not the RT queue) and clears bits — O(1) per signal number
(4 signal numbers checked: 19, 20, 21, 22). No additional locking is needed
because SIGLOCK(40) serializes all signal queue mutations within the thread
group.
This mutual cancellation is required by POSIX (IEEE Std 1003.1-2024, 2.4.1): "If a
SIGCONT signal is generated for the process, then all pending stop signals shall be
discarded." Without this step, a process that receives SIGCONT while a SIGSTOP is
pending would stop immediately after resuming, violating job control semantics. The
flush_sigqueue_mask() helper takes &PendingSignals (shared reference) and uses
UnsafeCell internally (SAFETY: siglock held). Clears the specified bits from
pending_mask, sets standard[signo] = None for each matched signal, and
decrements total_count.
3. SIGKILL/SIGSTOP fast path: if sig is SIGKILL or SIGSTOP, wake every
thread in the target process immediately; these signals bypass the blocked mask.
4. Check if ignored (lock-free via handler_cache): read the disposition type
from task.process.sighand.handler_cache[signo - 1].load(Relaxed). One byte
is naturally atomic on all architectures — no torn reads, no lock needed.
If the cached disposition is 1 (SigHandler::Ignore) and the signal is not
SIGCHLD with SA_NOCLDWAIT semantics, drop it immediately (return success).
Exception: signals that originated from the kernel (e.g. SIGSEGV from a fault)
are delivered forcibly regardless of disposition. The handler_cache field is
defined in SignalHandlers (Section 8.1)
and updated by sigaction() after updating the full SigAction under
SIGHAND_LOCK.
Locking: Steps 0-4 run without any signal lock. Step 2b (SIGCONT/stop
cancellation) acquires both SIGHAND_LOCK(30) and SIGLOCK(40). Steps 5-8
run under SIGLOCK(40):
fn send_signal(target: &Task, sig: u8, info: &SigInfo, scope: SignalScope) -> Result<()> {
// Steps 0-4: PF_EXITING check, validate, permission, SIGCONT/stop cancel,
// SIGKILL/SIGSTOP fast path, ignore check (all above).
// Acquire SIGLOCK(40) for enqueue + notification steps.
let _queue_guard = target.process.sighand.queue_lock.lock();
// Step 5: Enqueue into the correct PendingSignals (per scope).
let pending = match scope {
SignalScope::Thread => &target.pending,
_ => &target.process.shared_pending,
};
// ... (enqueue logic per step 5 below) ...
// Step 6: signalfd notification.
signalfd_notify(target, sig);
// Step 7: Set TIF_SIGPENDING.
target.thread_flags.fetch_or(TIF_SIGPENDING, Release);
// Step 8: Wake target (releases queue_guard before try_to_wake_up
// to avoid SIGLOCK(40) → RQ_LOCK(50) nesting).
drop(_queue_guard);
if should_wake(target, sig) {
// Wake mask depends on signal: SIGKILL wakes TASK_KILLABLE
// sleepers (e.g., NFS I/O, FUSE); all other signals only wake
// TASK_INTERRUPTIBLE. Matching Linux's signal_wake_up_state().
let wake_mask = if sig == SIGKILL {
TASK_INTERRUPTIBLE | TASK_KILLABLE
} else {
TASK_INTERRUPTIBLE
};
try_to_wake_up(target, wake_mask, WakeFlags::empty());
}
Ok(())
}
/// Determine whether the target task should be woken for signal delivery.
///
/// Returns `true` if the signal is unblocked in the target's `signal_mask`
/// (or if the signal is SIGKILL/SIGSTOP, which bypass the blocked mask),
/// AND the target is in a sleep state. Matching Linux's `wants_signal()`.
fn should_wake(target: &Task, sig: u8) -> bool {
// SIGKILL and SIGSTOP bypass the blocked mask — always wake.
if sig == SIGKILL || sig == SIGSTOP {
return true;
}
// Check if the signal is blocked by the target's signal mask.
!target.signal_mask.contains(sig)
}
- Enqueue:
- Standard signals (1–31): store
SigInfoinstandard[signo]and set the corresponding bit inpending_mask. If already set, do nothing (at-most-one-pending semantics). - RT signals (32–64): slab-allocate a
SigQueueEntryfromSIGQUEUE_SLAB, append to thert_tailFIFO, incrementrt_count[signo - 32], and set the bit inpending_mask. - Notify signalfd waiters: if the target task has any signalfd file descriptors
whose mask includes
sig, wake their wait queues so thatpoll()/epoll_wait()reportsEPOLLINand blockedread()calls unblock:The mask usessignalfd_notify(target, sig): for sfd in target.signalfd_list: // lock-free list walk (RCU-protected) if sfd.mask.load(Acquire) & (1u64 << (sig - 1)) != 0: sfd.waiters.wake_up()SignalSetencoding: bitN-1represents signalN(bit 0 = signal 1/SIGHUP, bit 8 = signal 9/SIGKILL, bit 63 = signal 64/SIGRTMAX). Using1 << siginstead of1 << (sig - 1)would be off-by-one for all signals and would cause undefined behavior for signal 64 (shift by 64 on au64is a panic in debug, wraps in release). Thesignalfd_listis a per-task intrusive list ofSignalFdobjects, updated under the task's signal lock when signalfds are created or closed. The walk is RCU-protected — no lock is needed in the signal delivery path. If the task has no signalfds (the common case), the list is empty and this step is a single null-pointer check. - Set TIF_SIGPENDING: mark the
TIF_SIGPENDINGflag in the target thread'sthread_flags(fetch_or(TIF_SIGPENDING, Release)). The flag is checked on every kernel-to-user transition. This must happen BEFORE waking the target: if the target wakes and reaches the return-to-user check beforeTIF_SIGPENDINGis set,do_signal()will not be called and the signal delivery is lost until the next syscall/interrupt. This ordering matches Linuxsignal_wake_up_state()(setTIF_SIGPENDING, thenwake_up_state()). - Wake the target: if the target thread is in
TaskState::INTERRUPTIBLEwaiting in an interruptible sleep (e.g.poll,read,nanosleep), and the signal is not blocked by itssignal_mask, reschedule the thread viatry_to_wake_up(). The syscall's slow path will return EINTR (or restart, per SA_RESTART — see Section 8.5). If the target is not sleeping (already running), theTIF_SIGPENDINGflag set in step 7 will be detected at the next return-to-user point.
Relationship to send_signal_to_task(): The send_signal() function described
above is the top-level entry point that performs permission checks, SIGCONT/stop
cancellation, ignore filtering, thread selection (for process-directed scope), and
signalfd notification. Steps 5-8 (enqueue, signalfd, TIF_SIGPENDING, wake) are the
low-level primitive also available as send_signal_to_task(task, sig, info)
(Section 8.1). send_signal_to_task() skips permission checks,
ignore checks, and cancellation — it directly enqueues into the specified task's
pending set, sets TIF_SIGPENDING, and wakes the task. It is used by:
- do_signal_stop(): sending SIGCHLD to the parent (no permission check needed).
- do_exit(): sending SIGCHLD to the parent on process termination.
- Kernel-internal signal sources (timer expiry, fault handlers).
The kill() syscall, tgkill(), and tkill() all route through send_signal()
(which calls send_signal_to_task() internally after the high-level checks).
force_sig(sig, task) — Unconditional Signal Delivery:
force_sig() differs from send_signal() in critical ways, used by
hardware fault handlers (SIGBUS, SIGSEGV, SIGFPE, SIGILL), the seccomp
violation handler (SIGSYS), and the exec post-PNR failure path (SIGKILL).
NOTE: the OOM killer does NOT use force_sig() — it uses
do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID) which is
a process-wide signal, not a thread-directed forced signal.
- Enqueues unconditionally:
force_sig()callssend_signal_to_task()without the permission checks or ignore-check thatsend_signal()performs. NOTE:send_signal()does NOT checkPF_EXITING(see step 0 above), so this distinction is about skipping permission/ignore checks, not about bypassing a PF_EXITING gate. A task indo_exit()can receive signals via either path; the difference is thatforce_sig()also resets the handler (point 2 below). - Resets custom handler to
SIG_DFL: For non-SIGKILL signals, ensures the signal cannot be caught or ignored — the default disposition (usually process termination) is enforced. - Sets
TIF_SIGPENDINGunconditionally: The flag is set even if the signal is masked. ForSIGKILL, this is always the case (SIGKILL cannot be masked). - Wakes TASK_KILLABLE sleepers: For
SIGKILLspecifically, the task is woken fromTASK_KILLABLEsleep state (interruptible sleep that responds only to fatal signals). This ensures that a task sleeping in a long I/O operation (e.g., NFS, FUSE) can be terminated promptly.
pub fn force_sig(sig: u8, task: &Task) {
// Acquire SIGHAND_LOCK(30) to safely mutate the action table.
// The action array is behind Arc<SignalHandlers> (shared reference from
// &Process). Mutation requires the lock.
let _guard = task.process.sighand.lock.lock();
// Reset handler to SIG_DFL (prevents catching the forced signal).
// SIGKILL and SIGSTOP handlers cannot be changed, so skip them.
if sig != SIGKILL && sig != SIGSTOP {
// Interior mutability: `action` is `UnsafeCell<[SigAction; SIGMAX]>`.
// SAFETY: SIGHAND_LOCK(30) is held (via `_guard` above), which
// serializes all writes to the action array. The UnsafeCell allows
// mutation through &SignalHandlers (shared reference via Arc).
let action = unsafe { &mut *task.process.sighand.action.get() };
action[(sig - 1) as usize].handler = SigHandler::Default;
task.process.sighand.handler_cache[(sig - 1) as usize]
.store(0, Relaxed); // 0 = Default
}
// Enqueue unconditionally — bypasses permission/ignore checks.
// Note: send_signal_to_task() does NOT check PF_EXITING (SF-090 fix).
send_signal_to_task(task, sig, SigInfo::kernel(sig));
// Note: SIGKILL wake is handled by send_signal_to_task() step 8
// (which wakes TASK_KILLABLE sleepers for SIGKILL). No redundant
// wake needed here.
}
do_send_sig_info(sig, info, task, type) — Process-Directed Signal Delivery:
do_send_sig_info() sends a signal to a process (thread group), not to a
specific thread. Used by the OOM killer, the reaper, kill() syscall with
PIDTYPE_TGID, and exit_group(). This is the primary interface for
process-wide signal delivery.
/// Send a signal to a process (thread group). The signal is enqueued to
/// the process's shared pending queue (`process.shared_pending`), and the
/// most responsive thread in the thread group is selected for wakeup.
///
/// `sig_type`: `PIDTYPE_TGID` for thread-group-directed (OOM kill, kill()),
/// `PIDTYPE_PID` for thread-directed (tkill).
///
/// For `PIDTYPE_TGID`, the signal goes to `process.shared_pending` and
/// thread selection uses the algorithm below. For `PIDTYPE_PID`, the
/// signal goes to `task.pending` (per-thread).
///
/// # OOM Killer Usage
/// The OOM killer calls:
/// `do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID)`
/// This ensures SIGKILL is delivered to the shared pending queue and
/// wakes the most responsive thread. Thread selection prefers
/// TASK_INTERRUPTIBLE threads over TASK_KILLABLE, ensuring delivery
/// succeeds even when the OOM-selected task is in D-state.
pub fn do_send_sig_info(
sig: u8,
info: &SigInfo,
task: &Task,
sig_type: PidType,
) -> Result<()> {
match sig_type {
PidType::Tgid => {
// Process-directed: enqueue to shared pending.
let _guard = task.process.sighand.queue_lock.lock();
enqueue_signal(&task.process.shared_pending, sig, info)?;
// Select the best thread for wakeup.
let target = select_thread_for_signal(&task.process, sig);
if let Some(thread) = target {
thread.thread_flags.fetch_or(TIF_SIGPENDING, Ordering::Release);
signal_wake_up(thread, sig == SIGKILL);
}
Ok(())
}
PidType::Pid => {
// Thread-directed: delegate to send_signal_to_task.
send_signal_to_task(task, sig, info.clone())
}
}
}
8.5.3.1.4 Thread Selection for Process-Directed Signals¶
A process-directed signal (sent to a PID rather than a TID) can be delivered to any thread that does not have it blocked. The selection algorithm is:
- Prefer a thread that is sleeping in an interruptible state (
TASK_INTERRUPTIBLE) and has the signal unblocked. - Among multiple eligible threads, select the first one found in thread-group order (deterministic but not specified to user space).
- If all threads have the signal blocked, the signal remains pending in
pending_processuntil a thread unblocks it viasigprocmask()orpthread_sigmask(). - If
execve()is in progress, defer until the exec completes.
8.5.3.1.5 Checking and Delivering Pending Signals¶
The kernel checks for pending signals at every kernel → user-mode return:
- Syscall return path: after
do_syscall()completes and beforeSYSRET/IRET. - Interrupt return path: after the interrupt handler runs, before returning to user mode.
schedule()return: after a context switch, when returning to the resumed task.
The check is gated on TIF_SIGPENDING. If set, do_signal() is called:
fn do_signal(regs: &mut ArchRegs) {
let task = current_task();
// --- Syscall restart handling (ERESTART*) ---
// Before delivering any signal, check if the interrupted syscall returned
// an internal restart code. These codes never reach userspace — do_signal()
// rewrites them before returning to user mode.
//
// Architecture-specific: the result register is `rax` on x86-64, `x0` on
// AArch64, `a0` on RISC-V, `r3` on PPC, `r2` on s390x, `a0` on LoongArch64.
let syscall_result = arch::get_syscall_result(regs);
let mut restarting = matches!(
syscall_result,
-ERESTARTSYS | -ERESTARTNOHAND | -ERESTARTNOINTR | -ERESTART_RESTARTBLOCK
);
loop {
// Acquire SIGLOCK(40) around dequeue_signal() AND the action table
// read for the restart decision. The siglock protects:
// (1) PendingSignals mutations (enqueue, dequeue, flush),
// (2) signal_mask reads and writes,
// (3) action table reads during restart decisions (prevents race
// with concurrent sigaction() modifying the entry — SIG-14 fix).
//
// The lock is held from dequeue through the SigAction capture
// (SF-093 fix). The captured SigAction snapshot is passed to
// handle_signal() after the lock is dropped, ensuring the action
// table read is race-free against concurrent sigaction(). The
// critical invariant is that ALL action table reads (restart
// decision AND handler dispatch) use data captured under siglock.
let _sig_guard = task.siglock();
let sig = dequeue_signal(&task.pending,
&task.process.shared_pending,
&task.signal_mask);
match sig {
None => {
// No more pending unblocked signals.
// Recalculate TIF_SIGPENDING while siglock is still held.
recalc_sigpending();
drop(_sig_guard);
// If we had a restart code and no handler was set up,
// handle unconditional restarts here. The original syscall
// number is read via arch::current::get_orig_syscall_nr()
// from task.context.orig_syscall_nr (saved at syscall entry;
// see [Section 8.1](#process-and-task-management--archcontext)).
if restarting {
match syscall_result {
-ERESTARTNOINTR => {
// Unconditional restart: rewind IP, restore args.
arch::restart_syscall(regs);
}
-ERESTART_RESTARTBLOCK => {
// Restart via restart_block: rewrite syscall number
// to __NR_restart_syscall (219 on x86-64), rewind IP.
arch::setup_restart_block(regs, task);
}
-ERESTARTNOHAND | -ERESTARTSYS => {
// No handler was invoked (all signals were
// default-ignore or default-continue). Restart
// unconditionally since no handler will be entered.
arch::restart_syscall(regs);
}
_ => {}
}
}
break;
}
Some((signum, info)) => {
// If a user handler is about to be invoked and we have a
// restart code, apply the restart/EINTR decision NOW.
// SIGLOCK(40) is held — action table read is race-free
// against concurrent sigaction() (SIG-14 fix).
if restarting {
// SAFETY: SIGLOCK(40) held. UnsafeCell access serialized.
let action_table = unsafe { &*task.process.sighand.action.get() };
let action = &action_table[signum as usize - 1];
match syscall_result {
-ERESTARTSYS if action.sa_flags.contains(SA_RESTART) => {
// SA_RESTART set: restart the syscall after the
// handler returns (rewind IP in the signal frame).
arch::restart_syscall(regs);
}
-ERESTARTSYS | -ERESTARTNOHAND => {
// No SA_RESTART or ERESTARTNOHAND with user handler:
// convert to EINTR for userspace.
arch::set_syscall_result(regs, -EINTR);
}
-ERESTARTNOINTR => {
// Always restart, even with a handler.
arch::restart_syscall(regs);
}
-ERESTART_RESTARTBLOCK => {
// Handler will be invoked: change to EINTR.
// (The restart_block path only applies when no
// handler is invoked — handled in the None branch.)
arch::set_syscall_result(regs, -EINTR);
}
_ => {}
}
// Clear the restarting flag after the first signal is
// processed. Without this, the restart logic would be
// applied to subsequent signals in the loop (e.g., a
// second default-ignore signal would incorrectly trigger
// restart code rewriting in the result register).
restarting = false;
}
// SF-093 fix: capture the SigAction under SIGLOCK(40) before
// dropping the lock. This prevents a concurrent sigaction()
// from changing the handler between dequeue and dispatch.
// This matches Linux's get_signal() which reads ka under
// siglock. Without this, a race between sigaction() and
// signal delivery can build a frame for a nonexistent handler.
// SAFETY: SIGLOCK(40) held. UnsafeCell access serialized.
let action_table = unsafe { &*task.process.sighand.action.get() };
let ka = action_table[(signum - 1) as usize].clone();
drop(_sig_guard);
let action = handle_signal(task, signum, info, regs, &ka);
match action {
SignalAction::UserHandler => {
// A user-space signal frame has been set up on the user
// stack. We MUST stop processing here — delivering a
// second user handler would overwrite the first frame,
// corrupting the saved register state. The remaining
// pending signals will be delivered when the first
// handler returns (via sigreturn -> do_signal again).
break;
}
SignalAction::DefaultTerm | SignalAction::DefaultCore => {
// do_exit() was called — this function never returns.
unreachable!();
}
SignalAction::DefaultStop => {
// Task was STOPPED — schedule() returned after SIGCONT.
// Continue the loop to deliver any pending signals that
// accumulated during the stop (including the SIGCONT
// handler if one is installed).
continue;
}
SignalAction::DefaultIgn | SignalAction::DefaultCont => {
// Default IGN or CONT: continue loop to process
// additional pending signals.
continue;
}
}
}
}
}
}
Critical invariant: At most ONE user-space signal handler frame is set up per
return-to-user. Default actions that kill (TERM/CORE) never return from do_signal().
Default STOP blocks (schedule returns on SIGCONT, re-entering do_signal). Default IGN
and CONT continue the loop. Only after setting up a user handler frame does the loop
break — remaining signals are delivered when the handler calls sigreturn.
/// Dequeue the highest-priority pending unblocked signal.
///
/// Checks per-task pending first (thread-directed signals), then per-process
/// shared pending. Returns `None` if no unblocked signal is pending.
///
/// # Preconditions
/// - `SIGLOCK(40)` MUST be held by the caller. All PendingSignals mutations
/// (enqueue, dequeue, flush) are serialized by this lock.
///
/// # Returns
/// `Some((signum, SigInfo))` — the dequeued signal number and its info.
/// `None` — no pending unblocked signals.
fn dequeue_signal(
pending_task: &PendingSignals,
pending_process: &PendingSignals,
blocked: &SignalSet,
) -> Option<(u32, SigInfo)> {
// 1. Check per-task pending first (thread-directed signals have priority).
if let Some(signo) = next_pending_signal(pending_task, blocked) {
let info = dequeue_from(pending_task, signo);
return Some((signo, info));
}
// 2. Check per-process shared pending.
if let Some(signo) = next_pending_signal(pending_process, blocked) {
let info = dequeue_from(pending_process, signo);
return Some((signo, info));
}
None
}
/// Dequeue a specific signal from a PendingSignals queue.
///
/// Called under siglock — all field accesses are serialized.
/// `pending_mask` is AtomicU64 for lock-free reads from `sigpending(2)`,
/// not for mutation serialization. `rt_count` and `total_count` are plain
/// types (u8 and u32 respectively) — no atomic methods needed because the
/// siglock serializes all mutations.
///
/// # Interior mutability
/// This function takes `&PendingSignals` (shared reference) because
/// `PendingSignals` is embedded in `Task.pending` or `Process.shared_pending`,
/// both accessed through `Arc<Task>` / `Arc<Process>` (shared references).
/// The `siglock` guarantees single-writer access; Rust's type system is
/// satisfied by wrapping mutable fields in `UnsafeCell`:
///
/// - `standard: UnsafeCell<[Option<SigInfo>; 32]>` — SAFETY: siglock held
/// - `rt_head: UnsafeCell<Option<NonNull<SigQueueEntry>>>` — SAFETY: siglock held
/// - `rt_tail: UnsafeCell<Option<NonNull<SigQueueEntry>>>` — SAFETY: siglock held
/// - `rt_count: UnsafeCell<[u8; 33]>` — SAFETY: siglock held
/// - `total_count: UnsafeCell<u32>` — SAFETY: siglock held
/// - `pending_mask: AtomicU64` — atomic for lock-free reads (sigpending(2))
///
/// All `UnsafeCell` accesses use `unsafe { &mut *field.get() }` with the
/// SAFETY invariant: "siglock is held by the caller."
fn dequeue_from(pending: &PendingSignals, signo: u32) -> SigInfo {
// SAFETY: siglock is held by the caller (dequeue_signal holds siglock).
// All UnsafeCell accesses below are serialized by the siglock.
if signo <= 31 {
// Standard signal: take from standard[signo] (1-indexed array —
// signal N maps directly to index N, index 0 unused).
let standard = unsafe { &mut *pending.standard.get() };
let info = standard[signo as usize].take()
.unwrap_or(SigInfo::kernel(signo as u8));
// Clear bit in pending_mask (no more instances of this signal).
// Bit (N-1) = signal N: signal 1 = bit 0, signal 31 = bit 30.
pending.pending_mask.fetch_and(!(1u64 << (signo - 1)), Relaxed);
let total = unsafe { &mut *pending.total_count.get() };
*total -= 1;
info
} else {
// RT signal: walk rt_head to find and unlink the oldest entry
// matching signo. Free it back to SIGQUEUE_SLAB.
let entry = rt_dequeue(pending, signo);
let info = entry.info;
let rt_count = unsafe { &mut *pending.rt_count.get() };
rt_count[(signo - 32) as usize] -= 1;
if rt_count[(signo - 32) as usize] == 0 {
pending.pending_mask.fetch_and(!(1u64 << (signo - 1)), Relaxed);
}
let total = unsafe { &mut *pending.total_count.get() };
*total -= 1;
SIGQUEUE_SLAB.get().unwrap().free(entry);
info
}
}
8.5.3.1.6 recalc_sigpending() and recalc_sigpending_for()¶
Recalculates whether TIF_SIGPENDING should be set for the current task (or an
arbitrary task). Called after do_signal() exhausts the pending queue, after
sigprocmask() changes the blocked mask, after epoll_pwait/ppoll/pselect
restore the signal mask, and after SIGCONT cancels pending stop signals on other
threads.
/// Recalculate TIF_SIGPENDING for the current task.
///
/// Sets TIF_SIGPENDING if there are any pending unblocked signals in
/// either the per-task or per-process pending sets. Clears the flag
/// otherwise — preventing the do_signal() slow path from being entered
/// on every subsequent return-to-userspace.
///
/// # Memory ordering
///
/// Must be called under `SIGLOCK(40)` (or with interrupts disabled on the
/// current CPU). The siglock provides the ordering guarantee:
///
/// - **Race: `send_signal()` on CPU B vs `recalc_sigpending()` on CPU A.**
/// CPU B holds siglock while setting `pending_mask` and then setting
/// `TIF_SIGPENDING` (step 7). CPU A holds siglock while reading
/// `pending_mask` and conditionally clearing `TIF_SIGPENDING`. Since
/// both are serialized by siglock, the race is eliminated: either
/// CPU A sees the new pending signal (and keeps TIF_SIGPENDING set),
/// or CPU A clears TIF_SIGPENDING and CPU B subsequently re-sets it.
///
/// - **`pending_mask` uses `Relaxed` ordering** because the siglock already
/// provides the inter-thread visibility guarantee. The `Relaxed` load is
/// sufficient because it runs within the siglock critical section.
///
/// - **`thread_flags` uses `Release` ordering** so that the cleared
/// TIF_SIGPENDING is visible to other CPUs when they check the flag
/// in the return-to-user path (which uses `Acquire` on thread_flags).
fn recalc_sigpending() {
recalc_sigpending_for(current_task());
}
/// Recalculate TIF_SIGPENDING for an arbitrary task.
///
/// Used by the SIGCONT cancellation path to clear stale TIF_SIGPENDING on
/// other threads in the thread group after removing their pending stop
/// signals. Without this, threads that had a pending stop signal removed
/// by SIGCONT would retain a stale TIF_SIGPENDING, causing a spurious
/// `do_signal()` call on every return-to-userspace until naturally cleared.
///
/// # Preconditions
///
/// Must be called under `SIGLOCK(40)`. The target task must be in the same
/// thread group as the caller (sharing the same sighand, hence the same
/// siglock).
fn recalc_sigpending_for(task: &Task) {
let blocked = task.signal_mask.bits();
let has_pending =
(task.pending.pending_mask.load(Relaxed) & !blocked) != 0
|| (task.process.shared_pending.pending_mask.load(Relaxed) & !blocked) != 0;
if has_pending {
task.thread_flags.fetch_or(TIF_SIGPENDING, Release);
} else {
task.thread_flags.fetch_and(!TIF_SIGPENDING, Release);
}
}
Without recalc_sigpending(), a task that ever received a signal would take the
do_signal() penalty on every subsequent syscall return forever — the flag would
be set but never cleared, wasting cycles on an empty dequeue loop.
8.5.3.1.7 Signal Handler Invocation¶
handle_signal(signum, info, regs) dispatches the dequeued signal: for default
actions it executes the action directly (TERM/CORE/STOP/CONT/IGN); for user
handlers it builds a signal frame on the user stack, updates the blocked mask
under siglock (matching Linux's signal_delivered()), and returns UserHandler.
/// Dispatch a dequeued signal — either execute the default action or
/// build a user signal frame.
///
/// # Preconditions
/// - `SIGLOCK(40)` is held on entry (from `dequeue_signal()` in `do_signal()`).
/// - `info` is the `SigInfo` dequeued from the pending set.
///
/// # Lock interactions
/// - For SA_RESETHAND: acquires `SIGHAND_LOCK(30)` to mutate the action table.
/// This is safe because SIGHAND_LOCK(30) < SIGLOCK(40) in the lock ordering
/// and we DROP siglock before acquiring sighand_lock (no nesting).
/// - For mask update (step 5): re-acquires `SIGLOCK(40)` to update
/// `signal_mask`. The `signal_mask` field is NOT atomic — it is a plain
/// `SignalSet(u64)` protected by siglock. Without the lock, concurrent
/// `send_signal()` reading `signal_mask` for the ignore/blocked check
/// would race. This matches Linux's `signal_delivered()` pattern.
fn handle_signal(
task: &Task,
signum: u32,
info: SigInfo,
regs: &mut ArchRegs,
ka: &SigAction, // SF-093 fix: captured under SIGLOCK(40) by do_signal()
) -> SignalAction {
// Step 1: Determine action. The SigAction was captured under SIGLOCK(40)
// by the caller (do_signal) before dropping the lock, preventing a race
// with concurrent sigaction() that could change the handler between
// dequeue and dispatch. The lock is NOT held here — only the captured
// snapshot is used.
// Step 2: Default action dispatch.
match ka.handler {
SigHandler::Default => {
let default_action = signal_default_action(signum);
match default_action {
DefaultAction::Term => {
do_exit(signal_exit_status(signum));
// do_exit never returns.
return SignalAction::DefaultTerm;
}
DefaultAction::Core => {
do_coredump(&info, regs);
do_exit(signal_exit_status(signum) | 0x80);
return SignalAction::DefaultCore;
}
DefaultAction::Stop => {
do_group_stop(task, signum);
return SignalAction::DefaultStop;
}
DefaultAction::Cont => {
// SIGCONT handling is in send_signal() step 2b.
return SignalAction::DefaultCont;
}
DefaultAction::Ign => {
return SignalAction::DefaultIgn;
}
}
}
SigHandler::Ignore => {
return SignalAction::DefaultIgn;
}
SigHandler::User(_handler_fn) => {
// Fall through to user handler path below.
}
}
// ---- User handler path ----
// Step 3: SA_ONSTACK — select signal stack BEFORE building the frame.
let sp = if ka.sa_flags.contains(SA_ONSTACK)
&& task.sas_ss_sp != 0
&& task.sas_ss_size != 0
&& (task.sas_ss_flags & SS_ONSTACK) == 0
{
if task.sas_ss_flags & SS_AUTODISARM != 0 {
// Prevent nested alt-stack use by clearing registration.
// These fields are plain u64 — only written by the owning
// task in signal delivery context (no concurrency).
// In implementation, use AtomicU64 or document single-writer.
}
task.sas_ss_sp + task.sas_ss_size
} else {
arch::current::signal::user_sp(regs)
};
// Step 4: Build signal frame on the user stack.
// Writes RtSigFrame at `sp`, redirects return to handler entry.
arch::current::signal::setup_rt_frame(
task, signum, &info, ka, regs, sp,
);
// Step 5: Update signal mask under SIGLOCK.
// signal_mask is plain SignalSet(u64) — NOT atomic. Mutation
// requires SIGLOCK(40) to prevent data race with send_signal()
// reads. Matching Linux's signal_delivered() / block_sigmask().
{
let _sig_guard = task.process.sighand.queue_lock.lock();
let new_blocked = if ka.sa_flags.contains(SA_NODEFER) {
task.signal_mask.union(ka.sa_mask)
} else {
task.signal_mask.union(ka.sa_mask).union(SignalSet::mask(signum as u8))
};
task.signal_mask = new_blocked;
recalc_sigpending_for(task);
}
// Step 6: SA_RESETHAND — reset disposition to Default after frame build.
if ka.sa_flags.contains(SA_RESETHAND) {
let _guard = task.process.sighand.lock.lock();
// SAFETY: SIGHAND_LOCK(30) held.
let action = unsafe { &mut *task.process.sighand.action.get() };
action[(signum - 1) as usize].handler = SigHandler::Default;
task.process.sighand.handler_cache[(signum - 1) as usize]
.store(0, Relaxed); // 0 = Default
}
SignalAction::UserHandler
}
8.5.3.1.8 Group Stop Protocol (POSIX Job Control)¶
When any thread in a thread group receives a stop signal (SIGSTOP, SIGTSTP, SIGTTIN,
SIGTTOU) whose default action is STOP, the entire thread group must stop. This is
a POSIX requirement: kill(-pgrp, SIGTSTP) stops all threads in all processes in the
process group, not just one thread.
The group stop protocol uses task.jobctl flags (Section 8.1):
JOBCTL flag constants (canonical definitions in Section 8.1; reproduced here for reference):
pub const JOBCTL_STOP_PENDING: u32 = 1 << 17; // This task must stop
pub const JOBCTL_STOP_CONSUME: u32 = 1 << 18; // Signal consumed, just stop
pub const JOBCTL_TRAP_STOP: u32 = 1 << 19; // ptrace trap on group stop
pub const JOBCTL_TRAP_NOTIFY: u32 = 1 << 20; // ptrace event notification
pub const JOBCTL_TRAPPING: u32 = 1 << 21; // ptrace trapping in progress
pub const JOBCTL_LISTENING: u32 = 1 << 22; // ptrace LISTEN mode
Group stop initiation (in handle_signal() when action is STOP):
fn do_group_stop(task: &Task, signum: u32) {
let process = &task.process;
// Count threads that need to stop (including current).
let mut stop_count: u32 = 0;
// Set JOBCTL_STOP_PENDING on all OTHER threads in the group.
for t in process.thread_group.iter() {
if core::ptr::eq(t.as_ref(), task) {
// Current thread — it will stop below via do_signal_stop().
// Count it but don't set TIF_SIGPENDING on itself (redundant).
stop_count += 1;
continue;
}
t.jobctl.fetch_or(JOBCTL_STOP_PENDING, Release);
// Set TIF_SIGPENDING so the thread checks at next return-to-user.
t.thread_flags.fetch_or(TIF_SIGPENDING, Release);
// If the thread is in INTERRUPTIBLE sleep, wake it so it can stop.
if t.state.load(Acquire) == TaskState::INTERRUPTIBLE.bits() {
try_to_wake_up(t, TASK_INTERRUPTIBLE, WakeFlags::empty());
}
stop_count += 1;
}
// Set the group stop counter. Each thread decrements this in
// do_signal_stop(); the last thread to decrement sends SIGCHLD.
process.thread_group.group_stop_count.store(stop_count, Release);
// The current thread stops immediately, passing the stop signal number
// so SIGCHLD to the parent reports the correct si_status.
do_signal_stop(task, signum);
}
Delayed notification for TASK_UNINTERRUPTIBLE threads: A thread blocked in
TASK_UNINTERRUPTIBLE (e.g., synchronous disk I/O) will not be woken by the
stop signal. It has TIF_SIGPENDING set, so when it eventually returns from
the I/O and reaches the return-to-user path, it will stop. Until then, the
group stop is incomplete and the parent is not notified. This matches Linux
behavior exactly. The parent's wait4() will block until the last thread stops.
Per-thread stop (each thread checks JOBCTL_STOP_PENDING at return-to-user):
/// Perform the per-thread portion of a group stop.
///
/// `stop_signal` is the signal number that initiated the group stop
/// (SIGSTOP=19, SIGTSTP=20, SIGTTIN=21, or SIGTTOU=22). It is passed
/// to `SigInfo::new_sigchld()` so that the parent's SIGCHLD `si_status`
/// field reports which signal caused the stop (POSIX requirement).
fn do_signal_stop(task: &Task, stop_signal: u32) -> SignalAction {
task.jobctl.fetch_and(!JOBCTL_STOP_PENDING, Release);
task.state.store(TaskState::STOPPED.bits(), Release);
// Decrement the group stop counter. The counter was set by
// do_group_stop() to the number of threads that need to stop.
// When it reaches 0, the last thread sends SIGCHLD.
// Using a counter avoids the TOCTOU race of iterating all threads:
// a concurrent SIGCONT could wake threads between the iteration and
// the "all stopped" check, causing a false negative.
let prev = task.process.thread_group.group_stop_count
.fetch_sub(1, AcqRel);
if prev == 1 {
// This is the last thread to stop.
// Last thread to stop: notify parent with SIGCHLD (CLD_STOPPED).
// Navigation: Process.parent is AtomicU64 (parent PID).
// Look up the parent Process via PID_TABLE, then its leader Task.
//
// SF-097 fix: Use a retry loop instead of .expect() to handle
// the TOCTOU race where the parent exits and is reaped between
// our load and the PID_TABLE lookup. Concurrent reparenting
// (exit_notify) updates Process.parent to init's PID. If the
// first lookup fails (stale PID), re-read the parent field
// which will now point to init (PID 1, always valid).
let parent_process = loop {
let parent_pid = task.process.parent.load(Acquire);
match PID_TABLE.get(parent_pid) {
Some(p) => break p,
None => {
// Parent was reaped. Re-read: reparenting has set
// Process.parent to init's PID. The re-read is safe
// because init (PID 1) is never reaped.
continue;
}
}
};
let parent = parent_process.thread_group.leader();
let info = SigInfo {
si_signo: SIGCHLD as i32,
si_code: CLD_STOPPED,
si_errno: 0,
_union: SigInfoUnion { sigchld: SigInfoSigchld::new(
task.process.pid.pid, // Extract i32 from ProcessId
task.cred.read().uid,
stop_signal as i32,
// cputime_to_clock_t(ns) converts nanoseconds to clock ticks
// (USER_HZ = 100). Formula: ns / (1_000_000_000 / USER_HZ).
// Defined in [Section 8.7](#resource-limits-and-accounting).
cputime_to_clock_t(task.rusage.utime_ns.load(Relaxed)),
cputime_to_clock_t(task.rusage.stime_ns.load(Relaxed)),
)},
};
send_signal_to_task(parent, SIGCHLD, &info);
wake_up_interruptible(&parent.wait_chldexit);
}
schedule(); // yield CPU — task is STOPPED until SIGCONT
// After schedule() returns (woken by SIGCONT), the task resumes here.
// Return DefaultStop to the do_signal() loop. The loop should CONTINUE
// (not break) so it re-checks pending signals — the SIGCONT that woke
// us may have a user handler that needs delivery, and other signals may
// have accumulated during the stop.
SignalAction::DefaultStop
}
Group continue (SIGCONT received): The SIGCONT/stop mutual cancellation in
send_signal() (step 2b) clears JOBCTL_STOP_PENDING on all threads. Each
stopped thread is woken by the SIGCONT delivery path (TASK_STOPPED is in the
wake mask for SIGCONT). On wakeup, each thread transitions from STOPPED to
RUNNING. The CLD_CONTINUED notification is sent to the parent by the thread
that processes the SIGCONT.
Ptrace interaction: When a ptraced task enters group stop, it enters
TASK_TRACED instead of TASK_STOPPED (so the tracer can inspect it). The
JOBCTL_TRAP_STOP flag is set, and PTRACE_EVENT_STOP is reported to the
tracer. The tracer can then use PTRACE_LISTEN to allow the task to participate
in the group stop without fully detaching.
8.5.4 Signal Frame Layout (x86-64)¶
On x86-64 the kernel pushes an rt_sigframe onto the user stack (or the alternate
signal stack if SA_ONSTACK is set and active). The frame is 16-byte aligned at the
point where RSP is set for the handler. The layout matches the Linux rt_sigframe
ABI so that unmodified glibc signal trampolines work without modification.
/// User-stack frame built by the kernel when delivering a signal (x86-64).
///
/// Stack grows downward. The kernel writes this struct below the current RSP,
/// aligns the resulting RSP to 16 bytes, then subtracts 8 (simulating a
/// `call` instruction's return address push) before setting RSP for the handler.
///
/// # ABI note
/// This layout is fixed by the Linux x86-64 signal ABI. glibc's `__restore_rt`
/// trampoline (the default `sa_restorer`) issues `syscall` with rax = SYS_rt_sigreturn
/// (15) to re-enter the kernel. The kernel then reads `uc` from this frame to
/// restore all registers and the signal mask.
// Userspace ABI — x86-64 specific. RtSigFrame, UContext, SigAltStack, MContext,
// and FpState are all architecture-dependent signal frame structs. Each
// architecture defines its own layout in umka-core/src/arch/<arch>/signal.rs.
// const_assert verified per-architecture at build time (not here in generic spec).
// kernel-internal, not KABI
#[repr(C)]
pub struct RtSigFrame {
/// Return address pushed by the handler call: points to `sa_restorer`
/// (or the vDSO `__restore_rt` trampoline if `sa_restorer` is None).
pub pretcode: *const u8,
/// Extended signal information (only meaningful when SA_SIGINFO is set;
/// present in the frame regardless).
pub info: SigInfo,
/// Saved user-space execution context, restored by `sigreturn`.
pub uc: UContext,
// Architecture-private FP/XSAVE state follows immediately in memory,
// pointed to by uc.uc_mcontext.fpstate. It is not part of this struct
// because its size is runtime-determined (XSAVE area size from CPUID).
}
/// User context saved on signal entry, restored on `sigreturn`.
///
/// x86-64 layout:
/// offset 0: uc_flags (u64, 8 bytes)
/// offset 8: uc_link (*mut UContext, 8 bytes)
/// offset 16: uc_stack (SigAltStack, 24 bytes)
/// offset 40: uc_mcontext (MContext, 256 bytes)
/// offset 296: uc_sigmask (SignalSet = u64, 8 bytes)
/// Total: 304 bytes
// kernel-internal, not KABI
#[repr(C)]
pub struct UContext {
pub uc_flags: u64,
pub uc_link: *mut UContext,
pub uc_stack: SigAltStack,
pub uc_mcontext: MContext,
pub uc_sigmask: SignalSet,
}
// x86-64: 8 + 8 + 24 + 256 + 8 = 304
#[cfg(target_arch = "x86_64")]
const_assert!(size_of::<UContext>() == 304);
/// Alternate signal stack descriptor (`struct stack_t`).
///
/// Layout on LP64:
/// offset 0: ss_sp (*mut u8, 8 bytes)
/// offset 8: ss_flags (i32, 4 bytes)
/// offset 12: _pad0 (4 bytes — implicit #[repr(C)] padding to align ss_size to 8)
/// offset 16: ss_size (usize, 8 bytes)
/// Total: 24 bytes
///
/// Layout on ILP32:
/// offset 0: ss_sp (*mut u8, 4 bytes)
/// offset 4: ss_flags (i32, 4 bytes)
/// offset 8: ss_size (usize, 4 bytes)
/// Total: 12 bytes (no padding — all fields 4-byte aligned)
// kernel-internal, not KABI (but embedded in UContext which is written to user stack)
#[repr(C)]
pub struct SigAltStack {
pub ss_sp: *mut u8,
pub ss_flags: i32,
/// Explicit padding on LP64 to align `ss_size` to 8-byte boundary.
/// On ILP32, `usize` is 4-byte aligned so no padding is needed.
#[cfg(target_pointer_width = "64")]
pub _pad0: [u8; 4],
pub ss_size: usize,
}
/// LP64: 8 + 4 + 4(pad) + 8 = 24
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<SigAltStack>() == 24);
/// ILP32: 4 + 4 + 4 = 12
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<SigAltStack>() == 12);
/// Machine context: all general-purpose registers plus segment and FP state.
/// Matches Linux `struct sigcontext` / `mcontext_t` for x86-64.
// kernel-internal, not KABI
#[repr(C)]
pub struct MContext {
pub r8: u64,
pub r9: u64,
pub r10: u64,
pub r11: u64,
pub r12: u64,
pub r13: u64,
pub r14: u64,
pub r15: u64,
pub rdi: u64,
pub rsi: u64,
pub rbp: u64,
pub rbx: u64,
pub rdx: u64,
pub rax: u64,
pub rcx: u64,
pub rsp: u64,
pub rip: u64,
pub eflags: u64,
pub cs: u16,
pub gs: u16,
pub fs: u16,
pub ss: u16,
pub err: u64,
pub trapno: u64,
pub oldmask: u64,
pub cr2: u64,
/// Pointer to XSAVE area (or null if no FP state).
pub fpstate: *mut FpState,
/// Saved PKRU value (x86-64 only). PKRU is saved/restored separately from the
/// XSAVE area to avoid the XRSTOR PKRU reset issue
/// (see [Section 7.3](07-scheduling.md#context-switch-and-register-state--pkru-management-during-context-switch-x86-64)).
/// On architectures without PKRU (AArch64, RISC-V, etc.), this field is zero.
pub pkru_val: u64,
pub _reserved: [u64; 7], // reduced from 8 to 7 to accommodate pkru_val
}
// x86-64: 18*8(GPRs+flags) + 4*2(seg) + 4*8(err/trapno/oldmask/cr2) + 8(fpstate) + 8(pkru) + 7*8(reserved)
// = 144 + 8 + 32 + 8 + 8 + 56 = 256
#[cfg(target_arch = "x86_64")]
const_assert!(size_of::<MContext>() == 256);
/// Header of the XSAVE state area (variable-length; size from `CPUID.(EAX=0Dh,ECX=0)`).
// kernel-internal, not KABI
#[repr(C)]
pub struct FpState {
pub cwd: u16, // x87 control word
pub swd: u16, // x87 status word
pub twd: u16, // x87 tag word
pub fop: u16, // last FP instruction opcode
pub rip: u64, // last FP instruction RIP
pub rdp: u64, // last FP data RIP
pub mxcsr: u32,
pub mxcsr_mask: u32,
// 8×16-byte st/mm registers, 16×16-byte XMM registers, optional AVX/AVX-512 state
// follow via the standard XSAVE layout.
}
Frame construction sequence:
- Compute the new RSP: subtract
size_of::<RtSigFrame>()from current user RSP. - Align down to 16 bytes, subtract 8 (ABI: stack must be 16-byte aligned at handler
entry, simulating the
callthat pushed a return address). - PKRU reset (x86-64 only): Before writing the signal frame to user memory,
save the current PKRU value into
uc.uc_mcontext.pkru_val, then set PKRU to the permissive value (0x55555550— user access allowed to all protection keys except key 0). This is required because the signal handler runs in a generic user context that must be able to access all user memory (the handler's stack and code may span multiple protection key regions). The saved PKRU value is restored onsigreturn. - Write
RtSigFramefields:pretcode=sa_restorer(or vDSO trampoline),info= theSigInfo,uc.uc_mcontext= all current user registers,uc.uc_sigmask= task's current signal mask before addingsa_mask. - Save XSAVE state: call the arch XSAVE routine to write FP/vector state into the
area immediately following the frame; set
uc.uc_mcontext.fpstate. Note: PKRU is excluded from the XSAVE area — it is saved/restored separately via thepkru_valfield to avoid the XRSTOR PKRU reset bug (see Section 7.3). - Set
uc.uc_stackto the alternate signal stack descriptor (so the handler can inspect it). - CET shadow stack (x86-64, when CET-SS is enabled): Push a shadow stack
restore token onto the shadow stack (not into
MContext). The token is an 8-byte value containing the current shadow stack pointer value with bit 1 set (the "restore token" marker). Onsigreturn, the kernel usesRSTORSSPto validate and consume the restore token from the shadow stack — this prevents signal-handler-based shadow stack manipulation attacks. The restore token lives on the shadow stack itself (matching Linuxarch/x86/kernel/shstk.csetup_signal_shadow_stack()), not in theMContextstruct.longjmp/setjmpmust also be shadow-stack-aware:setjmpsaves the current shadow stack pointer,longjmpusesRSTORSSP+SAVEPREVSSPto restore it. Thearch_prctl(ARCH_SHSTK_ENABLE)/ARCH_SHSTK_DISABLE/ARCH_SHSTK_STATUSfamily manages per-thread shadow stack state. - Modify the kernel → user return: set
rip= handler entry point,rdi= signum,rsi= &frame.info (if SA_SIGINFO),rdx= &frame.uc (if SA_SIGINFO),rsp= computed stack pointer from step 2. - Set
uc.uc_flags = 0(reserved for future use by the kernel; glibc checks it).
sigreturn: When the handler returns, the trampoline executes SYS_rt_sigreturn
(syscall number 15 on x86-64). The kernel reads RtSigFrame from RSP, validates
the restored register state (see security validation below), restores all registers
from uc.uc_mcontext, restores the signal mask from uc.uc_sigmask (blocking signals
that the kernel added for handler execution), restores the PKRU value from
uc.uc_mcontext.pkru_val (x86-64), validates the CET shadow stack restore token
(if CET-SS is active), validates FP/SIMD state (see table below), restores
FP state from the save area, and resumes execution at the saved rip.
Security validation in sigreturn: Before restoring the register state from the signal frame, the kernel validates critical fields to prevent privilege escalation:
| Architecture | Register | Validation | Failure |
|---|---|---|---|
| x86-64 | rip |
Must be below TASK_SIZE (userspace range). A frame pointing rip into kernel space would execute kernel code with user privilege state. |
SIGSEGV |
| x86-64 | cs |
Must be USER_CS (0x33 — GDT index 6, RPL=3, 64-bit code segment; defined in the GDT+TSS initialization, Section 2.2 Phase 1). Prevents restoring a kernel code segment. |
SIGSEGV |
| x86-64 | rflags |
IOPL must be 0, IF must be set, VM and RF must be clear. Prevents user code from disabling interrupts or entering V86 mode. | SIGSEGV |
| x86-64 | ss |
Must be USER_DS (0x2B — GDT index 5, RPL=3, 64-bit data segment; defined in the GDT+TSS initialization, Section 2.2 Phase 1). Prevents restoring a kernel stack segment. |
SIGSEGV |
| AArch64 | pc |
Must be below TASK_SIZE. Must have bit 0 clear (AArch64 instructions are 4-byte aligned). |
SIGSEGV |
| AArch64 | pstate |
Must have EL=0 (user mode). DAIF must not mask interrupts beyond the user-permitted set. |
SIGSEGV |
| ARMv7 | arm_pc |
Must be below TASK_SIZE. |
SIGSEGV |
| ARMv7 | arm_cpsr |
Mode bits must be USR (0x10). IRQ/FIQ disable bits must be clear. | SIGSEGV |
| RISC-V | sepc (in sc_regs[0]) |
Must be below TASK_SIZE. |
SIGSEGV |
| PPC32/PPC64LE | nip |
Must be below TASK_SIZE. |
SIGSEGV |
| PPC64LE | msr |
PR bit (problem state) must be set; HV bit must be clear. | SIGSEGV |
| s390x | psw.addr |
Must be below TASK_SIZE. |
SIGSEGV |
| s390x | psw.mask |
Problem state bit must be set; DAT must be enabled. | SIGSEGV |
| LoongArch64 | era (in sc_regs) |
Must be below TASK_SIZE. |
SIGSEGV |
If any validation fails, the kernel sends SIGSEGV (si_code = SI_KERNEL) to the
task (corrupted frame). This prevents a malicious signal handler from crafting a frame
that elevates privilege on return.
FPU/SIMD state validation on rt_sigreturn: The signal frame's FP/SIMD state
is user-writable (the signal handler can modify it). The kernel MUST validate it
before restoring, or a malicious handler can inject invalid FPU state that causes
undefined behavior, information leakage (reading stale kernel FPU state from reserved
bits), or denial of service (#MF/#XF exceptions on the next FP instruction).
| Architecture | FP Save Area | Validation |
|---|---|---|
| x86-64 | XSAVE/FXSAVE area at uc.uc_mcontext.fpstate |
xstate_bv & ~XCR0 must be 0 (no unsupported features). mxcsr & MXCSR_RESERVED_MASK must be 0. XSAVE area size must not exceed xstate_size (from CPUID leaf 0Dh). Each XSAVE component's offset and size must match the CPUID-reported layout. If any check fails: force SIGSEGV. |
| AArch64 | FPSIMD context in __reserved[] area (tagged FPSIMD_MAGIC) |
fpsr reserved bits must be 0. fpcr reserved bits must be 0. SVE state (if present, tagged SVE_MAGIC): VL must match current thread's VL. |
| ARMv7 | VFP state in uc_vfp |
fpscr reserved bits must be 0. Number of VFP registers must match hardware (16 for VFPv2, 32 for VFPv3/NEON). |
| RISC-V | __riscv_d_ext_state in sigcontext |
fcsr & ~0xFF must be 0 (only frm[7:5] and fflags[4:0] are valid). |
| PPC32 | FPR save area in mcontext |
fpscr reserved bits must be 0. Number of FPRs must be 32. |
| PPC64LE | FPR + VSX save area in mcontext |
fpscr reserved bits must be 0. If VSX state present: vscr reserved bits must be 0. |
| s390x | FP control + FPR area in _sigregs |
fpc reserved bits must be 0 (bits 31:16 except DXC, bits 7:3). |
| LoongArch64 | FP context in sctx_info chain (FPU_CTX_MAGIC) |
fcsr0 reserved bits (31:30, 28:25) must be 0. |
If validation fails on any architecture, the kernel does not restore the invalid
state. Instead it delivers SIGSEGV (si_code = SI_KERNEL) to the process —
the same behavior as Linux when rt_sigreturn encounters a corrupt signal frame.
8.5.4.1 Signal Frame Layouts — All Architectures¶
Each architecture's signal frame is ABI-fixed by Linux. UmkaOS must produce binary-identical
frames so that unmodified glibc signal trampolines and sigreturn work correctly. The
implementation generates arch-specific frames; this section defines the ABI contract for each.
All architectures share the same overall structure: rt_sigframe contains a SigInfo, a
UContext (with uc_mcontext holding the register save area), and an FP/vector state region.
The frame is placed on the user stack (or the alternate signal stack if SA_ONSTACK is set).
The handler's return address points to a sigreturn trampoline (in the vDSO or sa_restorer)
that invokes SYS_rt_sigreturn to restore the saved context.
AArch64 (struct sigcontext — Linux arch/arm64/include/uapi/asm/sigcontext.h):
| Field | Type | Description |
|---|---|---|
fault_address |
u64 |
Faulting virtual address |
regs[31] |
[u64; 31] |
General-purpose registers x0–x30 |
sp |
u64 |
Stack pointer |
pc |
u64 |
Program counter |
pstate |
u64 |
PSTATE (condition flags, execution state) |
__reserved[4096] |
[u8; 4096] |
Extension context area (16-byte aligned) |
Total fixed size: 272 bytes + 4096 bytes reserved = 4368 bytes. The __reserved region
contains a chain of tagged extension blocks, each prefixed by an _aarch64_ctx header
(magic: u32, size: u32). Standard blocks: fpsimd_context (FPSIMD_MAGIC 0x46508001,
528 bytes: fpsr, fpcr, 32×128-bit vregs), esr_context (ESR_MAGIC 0x45535201, 16
bytes), optional sve_context for SVE state, optional za_context for SME ZA state,
optional zt_context for SME2 ZT0 state. The chain is terminated by a null
_aarch64_ctx (magic = 0, size = 0). If the total extension context exceeds
__reserved, an extra_context record points to additional stack-allocated space.
SVE signal frame extension (sve_context, SVE_MAGIC 0x53564501):
| Field | Type | Description |
|---|---|---|
vl |
u16 |
Vector length in bytes (from ZCR_EL1.LEN × 128 / 8) |
flags |
u16 |
Bit 0 (SVE_SIG_FLAG_SM): 1 = streaming SVE mode (SME SSVE) |
__reserved[2] |
u16 |
Padding |
vregs[...] |
variable | Z0-Z31 (each vl bytes), P0-P15 (each vl/8 bytes), FFR (vl/8 bytes) |
Total size: SVE_SIG_CONTEXT_SIZE(vl) = header + 32×vl + 16×vl/8 + vl/8 bytes.
For VL=512 (SVE 512-bit): 32×64 + 16×8 + 8 = 2184 bytes + header.
SME signal frame extensions:
- za_context (ZA_MAGIC 0x54366345): SVL (streaming vector length) + full ZA tile
state (SVL × SVL bytes). Only saved when PSTATE.ZA = 1 (ZA is enabled). If
PSTATE.ZA = 0, ZA context is omitted to avoid saving potentially large zero state.
- zt_context (ZT_MAGIC 0x5a544e01, SME2 only): ZT0 register (512 bytes).
SVE/SME flag validation on sigreturn: The kernel validates all SVE/SME flag
combinations on rt_sigreturn. Invalid combinations (e.g., SVE_SIG_FLAG_SM set
without FEAT_SME, ZA context present without PSTATE.ZA in the saved PSTATE, or
VL/SVL exceeding the hardware maximum) cause sigreturn to return -EINVAL and the
task receives SIGSEGV. On SME-only systems (FEAT_SME without FEAT_SVE — rare but
allowed by the architecture), the signal frame allocates space for the SVE context
header even though SVE is not independently usable — the streaming SVE (SSVE) state
is saved in the sve_context with SVE_SIG_FLAG_SM=1.
Sigreturn trampoline: vDSO __kernel_rt_sigreturn executes svc #0 with x8 = __NR_rt_sigreturn (139).
ARMv7 (struct sigcontext — Linux arch/arm/include/uapi/asm/sigcontext.h):
| Field | Type | Description |
|---|---|---|
trap_no |
u32 |
Trap number |
error_code |
u32 |
Error code |
oldmask |
u32 |
Old signal mask |
arm_r0–arm_r10 |
[u32; 11] |
General-purpose registers r0–r10 |
arm_fp |
u32 |
Frame pointer (r11) |
arm_ip |
u32 |
Intra-procedure-call scratch register (r12) |
arm_sp |
u32 |
Stack pointer (r13) |
arm_lr |
u32 |
Link register (r14) |
arm_pc |
u32 |
Program counter (r15) |
arm_cpsr |
u32 |
Current program status register |
fault_address |
u32 |
Data fault address |
Total: 21 fields × 4 bytes = 84 bytes (without VFP). VFP/NEON state (struct
vfp_sigframe: 32×64-bit double registers + fpscr + magic) follows the rt_sigframe
on the stack when VFP is present.
Sigreturn trampoline: vDSO __kernel_rt_sigreturn executes svc #0 with r7 = __NR_rt_sigreturn (173).
RISC-V 64 (struct sigcontext — Linux arch/riscv/include/uapi/asm/sigcontext.h):
| Field | Type | Description |
|---|---|---|
sc_regs |
[u64; 32] |
[0] = PC, [1] = x1 (ra), [2] = x2 (sp), ..., [31] = x31 (t6). Follows Linux user_regs_struct layout. |
sc_fpregs |
[u64; 66] |
FP state: 32 double regs + fcsr (16-byte aligned) |
Total: 256 + 528 = 784 bytes (with alignment padding, actual size may be slightly larger).
The sc_regs array follows the Linux user_regs_struct layout: sc_regs[0] is the saved
PC (not x0, which is hardwired to zero and has no entry), sc_regs[1] = x1 (ra),
sc_regs[2] = x2 (sp), through sc_regs[31] = x31 (t6). Vector extension state (V)
uses an extensible __riscv_ctx_hdr chain appended after the base context (analogous to
AArch64's extension mechanism).
RISC-V Vector (RVV) signal frame extension:
When the task has RVV state (detected via riscv,isa containing v or _v), the signal
frame includes a __riscv_v_ext_state block appended after the base FP context in the
__riscv_ctx_hdr chain:
| Field | Type | Description |
|---|---|---|
vl |
u64 |
Vector length (CSR_VL) |
vtype |
u64 |
Vector type (CSR_VTYPE — encodes SEW, LMUL, vta, vma) |
vcsr |
u64 |
Vector CSR (CSR_VCSR — vxrm[2:1] + vxsat[0]) |
vlenb |
u64 |
Vector register byte length (CSR_VLENB = VLEN/8) |
vstart |
u64 |
Vector start position (CSR_VSTART) |
datap |
*mut u8 |
Pointer to vector register data (v0-v31, each vlenb bytes) |
Total vector data: 32 × vlenb bytes (e.g., 32 × 16 = 512 bytes for VLEN=128,
32 × 64 = 2048 bytes for VLEN=512). The datap pointer references additional stack
space allocated beyond the base signal frame.
Critical: when saving RVV state to the signal frame, the kernel must not use
vector-accelerated memory copies (vse*.v instructions) for the vstate_save operation.
The RVV registers hold the task's live vector state — using vector instructions to save
them would corrupt the values being saved. The save path uses scalar sd/ld loops
to copy the register file to memory. The same restriction applies to vstate_restore
during sigreturn.
RVV 0.7.1 (XTheadVector) signal frame handling: On T-Head C906/C910 cores that
implement the non-standard XTheadVector (draft V 0.7.1), the signal frame uses the
same __riscv_v_ext_state layout but the CSR addresses differ
(th.vl = 0xC20, th.vtype = 0xC21, etc.). The kernel detects XTheadVector via
the errata flag RiscvErrata::RVV_071_COMPAT and uses the vendor-specific CSR
addresses in the save/restore path. Standard V (1.0) and XTheadVector are mutually
exclusive — a core implements one or the other, never both.
Sigreturn trampoline: vDSO __vdso_rt_sigreturn executes ecall with a7 = __NR_rt_sigreturn (139).
PPC32 (struct mcontext — Linux arch/powerpc/include/uapi/asm/sigcontext.h):
| Field | Type | Description |
|---|---|---|
gp_regs[48] |
[u32; 48] |
32 GPRs + nip, msr, orig_r3, ctr, lr, xer, ccr, mq, trap, dar, dsisr, result + pad |
fp_regs[66] |
[f64; 33] |
32 FP double regs + fpscr (represented as [u64; 33]) |
v_regs |
*mut vrregset_t |
Pointer to AltiVec register state (optional) |
Total mcontext: approximately 480 bytes (192 GPR + 264 FP + padding + pointer).
The sigcontext wraps mcontext with additional fields (signal, handler, oldmask,
regs pointer). The rt_sigframe contains siginfo_t, ucontext_t (with embedded
mcontext), and optional AltiVec/VSX state appended on the stack.
Sigreturn trampoline: vDSO __kernel_sigtramp_rt32 executes sc with r0 = __NR_rt_sigreturn (172).
PPC64LE (struct mcontext — Linux arch/powerpc/include/uapi/asm/sigcontext.h):
| Field | Type | Description |
|---|---|---|
gp_regs[48] |
[u64; 48] |
32 GPRs + nip, msr, orig_r3, ctr, lr, xer, ccr, softe, trap, dar, dsisr, result + pad |
fp_regs[66] |
[f64; 33] |
32 FP double regs + fpscr (represented as [u64; 33]) |
v_regs |
*mut vrregset_t |
Pointer to VMX (AltiVec) register state (optional) |
Total mcontext: approximately 696 bytes (384 GPR + 264 FP + padding + pointer).
PPC64LE uses ELFv2 ABI; the rt_sigframe contains siginfo_t, ucontext_t (with
uc_mcontext holding the full mcontext), and optional VMX/VSX state. The transactional
memory (TM) checkpoint context, if active, is saved in a second ucontext on the stack.
Sigreturn trampoline: vDSO __kernel_sigtramp_rt64 executes sc with r0 = __NR_rt_sigreturn (172).
s390x and LoongArch64 signal frame layouts follow the same structural pattern
(siginfo + ucontext + FP state) with architecture-specific register save areas.
LoongArch64 uses 16-byte stack alignment per the LoongArch ABI, and the FP save area
uses a sctx_info chain format (FPU_CTX_MAGIC header, 32×64-bit FP registers + fcsr0).
s390x uses the _sigregs layout (PSW + 16 GPRs + 16 access registers + FP control +
16 FP registers). Full struct definitions are in the respective
umka-core/src/arch/<arch>/signal.rs files.
Implementation note: Each architecture implements arch::current::signal::build_signal_frame()
and arch::current::signal::restore_signal_frame(). The frame struct definitions live in
umka-core/src/arch/<arch>/signal.rs and are #[repr(C)] to match the Linux binary layout.
The generic signal delivery code in umka-core/src/signal.rs calls these arch functions
without knowledge of the specific register set or frame format.
8.5.5 Signal-Related System Calls¶
8.5.5.1.1 kill(pid, sig) → Result<(), Errno>¶
Send signal sig to a target specified by pid:
- pid > 0: send to the process with that PID.
- pid == 0: send to every process in the sender's process group.
- pid == -1: send to every process the sender has permission to signal,
except PID 1 (init).
- pid < -1: send to process group |pid|.
Returns ESRCH if no target process exists, EPERM if the caller lacks permission. Signal 0 performs a permission check only (no signal delivered); this is used to test process existence.
8.5.5.1.2 tgkill(tgid, tid, sig) → Result<(), Errno>¶
Send sig to thread tid within thread group tgid. This is the correct way to
direct a signal to a specific thread. Returns ESRCH if tid does not exist within
tgid, preventing the race in tkill() where the TID may have been recycled.
8.5.5.1.3 tkill(tid, sig) → Result<(), Errno>¶
Legacy single-argument form: send sig to thread tid without verifying the
thread group. Retained for compatibility; new code should use tgkill().
8.5.5.1.4 sigqueue(pid, sig, value) → Result<(), Errno>¶
Send a real-time signal sig to process pid with an attached SigVal payload
value. Semantics are the same as kill() for the pid argument. The value is
stored in SigInfo::si_value (union: sival_int or sival_ptr), and si_code is
set to SI_QUEUE. Only meaningful for RT signals (34–64); for standard signals, the
payload is carried in the queued SigInfo but only one instance is queued.
8.5.5.1.5 sigaction(sig, act, oldact) → Result<(), Errno>¶
Install a new disposition act for signal sig, returning the previous disposition
in oldact (if non-null). Rejects attempts to change SIGKILL or SIGSTOP (EINVAL).
The disposition is process-wide and shared among all threads. The act.sa_mask is
sanitized: SIGKILL and SIGSTOP bits are cleared.
8.5.5.1.6 sigprocmask(how, set, oldset) → Result<(), Errno>¶
Modify the calling thread's signal mask. how is one of:
- SIG_BLOCK: add set to mask.
- SIG_UNBLOCK: remove set from mask.
- SIG_SETMASK: replace mask with set.
SIGKILL and SIGSTOP bits in set are silently ignored. After modification, if any
previously blocked signal is now unblocked and pending, TIF_SIGPENDING is set so
the signal is delivered at the next kernel exit.
8.5.5.1.7 sigaltstack(ss, old_ss) → Result<(), Errno>¶
Register an alternate signal stack. If ss is non-null, sets the alternate stack
to the region [ss.ss_sp, ss.ss_sp + ss.ss_size) with flags ss.ss_flags. Flag
SS_DISABLE disables the alternate stack; SS_AUTODISARM clears the alternate
stack flag on signal entry (prevents recursive use without explicit re-arm).
Minimum size is MINSIGSTKSZ (architecture-dependent; typically 2 KiB for x86-64,
defined in the umka-sysapi header shim).
8.5.5.1.8 sigwaitinfo(set, info) / sigtimedwait(set, info, timeout)¶
Synchronously wait for any signal in set to become pending. Atomically dequeues
and returns it in info. These are thread-directed: only signals pending on the
calling thread or its process are considered. sigtimedwait adds a timespec
timeout; returns EAGAIN if it expires.
8.5.6 SA_RESTART and EINTR¶
When a signal interrupts a blocking syscall:
- If the installed handler has
SA_RESTARTset: the kernel automatically restarts the syscall by setting RIP back to thesyscallinstruction and re-entering the syscall handler. This is transparent to the user-space process. - Otherwise: the syscall returns
-EINTR. The process must restart manually (or useTEMP_FAILURE_RETRY-style looping).
Restartable syscalls (SA_RESTART causes automatic restart):
read, readv, write, writev, ioctl (when marked restartable by the driver),
wait4, waitpid, waitid, nanosleep, clock_nanosleep, pause, sigsuspend,
sigtimedwait, sigwaitinfo, poll, select, pselect6, ppoll,
epoll_wait, epoll_pwait, futex(FUTEX_WAIT), msgrcv, msgsnd,
semop, semtimedop, recvfrom, recvmsg, recvmmsg, sendto,
sendmsg, sendmmsg, connect, accept, accept4, open (if blocking on
a FIFO or device), openat.
Non-restartable syscalls (always return EINTR even with SA_RESTART):
sleep (returns remaining time), usleep, clock_nanosleep with
TIMER_ABSTIME (returns EINTR with no restart), io_getevents,
io_uring_enter (with IORING_ENTER_GETEVENTS when interrupted before any
completion), getrandom with GRND_NONBLOCK.
The distinction reflects whether the syscall can safely re-enter from the top without
corrupting partial progress. Syscalls that have already partially consumed data
(e.g. a partial read) complete and return the partial count; they are not restarted.
SA_RESTART and io_uring_enter: When io_uring_enter() is interrupted by a
signal with SA_RESTART, the syscall is NOT automatically restarted. This matches
Linux behavior: io_uring_enter with IORING_ENTER_GETEVENTS uses
ERESTARTSYS internally, but the actual restart decision depends on whether
the signal handler has SA_RESTART set AND whether the io_uring context allows
restart (it does, unless IORING_SETUP_IOPOLL is set). The standard restart
mechanism (below) handles this.
8.5.6.1 Syscall Restart Mechanism¶
When a syscall is interrupted by a signal, the kernel must decide whether to restart the syscall or return an error. This decision is mediated by internal error codes that never reach userspace:
- The interrupted syscall returns one of four internal error codes:
-ERESTARTSYS: restart if SA_RESTART is set on the interrupting signal's handler, else return-EINTRto userspace.-ERESTARTNOHAND: restart only if the signal's disposition is Default or Ignore (no user handler installed). If a handler is invoked, return-EINTR.-ERESTARTNOINTR: always restart unconditionally, regardless of SA_RESTART or handler disposition. Used bynanosleepinternals.-
-ERESTART_RESTARTBLOCK: restart via the task'srestart_block(see below). -
The signal delivery path in
do_signal()checks the interrupted syscall's return value (stored in the architecture's result register:raxon x86-64,x0on AArch64,a0on RISC-V): -
If
-ERESTARTSYSand handler has SA_RESTART: rewind the instruction pointer to the syscall entry point (rip -= 2on x86-64 for the 2-bytesyscallinstruction; adjustpcon other architectures), restore the original syscall arguments in registers, and let the kernel re-execute the syscall after the handler returns viasigreturn. - If
-ERESTARTSYSand handler does NOT have SA_RESTART: change the result register to-EINTR, deliver the signal normally. - If
-ERESTARTNOHANDand a user handler is being invoked: change the result register to-EINTR. - If
-ERESTARTNOINTR: always rewind IP and re-execute (unconditional restart). -
If
-ERESTART_RESTARTBLOCK: settask.restart_blockto the restart function provided by the interrupted syscall. Rewrite the syscall number register to__NR_restart_syscall(x86-64: 219) and rewind IP. After the signal handler returns viasigreturn, the kernel enterssys_restart_syscall()which callstask.restart_block.execute(). -
RestartBlock variants: Used by syscalls that need to adjust their timeout or other parameters on restart (simple IP rewind would restart with the original, now-stale timeout):
/// Per-task restart block for `ERESTART_RESTARTBLOCK` syscalls.
/// Stored in `Task.restart_block`. Set by the interrupted syscall,
/// consumed by `sys_restart_syscall()` after signal handler returns.
///
/// **Canonical definition** — the only definition of `RestartBlock` in the
/// spec. See [Section 8.1](#process-and-task-management) for the `Task.restart_block`
/// field that stores this enum.
///
/// Linux equivalent: `struct restart_block` in `include/linux/restart_block.h`.
pub enum RestartBlock {
/// No restart pending (default state).
None,
/// `nanosleep()` / `clock_nanosleep()`: restart with remaining time.
/// Linux: `restart_block.nanosleep`.
Nanosleep {
/// Clock ID (`CLOCK_MONOTONIC`, `CLOCK_REALTIME`, etc.).
/// Linux: `clockid_t` (`int` = `i32`).
clockid: i32,
/// Nanosleep type flags (absolute vs relative, etc.).
/// Linux: `restart_block.nanosleep.type`.
flags: u32,
/// Remaining sleep time in nanoseconds.
remaining_ns: u64,
},
/// `futex(FUTEX_WAIT)` / `futex_wait_bitset()`: restart with adjusted timeout.
/// Linux: `restart_block.futex`.
FutexWait {
/// Userspace futex address (stored as `usize` — kernel-internal,
/// not a raw pointer to avoid lifetime issues in the restart block).
uaddr: usize,
/// Expected value at `uaddr`.
val: u32,
/// Futex operation flags (FUTEX_PRIVATE_FLAG, FUTEX_CLOCK_REALTIME, etc.).
/// Linux: `restart_block.futex.flags`.
flags: u32,
/// Bitset mask (0xFFFFFFFF for plain `futex_wait`).
/// Linux: `restart_block.futex.bitset`.
bitset: u32,
/// Absolute timeout in nanoseconds (0 = no timeout).
timeout_ns: u64,
},
/// `poll()` / `ppoll()`: restart with adjusted timeout.
/// Linux: `restart_block.poll`.
PollTimeout {
/// Absolute timeout in nanoseconds (monotonic).
timeout_ns: u64,
},
/// POSIX timer: re-arm with remaining interval.
/// UmkaOS addition — Linux uses `it_requeue_pending` instead.
PosixTimer {
/// POSIX timer ID.
timer_id: i32,
},
}
/// sys_restart_syscall() — entered when sigreturn detects __NR_restart_syscall.
fn sys_restart_syscall(task: &Task) -> isize {
match task.restart_block.take() {
RestartBlock::None => -EINTR as isize,
RestartBlock::Nanosleep { clockid, flags, remaining_ns } => {
sys_clock_nanosleep_restart(clockid, flags, remaining_ns)
}
RestartBlock::FutexWait { uaddr, val, flags, bitset, timeout_ns } => {
sys_futex_wait_restart(uaddr, val, flags, bitset, timeout_ns)
}
RestartBlock::PollTimeout { timeout_ns } => {
sys_poll_restart(timeout_ns)
}
RestartBlock::PosixTimer { timer_id } => {
sys_timer_restart(timer_id)
}
}
}
The restart block is a per-task field, not per-CPU, because the signal handler
may sleep and be context-switched before calling sigreturn. Only one restart
block can be active at a time (a task can only be in one interrupted syscall).
task.restart_block is set to RestartBlock::None after consumption.
8.5.6.2 Architecture-Specific Syscall Restart Functions¶
Each architecture defines the following functions in arch::current::signal:
/// Read the original syscall number from the per-arch ArchContext.
/// Each architecture's syscall entry path saves the original syscall number
/// (before the syscall handler may modify the result register) into
/// `ArchContext.orig_syscall_nr: u64`.
///
/// | Architecture | Saved from register | Field name |
/// |---|---|---|
/// | x86-64 | RAX (original, before handler overwrites RAX with result) | orig_syscall_nr |
/// | AArch64 | X8 (syscall number register) | orig_syscall_nr |
/// | ARMv7 | R7 (syscall number register) | orig_syscall_nr |
/// | RISC-V 64 | A7 (syscall number register) | orig_syscall_nr |
/// | PPC32/PPC64LE | R0 (syscall number register) | orig_syscall_nr |
/// | s390x | R1 (SVC code) | orig_syscall_nr |
/// | LoongArch64 | A7 (syscall number register) | orig_syscall_nr |
pub fn get_orig_syscall_nr(ctx: &ArchContext) -> u64;
/// Read the syscall result from the architecture's result register in
/// the saved pt_regs frame.
pub fn get_syscall_result(regs: &ArchRegs) -> isize;
/// Write a new syscall result into the architecture's result register.
/// Used to convert ERESTART* codes to EINTR before returning to userspace.
pub fn set_syscall_result(regs: &mut ArchRegs, val: isize);
/// Rewind the instruction pointer to the syscall entry point and restore
/// the original syscall number in the syscall number register. This causes
/// the syscall to be re-executed when the task returns to userspace.
///
/// Architecture-specific IP rewind:
/// - x86-64: `regs.rip -= 2` (2-byte `syscall` instruction)
/// - AArch64: `regs.pc -= 4` (4-byte `svc #0` instruction)
/// - ARMv7: `regs.pc -= 4` (4-byte `svc #0`)
/// - RISC-V: `regs.pc -= 4` (4-byte `ecall`)
/// - PPC32/PPC64LE: `regs.nip -= 4` (4-byte `sc`)
/// - s390x: `regs.psw.addr -= 4` (4-byte `svc`)
/// - LoongArch64: `regs.era -= 4` (4-byte `syscall 0`)
pub fn restart_syscall(regs: &mut ArchRegs);
/// Set up restart via restart_block: rewrite the syscall number register to
/// `__NR_restart_syscall` and rewind IP. The original arguments are preserved
/// in the restart_block, not in registers.
pub fn setup_restart_block(regs: &mut ArchRegs, task: &Task);
The orig_syscall_nr field is added to each architecture's ArchContext struct
(the per-task kernel stack frame). It is populated by the syscall entry path BEFORE
the handler is called, so that do_signal() can read it to determine the restart
behavior. This is the UmkaOS equivalent of Linux's orig_rax (x86-64),
orig_x0 (AArch64), orig_a7 (RISC-V), etc.
8.5.7 Signal Inheritance Across fork() and exec()¶
fork() / clone(): The child inherits a complete copy of the parent's signal
disposition table (sighand), the current signal mask of the cloning thread, and the
alternate signal stack descriptor. The child's pending signal sets are cleared: signals
pending in the parent are not delivered to the child. This prevents fork-bomb-style
cascades and matches POSIX semantics.
exec(): When a process calls execve(), all signal handlers are reset to their
default disposition. Signals set to SIG_IGN remain ignored (POSIX exception: SIGCHLD
set to SIG_IGN before exec retains SIG_IGN after exec). The signal mask is unchanged.
Pending signals that were sent to the old executable are retained and delivered to the
new image after exec completes. The alternate signal stack is cleared (the old stack
region is no longer mapped).
Thread creation (clone with CLONE_SIGHAND): The new thread inherits the process's
signal handler table (shared, not copied) and begins with an empty pending set. The
new thread's signal mask is copied from the creating thread's mask. The new thread
has no alternate signal stack; it must call sigaltstack() independently.
8.5.8 SIGCHLD and wait()¶
SIGCHLD is sent to a parent process whenever a child:
- Terminates (normal exit or killed by signal).
- Stops due to a job control signal (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU), unless the
parent has set SA_NOCLDSTOP in its SIGCHLD disposition.
- Continues after being stopped, if the parent is listening for SIGCHLD with
SA_NOCLDSTOP cleared and WCONTINUED semantics.
wait4(pid, status, options, rusage) / waitpid(pid, status, options):
These syscalls block until a child matching pid changes state:
- pid > 0: wait for the specific child.
- pid == -1: wait for any child.
- pid == 0: wait for any child in the same process group.
- pid < -1: wait for any child in process group |pid|.
Options: WNOHANG (non-blocking; returns 0 if no child is waitable),
WUNTRACED (report stopped children), WCONTINUED (report continued children),
__WALL (wait for any child regardless of clone flags).
When a child transitions to a waitable state, the kernel:
1. Sets the child's exit status in Process::exit_status.
2. Sends SIGCHLD to the parent (unless parent set SA_NOCLDWAIT).
3. Wakes any parent blocked in wait4() / waitpid() / waitid().
4. If the child is a zombie (terminated, not yet reaped), the wait() call
consumes the zombie and releases the child's PID.
If the parent has set SA_NOCLDWAIT in its SIGCHLD action, or has set the SIGCHLD
disposition to SIG_IGN, children are automatically reaped on termination without
becoming zombies. wait() then returns ECHILD immediately.
waitid(idtype, id, infop, options): Extended form that fills a SigInfo struct
with child status information (exit code, signal, stop/continue cause) rather than
encoding it in a status word. WNOWAIT option peeks at the waitable child without
consuming it.
8.6 Process Groups and Sessions¶
Process groups and sessions implement the POSIX job control model: the mechanism by which a shell manages sets of processes, routes terminal I/O signals, and controls which process group has foreground access to the controlling terminal.
8.6.1 Structures¶
8.6.1.1.1 ProcessGroup¶
A process group is a collection of processes that share a process group ID (PGID). Every process belongs to exactly one process group.
/// A collection of processes sharing a common process group ID.
///
/// # Invariants
/// - `pgid` equals the PID of the process group leader at the time the group
/// was created. The leader may subsequently exit; the group persists until
/// all members have exited.
/// - `session` never changes after creation.
/// - Every task linked via `members` corresponds to a live process whose
/// `Process::pgid` field equals this group's `pgid`.
pub struct ProcessGroup {
/// Process group identifier (PGID).
pub pgid: Pid,
/// Session this group belongs to.
pub session: Arc<Session>,
/// Members of this process group as an intrusive linked list.
/// Uses `Task::pid_group_node` embedded in each `Task` struct —
/// no heap allocation occurs under the spinlock.
pub members: SpinLock<IntrPidList>,
/// True if this group is the foreground process group of its session's
/// controlling terminal.
pub foreground: AtomicBool,
/// Number of member processes whose parent is in a *different* process
/// group within the *same* session — the "anchor" condition that prevents
/// the group from being orphaned. The counter is maintained incrementally
/// so that orphan detection on process exit is O(1) instead of O(members).
///
/// Incremented when a member gains such a parent (via `fork()` or
/// `setpgid()`). Decremented when such a parent exits or changes group.
/// Each `setpgid()`, `fork()`, or `exit()` operation touches at most
/// two groups (old and new).
pub parent_anchor_count: AtomicU32,
}
ProcessGroup and Session use intrusive linked lists for member tracking:
Task contains a pid_group_node: ListNode field that links it into its
process group's member chain. IntrPidList is a singly-linked intrusive list
through this node. The spinlock protects list pointer manipulation only; no
heap allocation occurs under the lock. Nodes are embedded in Task structs
(slab-allocated at task creation) and unlinked at task exit before the task's
slab memory is freed.
8.6.1.1.2 Session¶
A session is a collection of process groups sharing a controlling terminal and a
session ID (SID). The SID equals the PID of the session leader (the process that
called setsid()).
/// A collection of process groups sharing a controlling terminal and session ID.
///
/// # Invariants
/// - `sid` equals the PID of the process that created this session via `setsid()`.
/// - `leader` is None if the session leader has exited; the session persists
/// as long as any member process is alive.
/// - At most one process group in `process_groups` has `foreground == true`.
/// - `controlling_terminal` is None until `TIOCSCTTY` succeeds or a session
/// leader opens a terminal that becomes the controlling terminal.
pub struct Session {
/// Session identifier.
pub sid: Pid,
/// PID of the session leader, or None if it has exited.
pub leader: Option<Pid>,
/// The controlling terminal for this session, if any.
pub controlling_terminal: Option<Arc<Tty>>,
/// All process groups in this session, keyed by PGID (integer PID).
///
/// XArray: O(1) lookup by integer PID key with native RCU-compatible reads.
/// Process group changes are rare (only on `setpgid()`, `setsid()`, and
/// process exit); a typical session has O(1)–O(10) groups.
pub process_groups: XArray<Pid, Arc<ProcessGroup>>,
}
The Process struct (Section 8.1.1) carries two additional fields for group/session
membership:
pub struct Process {
// ... (existing fields from Section 8.1.1) ...
/// Process group ID.
pub pgid: Pid,
/// Session ID.
pub sid: Pid,
}
The global kernel state holds two lookup tables (XArray provides internal locking and RCU-compatible lock-free reads):
/// All live process groups, keyed by PGID (integer PID).
/// XArray: O(1) lookup by integer PID with native RCU-compatible reads.
/// XArray's internal lock replaces the external RwLock.
static PROCESS_GROUPS: XArray<Pid, Arc<ProcessGroup>>;
/// All live sessions, keyed by SID (integer PID).
/// XArray: O(1) lookup by integer PID with native RCU-compatible reads.
/// XArray's internal lock replaces the external RwLock.
static SESSIONS: XArray<Pid, Arc<Session>>;
8.6.2 System Calls¶
8.6.2.1.1 setpgid(pid, pgid) → Result<(), Errno>¶
Move process pid into process group pgid. If pid is 0, the caller is the
target. If pgid is 0, the target's own PID is used as the new PGID (creating a
new process group with the target as leader).
Preconditions enforced by the kernel:
- The target process must be either the caller itself, or a child of the caller
that has not yet called execve() (EACCES is returned after exec).
- The target process must be in the same session as the caller (EPERM if not).
- If pgid refers to an existing group, that group must be in the same session
(EPERM if not).
- A session leader cannot change its own PGID (EPERM).
Procedure:
1. Validate preconditions above.
2. If pgid refers to an existing ProcessGroup, add the target PID to its
members; update Process::pgid.
3. If pgid is a new PGID (no existing group), create a new ProcessGroup with
pgid = requested PGID, inherit the target's session; insert into
PROCESS_GROUPS. Add the target PID to its members.
4. Remove the target PID from its previous ProcessGroup::members. If the old
group is now empty, remove it from PROCESS_GROUPS and Session::process_groups.
5. Update Process::pgid.
8.6.2.1.2 getpgid(pid) → Result<Pid, Errno>¶
Return the PGID of process pid. If pid is 0, return the caller's PGID.
Returns ESRCH if pid does not exist.
8.6.2.1.3 setsid() → Result<Pid, Errno>¶
Create a new session with the caller as session leader.
Precondition: The caller must not already be a process group leader (EPERM if it is, because allowing it would create a session whose SID conflicts with an existing PGID in another session).
Procedure:
1. Create a new Session with sid = caller's PID, leader = caller's PID,
controlling_terminal = None.
2. Create a new ProcessGroup with pgid = caller's PID, session = new session.
3. Add the new group to the new session's process_groups.
4. Remove the caller from its old process group (see setpgid step 4).
5. Update caller's Process::pgid = caller's PID, Process::sid = caller's PID.
6. Insert session and group into SESSIONS and PROCESS_GROUPS.
7. Return the new SID.
The caller is now isolated from its former controlling terminal: no controlling
terminal is associated with the new session. The caller must open a terminal device
and acquire it as the controlling terminal via TIOCSCTTY if required.
8.6.2.1.4 getsid(pid) → Result<Pid, Errno>¶
Return the SID of process pid. If pid is 0, return the caller's SID. Returns
ESRCH if pid does not exist. Some implementations return EPERM if pid is in a
different session; UmkaOS returns the SID unconditionally (matches Linux behavior).
8.6.2.1.5 tcsetpgrp(fd, pgid) → Result<(), Errno>¶
Set the foreground process group of the terminal referred to by fd to pgid.
fd must refer to the controlling terminal of the calling process's session.
Preconditions:
- fd must be an open file descriptor for a terminal (ENOTTY otherwise).
- The terminal must be the controlling terminal of the calling process's session
(ENOTTY otherwise).
- The process group pgid must exist and be in the same session (EPERM if not).
- The calling process must be in the foreground group, or have SIGTTOU unblocked
and not ignored (otherwise SIGTTOU is sent to the calling process's group first).
Procedure:
1. Clear foreground on the current foreground process group (if any).
2. Set foreground on the ProcessGroup with the given pgid.
3. Update Tty::foreground_pgid (see Section 21.1).
8.6.2.1.6 tcgetpgrp(fd) → Result<Pid, Errno>¶
Return the PGID of the foreground process group of the terminal fd. Returns ENOTTY
if fd is not a terminal or not the session's controlling terminal.
8.6.3 Job Control Signals¶
Job control signals mediate access between process groups and the controlling terminal. The TTY layer (Section 21.1) generates these signals in response to hardware events or process I/O attempts.
8.6.3.1.1 SIGTSTP (signal 20)¶
Sent to the foreground process group of the controlling terminal when the
terminal's ISIG flag is set and the user types the SUSP character (typically
Ctrl+Z, character code 26). All processes in the foreground group receive SIGTSTP
simultaneously.
Default action: STOP. Processes may catch SIGTSTP to perform cleanup before stopping (e.g., restore terminal settings). A handler that catches SIGTSTP must eventually re-raise SIGSTOP to actually stop, or the shell cannot detect the stop.
8.6.3.1.2 SIGTTIN (signal 21)¶
Sent to a process group when a background process attempts to read() from
the controlling terminal and the process is not in the foreground group and the
terminal is the session's controlling terminal.
If the process group is orphaned (Section 8.4.4) or SIGTTIN is blocked/ignored in
the reading process, read() returns EIO instead of stopping the process.
8.6.3.1.3 SIGTTOU (signal 22)¶
Sent to a process group when a background process attempts to write() to the
controlling terminal, but only when the terminal's TOSTOP output discipline flag
is set (via tcsetattr(TOSTOP)). If TOSTOP is not set, background writes proceed
without signal.
Same orphan and block/ignore exceptions as SIGTTIN apply: EIO is returned instead of stopping an orphaned or signal-ignoring process.
8.6.3.1.4 SIGCONT (signal 18)¶
Resumes a stopped process group. Sent explicitly by the shell (via kill -CONT or
the fg/bg builtins) or by the kernel as part of the SIGTSTP handling path.
When SIGCONT is delivered to a stopped process:
1. All tasks in the process group transition from TaskState::Stopped to
TaskState::Running (no state flags set).
2. SIGCHLD is sent to the parent process (with si_code = CLD_CONTINUED) unless
the parent has set SA_NOCLDSTOP.
3. Any pending SIGSTOP or SIGTSTP for the same process is discarded (SIGCONT
cancels pending stop signals within the same delivery cycle).
SIGCONT may be sent from any process that has permission to signal the target, not only the session leader or TTY.
8.6.3.1.5 Interaction with SIGKILL and SIGSTOP¶
SIGKILL terminates a stopped process without resuming it first. SIGSTOP can be sent to any process group regardless of foreground status; it bypasses the job control machinery and cannot be caught or ignored.
8.6.4 Orphaned Process Groups¶
A process group is orphaned (per POSIX.1-2017 definition) if it has no member process whose parent is in a different process group within the same session.
In other words, a group G in session S is orphaned when every process p in
G has its parent either:
- Outside session S entirely (e.g. parent exited and was reparented to init, which
is in a different session), or
- Also inside group G.
If any process in G has its parent in a different group G' ≠ G within the same
session S, then G is non-orphaned and the parent group is responsible for
signaling it.
Why orphaning matters for job control: An orphaned process group has no parent process in the session that could deliver SIGCONT to resume it. A stopped orphaned group would be permanently stuck. POSIX therefore requires the kernel to send SIGHUP followed immediately by SIGCONT to the orphaned group, giving processes a chance to handle the hangup or resume.
8.6.4.1.1 Detection Algorithm¶
The check runs in two situations:
- On
exit(): when a processPexits, for each process groupGthat contains a child ofPin the same session: -
If
Gwas non-orphaned becausePwas in a different group within the same session, andGbecomes orphaned afterP's exit, and any process inGis stopped — deliver SIGHUP then SIGCONT to every process inG. -
On
setpgid(): when a process leaves a group, perform the same check for any group that may now have become orphaned due to the membership change.
Implementation in do_exit():
fn check_orphaned_pgrps_on_exit(exiting: &Process) {
for child_pid in exiting.children.iter() {
let child = get_process(child_pid);
let child_pgrp = get_pgroup(child.pgid);
// Only examine groups in the same session.
if child.sid != exiting.sid { continue; }
// Only care about groups where the exiting process was the "anchor"
// (i.e., exiting was in a *different* group within the session).
if exiting.pgid == child.pgid { continue; }
// Decrement the anchor counter: this exiting parent was an anchor
// for the child's process group. Use Release on decrement + check
// the return value directly to avoid TOCTOU with a separate load.
let prev = child_pgrp.parent_anchor_count.fetch_sub(1, Release);
// prev == 1 means we were the last anchor; group just became orphaned.
//
// TOCTOU note: between the `fetch_sub` above and the
// `group_has_stopped_member()` check below, a concurrent `fork()` or
// `setpgid()` could increment parent_anchor_count back to non-zero,
// making the group no longer orphaned. The consequence is a spurious
// SIGHUP + SIGCONT delivery to a non-orphaned group — annoying but
// harmless: processes must handle or ignore SIGHUP, and SIGCONT to a
// running process is a no-op. Linux has the same inherent race
// (serialized by `tasklist_lock`, which is coarser but still has
// contention windows). Eliminating this race would require holding a
// global lock across the decrement + stopped-member check + signal
// delivery, which is disproportionate for a benign consequence.
if prev == 1 {
// If any member is stopped, deliver SIGHUP + SIGCONT.
if group_has_stopped_member(&child_pgrp) {
send_signal_to_pgrp(child.pgid, SIGHUP);
send_signal_to_pgrp(child.pgid, SIGCONT);
}
}
}
}
/// O(1) orphan check via the incrementally-maintained parent anchor counter.
/// A group is orphaned when no member has a parent in a different group
/// within the same session. The counter tracks exactly those relationships,
/// updated on `fork()`, `setpgid()`, and `exit()`.
fn is_orphaned(pgrp: &ProcessGroup) -> bool {
pgrp.parent_anchor_count.load(Relaxed) == 0
}
A group with no stopped members that becomes orphaned is not signaled — there is no need, because it will not be permanently stuck. The SIGHUP+SIGCONT pair is sent only to prevent unrecoverable stop.
8.6.5 Controlling Terminal Association¶
A controlling terminal is a TTY device associated with a session. At most one terminal may be the controlling terminal for a given session; a given terminal may be the controlling terminal for at most one session.
8.6.5.1.1 Acquisition¶
A terminal becomes the controlling terminal of a session by one of two means:
-
Implicit acquisition (non-POSIX extension, enabled by default on Linux and UmkaOS): When a session leader opens a terminal device that does not already have a controlling terminal, and the open flag
O_NOCTTYis not set, the terminal is automatically assigned as the session's controlling terminal. -
Explicit acquisition via
TIOCSCTTYioctl: The session leader sendsTIOCSCTTY(with argument 1 to steal from another session, or 0 for a non-stealing assignment) on an open terminal file descriptor. Only the session leader may issueTIOCSCTTY. If another session currently controls the terminal: - Argument 0: returns EPERM.
- Argument 1 and caller has
CAP_SYS_ADMIN: sends SIGHUP to the former controlling session's foreground process group, then transfers the terminal.
8.6.5.1.2 Disassociation via setsid()¶
setsid() always disassociates the calling process from its current controlling
terminal. After setsid():
- The new session has no controlling terminal (Session::controlling_terminal = None).
- The old session retains its controlling terminal unchanged.
- The caller can no longer send or receive job control signals via the old terminal.
8.6.5.1.3 Disassociation on Terminal Hangup¶
When a controlling terminal is closed by its last opener (e.g., a modem hangs up,
or the terminal emulator closes the PTY master):
1. SIGHUP is sent to the session's foreground process group.
2. SIGCONT is sent to the foreground process group (to resume stopped processes
so they can handle SIGHUP).
3. Session::controlling_terminal is set to None.
4. The session no longer has a controlling terminal; processes that subsequently
attempt tcgetpgrp() on any fd receive ENOTTY.
The TTY layer initiates this sequence via tty_hangup() (Section 21.1).
8.6.5.1.4 TIOCNOTTY ioctl¶
A process in the session (not necessarily the leader) may call ioctl(fd, TIOCNOTTY)
to disassociate the calling process's session from its controlling terminal. This is
the POSIX-specified way for a daemon to relinquish its controlling terminal after
forking from a session leader. After TIOCNOTTY:
- If the caller was the session leader: same effect as the hangup procedure above
(SIGHUP + SIGCONT to foreground group, terminal disassociated from session).
- If the caller was not the session leader: the call succeeds but has no effect on
the session's controlling terminal (matches Linux behavior).
8.7 Resource Limits and Accounting¶
Resource limits (rlimit) and resource usage accounting (rusage) provide the
POSIX-standard mechanism for constraining per-process resource consumption and for
reporting how much of each resource a process has consumed. UmkaOS implements the full
Linux rlimit/rusage interface with wire-compatible struct layouts, exact signal
semantics, and /proc/PID/limits output format.
Internally UmkaOS improves on Linux in two respects:
- Lock-free accounting: all
RusageAccumfields areAtomicU64, updated in the scheduler hot path without taking any lock (Linux usestask_lock()for some rusage paths). - UID-level enforcement via atomics:
RLIMIT_NPROC,RLIMIT_SIGPENDING, andRLIMIT_MSGQUEUEare enforced against per-UID atomic counters in the user namespace rather than scanning process lists.
8.7.1 Resource Limit Types¶
UmkaOS supports all 16 standard Linux resource limit types. The numeric values match Linux
exactly so that getrlimit/setrlimit/prlimit64 wire calls are binary-compatible.
| Constant | Value | Resource | Unit | Enforcement point |
|---|---|---|---|---|
RLIMIT_CPU |
0 | CPU time | seconds | Scheduler tick |
RLIMIT_FSIZE |
1 | Max file size | bytes | vfs_write() |
RLIMIT_DATA |
2 | Data segment size | bytes | brk() / mmap() |
RLIMIT_STACK |
3 | Stack size | bytes | Page fault handler |
RLIMIT_CORE |
4 | Core dump size | bytes | Core dump path |
RLIMIT_RSS |
5 | Resident set size | bytes | Advisory; cgroup integration |
RLIMIT_NPROC |
6 | Processes/threads per UID | count | do_fork() |
RLIMIT_NOFILE |
7 | Open file descriptors | count | alloc_fd() |
RLIMIT_MEMLOCK |
8 | Locked memory | bytes | mlock() / mmap(MAP_LOCKED) |
RLIMIT_AS |
9 | Virtual address space | bytes | mmap(), mremap(), brk() |
RLIMIT_LOCKS |
10 | File locks (obsolete) | count | Always RLIM_INFINITY |
RLIMIT_SIGPENDING |
11 | Pending signals per UID | count | send_signal() |
RLIMIT_MSGQUEUE |
12 | POSIX MQ bytes per UID | bytes | MQ create/open |
RLIMIT_NICE |
13 | Maximum nice value | — | Scheduler (20 - rlim = min nice) |
RLIMIT_RTPRIO |
14 | Max RT scheduling priority | — | Scheduler |
RLIMIT_RTTIME |
15 | Max RT CPU time | microseconds | RT scheduler tick |
Signal semantics for exceeded limits:
RLIMIT_CPUsoft: SIGXCPU is delivered repeatedly (every second) after the soft limit is crossed. At the hard limit: SIGKILL is delivered (matches Linux behavior).RLIMIT_FSIZE: SIGXFSZ is delivered andvfs_write()returns EFBIG. Both the signal and the error are returned, matching Linux exactly.RLIMIT_RTTIMEsoft: SIGXCPU. Hard: SIGKILL.RLIMIT_STACK: the page fault handler rejects stack growth beyond the limit; the task receives SIGSEGV (stack overflow).- All other limits: the syscall returns EAGAIN or ENOMEM as appropriate; no signal is delivered for non-CPU/non-file limits.
RLIMIT_LOCKS (value 10) is present for ABI completeness but is always RLIM_INFINITY
in UmkaOS. The Linux file lock limit was never meaningfully enforced in modern kernels and
UmkaOS does not implement it.
RLIM_INFINITY on the wire is u64::MAX (0xffff_ffff_ffff_ffff), matching Linux.
8.7.2 Wire Format and Syscalls¶
8.7.2.1.1 struct rlimit wire layout¶
/// Wire format — binary-identical to Linux `struct rlimit64`.
/// Copied to/from userspace by `getrlimit`/`setrlimit`/`prlimit64`.
#[repr(C)]
pub struct RlimitWire {
pub rlim_cur: u64, // soft limit (RLIM_INFINITY = u64::MAX)
pub rlim_hard: u64, // hard limit (RLIM_INFINITY = u64::MAX)
}
const_assert!(core::mem::size_of::<RlimitWire>() == 16);
The 32-bit struct rlimit (with unsigned long fields) is handled by the
getrlimit/setrlimit compat path; the 64-bit form is used by prlimit64 and
internally.
8.7.2.1.2 Syscalls¶
getrlimit(resource: u32, rlim: *mut RlimitWire) -> Result<(), Errno>
Returns the soft and hard limits for resource in the calling process's RlimitSet.
On 64-bit architectures, getrlimit reads the pair without a lock because each u64
field is naturally aligned and loaded atomically in a single instruction. On 32-bit
architectures (ARMv7, PPC32), getrlimit acquires rlimit_lock to prevent torn reads
(u64 loads require two 32-bit instructions). setrlimit always holds rlimit_lock
while writing both fields.
Returns EINVAL if resource >= 16.
setrlimit(resource: u32, rlim: *const RlimitWire) -> Result<(), Errno>
Sets the soft and hard limits for resource. Rules:
- The soft limit must not exceed the hard limit.
- A non-privileged process (
!CAP_SYS_RESOURCE) may only lower the hard limit, not raise it. A lowered hard limit is irreversible. - A process with
CAP_SYS_RESOURCEmay raise both limits up to the system maximum (/proc/sys/fs/nr_openforRLIMIT_NOFILE, etc.). RLIMIT_NOFILEhard limit may not exceedNR_OPEN(1,048,576 by default, tunable).RLIMIT_NPROChard limit may not exceedPID_MAX_LIMIT(4,194,304).
Returns EINVAL for invalid resource, bad limit ordering, or out-of-range values.
Returns EPERM for privilege violations.
prlimit64(pid: pid_t, resource: u32, new_limit: *const RlimitWire, old_limit: *mut RlimitWire) -> Result<(), Errno>
The prlimit64 syscall (Linux 2.6.36+, x86-64 syscall number 302) extends
getrlimit/setrlimit with a pid argument for operating on another process.
pid == 0: operates on the calling process (identical togetrlimit/setrlimit).pid != 0: the caller must either haveCAP_SYS_PTRACEwithPTRACE_MODE_ATTACH_REALCREDS, or the caller's real/effective UID must match the target process's real/saved UID and the caller must not be cross-namespace.new_limit == NULL: read-only (equivalent togetrlimiton the target).old_limit == NULL: write-only (equivalent tosetrlimiton the target).- Both
new_limitandold_limitnon-NULL: atomic read-then-write underrlimit_lock.
Returns ESRCH if pid does not name a live process. Returns EPERM if the credential
check fails.
getrusage(who: i32, rusage: *mut RusageWire) -> Result<(), Errno>
Returns accumulated resource usage. who values:
| Constant | Value | Meaning |
|---|---|---|
RUSAGE_SELF |
0 | Current process (sum of all live threads) |
RUSAGE_CHILDREN |
-1 | Sum of all waited-for (reaped) children |
RUSAGE_THREAD |
1 | Current thread only (Linux extension) |
RUSAGE_BOTH |
-2 | Self + children (used internally by wait4) |
RUSAGE_BOTH is not a valid who value from user space; getrusage returns EINVAL
for it. It is used internally by the wait4/waitid exit path to atomically collect
and add child usage to the parent's children_rusage accumulator.
times(buf: *mut Tms) -> Result<clock_t, Errno>
Legacy POSIX interface. Returns elapsed real time (in clock ticks since an arbitrary
epoch) and fills buf with:
pub struct Tms {
pub tms_utime: clock_t, // user time of calling process, in USER_HZ ticks
pub tms_stime: clock_t, // system time of calling process, in USER_HZ ticks
pub tms_cutime: clock_t, // user time of waited-for children, in USER_HZ ticks
pub tms_cstime: clock_t, // system time of waited-for children, in USER_HZ ticks
}
USER_HZ = 100. Values are derived from RusageAccum.utime_ns /
RusageAccum.stime_ns divided by 1_000_000_000 / USER_HZ = 10_000_000.
8.7.3 Internal Structures¶
8.7.3.1.1 RlimitSet — per-process limit storage¶
/// Per-process resource limit set.
///
/// Stored in `Process`. Inherited verbatim on `fork()`. Not modified by `exec()`.
/// Protected by `Process.rlimit_lock` for write access; read access is unsynchronized
/// (see Section 8.5.2 for correctness argument).
pub struct RlimitSet {
/// Indexed by RLIMIT_* constant (0–15).
pub limits: [RlimitPair; 16],
}
/// A single soft/hard limit pair.
///
/// # Safety: lockless `getrlimit` reads
///
/// On 64-bit architectures, each `u64` load is atomic (single-instruction).
/// A momentarily inconsistent {old_soft, new_hard} pair during concurrent
/// `setrlimit` is benign because `setrlimit` maintains the invariant
/// `soft <= hard` by writing both fields atomically under `rlimit_lock`. Any transiently
/// visible pair satisfies `soft <= hard`.
///
/// **32-bit architectures (ARMv7, PPC32)**: u64 fields are NOT single-instruction
/// atomic on 32-bit targets. Reads and writes of `cur`/`max` use two separate
/// 32-bit loads/stores. Concurrent `getrlimit` and `setrlimit` are protected by
/// the task's `rlimit_lock` (SpinLock): `setrlimit` holds the lock while writing
/// both halves; `getrlimit` also acquires the lock on 32-bit targets to prevent
/// torn reads. On 64-bit architectures, `getrlimit` reads without the lock
/// (naturally aligned u64 loads are atomic).
#[repr(C)]
pub struct RlimitPair {
/// Soft limit. RLIM_INFINITY = u64::MAX.
pub soft: u64,
/// Hard limit. RLIM_INFINITY = u64::MAX. Invariant: soft <= hard.
pub hard: u64,
}
const_assert!(core::mem::size_of::<RlimitPair>() == 16);
Process is extended with the following fields (additions to the struct shown in
Section 8.1.1):
pub struct Process {
// ... existing fields ...
/// Resource limits for this process (Section 8.5).
pub rlimits: RlimitSet,
/// Held only during setrlimit / prlimit64 writes.
pub rlimit_lock: SpinLock<()>, // SpinLock per canonical Process def in §8.1
/// Locked memory byte count for RLIMIT_MEMLOCK enforcement.
pub locked_pages: AtomicU64,
/// Accumulated resource usage of waited-for children (updated on wait).
pub children_rusage: RusageAccum,
}
8.7.3.1.2 RusageAccum — per-task accounting¶
/// Lock-free resource usage accumulator, one per Task (thread).
///
/// All fields are AtomicU64 updated in the scheduler hot path and page fault handler
/// without taking any lock. Process-level totals are computed by summing across all
/// live threads plus the per-process children_rusage accumulator.
pub struct RusageAccum {
/// Accumulated user-mode CPU time in nanoseconds.
pub utime_ns: AtomicU64,
/// Accumulated kernel-mode CPU time in nanoseconds.
pub stime_ns: AtomicU64,
/// Minor (non-I/O) page faults.
pub minflt: AtomicU64,
/// Major (I/O-requiring) page faults.
pub majflt: AtomicU64,
/// Voluntary context switches (task called schedule() explicitly).
pub nvcsw: AtomicU64,
/// Involuntary context switches (preempted by scheduler).
pub nivcsw: AtomicU64,
/// Block device read operations (incremented by the block layer).
pub inblock: AtomicU64,
/// Block device write operations (incremented by the block layer).
pub oublock: AtomicU64,
/// Peak RSS in kilobytes. Updated with fetch_max() when RSS grows.
pub peak_rss_kb: AtomicU64,
}
Task is extended with:
pub struct Task {
// ... existing fields ...
/// Per-thread resource usage accumulator (Section 8.5).
pub rusage: RusageAccum,
/// Tick at which last SIGXCPU was delivered. 0 = never delivered.
/// Used to ensure SIGXCPU is sent at most once per second (POSIX requirement).
pub last_sigxcpu_tick: u64,
}
8.7.3.1.3 Update discipline¶
utime_ns/stime_ns: updated at every context switch. The scheduler records the timestamp at switch-out (Task.last_switched_in: Instant), computes elapsed time, and adds it to eitherutime_ns(if the task was in user mode at switch-out) orstime_ns(if in kernel mode). Addition usesfetch_addwithRelaxedordering; the value is only read bygetrusage, which is not on a critical path.minflt/majflt: incremented in the page fault handler withfetch_add(1, Relaxed).nvcsw/nivcsw: incremented inschedule(). Voluntary switches come from explicitschedule()calls (blocked on I/O, futex, etc.); involuntary from timer preemption oryield_to_scheduler().inblock/oublock: incremented by the block I/O completion path (Section 15.2) after each read/write bio completes.peak_rss_kb: updated in the page fault handler when a new page is faulted in and the resulting RSS (in KB) exceeds the current peak. Usesfetch_max(new_rss_kb, Relaxed).
8.7.4 Enforcement Points¶
Each limit is enforced at a specific point in the kernel. This section documents the exact call site, the check performed, and the consequence of exceeding the limit.
8.7.4.1.1 RLIMIT_NOFILE — file descriptor count¶
Checked in alloc_fd() before a slot is assigned in the task's FdTable. The check
is performed without holding rlimit_lock:
fn alloc_fd(task: &Task) -> Result<Fd, Errno> {
let soft = task.process.rlimits.limits[RLIMIT_NOFILE].soft;
let current_count = task.files.count.load(Relaxed);
if current_count >= soft {
return Err(Errno::EMFILE);
}
// ... allocate slot ...
}
The FdTable.count atomic is the authoritative open-fd count. Reading rlimits.soft
and fd_table.count separately (without a combined lock) can permit a small race window
during concurrent open() calls; UmkaOS accepts this because Linux has the same behavior
and the practical consequence is that the limit may be exceeded by at most
(number of concurrent open() calls - 1) descriptors, which is bounded and harmless.
8.7.4.1.2 RLIMIT_NPROC — process/thread count per UID¶
Checked in do_fork() before a new task is created. The count is maintained in the
UserEntry for the calling task's UID:
fn do_fork(parent: &Task, flags: CloneFlags) -> Result<Arc<Task>, Errno> {
// Snapshot credentials under the task's cred_guard_mutex to prevent
// TOCTOU with concurrent setuid(). The cred pointer is RCU-protected;
// rcu_read_lock() ensures the cred is not freed during the check.
// This matches Linux's copy_creds() model: the child's credentials
// are copied from the parent's at fork time under RCU protection.
let _rcu = rcu_read_lock();
let cred = parent.process.cred(); // RCU-dereferenced
let uid = cred.uid;
let user_entry = parent.nsproxy.load().user_ns.get_user_entry(uid);
let count = user_entry.task_count.fetch_add(1, AcqRel);
let soft = parent.process.rlimits.limits[RLIMIT_NPROC].soft;
// Init (PID 1) is always exempt from RLIMIT_NPROC to prevent system
// deadlock if init drops CAP_SYS_RESOURCE. Matches Linux is_global_init().
if count >= soft && !parent.process.is_global_init()
&& !cred.has_capability(CAP_SYS_ADMIN)
&& !cred.has_capability(CAP_SYS_RESOURCE) {
user_entry.task_count.fetch_sub(1, Relaxed);
return Err(Errno::EAGAIN);
}
// ... create task with cloned cred ...
// On failure after this point, decrement task_count before returning.
}
task_count is decremented in release_task(), when the parent reaps the zombie
via wait4(2). The zombie continues to count toward RLIMIT_NPROC because it
still occupies a PID slot. See Section 8.2.
The optimistic fetch_add/check/fetch_sub pattern permits a bounded transient overshoot
of at most (concurrent_callers - 1) tasks; matches RLIMIT_NOFILE pattern and Linux
behavior.
8.7.4.1.3 RLIMIT_AS — virtual address space¶
Checked in mmap(), mremap(), and brk() before committing a new VMA:
The AddressSpace struct maintains a running total of all mapped VMA sizes for O(1)
limit checking:
pub struct AddressSpace {
// ... other fields ...
/// Running total of all mapped VMA sizes in bytes.
/// Updated atomically on every mmap(), munmap(), and mremap().
/// Enables O(1) RLIMIT_AS checks without walking the VMA tree.
pub vm_total_bytes: AtomicUsize,
}
The RLIMIT_AS check is O(1): compare addr_space.vm_total_bytes.load(Acquire) + new_size
against task.rlimit[RLIMIT_AS]. No VMA walk required. vm_total_bytes is updated
atomically (with fetch_add/fetch_sub) on every mmap(), munmap(), and mremap().
fn check_rlimit_as(process: &Process, add_bytes: usize) -> Result<(), Errno> {
let soft = process.rlimits.limits[RLIMIT_AS].soft;
if soft == u64::MAX {
return Ok(());
}
let current_as = process.address_space.vm_total_bytes.load(Acquire);
if current_as + add_bytes > soft as usize {
return Err(Errno::ENOMEM);
}
Ok(())
}
8.7.4.1.4 RLIMIT_MEMLOCK — locked memory¶
mlock() and mmap(MAP_LOCKED) increment Process.locked_pages atomically. The check
and update are:
/// Check and atomically add to the process's locked page count.
///
/// # Arguments
/// - `add_pages`: number of pages to lock (NOT bytes). The caller converts
/// from bytes: `len / PAGE_SIZE`.
///
/// # Units
/// `Process.locked_pages` counts pages. `RLIMIT_MEMLOCK` is in bytes.
/// The comparison converts the limit to pages: `soft >> PAGE_SHIFT`.
/// This matches Linux `mm/mlock.c` which computes
/// `lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT` and compares
/// against `mm->locked_vm` (also in pages).
fn check_and_add_locked(process: &Process, add_pages: u64) -> Result<(), Errno> {
let soft = process.rlimits.limits[RLIMIT_MEMLOCK].soft;
if soft == u64::MAX {
return Ok(());
}
// Convert byte limit to page limit (matching Linux mm/mlock.c).
let limit_pages = soft >> PAGE_SHIFT;
let prev = process.locked_pages.fetch_add(add_pages, AcqRel);
if prev + add_pages > limit_pages {
process.locked_pages.fetch_sub(add_pages, Relaxed);
return Err(Errno::ENOMEM);
}
Ok(())
}
munlock() and mmap(MAP_LOCKED) region unmaps decrement locked_pages by the region
size (in pages). locked_pages never goes negative; the decrement is guarded by a debug
assertion.
The optimistic fetch_add/check/fetch_sub pattern permits a bounded transient overshoot
of at most (concurrent_callers - 1) * max_add_pages; matches RLIMIT_NOFILE pattern
and Linux behavior.
8.7.4.1.5 RLIMIT_SIGPENDING — pending signal count per UID¶
Checked in send_signal() for real-time signals (signals 34–64) and for sigqueue().
Standard signals (1–31) are not subject to this limit because they are not queued per
occurrence. The check uses the UserEntry.sigpending_count atomic:
fn queue_rt_signal(target_uid: Uid, ns: &UserNamespace, ...) -> Result<(), Errno> {
let user = ns.get_user_entry(target_uid);
let soft = /* resolved from target process's RlimitSet[RLIMIT_SIGPENDING] */;
let prev = user.sigpending_count.fetch_add(1, AcqRel);
if prev >= soft {
user.sigpending_count.fetch_sub(1, Relaxed);
return Err(Errno::EAGAIN);
}
// enqueue signal ...
}
sigpending_count is decremented when the signal is consumed by the target task in
dequeue_signal().
The optimistic fetch_add/check/fetch_sub pattern permits a bounded transient overshoot
of at most (concurrent_callers - 1) signals; matches RLIMIT_NOFILE pattern and Linux
behavior.
8.7.4.1.6 RLIMIT_MSGQUEUE — POSIX MQ bytes per UID¶
Tracked in UserNamespace.users per UID via UserEntry.mq_bytes: AtomicU64. Checked
when a new message queue is created or when a message is sent that would increase the
total byte count. The limit applies to the creator's UID:
fn mq_check_and_add(uid: Uid, ns: &UserNamespace,
add_bytes: u64, soft: u64) -> Result<(), Errno> {
let user = ns.get_user_entry(uid);
let prev = user.mq_bytes.fetch_add(add_bytes, AcqRel);
if prev + add_bytes > soft {
user.mq_bytes.fetch_sub(add_bytes, Relaxed);
return Err(Errno::EMFILE); // matches Linux (EMFILE for mq_open)
}
Ok(())
}
mq_bytes is decremented when a message is received (dequeued) or when a queue is
unlinked and drained.
8.7.4.1.7 RLIMIT_CPU — CPU time¶
Checked at every scheduler tick in the per-CPU run loop. The check compares accumulated CPU time (in seconds) against the soft and hard limits:
fn tick_check_rlimit_cpu(task: &mut Task) {
let cpu_secs = (task.rusage.utime_ns.load(Relaxed)
+ task.rusage.stime_ns.load(Relaxed)) / 1_000_000_000;
let soft = task.process.rlimits.limits[RLIMIT_CPU].soft;
let hard = task.process.rlimits.limits[RLIMIT_CPU].hard;
if soft != u64::MAX && cpu_secs >= soft {
// SIGXCPU is delivered at most once per second after soft limit crossing
// (POSIX/Linux behavior). `last_sigxcpu_tick` prevents signal flooding
// on every timer tick. HZ = scheduler tick rate (Section 7.1.2.1).
let now = current_tick();
if now.saturating_sub(task.last_sigxcpu_tick) >= HZ as u64 {
send_signal_to_process(&task.process, SIGXCPU);
task.last_sigxcpu_tick = now;
}
}
if hard != u64::MAX && cpu_secs >= hard {
send_signal_to_process(&task.process, SIGKILL);
}
}
SIGXCPU is delivered repeatedly (once per second after the soft limit is crossed) until the process responds (catches the signal and reduces CPU usage, or is killed by reaching the hard limit). This matches Linux behavior.
8.7.4.1.8 RLIMIT_RTTIME — RT CPU time¶
Checked in the RT scheduler tick path for tasks with SchedPolicy::Fifo or
SchedPolicy::RoundRobin. The rt_runtime_us field in SchedEntity accumulates
real-time CPU time in microseconds. At each RT tick:
fn rt_tick_check_rlimit(task: &Task) {
let rt_us = task.sched_entity.rt_runtime_us.load(Relaxed);
let soft = task.process.rlimits.limits[RLIMIT_RTTIME].soft;
let hard = task.process.rlimits.limits[RLIMIT_RTTIME].hard;
if soft != u64::MAX && rt_us >= soft {
send_signal_to_process(&task.process, SIGXCPU);
}
if hard != u64::MAX && rt_us >= hard {
send_signal_to_process(&task.process, SIGKILL);
}
}
rt_runtime_us resets to zero when the task transitions out of an RT scheduling class
(e.g., via sched_setscheduler() to SCHED_OTHER).
8.7.4.1.9 RLIMIT_FSIZE — maximum file size¶
Checked in vfs_write() before each write that would extend a file beyond its current
size. When the file size after the write would exceed the soft limit:
- SIGXFSZ is delivered to the calling process.
vfs_write()returns EFBIG.
Both the signal and the error are returned, matching Linux exactly. The hard limit is treated the same as the soft limit (there is no separate behavior at the hard limit for file size — SIGXFSZ is delivered regardless).
8.7.4.1.10 RLIMIT_STACK — stack size¶
The page fault handler checks RLIMIT_STACK when expanding the stack region downward
(on architectures where the stack grows down, i.e., all supported UmkaOS targets). If the
proposed new stack bottom would place the stack size beyond the soft limit:
- The fault is not satisfied (page is not mapped).
- The task receives SIGSEGV.
This is the only limit enforced by the fault handler rather than a syscall entry point.
The stack VMA is annotated with VmaFlags::STACK so the fault handler can identify it.
8.7.4.1.11 RLIMIT_DATA — data segment¶
Checked in brk() when expanding the heap:
fn do_brk(process: &Process, new_brk: usize) -> Result<usize, Errno> {
let soft = process.rlimits.limits[RLIMIT_DATA].soft;
if soft != u64::MAX {
let data_size = (new_brk - process.address_space.start_data) as u64;
if data_size > soft {
return Err(Errno::ENOMEM);
}
}
// ... extend heap VMA ...
}
RLIMIT_DATA is also checked in mmap(MAP_ANONYMOUS | MAP_PRIVATE) when the mapping
is created in the data/heap region (below the stack, above the text segment). It is not
checked for file-backed mappings or shared mappings.
8.7.5 Inheritance Across fork() and exec()¶
Fork: the child inherits an exact byte-for-byte copy of the parent's RlimitSet.
No limits are reset or modified. Both the rlimits array and the locked_pages counter
start as copies of the parent's values. The child gets its own rlimit_lock.
Thread creation (clone with CLONE_VM): threads of the same process share the process
RlimitSet directly (all threads reference the same Process struct via task.process).
There is no per-thread limit set; all threads in a process share one RlimitSet and
one rlimit_lock.
Exec: execve() does not modify any resource limits. Limits survive across exec.
This matches POSIX and Linux.
Exception — stack limit on exec: if the program's initial stack size (as determined
by the ELF loader and the kernel's stack setup) would exceed RLIMIT_STACK, the exec
fails with ENOMEM before the new address space is committed. This check occurs after the
new address space is built but before the old address space is torn down, so a failed
exec leaves the calling process intact.
UID counter on fork: UserEntry.task_count is incremented in do_fork() (for each
new task) and decremented in release_task() (when the zombie is reaped via wait4(2)).
The decrement happens in release_task(), not in do_exit(), because the zombie is
still a "task" occupying a PID slot and contributing to RLIMIT_NPROC accounting — the
count must remain accurate through the zombie phase. This matches Linux where free_uid()
is called from __put_task_struct() after release_task().
The child process's UID may differ from the parent's if the exec or clone set up a
different UID mapping; in that case, the decrement targets the UID's UserEntry at the
time release_task() runs, not at fork time.
8.7.6 UID-Level Accounting¶
Several limits (RLIMIT_NPROC, RLIMIT_SIGPENDING, RLIMIT_MSGQUEUE) are enforced
per-UID rather than per-process. The counters live in UserEntry objects stored in the
user namespace:
/// Per-UID accounting entry within a user namespace.
pub struct UserEntry {
/// Number of live tasks (threads) whose real UID matches this entry.
/// Incremented in do_fork(), decremented in release_task() (when the
/// zombie is reaped via wait4(2), not in do_exit() — the zombie still
/// counts as a task for RLIMIT_NPROC purposes).
/// u64 to correctly compare against RLIMIT values which are u64; u32
/// would silently pass the limit check if soft > 4 billion.
pub task_count: AtomicU64,
/// Number of queued real-time signals for tasks with this real UID.
/// Incremented in queue_rt_signal(), decremented in dequeue_signal().
/// u64 to correctly compare against RLIMIT_SIGPENDING which is u64.
pub sigpending_count: AtomicU64,
/// Total bytes allocated for POSIX message queues owned by this UID.
/// Incremented on mq_open/msgsnd, decremented on mq_unlink/msgrcv.
pub mq_bytes: AtomicU64,
}
UserEntry objects are stored in UserNamespace.users: RcuHashMap<Uid, Arc<UserEntry>>.
Lookup is O(1) average under RCU read lock (no blocking). New entries are created on
first task creation for a UID and are removed when task_count reaches zero and all
associated resources are released.
The RcuHashMap uses the UmkaOS RCU implementation (Section 3.1) for reads: readers
acquire an RcuReadGuard (a single CpuLocal flag write, ~1-3 cycles), look up the
entry, clone the Arc, and release the guard. Writers take a per-map mutex, insert
or remove, publish the new table under RCU, then wait for a grace period before freeing
the old table.
8.7.7 getrusage Wire Format¶
The rusage struct written to user space is binary-identical to the Linux
struct rusage:
/// Wire format for getrusage(2). Binary-identical to Linux `struct rusage`.
///
/// All `long` fields use `KernelLong` ([Section 19.1](19-sysapi.md#syscall-interface--kernellong--kernelulong))
/// to ensure correct layout on both ILP32 (ARMv7, PPC32) and LP64 architectures.
/// On LP64: each field is 8 bytes, total = 144 bytes.
/// On ILP32: each field is 4 bytes, total = 72 bytes.
#[repr(C)]
pub struct RusageWire {
/// User CPU time.
pub ru_utime: Timeval,
/// System CPU time.
pub ru_stime: Timeval,
/// Peak RSS in kilobytes (max over the process lifetime).
pub ru_maxrss: KernelLong,
/// Integral shared memory size. Not tracked by UmkaOS; always 0.
pub ru_ixrss: KernelLong,
/// Integral unshared data size. Not tracked by UmkaOS; always 0.
pub ru_idrss: KernelLong,
/// Integral unshared stack size. Not tracked by UmkaOS; always 0.
pub ru_isrss: KernelLong,
/// Minor (non-I/O) page faults.
pub ru_minflt: KernelLong,
/// Major (I/O-requiring) page faults.
pub ru_majflt: KernelLong,
/// Number of times the process was swapped out. Not tracked; always 0.
pub ru_nswap: KernelLong,
/// Block input operations (reads from block devices).
pub ru_inblock: KernelLong,
/// Block output operations (writes to block devices).
pub ru_oublock: KernelLong,
/// IPC messages sent. Not tracked by UmkaOS; always 0.
pub ru_msgsnd: KernelLong,
/// IPC messages received. Not tracked by UmkaOS; always 0.
pub ru_msgrcv: KernelLong,
/// Signals received. Not tracked by UmkaOS; always 0.
pub ru_nsignals: KernelLong,
/// Voluntary context switches.
pub ru_nvcsw: KernelLong,
/// Involuntary context switches.
pub ru_nivcsw: KernelLong,
}
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<RusageWire>() == 144);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<RusageWire>() == 72);
/// C `struct timeval`. Used in `RusageWire`, `ElfPrstatus`, and `select(2)`.
/// Fields are C `long` (not `time_t`) per POSIX — `KernelLong` ensures correct
/// width on ILP32 vs LP64.
#[repr(C)]
pub struct Timeval {
pub tv_sec: KernelLong,
pub tv_usec: KernelLong,
}
#[cfg(target_pointer_width = "64")]
const_assert!(size_of::<Timeval>() == 16);
#[cfg(target_pointer_width = "32")]
const_assert!(size_of::<Timeval>() == 8);
Fields marked "always 0" correspond to metrics that Linux also does not reliably
populate (ru_ixrss, ru_idrss, ru_isrss, ru_nswap, ru_msgsnd, ru_msgrcv,
ru_nsignals). Returning zero for these fields matches Linux behavior and is correct
for RUSAGE_SELF, RUSAGE_CHILDREN, and RUSAGE_THREAD.
Assembly from RusageAccum:
fn fill_rusage_wire(accum: &RusageAccum, out: &mut RusageWire) {
let utime_us = accum.utime_ns.load(Relaxed) / 1_000;
let stime_us = accum.stime_ns.load(Relaxed) / 1_000;
out.ru_utime = Timeval { tv_sec: (utime_us / 1_000_000) as KernelLong,
tv_usec: (utime_us % 1_000_000) as KernelLong };
out.ru_stime = Timeval { tv_sec: (stime_us / 1_000_000) as KernelLong,
tv_usec: (stime_us % 1_000_000) as KernelLong };
// NOTE: Internal accumulators are AtomicU64 for 50-year correctness.
// The wire format truncates to KernelLong (i32 on ILP32 / i64 on LP64)
// because Linux's `struct rusage` uses `long` for these fields.
// On ILP32 (ARMv7, PPC32), values >2^31-1 will wrap — this matches Linux.
out.ru_maxrss = accum.peak_rss_kb.load(Relaxed) as KernelLong;
out.ru_minflt = accum.minflt.load(Relaxed) as KernelLong;
out.ru_majflt = accum.majflt.load(Relaxed) as KernelLong;
out.ru_inblock = accum.inblock.load(Relaxed) as KernelLong;
out.ru_oublock = accum.oublock.load(Relaxed) as KernelLong;
out.ru_nvcsw = accum.nvcsw.load(Relaxed) as KernelLong;
out.ru_nivcsw = accum.nivcsw.load(Relaxed) as KernelLong;
// Unused fields already zero-initialized.
}
For RUSAGE_SELF, the kernel iterates all live threads in the process thread group,
sums their RusageAccum fields into a temporary accumulator, then calls
fill_rusage_wire. For RUSAGE_CHILDREN, it reads Process.children_rusage directly
(already a summed accumulator updated atomically on each wait()). For
RUSAGE_THREAD, it uses only the calling thread's Task.rusage.
8.7.8 /proc/PID/limits Format¶
UmkaOS generates /proc/PID/limits in the exact format that Linux uses, enabling
unmodified tools (bash ulimit -a, prlimit(1), container runtimes) to read it.
Format specification:
- Header line (fixed width):
"Limit Soft Limit Hard Limit Units " - One data line per resource (in ascending RLIMIT order, 0 through 15).
- Column layout (left-aligned fixed-width fields):
Limitcolumn: 26 charactersSoft Limitcolumn: 21 characters (value or"unlimited")Hard Limitcolumn: 21 characters (value or"unlimited")Unitscolumn: remainder
Example output:
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 31672 62193 processes
Max open files 1024 1048576 files
Max locked memory 67108864 67108864 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 31672 31672 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
The display names, ordering, and units strings must match Linux exactly. The umkafs
virtual filesystem handler for /proc/PID/limits
(Section 20.5) reads the process
RlimitSet under rlimit_lock (since it is a read of the full 32-field set and we
want a consistent snapshot) and formats the output using a static table:
static RLIMIT_DISPLAY: [RlimitDisplay; 16] = [
RlimitDisplay { name: "Max cpu time", unit: "seconds", resource: RLIMIT_CPU },
RlimitDisplay { name: "Max file size", unit: "bytes", resource: RLIMIT_FSIZE },
RlimitDisplay { name: "Max data size", unit: "bytes", resource: RLIMIT_DATA },
RlimitDisplay { name: "Max stack size", unit: "bytes", resource: RLIMIT_STACK },
RlimitDisplay { name: "Max core file size", unit: "bytes", resource: RLIMIT_CORE },
RlimitDisplay { name: "Max resident set", unit: "bytes", resource: RLIMIT_RSS },
RlimitDisplay { name: "Max processes", unit: "processes", resource: RLIMIT_NPROC },
RlimitDisplay { name: "Max open files", unit: "files", resource: RLIMIT_NOFILE },
RlimitDisplay { name: "Max locked memory", unit: "bytes", resource: RLIMIT_MEMLOCK },
RlimitDisplay { name: "Max address space", unit: "bytes", resource: RLIMIT_AS },
RlimitDisplay { name: "Max file locks", unit: "locks", resource: RLIMIT_LOCKS },
RlimitDisplay { name: "Max pending signals", unit: "signals", resource: RLIMIT_SIGPENDING },
RlimitDisplay { name: "Max msgqueue size", unit: "bytes", resource: RLIMIT_MSGQUEUE },
RlimitDisplay { name: "Max nice priority", unit: "", resource: RLIMIT_NICE },
RlimitDisplay { name: "Max realtime priority", unit: "", resource: RLIMIT_RTPRIO },
RlimitDisplay { name: "Max realtime timeout", unit: "us", resource: RLIMIT_RTTIME },
];
Values of u64::MAX are rendered as "unlimited". All other values are rendered as
decimal integers. The output is generated on every read of the /proc file; no caching
is performed (the file is small and infrequently read).
8.7.9 /proc/PID/stat Field Mapping¶
UmkaOS's /proc/PID/stat output is generated directly from Task struct fields, maintaining
Linux field order and encoding for compatibility with tools like ps, top, htop, and
procps.
Linux /proc/PID/stat has 52 space-separated fields in a fixed order (per man 5 proc).
The table below documents the authoritative mapping from each field to the UmkaOS Task struct
field that populates it (field numbers are 1-indexed):
| Field # | Name | Format | UmkaOS Task field | Notes |
|---|---|---|---|---|
| 1 | pid |
%d |
task.pid |
Process ID |
| 2 | comm |
%s |
task.comm |
Command name, truncated to 15 chars, surrounded by () |
| 3 | state |
%c |
task.state → char |
R=Running, S=Sleeping, D=Disk sleep, Z=Zombie, T=Stopped, t=Tracing, X=Dead |
| 4 | ppid |
%d |
task.parent.pid |
Parent PID (0 for init) |
| 5 | pgrp |
%d |
task.pgrp |
Process group ID |
| 6 | session |
%d |
task.session |
Session ID |
| 7 | tty_nr |
%d |
task.tty |
Controlling terminal (encoded as (major<<8)\|minor), 0 if none |
| 8 | tpgid |
%d |
task.tty.foreground_pgrp |
Foreground process group of controlling terminal, -1 if none |
| 9 | flags |
%u |
task.flags |
Kernel flags bitmask (PF_* values, Linux-compatible) |
| 10 | minflt |
%lu |
task.min_faults |
Minor page faults (no I/O needed) |
| 11 | cminflt |
%lu |
task.children_min_faults |
Minor faults of waited-for children |
| 12 | majflt |
%lu |
task.maj_faults |
Major page faults (I/O required) |
| 13 | cmajflt |
%lu |
task.children_maj_faults |
Major faults of waited-for children |
| 14 | utime |
%lu |
task.utime_ticks |
User mode time in clock ticks |
| 15 | stime |
%lu |
task.stime_ticks |
Kernel mode time in clock ticks |
| 16 | cutime |
%ld |
task.children_utime |
User time of waited-for children |
| 17 | cstime |
%ld |
task.children_stime |
Kernel time of waited-for children |
| 18 | priority |
%ld |
task.prio |
Kernel scheduling priority (negated nice-20 for RT) |
| 19 | nice |
%ld |
task.nice |
Nice value: -20 (high priority) to 19 (low priority) |
| 20 | num_threads |
%ld |
task.thread_group.count |
Number of threads in thread group |
| 21 | itrealvalue |
%ld |
0 | Always 0 (obsolete, was jiffies before next SIGALRM) |
| 22 | starttime |
%llu |
task.start_time_ticks |
Start time after boot in clock ticks |
| 23 | vsize |
%lu |
task.mm.total_vm * PAGE_SIZE |
Virtual memory size in bytes |
| 24 | rss |
%ld |
task.mm.rss_pages |
Resident set size in pages |
| 25 | rsslim |
%lu |
task.rlimit[RLIMIT_RSS] |
RSS soft limit in bytes |
| 26 | startcode |
%lu |
task.mm.start_code |
Start address of program text |
| 27 | endcode |
%lu |
task.mm.end_code |
End address of program text |
| 28 | startstack |
%lu |
task.mm.start_stack |
Start address of stack |
| 29 | kstkesp |
%lu |
task.regs.sp |
Current ESP/SP value from pt_regs |
| 30 | kstkeip |
%lu |
task.regs.ip |
Current EIP/PC value from pt_regs |
| 31 | signal |
%lu |
task.pending_signals.bitmap |
Bitmap of pending signals (obsolete; use /proc/PID/status) |
| 32 | blocked |
%lu |
task.signal_mask |
Bitmap of blocked signals (obsolete; use /proc/PID/status) |
| 33 | sigignore |
%lu |
task.sighand.ignored |
Bitmap of ignored signals (obsolete) |
| 34 | sigcatch |
%lu |
task.sighand.caught |
Bitmap of caught signals (obsolete) |
| 35 | wchan |
%lu |
task.wchan |
Wait channel (kernel address, 0 if not waiting) |
| 36 | nswap |
%lu |
0 | Not maintained (always 0, matches Linux) |
| 37 | cnswap |
%lu |
0 | Not maintained (always 0, matches Linux) |
| 38 | exit_signal |
%d |
task.exit_signal |
Signal sent to parent on death (usually SIGCHLD) |
| 39 | processor |
%d |
task.cpu_id |
Last CPU on which task ran |
| 40 | rt_priority |
%u |
task.rt_priority |
RT priority 1-99 (0 for non-RT) |
| 41 | policy |
%u |
task.sched_policy |
Scheduling policy (SCHED_NORMAL=0, SCHED_FIFO=1, etc.) |
| 42 | delayacct_blkio_ticks |
%llu |
task.io_accounting.blkio_delay |
Aggregated block I/O delay in clock ticks |
| 43 | guest_time |
%lu |
task.guest_time_ticks |
Guest time (virtual CPU for guest OS) in clock ticks |
| 44 | cguest_time |
%ld |
task.children_guest_time |
Guest time of waited-for children |
| 45 | start_data |
%lu |
task.mm.start_data |
Start address of initialized data |
| 46 | end_data |
%lu |
task.mm.end_data |
End address of initialized data |
| 47 | start_brk |
%lu |
task.mm.start_brk |
Start address of heap (brk) |
| 48 | arg_start |
%lu |
task.mm.arg_start |
Start address of argv |
| 49 | arg_end |
%lu |
task.mm.arg_end |
End address of argv |
| 50 | env_start |
%lu |
task.mm.env_start |
Start address of environment |
| 51 | env_end |
%lu |
task.mm.env_end |
End address of environment |
| 52 | exit_code |
%d |
task.exit_code |
Exit status as reported by waitpid(2) (WEXITSTATUS encoding) |
All 52 fields match the Linux man 5 proc specification. Field numbers are verified against
Linux kernel fs/proc/array.c. Fields 31-34 (signal bitmaps) are rendered as single %lu
values for compatibility; /proc/PID/status provides the full 64-signal format.
Field path clarifications: The task.* paths in the table above are logical
references to the Task struct (Section 8.1) and its sub-objects.
Several commonly referenced fields are resolved through the Process struct, not
directly on Task:
| Logical path | Actual resolution |
|---|---|
task.pid |
task.process.pid (PID namespace-aware) |
task.pgrp |
task.process.pgid |
task.session |
task.process.session_id |
task.tty |
task.process.controlling_tty |
task.parent.pid |
task.process.parent.pid |
task.flags |
task.process.flags (PF_* bitmask) |
task.exit_signal |
task.process.exit_signal |
task.exit_code |
task.process.exit_code.load(Acquire) |
task.min_faults |
task.rusage.min_flt |
task.maj_faults |
task.rusage.maj_flt |
task.utime |
task.rusage.utime_ns / tick_ns |
task.stime |
task.rusage.stime_ns / tick_ns |
task.children_utime |
task.process.signal_struct.cutime |
task.children_stime |
task.process.signal_struct.cstime |
task.prio |
task.sched_entity.prio |
task.nice |
task.sched_entity.nice |
task.rt_priority |
task.sched_entity.rt_priority |
task.sched_policy |
task.sched_entity.policy |
task.mm.* |
task.process.mm.*() |
task.pending_signals.bitmap |
task.pending.signal.bits[0] |
task.sighand.* |
task.process.sighand.* |
task.wchan |
task.context.wchan (arch-specific) |
task.cpu_id |
task.sched_entity.cpu |
task.start_time_ticks |
task.start_time_ns / tick_ns |
task.guest_time_ticks |
task.rusage.guest_time_ns / tick_ns |
task.io_accounting.blkio_delay |
task.rusage.blkio_delay_ns / tick_ns |
8.7.10 Linux Compatibility Notes¶
| Topic | Detail |
|---|---|
RLIMIT_* numeric values |
Identical to Linux (0–15) |
RLIM_INFINITY wire value |
u64::MAX on 64-bit; u32::MAX on 32-bit compat path |
prlimit64 syscall number |
302 on x86-64 (verified against Linux 6.x) |
getrlimit / setrlimit |
Syscall numbers 97 / 160 on x86-64 |
getrusage syscall number |
98 on x86-64 |
times syscall number |
100 on x86-64 |
RUSAGE_SELF / RUSAGE_CHILDREN / RUSAGE_THREAD |
Values 0, -1, 1 match Linux |
struct rusage layout |
Binary-identical, including unused zero fields |
struct rlimit64 layout |
Binary-identical (two u64 fields) |
USER_HZ |
100 (matches Linux default; not runtime-configurable) |
NR_OPEN default |
1,048,576 (matches fs.nr_open default in Linux) |
PID_MAX_LIMIT |
4,194,304 (matches Linux pid_max hard ceiling) |
RLIMIT_LOCKS |
Always RLIM_INFINITY; enforcement not implemented (matches modern Linux) |
/proc/PID/limits format |
Character-for-character identical to Linux |
prlimit64 credential check |
Requires PTRACE_MODE_ATTACH_REALCREDS or matching real/saved UID |
RLIMIT_RSS enforcement |
Advisory in Linux; UmkaOS integrates with cgroup memory controller |
RLIMIT_NICE encoding |
20 - rlim_cur = minimum nice value; rlim_cur 0 → nice min 20 (no privilege), rlim_cur 20 → nice min 0 |
| SIGXCPU on CPU limit | Delivered repeatedly (every second) after soft limit; SIGKILL at hard limit |
| SIGXFSZ on file size | Signal + EFBIG returned, matching Linux exactly |