Sections 10–16 of the ISLE Architecture. For the full table of contents, see README.md.

Part III: Core Kernel Subsystems

Concurrency, security, memory management, scheduling, and real-time guarantees.

10. Concurrency Model

10.1 Rust Ownership for Lock-Free Paths

Rust's ownership system provides compile-time guarantees that replace many of Linux's runtime-only checking tools (lockdep, KASAN, KCSAN).

/// Per-CPU data: compile-time prevention of cross-CPU access.
/// Access requires a proof token that can only be obtained with
/// preemption disabled on the current CPU.
///
/// The backing storage is dynamically sized at boot based on the actual
/// CPU count discovered from ACPI MADT / device tree / firmware tables.
/// No compile-time MAX_CPUS constant — the kernel adapts to the hardware.
/// Allocation uses the boot allocator (Section 12.1) during early init,
/// before the general-purpose allocator is available.
pub struct PerCpu<T> {
    /// Pointer to dynamically allocated array of per-CPU slots.
    /// Length = num_possible_cpus(), discovered at boot.
    data: *mut UnsafeCell<T>,
    /// Number of CPU slots (set once at boot, never changes).
    count: usize,
    /// Per-slot borrow state tracking for runtime aliasing detection.
    /// Length = count, parallel to data array. Each slot tracks:
    ///   - 0 = free (no active borrows)
    ///   - 1..=MAX-1 = that many active read borrows
    ///   - u32::MAX = mutably borrowed
    /// This array enables detection of aliased borrows across multiple
    /// PreemptGuard instances, which the type system cannot prevent alone.
    borrow_state: *mut AtomicU32,
}

impl<T> PerCpu<T> {
    /// Returns a reference to the borrow_state atomic for the given CPU slot.
    /// # Panics
    /// Panics if cpu >= count (bounds check).
    fn borrow_state(&self, cpu: usize) -> &'static AtomicU32 {
        assert!(cpu < self.count, "PerCpu: cpu {} out of range (count {})", cpu, self.count);
        // SAFETY: borrow_state was allocated with `count` elements at boot,
        // and cpu < count is verified above. The pointer remains valid for
        // the kernel's lifetime (static allocation). The returned reference
        // is 'static because the borrow_state array outlives all PerCpu usages.
        unsafe { &*self.borrow_state.add(cpu) }
    }

    /// # Safety contract
    ///
    /// `PerCpu<T>` is a global singleton per data type, accessed via a static
    /// reference. Soundness depends on preventing aliased mutable references
    /// to the same CPU's slot. Two distinct hazards must be addressed:
    ///
    /// 1. **Cross-thread aliasing**: Two threads on different CPUs access
    ///    *different* slots, so no aliasing occurs. A thread cannot migrate
    ///    mid-access because preemption is disabled by the guard.
    ///
    /// 2. **Interrupt aliasing**: Interrupt handlers (including NMI) may fire
    ///    while preemption is disabled. If an interrupt handler accesses the
    ///    same per-CPU variable mutably, this creates aliased `&mut T`
    ///    references — undefined behavior in Rust. Therefore:
    ///    - **Read-only access via `get()`** is safe with only preemption
    ///      disabled, because multiple `&T` references are permitted.
    ///      `get()` returns a `PerCpuRefGuard` that only disables preemption.
    ///    - **Mutable access via `get_mut()`** requires BOTH preemption
    ///      disabled AND local interrupts disabled. `get_mut()` returns a
    ///      `PerCpuMutGuard` that disables local interrupts (via
    ///      `local_irq_save`/`local_irq_restore`, defined in §18 per-arch)
    ///      for the duration of the borrow, in addition to disabling
    ///      preemption. This prevents any maskable interrupt handler from
    ///      observing or mutating the slot while the caller holds `&mut T`.
    ///
    /// 3. **NMI constraint**: NMIs (non-maskable interrupts) cannot be
    ///    disabled by `local_irq_save`. An NMI that fires during a
    ///    `get_mut()` borrow will see `borrow_state == u32::MAX`. If the
    ///    NMI handler calls `get()` on the **same** PerCpu variable, the
    ///    borrow_state check detects the conflict and panics. This is the
    ///    intended safety mechanism — panicking is preferable to silent
    ///    data corruption.
    ///
    ///    **NMI handler rules**:
    ///    - NMI handlers MUST NOT call `get_mut()` on any PerCpu variable.
    ///    - NMI handlers SHOULD avoid calling `get()` on PerCpu variables
    ///      that are commonly accessed via `get_mut()` elsewhere, because
    ///      the conflict causes a panic (which in NMI context halts the
    ///      CPU).
    ///    - NMI handlers that need per-CPU state MUST use dedicated
    ///      NMI-specific PerCpu variables that are only read (never
    ///      mutated) from non-NMI contexts, or use raw atomic fields
    ///      outside the PerCpu abstraction (e.g., the pre-allocated
    ///      per-CPU crash buffer used by the panic NMI handler).
    ///    - ISLE's NMI handlers (panic coordinator, watchdog, perf sampling)
    ///      follow this rule: they write only to dedicated pre-allocated
    ///      per-CPU buffers and never access the general PerCpu<T> API.
    ///
    /// The `PreemptGuard` is `!Send`, `!Clone`, and pin-bound to the issuing
    /// CPU, proving that the caller is on the correct CPU. The borrow checker
    /// prevents calling `get()` and `get_mut()` on the same guard
    /// simultaneously, preventing aliased `&T` + `&mut T` within the same
    /// non-interrupt context.
    ///
    /// **Aliasing safety**: The `&mut PreemptGuard` borrow prevents aliasing
    /// through the *same* guard, but nothing in the type system prevents a
    /// caller from creating two separate `PreemptGuard`s and using them to
    /// obtain aliased references (e.g., `&T` from one guard and `&mut T` from
    /// another, or two `&mut T`). To close this hole, `PerCpu<T>` maintains a
    /// per-slot `borrow_state: AtomicU32` with the following encoding:
    ///   - `0`         = slot is free (no active borrows)
    ///   - `1..=MAX-1` = slot has that many active read borrows (`get()` in use)
    ///   - `u32::MAX`  = slot is mutably borrowed (`get_mut()` in use)
    ///
    /// `get()` atomically increments the reader count (fails if currently
    /// `u32::MAX`, i.e., a writer is active). `get_mut()` atomically
    /// transitions from `0` to `u32::MAX` (fails if any readers OR another
    /// writer are active). Both `PerCpuRefGuard` and `PerCpuMutGuard` restore
    /// the borrow_state on drop. Violations panic unconditionally — in both
    /// debug AND release builds. This is a kernel: undefined behavior from
    /// aliased references is never acceptable, even in release. The cost of
    /// one atomic operation per `get()` / `get_mut()` call is negligible
    /// compared to the cost of memory corruption.
    /// See `get()` and `get_mut()` documentation below.

    /// Read-only access to the current CPU's slot. Takes `&PreemptGuard`,
    /// which only requires preemption to be disabled. Multiple `&T` references
    /// are sound because `&T` is `Sync`-like — interrupt handlers may also
    /// hold `&T` to the same slot without causing UB.
    /// Returns a `PerCpuRefGuard<T>` that derefs to `&T`.
    ///
    /// The lifetime `'g` ties the returned guard to the `PreemptGuard`, not
    /// to `&self`. This prevents dropping the `PreemptGuard` (re-enabling
    /// preemption and allowing migration) while still holding a reference
    /// to this CPU's slot — which would be use-after-migrate unsoundness.
    ///
    /// The `T: Sync` bound is required because interrupt handlers may
    /// concurrently hold `&T` to the same slot. Without `Sync`, types like
    /// `Cell<u32>` would allow data races between the caller and interrupt
    /// handlers sharing the same per-CPU slot.
    ///
    /// **Borrow-state protocol**: `get()` atomically increments the per-slot
    /// `borrow_state` counter, provided the current value is not `u32::MAX`
    /// (which indicates an active mutable borrow). If the slot is currently
    /// mutably borrowed, `get()` panics. This prevents the `&T` + `&mut T`
    /// aliasing UB that would arise if a caller obtained an immutable borrow
    /// via one `PreemptGuard` while another context held a mutable borrow via
    /// a second `PreemptGuard`. The `PerCpuRefGuard` decrements the counter
    /// on drop, returning the slot to a lower reader count or back to 0.
    pub fn get<'g>(&self, guard: &'g PreemptGuard) -> PerCpuRefGuard<'g, T>
    where
        T: Sync,
    {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpu: cpu_id {} out of range (count {})", cpu, self.count);
        // Atomically increment the reader count. Fail if a writer holds the
        // slot (borrow_state == u32::MAX). Use a compare-exchange loop so we
        // can distinguish "writer active" from a successful increment.
        // SAFETY: local IRQ state is not required for get() — read-only access
        // is sound from interrupt context (T: Sync). Preemption is disabled by
        // the PreemptGuard, so the CPU cannot migrate between the increment and
        // the slot access.
        let state = self.borrow_state(cpu);
        let prev = state.fetch_update(Ordering::Acquire, Ordering::Relaxed, |v| {
            if v == u32::MAX { None } else { Some(v + 1) }
        });
        if prev.is_err() {
            panic!("PerCpu: slot {} already mutably borrowed — cannot take shared borrow", cpu);
        }
        // SAFETY: PreemptGuard guarantees we are on cpu_id() and cannot
        // migrate. borrow_state increment above ensures no mutable borrow is
        // active. Read-only access is safe even if interrupt handlers also
        // hold &T, because T: Sync guarantees shared references are sound
        // across concurrent contexts (caller + interrupt handlers).
        unsafe {
            PerCpuRefGuard {
                value: &*self.data.add(cpu).as_ref().unwrap().get(),
                borrow_state: self.borrow_state(cpu),
                _guard: PhantomData,
            }
        }
    }

    /// Mutable access to the current CPU's slot. Takes `&mut PreemptGuard`
    /// to prevent aliasing with any concurrent `get()` or `get_mut()` on
    /// the same guard in the caller's context. Additionally, the returned
    /// `PerCpuMutGuard` disables local interrupts (via `local_irq_save`)
    /// to prevent interrupt handlers from accessing this slot while `&mut T`
    /// is live. Interrupts are restored when the guard is dropped.
    ///
    /// The lifetime `'g` ties the returned guard to the `PreemptGuard`,
    /// preventing use-after-migrate (same rationale as `get()`).
    ///
    /// This two-layer protection is necessary because preemption-disable
    /// alone does NOT prevent interrupt handlers from firing and potentially
    /// accessing the same per-CPU variable.
    ///
    /// **Aliasing safety**: The `&mut PreemptGuard` borrow prevents aliasing
    /// through the *same* guard, but callers could theoretically create a
    /// second `PreemptGuard` and call `get()` or `get_mut()` again on the
    /// same slot, producing `&T` + `&mut T` or two `&mut T` — both are UB.
    /// To close this hole, `get_mut()` atomically transitions the per-slot
    /// `borrow_state` from exactly `0` (free) to `u32::MAX` (writer) using
    /// a compare-exchange. If the slot has any active readers (count > 0) or
    /// another writer (count == u32::MAX), the CAS fails and `get_mut()`
    /// panics unconditionally — in both debug AND release builds. This is a
    /// kernel: UB from aliased references is never acceptable, even in release.
    /// The cost of one CAS per `get_mut()` call is negligible compared to the
    /// cost of memory corruption. `PerCpu<T>` is designed to be used with a
    /// single `PreemptGuard` per critical section — creating multiple guards
    /// and using them to access the same `PerCpu<T>` is a logic error detected
    /// at runtime.
    pub fn get_mut<'g>(&self, guard: &'g mut PreemptGuard) -> PerCpuMutGuard<'g, T> {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpu: cpu_id {} out of range (count {})", cpu, self.count);
        // SAFETY: local_irq_save() MUST be called BEFORE updating borrow_state.
        // If we update borrow_state first and an interrupt fires before IRQs
        // are disabled, an interrupt handler calling get() on the same PerCpu
        // variable would see borrow_state == u32::MAX and panic. Disabling IRQs
        // first ensures no interrupt handler can observe or race with the
        // borrow_state transition.
        let saved_flags = local_irq_save();
        // Runtime check (always active, including release builds): atomically
        // transition borrow_state from 0 (free) to u32::MAX (writer). Fails if
        // any reader (count > 0) or another writer (u32::MAX) is active. Fires
        // both for &mut T + &mut T (two writers) and for &T + &mut T (reader
        // + writer) aliasing. This is a kernel — UB is never acceptable.
        // This check is safe from interrupt-handler races because IRQs are
        // already disabled above.
        let state = self.borrow_state(cpu);
        if state.compare_exchange(0, u32::MAX, Ordering::Acquire, Ordering::Relaxed).is_err() {
            local_irq_restore(saved_flags);
            panic!("PerCpu: slot {} already borrowed (reader or writer active)", cpu);
            // Note: The kernel is compiled with `panic = "abort"` (no unwinding),
            // so borrow_state cannot leak: a panic immediately halts the core.
            // For Tier 1 drivers sharing the address space, driver panics are caught
            // by the driver fault handler (Section 9) which resets the driver's state
            // including any held per-CPU borrows.
        }
        // SAFETY: PreemptGuard prevents migration. local_irq_save() prevents
        // interrupt handlers from accessing this slot. The CAS above guarantees
        // no active readers or writers on this slot. Together, these guarantee
        // exclusive access to this CPU's slot.
        unsafe {
            PerCpuMutGuard {
                value: &mut *self.data.add(cpu).as_ref().unwrap().get(),
                saved_flags,
                borrow_state: self.borrow_state(cpu),
                _guard: PhantomData,
            }
        }
    }
}

/// Guard for read-only per-CPU access. Only requires preemption disabled.
/// Implements `Deref<Target = T>` for ergonomic read access.
///
/// The lifetime `'a` is tied to the `PreemptGuard`, not to the `PerCpu<T>`
/// container. This ensures the guard cannot outlive the preemption-disabled
/// critical section, preventing use-after-migrate.
///
/// On creation, the per-slot `borrow_state` reader count was incremented by
/// `get()`. On drop, `PerCpuRefGuard` decrements the reader count, returning
/// the slot to its prior borrow state. This is necessary so that `get_mut()`
/// can detect when readers are no longer active and safely transition to
/// writer mode.
pub struct PerCpuRefGuard<'a, T> {
    value: &'a T,
    /// Reference to the per-slot borrow_state counter so Drop can decrement
    /// the reader count. Always present (not debug-only) because the runtime
    /// borrow check is active in all build profiles.
    borrow_state: &'a AtomicU32,
    /// Ties this guard's lifetime to the `PreemptGuard`, not to `PerCpu<T>`.
    _guard: PhantomData<&'a PreemptGuard>,
}

/// `PerCpuRefGuard` must NOT be sent to another CPU/thread. It holds a
/// reference to a per-CPU slot that is only valid on the CPU where the
/// `PreemptGuard` was obtained. Sending it to another thread (which runs
/// on a different CPU) would allow reading another CPU's slot without any
/// synchronization, violating the per-CPU data invariant.
impl<T> !Send for PerCpuRefGuard<'_, T> {}

impl<'a, T> core::ops::Deref for PerCpuRefGuard<'a, T> {
    type Target = T;
    fn deref(&self) -> &T { self.value }
}

impl<'a, T> Drop for PerCpuRefGuard<'a, T> {
    fn drop(&mut self) {
        // Decrement the reader count with underflow detection.
        //
        // CORRECTNESS: fetch_sub on 0 wraps to u32::MAX (the writer sentinel),
        // which would corrupt the borrow state. We detect this case after the
        // fact and panic rather than silently corrupting state.
        //
        // Normal case: borrow_state is 1..=u32::MAX-1 (one or more readers).
        // Underflow case: borrow_state is 0 (no readers — this is a bug).
        // Writer-corruption case: borrow_state is u32::MAX (impossible if get()
        //   correctly checked for writer before incrementing).
        //
        // We panic on error because it indicates a logic error in the PerCpu
        // borrow tracking, and continuing would corrupt state. The recovery
        // store attempts to restore a consistent state for debugging, but
        // the kernel will halt anyway (panic = "abort" for kernel code).
        //
        // **Why no IRQ protection?** An interrupt handler that fires between
        // fetch_sub and the panic check, and calls get() on the same PerCpu,
        // will see borrow_state == u32::MAX (writer sentinel) and correctly
        // fail its get() call. The handler does NOT proceed with corrupted
        // state — it safely errors out. Adding IRQ save/restore to every guard
        // drop (hot path) to protect a code path that already panics is
        // unnecessary overhead.
        let old = self.borrow_state.fetch_sub(1, Ordering::Release);
        if old == 0 {
            // Underflow: no reader was registered. This is a kernel bug.
            // The fetch_sub wrapped to u32::MAX. Restore to 0 and panic.
            self.borrow_state.store(0, Ordering::Release);
            panic!("PerCpuRefGuard::drop: borrow_state underflow — double-drop or missing get()");
        }
        if old == u32::MAX {
            // Was in writer mode — this should be impossible since get() fails
            // when a writer is active. If we reach here, state is corrupted.
            // The fetch_sub wrapped to u32::MAX-1. Restore sentinel and panic.
            self.borrow_state.store(u32::MAX, Ordering::Release);
            panic!("PerCpuRefGuard::drop: borrow_state was u32::MAX (writer sentinel) — corrupted state");
        }
        // Normal case: old was 1..=u32::MAX-1, now decremented successfully.
        // No further action needed.
    }
}

/// Guard for mutable per-CPU access. Disables local interrupts on creation,
/// restores them on drop. This prevents interrupt handlers from creating
/// aliased references to the same per-CPU slot.
///
/// The lifetime `'a` is tied to the `PreemptGuard` (via `PhantomData`),
/// not to the `PerCpu<T>` container. Same rationale as `PerCpuRefGuard`.
///
/// On creation, the per-slot `borrow_state` was set to `u32::MAX` (writer
/// sentinel) by `get_mut()`. On drop, `PerCpuMutGuard` resets `borrow_state`
/// to `0` (free) before restoring local IRQs, allowing subsequent `get()` or
/// `get_mut()` calls on this slot.
pub struct PerCpuMutGuard<'a, T> {
    value: &'a mut T,
    saved_flags: usize,
    /// Reference to the per-slot borrow_state counter so Drop can reset it
    /// to 0 (free), enabling detection of concurrent borrows in subsequent
    /// callers. Always present (not debug-only) because aliasing UB is never
    /// acceptable in a kernel.
    borrow_state: &'a AtomicU32,
    /// Ties this guard's lifetime to the `PreemptGuard`, not to `PerCpu<T>`.
    _guard: PhantomData<&'a mut PreemptGuard>,
}

/// `PerCpuMutGuard` must NOT be sent to another CPU/thread. The `saved_flags`
/// field contains the interrupt state of the originating CPU (saved via
/// `local_irq_save`). Restoring these flags on a different CPU would corrupt
/// that CPU's interrupt state — potentially enabling interrupts that should
/// be disabled or vice versa, leading to missed interrupts or unsafe re-entry.
impl<T> !Send for PerCpuMutGuard<'_, T> {}

impl<'a, T> core::ops::Deref for PerCpuMutGuard<'a, T> {
    type Target = T;
    fn deref(&self) -> &T { self.value }
}

impl<'a, T> core::ops::DerefMut for PerCpuMutGuard<'a, T> {
    fn deref_mut(&mut self) -> &mut T { self.value }
}

impl<'a, T> Drop for PerCpuMutGuard<'a, T> {
    fn drop(&mut self) {
        // Reset borrow_state from u32::MAX (writer) back to 0 (free) BEFORE
        // restoring IRQs. This ensures that any interrupt handler firing
        // immediately after IRQ restoration sees the slot as free and can
        // call get() or get_mut() without a false-positive panic.
        self.borrow_state.store(0, Ordering::Release);
        local_irq_restore(self.saved_flags);
    }
}

/// Per-CPU locks: type-safe sharded locking without array-indexing aliasing hazards.
///
/// A common pattern in scalable kernels is to give each CPU its own lock protecting
/// a shard of data, avoiding contention on a single global lock. Naively implementing
/// this as `[SpinLock<T>; N]` creates a type-safety hole: Rust's aliasing rules permit
/// holding `&mut T` from `locks[i]` and `&mut T` from `locks[j]` simultaneously if
/// `i != j`, but nothing in the type system prevents accessing `locks[other_cpu]`
/// when the caller intended to access only `locks[this_cpu]`. Furthermore, if two
/// CPUs simultaneously hold their respective locks, the `&mut T` references are
/// derived from the same array allocation, which can violate LLVM's noalias assumptions
/// in ways that are difficult to reason about.
///
/// ISLE provides `PerCpuLock<T>` to enforce the per-CPU locking invariant at the
/// type level:
///
/// 1. **Separate allocations per CPU**: Each CPU's lock+data is a separate slab
///    allocation (one `PerCpuLockSlot<T>` per CPU), NOT an array element. This ensures
///    that `&mut T` references from different CPUs are derived from disjoint
///    allocations, satisfying Rust's aliasing rules without relying on the
///    "different array indices" reasoning that LLVM may not honor.
///
/// 2. **Access restricted to current CPU**: `PerCpuLock::lock()` takes a
///    `&PreemptGuard` (the same proof token used by `PerCpu<T>`) and only returns
///    a guard for the current CPU's lock. There is no API to access another CPU's
///    lock — the type system makes it impossible.
///
/// 3. **Cache-line aligned**: Each `PerCpuLockSlot<T>` is `#[repr(align(64))]` to
///    prevent false sharing. The alignment is part of the type, not a runtime hint.
///
/// 4. **Safe composition with `PerCpu<T>`**: For cases where per-CPU data needs
///    both a lock and lock-free access, use `PerCpu<UnsafeCell<T>>` with explicit
///    `PerCpuMutGuard` for writes, or wrap the data in `PerCpu<SpinLock<T>>` where
///    the lock itself is per-CPU. The key invariant is that `PerCpuLock<T>` never
///    exposes `&mut T` from one CPU while another CPU's lock is held — each CPU's
///    lock guards only that CPU's data shard.
///
/// **Use case**: Per-CPU statistics counters that need occasional atomic updates
/// across all CPUs (e.g., network Rx packet counts). Each CPU locks its own slot
/// for local updates; aggregating across all CPUs requires iterating without holding
/// any locks (the counters are `AtomicU64`, so reads are consistent).
pub struct PerCpuLock<T> {
    /// Array of pointers to independently-allocated lock slots.
    /// Each pointer points to a separately-allocated `PerCpuLockSlot<T>`.
    /// Length = num_possible_cpus(), discovered at boot.
    /// The indirection ensures each slot is a separate allocation for aliasing purposes.
    ///
    /// **Allocation timing**: `PerCpuLock<T>` is initialized during phase 2 of boot
    /// (after the slab allocator is available, Section 12.3). Each slot is allocated
    /// via the slab allocator (not `Box`/global allocator). The pointer array itself
    /// is allocated from the boot bump allocator during early init.
    slots: *mut *mut PerCpuLockSlot<T>,
    /// Number of CPU slots (set once at boot, never changes).
    count: usize,
}

/// A single CPU's lock slot, cache-line aligned to prevent false sharing.
/// This is allocated independently (via boot allocator) for each CPU at boot.
///
/// `#[repr(align(64))]` guarantees:
/// - The struct's starting address is 64-byte aligned.
/// - The compiler pads the struct to a multiple of 64 bytes automatically.
/// - `size_of::<PerCpuLockSlot<T>>()` >= 64 regardless of `SpinLock<T>` size.
///
/// No explicit `_pad` field is needed — the compiler handles padding.
/// This works correctly for any `SpinLock<T>` size: small types get padded
/// up to 64 bytes, large types round up to the next 64-byte boundary.
#[repr(align(64))]
struct PerCpuLockSlot<T> {
    /// The spinlock protecting this CPU's data shard.
    lock: SpinLock<T>,
}

impl<T> PerCpuLock<T> {
    /// Lock the current CPU's data shard.
    ///
    /// # Safety contract
    ///
    /// - Takes `&PreemptGuard` to prove the caller is pinned to a specific CPU.
    /// - Returns `PerCpuLockGuard<'_, T>` that derefs to `&T` and `&mut T`.
    /// - The guard holds the spinlock for this CPU's slot.
    /// - No API exists to lock another CPU's shard — the only access path is
    ///   through the current CPU, enforced by the `PreemptGuard` proof token.
    ///
    /// **Aliasing safety**: Because each slot is a separate heap allocation,
    /// holding `&mut T` on CPU 0 and `&mut T` on CPU 1 simultaneously is sound —
    /// the references are derived from disjoint allocations, not from different
    /// indices of the same array. This satisfies Rust's aliasing rules and
    /// LLVM's noalias semantics without subtle reasoning about array indexing.
    ///
    /// **Interrupt safety**: `SpinLock::lock()` disables preemption for the
    /// critical section. If the caller needs to also disable interrupts
    /// (to prevent interrupt handlers from deadlocking on the same lock),
    /// they must wrap the call in `local_irq_save()`/`local_irq_restore()`.
    /// For per-CPU locks, this is rarely needed because interrupt handlers
    /// typically access different data or use lock-free patterns.
    pub fn lock<'g>(&self, guard: &'g PreemptGuard) -> PerCpuLockGuard<'g, T> {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpuLock: cpu_id {} out of range", cpu);
        // SAFETY: slots was allocated with count elements at boot. cpu is in bounds.
        let slot_ptr = unsafe { *self.slots.add(cpu) };
        // SAFETY: slot_ptr was obtained from slab allocation at boot and is never
        // freed during kernel operation. It points to a valid PerCpuLockSlot<T>.
        let slot = unsafe { &*slot_ptr };
        PerCpuLockGuard {
            inner: slot.lock.lock(),
            _cpu_pin: PhantomData,
        }
    }

    /// Try to lock the current CPU's data shard without blocking.
    ///
    /// Returns `Some(PerCpuLockGuard)` if the lock was acquired, `None` if
    /// the lock is currently held (e.g., by an interrupt handler on this CPU).
    ///
    /// Same safety contract as `lock()`, but non-blocking.
    pub fn try_lock<'g>(&self, guard: &'g PreemptGuard) -> Option<PerCpuLockGuard<'g, T>> {
        let cpu = guard.cpu_id();
        assert!(cpu < self.count, "PerCpuLock: cpu_id {} out of range", cpu);
        let slot_ptr = unsafe { *self.slots.add(cpu) };
        let slot = unsafe { &*slot_ptr };
        slot.lock.try_lock().map(|inner| PerCpuLockGuard {
            inner,
            _cpu_pin: PhantomData,
        })
    }

    /// Access all slots for cross-CPU aggregation (read-only, no locks held).
    ///
    /// This is the ONLY way to access another CPU's slot, and it only provides
    /// read-only access to the lock structure itself — NOT to the protected data.
    /// The typical use case is iterating over all CPUs' atomic counters.
    ///
    /// # Safety
    ///
    /// The caller must ensure no CPU is currently mutating its data shard through
    /// a `PerCpuLockGuard`. For atomic counter aggregation, this is safe because
    /// the counters are read atomically without holding the lock. For non-atomic
    /// data, the caller must use external synchronization (e.g., pause all CPUs
    /// via IPI) before calling this method.
    ///
    /// Returns an iterator over `&SpinLock<T>` for each CPU. The caller can
    /// read the lock state or use `try_lock()` on each, but cannot obtain `&mut T`
    /// through this path without holding the lock.
    pub unsafe fn iter_slots(&self) -> impl Iterator<Item = &'_ SpinLock<T>> {
        (0..self.count).map(move |i| {
            let slot_ptr = *self.slots.add(i);
            &(*slot_ptr).lock
        })
    }
}

/// Guard for a per-CPU lock. Implements `Deref`/`DerefMut` to access the protected data.
///
/// The guard holds a `SpinLockGuard` derived from the current CPU's lock slot.
/// The `PhantomData<&'a PreemptGuard>` ties the guard's lifetime to the CPU pin,
/// preventing use-after-migrate if the caller drops the `PreemptGuard` while
/// still holding the lock guard.
pub struct PerCpuLockGuard<'a, T> {
    inner: SpinLockGuard<'a, T>,
    _cpu_pin: PhantomData<&'a PreemptGuard>,
}

impl<'a, T> core::ops::Deref for PerCpuLockGuard<'a, T> {
    type Target = T;
    fn deref(&self) -> &T { &*self.inner }
}

impl<'a, T> core::ops::DerefMut for PerCpuLockGuard<'a, T> {
    fn deref_mut(&mut self) -> &mut T { &mut *self.inner }
}

/// `PerCpuLockGuard` must NOT be sent to another CPU/thread.
/// The guard holds a `SpinLockGuard` from a specific CPU's lock slot.
/// Sending it to another thread would allow that thread to access a lock
/// that may be concurrently acquired by the original CPU (e.g., in an
/// interrupt handler), causing deadlock or data corruption.
impl<T> !Send for PerCpuLockGuard<'_, T> {}

Why separate allocations matter for type soundness:

The naive [SpinLock<T>; N] approach has a subtle aliasing problem. When CPU 0 holds locks[0].lock() and CPU 1 holds locks[1].lock(), both CPUs have derived their &mut T references from the same array allocation. While Rust's reference rules permit disjoint array element access, LLVM's noalias attribute and the optimizer's alias analysis may not distinguish between "different indices of the same array" and "same allocation." In practice, this is unlikely to cause miscompilation with current LLVM, but the ISLE architecture takes a conservative approach: each CPU's lock+data is a separate heap allocation, guaranteeing that the &mut T references are truly disjoint at the allocation level.

This design also simplifies reasoning about memory reclamation: if a per-CPU lock slot needs to be freed (e.g., during CPU hot-unplug), the individual slab allocation can be returned without affecting other CPUs' slots.

/// RCU in Rust: zero-lock read path, deferred reclamation.
/// Readers hold an RcuReadGuard (analogous to rcu_read_lock).
/// Writers swap the pointer atomically and defer freeing the old
/// value until all readers have exited their critical sections.
pub struct RcuCell<T: Send + Sync> {
    ptr: AtomicPtr<T>,
}

impl<T: Send + Sync> RcuCell<T> {
    /// Create a new RcuCell with an initial value. The value is heap-allocated
    /// via `Box::try_new` (fallible) and the RcuCell takes ownership of the
    /// raw pointer. Returns `Err(KernelError::OutOfMemory)` if allocation
    /// fails. The pointer is always non-null after successful construction.
    ///
    /// All RcuCell allocation is fallible. Callers must handle OutOfMemory.
    pub fn new(value: T) -> Result<Self, KernelError> {
        let ptr = Box::try_new(value).map_err(|_| KernelError::OutOfMemory)?;
        Ok(Self {
            ptr: AtomicPtr::new(Box::into_raw(ptr)),
        })
    }

    pub fn read<'a>(&'a self, _guard: &'a RcuReadGuard) -> &'a T {
        unsafe { &*self.ptr.load(Ordering::Acquire) }
    }

    /// Atomically replace the value. The old value is scheduled for deferred
    /// freeing after all current RCU read-side critical sections complete
    /// (grace period). The caller must NOT access the old value after this
    /// call — RCU owns it and will free it asynchronously.
    ///
    /// **Writer serialization**: Takes `&self` (not `&mut self`) plus a
    /// `MutexGuard` proof token that demonstrates the caller holds the
    /// write-side lock. This is the standard RCU pattern: readers are
    /// lock-free via `read()`, writers serialize through an external mutex.
    ///
    /// **Why `&self` instead of `&mut self`**: RCU cells are typically
    /// stored in global/static data structures (routing tables, config
    /// registries, module lists) accessed concurrently by many readers.
    /// Requiring `&mut self` would force the caller to hold an exclusive
    /// reference to the entire `RcuCell`, which is impractical for global
    /// state — it would require wrapping the `RcuCell` in a `Mutex` or
    /// `RwLock` that also blocks readers, defeating the purpose of RCU.
    /// With `&self` + lock proof, the `RcuCell` can live in a shared
    /// context (e.g., `static`, `Arc`, or behind `&`), readers access it
    /// without any lock, and writers prove serialization by passing the
    /// `MutexGuard` token. The `MutexGuard` lifetime ensures the lock is
    /// held for the duration of the `update()` call but does not restrict
    /// access to the `RcuCell` itself.
    ///
    /// Concurrent writers without the lock would race on the swap: writer A
    /// swaps old->new_A, writer B swaps new_A->new_B, then writer B defers
    /// freeing new_A — which was just published and may have active readers.
    /// The `MutexGuard` proof prevents this at compile time.
    ///
    /// All RcuCell allocation is fallible. Returns `Err(KernelError::OutOfMemory)`
    /// if allocation fails; the existing value is left unchanged in that case.
    pub fn update(
        &self,
        new_value: T,
        _writer_lock: &MutexGuard<'_, ()>,
    ) -> Result<(), KernelError> {
        let new_box = Box::try_new(new_value).map_err(|_| KernelError::OutOfMemory)?;
        let old = self.ptr.swap(Box::into_raw(new_box), Ordering::Release);
        // Schedule old value for deferred freeing after grace period.
        // rcu_defer_free takes ownership of the raw pointer and will
        // reconstruct the Box and drop it after the grace period elapses.
        // SAFETY: `old` was created by a previous `Box::into_raw()` call
        // (either in `new()` or a prior `update()`). The writer lock
        // guarantees no concurrent `update()` can swap the same pointer
        // twice. After this call, the caller must not access `old`.
        unsafe { rcu_defer_free(old) };
        Ok(())
    }
}

// Implementation note on MutexGuard proof tokens: The `_writer_lock` parameter
// ensures single-writer semantics at compile time. In the actual implementation,
// the `Mutex<()>` must be embedded in the same struct that contains the `RcuCell`,
// or in a per-instance wrapper, so that each RcuCell has its own dedicated mutex.
// Passing an unrelated mutex guard would compile but violate the invariant. This is
// enforced structurally: the kernel's RCU-protected data structures always pair their
// `RcuCell` and `Mutex` in the same struct (e.g., `struct RcuProtected<T> { cell:
// RcuCell<T>, writer_lock: Mutex<()> }`), and the `update()` call site acquires the
// co-located lock. This pattern is standard in Rust kernel design (similar to how
// Linux's `struct rcu_head` is always embedded in the protected struct).

// Note: `RcuCell<T>` does NOT block in its `Drop` implementation.
// Calling `rcu_synchronize()` (a blocking wait) from `Drop` would be
// unsafe in contexts where blocking is illegal: interrupt handlers, code
// executing under a spinlock, NMI handlers, or any other atomic context.
// Because `RcuCell` values can be dropped from any of these contexts
// (e.g., a global `RcuCell` freed during module unload while holding a lock),
// `Drop` uses `rcu_call()` (deferred callback) instead. The current pointer
// is enqueued for deferred freeing via the RCU callback mechanism; the actual
// `Box::drop` runs in the RCU grace period worker thread, which executes in
// a fully schedulable, non-atomic context. This matches the pattern used by
// `update()`, which also defers old-value freeing via `rcu_defer_free()`.
impl<T: Send + Sync> Drop for RcuCell<T> {
    fn drop(&mut self) {
        // SAFETY: We have &mut self (exclusive access). This guarantees no
        // concurrent writers (update() takes &self + MutexGuard, but &mut self
        // is incompatible with any shared reference). Readers may still hold
        // &T references obtained via read(); rcu_call() defers the actual
        // Box::drop until all pre-existing RCU read-side critical sections
        // complete (the grace period), ensuring no live references to the
        // pointed-to value remain before it is freed. The pointer was created
        // by Box::into_raw() in new() or update() and has not yet been passed
        // to rcu_defer_free() (only old values replaced by update() are
        // deferred there). This Drop path covers only the *current* (final)
        // value still held by the RcuCell at destruction time.
        //
        // Using rcu_call() (non-blocking enqueue) rather than rcu_synchronize()
        // (blocking wait) is mandatory here: Drop can be invoked from atomic
        // contexts (interrupt handlers, spinlock-held paths, etc.) where
        // blocking would deadlock or corrupt kernel state.
        let ptr = self.ptr.load(Ordering::Relaxed);
        if !ptr.is_null() {
            // SAFETY: ptr was created by Box::into_raw() and is non-null.
            // rcu_call takes ownership of the raw pointer and will reconstruct
            // the Box and drop it after the grace period elapses in the RCU
            // callback worker thread.
            unsafe { rcu_call(ptr) };
        }
    }
}

RCU Grace Period Detection:

The grace period mechanism determines when all pre-existing RCU read-side critical sections have completed, making it safe to execute deferred callbacks (such as freeing old values swapped out by RcuCell::update()).

ISLE uses per-CPU quiescent state tracking:

Each CPU maintains a per-CPU quiescent state counter (rcu_qs_count), incremented each time the CPU passes through a quiescent point — a state where the CPU is guaranteed not to hold any RCU read-side references.
Quiescent points occur at: (1) context switch (the outgoing task releases all RcuReadGuards before being descheduled), (2) idle entry (a CPU entering the idle loop has no active critical sections), and (3) explicit rcu_quiescent_state() calls in long-running kernel loops that do not hold RCU references.
A grace period begins when synchronize_rcu() or rcu_defer_free() is called. The RCU subsystem snapshots all per-CPU quiescent state counters at that moment.
The grace period ends when every CPU has passed through at least one quiescent point since the snapshot (i.e., every CPU's counter has advanced). At that point, no CPU can still hold a reference to data that was visible before the grace period started.
Once the grace period ends, all deferred callbacks registered before or during that grace period are executed (freeing old pointers, running destructors, etc.).
Grace period detection is batched: multiple rcu_defer_free() calls are coalesced into the same grace period to amortize the per-CPU polling overhead.
KABI boundary crossing constitutes an RCU quiescent state: every KABI vtable call entry and return is treated as a quiescent point for the calling CPU. This ensures that Tier 1 drivers that return from KABI calls within bounded time (enforced by the per-call timeout watchdog, Section 7.5.3) cannot block RCU grace period completion indefinitely. Drivers that perform long-polling loops must call rcu_quiescent_state() at each poll iteration, or use the KABI polling helper kabi_poll_wait() which includes an implicit quiescent state.

/// RCU read-side guard. Obtained via `rcu_read_lock()`, released via Drop.
///
/// This is a zero-cost marker type — it does not perform any atomic operations
/// or memory barriers on acquisition. The RCU read-side critical section is
/// purely a contract with the grace period detection mechanism: as long as any
/// CPU holds an `RcuReadGuard`, the current grace period cannot complete.
///
/// The guard is `!Send` because RCU read-side sections are per-CPU — the quiescent
/// state tracking (Section 10.1, "RCU Grace Period Detection") is CPU-local.
/// Sending an `RcuReadGuard` to another thread would allow the grace period
/// detection to miss an active reader.
///
/// # Example
/// ```rust
/// let guard = rcu_read_lock();
/// // Within this scope, any RCU-protected data can be safely read.
/// // The grace period will not complete until this guard is dropped.
/// let value = rcu_cell.read(&guard);
/// // guard dropped here; this CPU may now pass through a quiescent point
/// ```
pub struct RcuReadGuard {
    /// CPU ID on which this guard was acquired. Used for debug assertions
    /// and the KRL timeout mitigation (Section 22.6.1). Not used for
    /// grace period tracking — that is handled by per-CPU quiescent state counters.
    _cpu_id: u32,
    /// Marker to prevent Send/Sync auto-traits.
    _not_send: PhantomData<*const ()>,
}

impl Drop for RcuReadGuard {
    fn drop(&mut self) {
        // No explicit action needed. The quiescent state counter is incremented
        // when the CPU passes through a quiescent point (context switch, idle,
        // explicit rcu_quiescent_state() call), not when this guard is dropped.
        // The guard's purpose is to prevent the compiler from reordering
        // RCU-protected accesses outside the critical section.
    }
}

impl !Send for RcuReadGuard {}
impl !Sync for RcuReadGuard {}

/// Acquire an RCU read-side critical section guard.
///
/// The returned guard prevents the current RCU grace period from completing
/// until it is dropped. This is a zero-cost operation — no atomic operations
/// or memory barriers are performed on entry.
///
/// # Safety invariants
/// - Must be paired with a `Drop` (RAII pattern — cannot be leaked).
/// - Must not be held across a blocking operation (sleep, mutex acquisition)
///   unless the holder is prepared for an extended grace period latency.
/// - The KRL timeout mitigation (Section 22.6.1) enforces a maximum critical
///   section duration for KRL access to prevent DoS.
pub fn rcu_read_lock() -> RcuReadGuard {
    RcuReadGuard {
        _cpu_id: cpu_id(),
        _not_send: PhantomData,
    }
}

/// A slice wrapper that can only be dereferenced within an RCU read-side
/// critical section. Used for RCU-protected arrays (e.g., KRL revoked_keys).
///
/// This is a zero-cost newtype around a raw pointer. The `Deref` implementation
/// is gated on an `RcuReadGuard` reference, ensuring that the pointed-to data
/// cannot be accessed after its backing memory is freed by the RCU callback.
///
/// # Type parameter
/// - `T`: The element type of the slice. Typically `[u8; 32]` for hash arrays.
///
/// # Safety contract
/// - The pointer must have been obtained from memory that will remain valid
///   until at least the next RCU grace period.
/// - The `RcuReadGuard` passed to `deref()` must have been acquired after the
///   last RCU update that could have freed this memory.
/// - Callers must not hold the `Deref` result across an RCU grace period
///   boundary (e.g., must not call `rcu_synchronize()` while holding a
///   reference derived from this slice).
///
/// # Example
/// ```rust
/// pub struct KeyRevocationList {
///     pub revoked_keys: RcuSlice<[u8; 32]>,
///     pub revoked_count: u32,
/// }
///
/// fn is_revoked(krl: &RcuCell<KeyRevocationList>, fingerprint: &[u8; 32]) -> bool {
///     let guard = rcu_read_lock();
///     let krl_ref = krl.read(&guard);
///     // Deref RcuSlice within the guard's lifetime
///     let keys: &[[u8; 32]] = krl_ref.revoked_keys.deref(&guard);
///     keys[..krl_ref.revoked_count as usize]
///         .binary_search(fingerprint)
///         .is_ok()
/// }
/// ```
#[repr(C)]
pub struct RcuSlice<T> {
    ptr: *const T,
    len: usize,
}

impl<T> RcuSlice<T> {
    /// Create a new RcuSlice from a raw pointer and length.
    ///
    /// # Safety
    /// The caller must ensure that:
    /// 1. The pointer is valid and properly aligned for type `T`.
    /// 2. `ptr` points to `len` contiguous initialized elements of type `T`.
    /// 3. The memory pointed to will remain valid until at least the next RCU
    ///    grace period after the last access via this slice.
    pub unsafe fn from_raw(ptr: *const T, len: usize) -> Self {
        Self { ptr, len }
    }
}

impl<T> RcuSlice<T> {
    /// Dereference the slice within an RCU read-side critical section.
    ///
    /// The returned reference is valid only for the lifetime of the guard.
    /// Accessing the reference after the guard is dropped is undefined behavior.
    ///
    /// # Arguments
    /// - `_guard`: A reference to an `RcuReadGuard`, proving that the caller
    ///   is within an RCU read-side critical section. The guard's lifetime
    ///   bounds the returned reference.
    ///
    /// # Returns
    /// A shared reference to the underlying slice of `T` elements.
    /// For `RcuSlice<[u8; 32]>`, this returns `&[[u8; 32]]`.
    pub fn deref<'a>(&self, _guard: &'a RcuReadGuard) -> &'a [T] {
        // SAFETY: The caller has provided an RcuReadGuard, proving they are
        // within an RCU read-side critical section. The memory pointed to by
        // self.ptr was allocated by an RCU-protected update and will remain
        // valid until at least the next grace period. Since the guard prevents
        // grace period completion, the memory is valid for the guard's lifetime.
        // The len field was set at construction time and is invariant.
        unsafe { core::slice::from_raw_parts(self.ptr, self.len) }
    }
}

// RcuSlice does NOT implement Send or Sync directly. It can only be accessed
// through an RcuReadGuard, which is itself !Send. The containing struct
// (e.g., KeyRevocationList) provides the necessary Send/Sync impls when
// accessed through RcuCell, which enforces the RCU lifetime contract.

/// Wait for an RCU grace period to complete (blocking).
///
/// This function blocks the calling thread until all pre-existing RCU
/// read-side critical sections have completed. Use this when you need
/// to free memory that may be referenced by RCU readers.
///
/// **MUST NOT be called from atomic context** (interrupt handler, spinlock-held,
/// preempt-disabled, or NMI). In atomic contexts, use `rcu_call()` instead.
///
/// # Implementation
/// On a busy system with many CPUs, this may block for milliseconds.
/// The implementation polls per-CPU quiescent state counters until all
/// CPUs have passed through at least one quiescent point.
pub fn rcu_synchronize() {
    // Implementation details in isle-core/src/rcu.rs
    // Records the current grace period epoch, then blocks until all CPUs
    // have advanced their quiescent state counters past that epoch.
}

/// Defer a callback to run after an RCU grace period (non-blocking).
///
/// This is the preferred way to free memory from atomic contexts. The callback
/// will be invoked in the RCU worker thread context after all pre-existing
/// readers have completed.
///
/// # Safety
/// - The callback must not access any data that was freed before the callback runs.
/// - The callback runs in a kernel thread context (not interrupt context), so
///   it may block, but should complete quickly to avoid delaying other callbacks.
///
/// # Example
/// ```rust
/// // Defer Box::drop after grace period
/// let ptr = Box::into_raw(old_value);
/// unsafe { rcu_call(ptr) }; // callback will reconstruct Box and drop it
/// ```
pub unsafe fn rcu_call<T>(ptr: *mut T) {
    // Enqueue the callback on the per-CPU RCU callback list.
    // The RCU worker thread will execute it after the grace period.
}

/// Defer Box::drop after an RCU grace period (non-blocking).
///
/// This is a convenience wrapper around `rcu_call()` for the common case
/// of freeing a Box<T> after an RCU grace period.
///
/// # Safety
/// Same as `rcu_call()` — ptr must have been obtained from `Box::into_raw()`.
pub unsafe fn rcu_defer_free<T>(ptr: *mut T) {
    rcu_call(ptr);
}

RCU Slice Lifetime Safety Example (KRL):

The KeyRevocationList in Section 22.6 demonstrates correct RcuSlice usage:

// Boot-time KRL allocation (bump-allocated, 'static lifetime):
let boot_krl = KeyRevocationList {
    revoked_keys: unsafe { RcuSlice::from_raw(boot_keys_ptr) },  // lives forever
    revoked_count: boot_count,
    // ... other fields
};

// Runtime KRL allocation (slab-allocated, RCU-managed):
let rt_keys = Box::try_new_in_slice(count, slab_allocator)?;
let rt_krl = Box::try_new(KeyRevocationList {
    revoked_keys: unsafe { RcuSlice::from_raw(Box::as_slice_ptr(rt_keys)) },
    revoked_count: count,
    // ... other fields
})?;
// rt_krl is published via RcuCell::update(). Old KRL (if any) is freed
// by the RCU callback after the grace period, including its revoked_keys array.

/// Compile-time lock ordering: deadlock prevention via the type system.
/// A Lock<T, 3> can only be acquired while holding a Lock<_, N> where N < 3.
/// Attempting to acquire locks out of order is a compile-time error.
pub struct Lock<T, const LEVEL: u32> {
    inner: SpinLock<T>,  // or MutexLock<T> for sleeping locks
}

impl<T, const LEVEL: u32> Lock<T, LEVEL> {
    pub fn lock<const HELD: u32>(&self, _proof: &LockGuard<HELD>) -> LockGuard<LEVEL>
    where
        Assert<{ HELD < LEVEL }>: IsTrue,
    {
        LockGuard::new(self.inner.lock())
    }

    /// Acquiring the first lock in a chain requires no proof.
    /// Any level can be the starting lock — the constraint is that no other
    /// lock is currently held. The returned `LockGuard<LEVEL>` then constrains
    /// all subsequent lock acquisitions to levels strictly greater than LEVEL.
    ///
    /// **Enforcement**: The type system alone cannot prevent a caller from
    /// invoking `lock_first()` while already holding a `LockGuard` from a
    /// different scope. Therefore, `lock_first()` performs a **runtime check**
    /// in all build configurations: it reads the per-CPU `max_held_level`
    /// variable and panics if any lock is currently held. This is a cheap
    /// check (one per-CPU variable read + branch, ~2-3 cycles) and is always
    /// enabled — not only in debug mode. The `lock()` method's compile-time
    /// ordering guarantee (via `HELD < LEVEL`) remains the primary enforcement;
    /// the runtime check in `lock_first()` closes the only loophole.
    ///
    /// **Cross-session ABBA prevention**: Lock ordering is enforced by a global
    /// total order defined at compile time via the `LEVEL` const type parameter.
    /// Two sessions acquiring lock A (level 2) then lock B (level 5) always
    /// acquire in the same order because the type system enforces `HELD < LEVEL`
    /// at every step.
    pub fn lock_first(&self) -> LockGuard<LEVEL> {
        assert_no_locks_held(); // per-CPU max_held_level == NONE
        LockGuard::new(self.inner.lock())
    }
}

Known limitation — total ordering only: The const-generic approach enforces a total order on lock levels (level 0 < level 1 < level 2 < ...). This prevents deadlocks caused by circular lock chains, but it cannot express a partial order where two locks at the same conceptual level are safe to acquire together (because they protect independent subsystems). In practice, this means some subsystems that Linux allows to lock "in parallel" must be assigned adjacent but distinct levels in ISLE, potentially over-constraining the lock graph. If this becomes a scalability issue, the fallback is to introduce a lock_independent() API that takes two locks at the same level with a static proof that their domains are disjoint (e.g., per-CPU locks on different CPUs). For now, the total-order approach covers all known subsystem interactions, and runtime lockdep (debug-mode only) validates that no partial-order case is missed.

10.2 Locking Strategy

ISLE eliminates all "big kernel lock" patterns that plague Linux scalability:

Linux Problem	ISLE Solution
RTNL global mutex	Per-table RCU + per-route fine-grained locks
`dcache_lock` contention	Per-directory RCU + per-inode locks
`zone->lock` on NUMA	Per-CPU page lists + per-NUMA-node pools
`tasklist_lock`	RCU-protected process table + per-PID locks
`files_lock` (file table)	Per-fdtable RCU + per-fd locks
`inode_hash_lock`	Per-bucket RCU-protected hash chains

10.3 Lock-Free Data Structures

Where possible, lock-free data structures replace locked ones:

MPSC ring buffers: For all cross-domain communication (io_uring-style)
RCU-protected radix trees: For page cache lookups
Per-CPU freelists: For slab allocation (no cross-CPU contention)
Atomic reference counts: For capability tokens and shared objects
Sequence locks (seqlock): For rarely-written, frequently-read data (e.g., system time, mount table snapshot)

10.3a Scalability Analysis: Hot-Path Metadata on 256+ Cores

Moving from a monolithic kernel to a hybrid one can introduce new contention bottlenecks in the "core" layer — specifically in metadata structures that every I/O operation must touch. This section analyzes ISLE's three highest-contention metadata subsystems on many-core machines (256-512 cores, 4-8 NUMA nodes).

1. Capability Table (Section 11)

Every syscall and driver invocation performs a capability check. On a 512-core machine running 10,000 concurrent processes, the capability table sees millions of lookups per second.

Design for contention: - Per-process capability tables: each process has its own capability table (a small array indexed by capability handle, typically <256 entries). Lookups are process-local with no cross-process contention. Two processes on different CPUs never touch the same capability table. - Capability creation/delegation (write path): uses per-process lock. Only contends when the same process creates capabilities from multiple threads simultaneously (rare — capability creation is a cold path). - Capability revocation: RCU-deferred. Revoking a capability marks it as invalid (atomic store), then defers memory reclamation to an RCU grace period. No lock on the read path.

Contention profile: none on read path (per-process, indexed by local handle). Comparable to Linux's file descriptor table — per-process, never a global bottleneck.

2. Unified Object Namespace / islefs (Section 41)

The object namespace provides /sys/kernel/isle/ introspection. On a busy system, monitoring tools (isle-top, prometheus-exporter) may read thousands of objects per second.

Design for contention: - Registry reads: the device registry (Section 7) is RCU-protected. Reads are lock-free. Enumeration snapshots use a seqlock to detect concurrent modifications. - Attribute reads (sysfs-style): most attributes are backed by atomic counters or per-CPU counters that are aggregated on read. Example: a NIC's rx_packets counter is per-CPU; reading it sums all per-CPU values. No lock on the write path (per-CPU increment), brief aggregation on the read path. - Object creation/destruction (hot-plug, process exit): per-subsystem lock, not global. Creating a network device takes the network subsystem lock; creating a block device takes the block subsystem lock. No global namespace lock. - Pathological case: a monitoring tool reading every object attribute on every CPU at 1Hz on a 512-core, 10,000-device system. The aggregation overhead is proportional to num_cpus × num_objects — at 512 × 10,000, this is ~5M atomic reads per scan. At ~5ns each, one scan takes ~25ms. This is acceptable for 1Hz monitoring; for higher-frequency monitoring, tools should read only the objects they need.

Contention profile: lock-free reads (RCU + atomics), per-subsystem writes. Bottleneck only if monitoring tools read aggressively (configurable rate limiting).

3. Page Cache and Memory Management (Section 12)

The page cache is the single highest-contention structure in any kernel. Every file read, every mmap fault, every page writeback touches it.

Design for contention: - Page cache lookups: RCU-protected radix tree, per-inode (same as Linux's xarray). Two threads faulting pages from different files never contend. Two threads faulting different pages from the same file contend only at the xarray level (fine-grained, per-node locking within the radix tree). - LRU lists: per-NUMA-node LRU lists with per-CPU pagevec batching (same design as Linux). Pages are added to a per-CPU buffer; the buffer is drained to the LRU list in batches of 15 pages. This amortizes the per-NUMA LRU lock to 1 acquisition per 15 page faults. - Buddy allocator: per-NUMA-node lock, with per-CPU page caches absorbing >95% of allocations (Section 12.2). On a 512-core, 8-NUMA-node system, each buddy allocator sees ~1/8 of the allocation traffic, and per-CPU caches absorb most of it. - Known scaling limit: extreme mmap/munmap churn (thousands of VMAs per second per process) contends on the per-process VMA lock. ISLE uses the same maple tree (lockless reads, locked writes) as Linux 6.1+. For workloads that create/destroy VMAs at extreme rates (JVMs, some databases), this is a known bottleneck in all current kernels. ISLE does not claim to solve it — the contention is fundamental to the VMA data structure, not to the hybrid architecture.

Contention profile: per-inode, per-NUMA, per-CPU on all hot paths. No new bottlenecks introduced by the hybrid architecture — ISLE's page cache follows the same design as Linux (which already runs on 512+ core machines).

Summary — the hybrid architecture's overhead (isolation domain switches) is orthogonal to metadata contention. The "core" layer's metadata structures use the same per-CPU, per- NUMA, RCU-based techniques that make Linux scale to 256+ cores. No new global locks are introduced. The capability system is per-process (no shared structure), the object namespace is RCU-read / per-subsystem-write, and the page cache follows Linux's proven xarray + pagevec design.

10.4 Interrupt Handling

All device interrupts are routed to threaded interrupt handlers (same as Linux IRQF_THREADED).
Top-half handlers are kept minimal: acknowledge interrupt, wake thread.
This ensures all interrupt processing is preemptible and schedulable.
MSI/MSI-X is preferred for all PCI devices (per-queue interrupts, no sharing).
High-rate interrupt paths (100GbE at line-rate, NVMe at millions of IOPS): the threaded-handler model introduces scheduling latency (~1-5μs per interrupt) that could limit throughput if each packet triggered a separate interrupt. ISLE addresses this the same way Linux does — NAPI-style polling: the first interrupt wakes the thread, the thread then polls the device ring in a busy loop until the ring is drained, then re-enables interrupts. For 100GbE, this means the thread is woken once per batch of 64-256 packets, not once per packet. Combined with MSI-X per-queue affinity (one interrupt thread per CPU, one queue per CPU), the scheduling overhead is amortized to <100ns per packet, which is within the performance budget (Section 4).

10.5 Cross-Architecture Memory Ordering Hazards

Critical Implementation Warning: ISLE relies extensively on lock-free concurrency, asynchronous ring buffers, and RCU to meet its strict performance budget. However, developers must strictly decouple their mental model of concurrency from x86 hardware behavior.

The x86 TSO Trap: x86_64 implements Total Store Ordering (TSO), a very strong memory model. On x86, loads are not reordered with other loads, and stores are not reordered with other stores. Consequently, lock-free code that omits explicit memory barriers (or uses Ordering::Relaxed incorrectly) will often execute perfectly and pass all tests on x86 hardware.

Weak Memory Models (ARM, RISC-V, PowerPC): AArch64, ARMv7, RISC-V, and PowerPC implement Weak Memory Models. The CPU is free to aggressively reorder independent memory reads and writes. A missing memory barrier on these architectures will cause sequence locks to read torn data, ring buffer consumers to see the published pointer advance before the actual payload data is visible, and RCU readers to dereference pointers to uninitialized memory.

Implementation Mandates: To ensure lock-free algorithms are safe across all target architectures, the following rules apply:

Explicit Rust Atomics: All shared memory synchronization must use Rust's std::sync::atomic types with mathematically correct memory orderings (Acquire, Release, AcqRel, SeqCst). Never rely on implicit hardware ordering.
Release-Acquire Semantics: The standard pattern for lock-free publishing in ISLE (e.g., advancing a ring buffer head, or updating an RCU pointer) MUST pair an Ordering::Release store on the producer with an Ordering::Acquire load on the consumer.
No Relaxed in Control Flow: Ordering::Relaxed may only be used for pure statistical counters (e.g., rx_packets). It must never be used to synchronize visibility of other data.
Mandatory Multi-Arch CI: Lock-free primitives (MPSC queues, RCU, seqlocks) must be subjected to heavy stress-testing natively on AArch64 and RISC-V hardware (or QEMU/emulators with memory-reordering fuzzing enabled). x86_64 test passes are considered insufficient to prove the correctness of lock-free code.

11. Security Architecture

11.1 Capability-Based Foundation

Every resource in ISLE is accessed through unforgeable capability tokens. This is the native security model -- not a bolt-on.

pub struct Capability {
    /// Unique identifier of the target object
    pub object_id: ObjectId,

    /// Bitfield of permitted operations
    pub permissions: PermissionBits,

    /// Monotonically increasing generation counter for revocation.
    /// When an object's generation advances, all capabilities with
    /// older generations become invalid.
    pub generation: u64,

    /// Additional constraints on this capability
    pub constraints: CapConstraints,
}

pub struct CapConstraints {
    /// Capability expires after this timestamp (0 = no expiry)
    pub expires_at: u64,

    /// Can this capability be delegated to other processes?
    pub delegatable: bool,

    /// Maximum delegation depth (0 = no further delegation)
    pub max_delegation_depth: u32,

    /// Restrict to specific CPU set. Uses the same `CpuMask` type as the
    /// scheduler (Section 14) and NUMA topology (Section 12.7). `CpuMask`
    /// is a dynamically-sized bitmask allocated at boot to fit the actual
    /// CPU count discovered from firmware. A default-constructed (all-zeros)
    /// mask means "any CPU" (no affinity constraint).
    pub cpu_affinity_mask: CpuMask,
}

Permission Bits:

The PermissionBits type defines fine-grained access rights that can be granted on any capability. Different object types interpret these bits differently, but the base set is universal:

bitflags! {
    /// Fine-grained permission bits that can be set on any capability.
    /// These are orthogonal to SystemCaps (administrative capabilities) —
    /// PermissionBits control what operations a specific capability permits
    /// on its target object, while SystemCaps control what system-wide
    /// operations a process can perform.
    pub struct PermissionBits: u32 {
        /// READ: Read access to the target object's data or state.
        /// For memory objects: read data.
        /// For files: read file contents.
        /// For processes: read registers, memory, status.
        const READ = 1 << 0;

        /// WRITE: Write access to the target object's data or state.
        /// For memory objects: write data.
        /// For files: write file contents, append.
        /// For processes: modify registers, memory.
        const WRITE = 1 << 1;

        /// EXECUTE: Execute access or control flow modification.
        /// For memory objects: execute code from this region.
        /// For files: execute as a program.
        /// For processes: single-step, continue execution.
        const EXECUTE = 1 << 2;

        /// DEBUG: Debugging access to the target object.
        /// For processes: attach via ptrace, inspect/modify state.
        /// For /proc entries: access to private (non-public) fields.
        /// Implies READ unless explicitly excluded. See Section 40a.1.
        const DEBUG = 1 << 3;

        /// SYSCALL_TRACE: Trace syscall entry/exit for a process.
        /// Used with PTRACE_SYSCALL. Requires DEBUG to also be set.
        /// See Section 40a.1.
        const SYSCALL_TRACE = 1 << 4;

        /// DELEGATE: Delegate this capability to another process.
        /// Subject to CapConstraints::delegatable and max_delegation_depth.
        const DELEGATE = 1 << 5;

        /// ADMIN: Administrative access to the target object.
        /// Object-specific meaning: for devices, may allow reset;
        /// for filesystems, may allow unmount; etc.
        const ADMIN = 1 << 6;

        /// MAP_READ: Map the object read-only into address space.
        /// Used for memory objects, file-backed mappings, DMA buffers.
        const MAP_READ = 1 << 7;

        /// MAP_WRITE: Map the object read-write into address space.
        const MAP_WRITE = 1 << 8;

        /// MAP_EXECUTE: Map the object with execute permission.
        const MAP_EXECUTE = 1 << 9;

        /// KERNEL_READ: Read kernel-side data associated with a userspace object.
        /// For processes: read kernel stack trace (/proc/pid/stack).
        /// Requires CAP_DEBUG on the target in addition to this bit.
        /// See Section 40a.5.
        const KERNEL_READ = 1 << 10;
    }
}

Permission composition: Multiple bits can be combined. For example, a debugging capability on a process might have DEBUG | READ | WRITE | SYSCALL_TRACE to allow full ptrace-style debugging including syscall interception.

Capabilities are stored in kernel-managed capability tables, indexed by per-process capability handles. User space never sees raw capability data -- only opaque handles.

Capability Revocation Semantics:

ISLE supports two revocation mechanisms, chosen per-object-type:

Generation-based (for bulk invalidation): Each object has its own monotonic generation counter (object.generation). Each capability records the generation of the object at the time the capability was created (cap.generation). Validation checks cap.generation == object.generation (exact match). Revoking all capabilities for an object increments object.generation; all existing capabilities now have a stale generation and fail validation. O(1), no table scanning.

Slot persistence: The generation counter lives in the kernel's object registry slot, not in the object itself. When an object is freed, its registry slot retains the last generation value. When a new object is allocated in the same slot, the slot's generation is incremented before the new object becomes visible. This prevents ABA vulnerabilities: an old capability with generation == N cannot validate against a new object in the same slot at generation == N+1. Object IDs are slot indices — they may be reused, but the generation makes each (ObjectId, generation) pair unique over time.

Trade-off: this is per-object all-or-nothing. Incrementing the generation for one PhysMemory object invalidates only capabilities pointing to that object — it does NOT affect capabilities for other PhysMemory objects (they have independent generation counters). If a single object needs fine-grained revocation (revoke one capability without affecting others for the same object), use indirection-based revocation instead.

Indirection-based (for fine-grained control): Capabilities point to an indirection entry in a per-object revocation table. Revoking sets the entry to "revoked." Allows individual revocation without affecting other capabilities for the same object. Cost: one extra pointer dereference per validation (~2-3ns).

Synchronization: Indirection entries are RCU-protected. The validate path runs inside an RCU read-side critical section: it dereferences the indirection pointer, checks the "revoked" flag, and proceeds — all under rcu_read_lock(). The revoke path sets the entry to "revoked" (atomic store), then defers entry reclamation to an RCU grace period via rcu_call(). This closes the TOCTOU window: a thread that has already dereferenced the indirection entry and sees "not revoked" is guaranteed to be within an RCU read-side critical section, so the entry cannot be freed until that thread exits the critical section. The indirection entry itself is never freed while any reader may hold a reference — only after the RCU grace period completes.

Object type	Revocation model	Rationale
`PhysMemory`	Generation (per-object)	Bulk revocation on free — all caps for this region die
`DeviceIo`	Generation (per-object)	Device reset invalidates all handles for this device
`FileDescriptor`	Indirection	Individual fd close must not affect other fds to the same inode
`IpcChannel`	Indirection	Can revoke one endpoint independently
`Process`	Generation (per-object)	Process exit invalidates all handles to this process

Validation rule: is_valid() in isle-core/src/cap/mod.rs must use exact-match: self.generation == object.generation. A capability is valid only when its generation matches the object's current generation exactly. Less-than-or-equal (<=) is incorrect because it would allow stale capabilities from earlier generations to pass validation — a capability from generation 3 must not be valid when the object is at generation 5.

11.1.1 Capability Table Lifecycle and Garbage Collection

Per-process capability tables (CapSpace) are small arrays indexed by local handle (typically <256 entries per process). Lifecycle:

Allocation: Capability table entries are allocated from the process's CapSpace when a capability is created (e.g., open() → new FileDescriptor capability) or received via IPC delegation. Each entry is reference-counted: the CapEntry holds a strong reference to the underlying kernel object.
Deallocation: When a capability handle is explicitly closed (e.g., close(fd)), the entry is removed from the CapSpace, and the kernel object's reference count is decremented. If the reference count reaches zero, the kernel object is freed.
Process exit: On process exit (do_exit), the kernel iterates the process's CapSpace and drops every entry. For generation-based objects, this decrements the reference count (the generation counter is untouched — other processes' capabilities remain valid if the object is still alive). For indirection-based objects, the indirection entry is marked "revoked" and scheduled for RCU-deferred reclamation.
Table exhaustion prevention: CapSpace has a configurable per-process maximum (default: 65536 entries, matching Linux's RLIMIT_NOFILE default hard limit). Attempts to allocate beyond this limit fail with -EMFILE. The system-wide total of live capability entries is bounded by the slab allocator's memory pressure feedback — under memory pressure, capability creation fails with -ENOMEM like any other kernel allocation. There is no unbounded capability table growth.

11.2 Linux Permission Emulation

Traditional Unix permissions (UIDs, GIDs, file modes, POSIX capabilities) are emulated on top of the ISLE capability model. The translation is transparent to applications:

UIDs/GIDs: Mapped to capability sets. uid == 0 grants a broad (but still bounded) capability set, not unlimited access.
File modes: Translated to per-file capability checks at open time.
POSIX capabilities (CAP_NET_RAW, CAP_SYS_ADMIN, etc.): Each maps to a specific set of ISLE capabilities.
Supplementary groups: Expand the effective capability set.

Applications see standard getuid(), stat(), access() behavior.

11.2.1 System Administration Capabilities

ISLE defines a set of core capabilities that map to Linux's POSIX capabilities. These are the native ISLE capabilities that form the basis for permission checks:

bitflags! {
    /// Core system administration and access control capabilities.
    /// These are the ISLE-native capabilities that correspond to Linux's
    /// POSIX capability set, but with a more granular and typed design.
    /// **Backing type**: `u128` provides 128 bits — enough for all Linux POSIX
    /// capabilities (currently 41, bits 0-40) and ISLE-native capabilities,
    /// with room for growth. Linux has added ~1 capability per release cycle;
    /// 128 bits gives decades of headroom. On 64-bit platforms, `u128` is
    /// two registers — capability checks remain fast (two AND+CMP operations).
    ///
    /// **Layout**: Bits 0-63 are reserved for POSIX-compatible capabilities
    /// (matching Linux numbering exactly — bit N = Linux `CAP_*` number N).
    /// Bits 64-127 are ISLE-native capabilities that have no Linux equivalent.
    /// Bits 41-63 are reserved for future Linux capabilities.
    pub struct SystemCaps: u128 {
        // ===== POSIX capabilities (bits 0-40, matching Linux exactly) =====
        // These bit positions are 1:1 with Linux's capability numbers.
        // The syscall compat layer (Section 20) passes Linux cap numbers
        // through directly: capable(N) → has_cap(1 << N). No translation needed.

        /// CAP_CHOWN (Linux 0): Allow arbitrary file ownership changes.
        const CAP_CHOWN = 1 << 0;
        /// CAP_DAC_OVERRIDE (Linux 1): Bypass all DAC checks.
        const CAP_DAC_OVERRIDE = 1 << 1;
        /// CAP_DAC_READ_SEARCH (Linux 2): Bypass file read and directory search checks.
        const CAP_DAC_READ_SEARCH = 1 << 2;
        /// CAP_FOWNER (Linux 3): Bypass file-owner permission checks.
        const CAP_FOWNER = 1 << 3;
        /// CAP_FSETID (Linux 4): Set set-user-ID and set-group-ID bits.
        const CAP_FSETID = 1 << 4;
        /// CAP_KILL (Linux 5): Send signals to arbitrary processes.
        const CAP_KILL = 1 << 5;
        /// CAP_SETGID (Linux 6): Set arbitrary group IDs.
        const CAP_SETGID = 1 << 6;
        /// CAP_SETUID (Linux 7): Set arbitrary user IDs.
        const CAP_SETUID = 1 << 7;
        /// CAP_SETPCAP (Linux 8): Modify capability sets.
        const CAP_SETPCAP = 1 << 8;
        /// CAP_LINUX_IMMUTABLE (Linux 9): Set immutable and append-only file attributes.
        const CAP_LINUX_IMMUTABLE = 1 << 9;
        /// CAP_NET_BIND_SERVICE (Linux 10): Bind to privileged ports (< 1024).
        const CAP_NET_BIND_SERVICE = 1 << 10;
        /// CAP_NET_BROADCAST (Linux 11): Socket broadcasting and multicast.
        const CAP_NET_BROADCAST = 1 << 11;
        /// CAP_NET_ADMIN (Linux 12): Network administration operations.
        const CAP_NET_ADMIN = 1 << 12;
        /// CAP_NET_RAW (Linux 13): Raw and packet sockets.
        const CAP_NET_RAW = 1 << 13;
        /// CAP_IPC_LOCK (Linux 14): Lock memory (mlock, mlockall, SHM_LOCK).
        const CAP_IPC_LOCK = 1 << 14;
        /// CAP_IPC_OWNER (Linux 15): Override IPC ownership checks.
        const CAP_IPC_OWNER = 1 << 15;
        /// CAP_SYS_MODULE (Linux 16): Load and unload kernel modules.
        const CAP_SYS_MODULE = 1 << 16;
        /// CAP_SYS_RAWIO (Linux 17): Raw I/O operations.
        const CAP_SYS_RAWIO = 1 << 17;
        /// CAP_SYS_CHROOT (Linux 18): Use chroot(2).
        const CAP_SYS_CHROOT = 1 << 18;
        /// CAP_SYS_PTRACE (Linux 19): Trace arbitrary processes.
        /// Note: ISLE also provides CAP_DEBUG (bit 68) as the native debugging
        /// capability. The compat layer maps CAP_SYS_PTRACE checks to CAP_DEBUG.
        const CAP_SYS_PTRACE = 1 << 19;
        /// CAP_SYS_PACCT (Linux 20): Configure process accounting.
        const CAP_SYS_PACCT = 1 << 20;
        /// CAP_SYS_ADMIN (Linux 21): Broad system administration.
        /// Note: ISLE also provides CAP_ADMIN (bit 64) as the native admin
        /// capability. The compat layer: `capable(CAP_SYS_ADMIN)` checks
        /// `has_cap(CAP_SYS_ADMIN) || has_cap(CAP_ADMIN)`.
        const CAP_SYS_ADMIN = 1 << 21;
        /// CAP_SYS_BOOT (Linux 22): Use reboot(2) and kexec_load(2).
        const CAP_SYS_BOOT = 1 << 22;
        /// CAP_SYS_NICE (Linux 23): Set scheduling policies, nice values.
        const CAP_SYS_NICE = 1 << 23;
        /// CAP_SYS_RESOURCE (Linux 24): Override resource limits (RLIMIT).
        const CAP_SYS_RESOURCE = 1 << 24;
        /// CAP_SYS_TIME (Linux 25): Set system clock and adjtime.
        const CAP_SYS_TIME = 1 << 25;
        /// CAP_SYS_TTY_CONFIG (Linux 26): Configure virtual terminal settings.
        const CAP_SYS_TTY_CONFIG = 1 << 26;
        /// CAP_MKNOD (Linux 27): Create special files using mknod(2).
        const CAP_MKNOD = 1 << 27;
        /// CAP_LEASE (Linux 28): Establish leases on arbitrary files.
        const CAP_LEASE = 1 << 28;
        /// CAP_AUDIT_WRITE (Linux 29): Write records to audit log.
        const CAP_AUDIT_WRITE = 1 << 29;
        /// CAP_AUDIT_CONTROL (Linux 30): Configure audit rules.
        const CAP_AUDIT_CONTROL = 1 << 30;
        /// CAP_SETFCAP (Linux 31): Set file capabilities.
        const CAP_SETFCAP = 1 << 31;
        /// CAP_MAC_OVERRIDE (Linux 32): Override MAC enforcement.
        const CAP_MAC_OVERRIDE = 1 << 32;
        /// CAP_MAC_ADMIN (Linux 33): MAC configuration changes.
        const CAP_MAC_ADMIN = 1 << 33;
        /// CAP_SYSLOG (Linux 34): Privileged syslog operations.
        const CAP_SYSLOG = 1 << 34;
        /// CAP_WAKE_ALARM (Linux 35): Set system wakeup alarms.
        const CAP_WAKE_ALARM = 1 << 35;
        /// CAP_BLOCK_SUSPEND (Linux 36): Prevent system suspending.
        const CAP_BLOCK_SUSPEND = 1 << 36;
        /// CAP_AUDIT_READ (Linux 37): Read audit log messages.
        const CAP_AUDIT_READ = 1 << 37;
        /// CAP_PERFMON (Linux 38): Performance monitoring and observability.
        const CAP_PERFMON = 1 << 38;
        /// CAP_BPF (Linux 39): Load eBPF programs and create maps.
        const CAP_BPF = 1 << 39;
        /// CAP_CHECKPOINT_RESTORE (Linux 40): Checkpoint/restore operations.
        const CAP_CHECKPOINT_RESTORE = 1 << 40;

        // Bits 41-63: Reserved for future Linux capabilities.

        // ===== ISLE-native capabilities (bits 64-127) =====
        // These have no Linux equivalent. They provide finer-grained
        // control for ISLE-specific subsystems.

        /// CAP_ADMIN: ISLE-native administrative capability. Grants broad
        /// administrative access including: mounting/unmounting filesystems,
        /// modifying firewall/routing, loading kernel modules, sysctl changes,
        /// privilege escalation, cgroup/namespace administration, device
        /// management. The compat layer maps `capable(CAP_SYS_ADMIN)` to
        /// check both CAP_SYS_ADMIN (bit 21) and CAP_ADMIN (bit 64) — a
        /// process holding either one passes the check. CAP_ADMIN is the
        /// preferred capability for new ISLE-native code paths; CAP_SYS_ADMIN
        /// exists purely for Linux application compatibility.
        const CAP_ADMIN = 1 << 64;

        /// CAP_P2P_DMA: Peer-to-peer DMA operations between devices.
        /// Required for drivers that initiate P2P transactions.
        const CAP_P2P_DMA = 1 << 65;

        /// CAP_NET_LOOKUP: BPF socket/connection table lookups via
        /// bpf_sk_lookup(). Required for BPF load balancers. See §35.2.
        const CAP_NET_LOOKUP = 1 << 66;

        /// CAP_NET_ROUTE_READ: BPF FIB (routing table) lookups via
        /// bpf_fib_lookup(). Required for XDP forwarding. See §35.2.
        const CAP_NET_ROUTE_READ = 1 << 67;

        /// CAP_DEBUG: Debug arbitrary processes via ptrace or /proc/pid.
        /// ISLE-native debugging capability. CAP_SYS_PTRACE (bit 19) is
        /// the Linux compat alias. See §40a.1 for ptrace capability model.
        const CAP_DEBUG = 1 << 68;

        /// CAP_NS_TRAVERSE: Traverse namespace boundaries for cross-namespace
        /// operations. Each boundary crossing requires this capability.
        /// Combined with CAP_DEBUG, enables cross-namespace ptrace. See §40a.1.
        const CAP_NS_TRAVERSE = 1 << 69;

        /// CAP_MOUNT: Mount/unmount filesystems. Required for mount(2),
        /// umount(2), pivot_root(2). Scoped to caller's mount namespace. See §27.2.
        const CAP_MOUNT = 1 << 70;

        /// CAP_VMX: Execute VMX instructions for KVM host operation.
        /// Maps to /dev/kvm open and VMXON/VMXOFF. See §37.
        const CAP_VMX = 1 << 71;

        /// CAP_CGROUP_ADMIN: Manage cgroup hierarchies — create, modify,
        /// destroy within delegated subtree. See §20.5.
        const CAP_CGROUP_ADMIN = 1 << 72;

        /// CAP_TPM_SEAL: Seal/unseal data to TPM. See §23.2.
        const CAP_TPM_SEAL = 1 << 73;

        /// CAP_NET_CONNTRACK: BPF conntrack state query/modify via
        /// bpf_ct_lookup(), bpf_ct_insert(), bpf_ct_set_nat(). See §35.2.
        const CAP_NET_CONNTRACK = 1 << 74;

        /// CAP_NET_REDIRECT: XDP packet redirect to another interface's
        /// isolation domain. See §35.2.
        const CAP_NET_REDIRECT = 1 << 75;

        /// CAP_TTY_DIRECT: Zero-copy PTY mode (PtyRingPage mapped into both
        /// master/slave). For container logging optimization. See §59.2.
        const CAP_TTY_DIRECT = 1 << 76;

        /// CAP_ACCEL_ADMIN: Accelerator device admin (firmware update, reset,
        /// perf counter reset, privileged debug). See §42.2.2.
        const CAP_ACCEL_ADMIN = 1 << 77;

        // ZFS-specific capabilities (scoped to datasets via object_id, §28.2).
        /// CAP_ZFS_MOUNT: Mount a ZFS dataset as a filesystem.
        const CAP_ZFS_MOUNT = 1 << 78;
        /// CAP_ZFS_SNAPSHOT: Create and destroy snapshots.
        const CAP_ZFS_SNAPSHOT = 1 << 79;
        /// CAP_ZFS_SEND: Generate a send stream for replication.
        const CAP_ZFS_SEND = 1 << 80;
        /// CAP_ZFS_RECV: Receive a send stream into a dataset.
        const CAP_ZFS_RECV = 1 << 81;
        /// CAP_ZFS_CREATE: Create child datasets within a parent.
        const CAP_ZFS_CREATE = 1 << 82;
        /// CAP_ZFS_DESTROY: Destroy a dataset (highest ZFS privilege).
        const CAP_ZFS_DESTROY = 1 << 83;

        // DLM (Distributed Lock Manager) capabilities. See §31a.14.
        /// CAP_DLM_LOCK: Acquire, convert, release locks in permitted lockspaces.
        const CAP_DLM_LOCK = 1 << 84;
        /// CAP_DLM_ADMIN: Create/destroy lockspaces, configure, view cluster-wide.
        const CAP_DLM_ADMIN = 1 << 85;
        /// CAP_DLM_CREATE: Create new lock resources (app-level via /dev/dlm).
        const CAP_DLM_CREATE = 1 << 86;

        // Bits 87-127: Reserved for future ISLE-native capabilities.
    }
}

Key design notes:

Bit layout: Bits 0-40 match Linux's capability numbering exactly (bit N = Linux CAP_* number N). This means the compat layer needs no translation for POSIX capability checks — capable(N) simply checks caps & (1 << N). Bits 41-63 are reserved for future Linux capabilities. Bits 64-127 are ISLE-native.
CAP_ADMIN vs CAP_SYS_ADMIN: CAP_SYS_ADMIN (bit 21) is the Linux-compatible capability at its exact Linux bit position. CAP_ADMIN (bit 64) is the ISLE-native administrative capability. The compat layer checks both: capable(CAP_SYS_ADMIN) succeeds if the process holds either CAP_SYS_ADMIN or CAP_ADMIN. New ISLE-native code should check CAP_ADMIN; the POSIX bits exist purely for unmodified Linux application compatibility.
CAP_SYS_PTRACE vs CAP_DEBUG: Similar pattern. CAP_SYS_PTRACE (bit 19) is the Linux compat bit. CAP_DEBUG (bit 68) is the ISLE-native debugging capability. The compat layer maps ptrace capability checks to CAP_SYS_PTRACE || CAP_DEBUG.
Granularity over blanket permissions: Individual capabilities (e.g., CAP_NET_ADMIN, CAP_DAC_OVERRIDE) allow least-privilege assignment. A web server needs only CAP_NET_BIND_SERVICE and CAP_SETUID, not CAP_ADMIN.
No root-equivalent unlimited access: Even CAP_ADMIN is bounded. It grants the operations listed above but does not bypass hardware isolation domains or allow arbitrary code execution in isle-core. There is no capability that grants "ignore all security checks" — administrative operations are still mediated through the capability system.

POSIX capability mapping: The syscall entry point (Section 20) converts Linux capability checks (capable(N)) to SystemCaps bit checks. For POSIX capabilities (0-40), this is a direct 1 << N check with no translation. For caps that have ISLE-native equivalents (e.g., CAP_SYS_ADMIN/CAP_ADMIN), the check is has_cap(1 << N) || has_cap(ISLE_NATIVE_EQUIVALENT).

11.3 Dual ACL Model: POSIX Draft ACLs + NFSv4 ACLs

ISLE supports two ACL models as first-class citizens, designed into the VFS layer from the start:

POSIX Draft ACLs (IEEE 1003.1e/1003.2c): - Standard Linux ACL model (setfacl/getfacl) - ACL_USER, ACL_GROUP, ACL_MASK, ACL_OTHER entry types - Default ACLs for directory inheritance - Required for ext4, XFS, btrfs compatibility

NFSv4 ACLs (RFC 7530 / RFC 8881): - Richer model used by NFS, ZFS, and most modern Unix systems (FreeBSD, Solaris/illumos) - Explicit ALLOW/DENY ACE ordering (access control entries processed in order) - Fine-grained permissions: READ_DATA, WRITE_DATA, APPEND_DATA, READ_NAMED_ATTRS, WRITE_NAMED_ATTRS, EXECUTE, DELETE_CHILD, READ_ATTRIBUTES, WRITE_ATTRIBUTES, DELETE, READ_ACL, WRITE_ACL, WRITE_OWNER, SYNCHRONIZE - Inheritance flags: FILE_INHERIT, DIRECTORY_INHERIT, NO_PROPAGATE_INHERIT, INHERIT_ONLY - Automatic inheritance tracking (for efficient subtree ACL changes)

VFS ACL Abstraction:

pub enum AclModel {
    PosixDraft,  // POSIX.1e draft ACLs
    Nfsv4,       // NFSv4/ZFS-style rich ACLs
}

pub trait VfsAcl {
    /// Which ACL model this filesystem uses
    fn acl_model(&self) -> AclModel;
    /// Get the effective ACL for an inode
    fn get_acl(&self, inode: InodeId, acl_type: AclType) -> Result<Acl, Error>;
    /// Set an ACL on an inode
    fn set_acl(&self, inode: InodeId, acl_type: AclType, acl: &Acl) -> Result<(), Error>;
    /// Check access (called by the permission check path)
    fn check_access(&self, inode: InodeId, who: &Principal, mask: AccessMask) -> Result<(), Error>;
}

Filesystem support: - ext4, XFS, btrfs: POSIX draft ACLs (native) + NFSv4 via translation layer - ZFS: NFSv4 ACLs (native) - NFS client: NFSv4 ACLs (native, passed through to server) - tmpfs, procfs, sysfs: POSIX draft ACLs (simple model sufficient)

Translation layer: When a filesystem natively uses one model but the user/application requests the other, a translation layer converts between them. This is similar to how FreeBSD and illumos handle mixed ACL environments. The translation is lossy in some edge cases (NFSv4 DENY entries have no POSIX equivalent), but covers common use cases.

Linux compatibility: The getxattr/setxattr syscalls support both system.posix_acl_access/system.posix_acl_default (POSIX) and system.nfs4_acl (NFSv4) extended attribute names. Tools like nfs4_getfacl/nfs4_setfacl work unmodified.

11.4 Driver Sandboxing

Each driver tier has a different sandboxing model:

Tier 1 (domain-isolated): - Memory isolation via hardware domain keys (cannot access core or other drivers) - DMA fencing via IOMMU (cannot DMA to arbitrary physical addresses) - Capability restrictions: driver only receives capabilities for its own devices - No access to: page tables, capability tables, scheduler state, other driver state

Tier 2 (process-isolated): - Full address space isolation (separate page tables) - IOMMU DMA fencing - seccomp-like syscall filtering: driver can only invoke KABI syscalls, not arbitrary Linux syscalls - Resource limits: memory, CPU time, file descriptors - Mandatory capability-based access control

All drivers: - Least-privilege capability grants based on driver manifest and system policy - Cryptographic signature verification (optional, configurable) - Audit logging of all capability-mediated operations (when audit is enabled)

See also: Section 26 (Confidential Computing) extends capability-based isolation to TEE enclaves with hardware-encrypted memory (SEV-SNP, TDX, ARM CCA). Section 25 (Post-Quantum Cryptography) ensures capability tokens and distributed credentials remain secure against quantum attacks via algorithm-agile crypto abstractions.

11.5 Security by Default

Linux problem: MAC (SELinux/AppArmor) exists but is complex and usually permissive by default. POSIX capabilities were supposed to replace setuid but are clunky and poorly adopted.

ISLE design: - The capability model is the foundation, not an add-on. Every resource access goes through capability checks — there is no "bypass" path. - Default-deny: Processes start with an empty capability set. They receive capabilities explicitly from their parent or from the exec-time capability grants (replacing setuid). - No setuid binaries: The setuid bit on executables maps to "grant these capabilities on exec" — the process never actually runs as uid 0 with unlimited power. This is transparent to applications that check geteuid() == 0 (the compat layer handles this). - LSM hooks at all security-relevant points: Pre-integrated, always active, not optional kernel config. SELinux and AppArmor policy engines are loadable modules that attach to these hooks. - Application profiles: Ship default confinement profiles for common services (sshd, nginx, postgres) — applications are sandboxed out of the box.

11a. Error Handling and Fault Containment

11a.1 Kernel Error Model

Linux problem: Kernel functions return negative integers (-ENOMEM, -EINVAL) for errors, sometimes stuffed into pointers via ERR_PTR(). The type system enforces nothing — callers can silently ignore errors, confuse pointers with error pointers, or propagate the wrong errno. Unchecked kmalloc returns and missing error propagation are endemic.

ISLE design: All kernel-internal functions return Result<T, KernelError>. The ? operator propagates errors up the call stack. The Rust type system makes it impossible to silently ignore an error — no integer error codes, no sentinel values, no ERR_PTR.

/// Canonical kernel error type. All isle-core and driver-facing
/// kernel functions return `Result<T, KernelError>`.
#[repr(u32)]
pub enum KernelError {
    OutOfMemory       = 1,   // Physical or virtual memory exhausted
    InvalidCapability = 2,   // Capability handle missing or revoked
    PermissionDenied  = 3,   // Capability lacks required permission bits
    InvalidArgument   = 4,   // Syscall argument out of range
    DeviceError       = 5,   // Device error or device in error state
    Timeout           = 6,   // Operation timed out
    WouldBlock        = 7,   // Non-blocking operation would block
    NotFound          = 8,   // Requested object does not exist
    AlreadyExists     = 9,   // Object or resource already exists
    Interrupted       = 10,  // Operation interrupted by signal
    IoError           = 11,  // Generic I/O error (disk, network, DMA)
    // Extensible: new variants added at end, values are stable.
}

POSIX errno mapping: The syscall entry point (Section 20) converts KernelError to POSIX errno values at the syscall boundary — the only place integer error codes exist:

KernelError	POSIX errno	Value
OutOfMemory	ENOMEM	12
InvalidCapability	EBADF	9
PermissionDenied	EPERM / EACCES	1 / 13
InvalidArgument	EINVAL	22
DeviceError	EIO	5
Timeout	ETIMEDOUT	110
WouldBlock	EAGAIN	11
NotFound	ENOENT	2
AlreadyExists	EEXIST	17
Interrupted	EINTR	4
IoError	EIO	5

Some variants map to different errnos depending on context (PermissionDenied becomes EPERM for capability operations, EACCES for filesystem operations). The translation is handled by the syscall dispatch layer, not by the originating subsystem.

11a.2 Fault Containment Boundaries

ISLE has four fault containment domains. A fault in one domain does not propagate to domains above it in the hierarchy:

Domain	Failure scope	Recovery
isle-core (Tier 0)	Kernel panic — entire system	Reboot (same as Linux)
Tier 1 driver (domain-isolated)	Single driver crash	Automatic restart, device FLR (Section 9)
Tier 2 driver (process)	Single driver process crash	Automatic restart (Section 9)
Userspace process	Single process terminated	Application-level recovery

Linux problem: Linux has exactly one fault domain for the entire kernel. A null dereference in an obscure USB driver is indistinguishable from a bug in the scheduler — both trigger the same kernel panic. The only containment boundary is kernel vs. userspace.

ISLE design: The isolation model (Section 3) gives each Tier 1 driver its own isolation domain. When a CPU exception fires (page fault, general protection fault, divide-by-zero), isle-core's exception handler inspects the faulting context's isolation domain ID (architecture-specific: PKRU on x86, page table base on ARM/RISC-V — see arch::current::isolation::current_domain_id()):

Domain 0 (isle-core): The fault is in the trusted kernel. This is a genuine kernel panic — proceed to the panic handler (Section 11a.3).
Domain 1-N (Tier 1 driver): The fault is in an isolated driver. The exception handler identifies the driver from the faulting domain ID, marks it as crashed, and invokes the crash recovery sequence (Section 9). The rest of the kernel continues running.

Tier 2 driver faults are even simpler: the driver runs in a separate address space, so a fault (SIGSEGV, SIGBUS, SIGFPE) terminates the driver process. The driver supervisor detects the exit and restarts it.

11a.3 Panic Handling

A kernel panic means a bug in isle-core itself — the small trusted computing base. This is the only code whose failure is fatal.

Panic sequence:

1. DISABLE INTERRUPTS — local CLI on the faulting CPU.
   NMI IPI broadcast to all other CPUs. Each CPU receiving the NMI executes
   the NMI panic handler, which:
     (a) Saves the current register context to a pre-allocated per-CPU crash buffer
         (allocated at boot, never freed, immune to OOM — one 4KB page per CPU).
     (b) Disables local interrupts (preventing further preemption or nested exceptions).
     (c) Spins on an atomic flag waiting for the panic coordinator (the faulting CPU).
   This ensures all CPUs are in a known-safe state before the coordinator reads
   system data structures. The NMI handler is NMI-safe — it uses no locks, no
   allocation, and no printk. It writes only to the pre-allocated crash buffer.
   Architecture-specific NMI delivery: On x86-64, this uses the APIC NMI delivery
   mode. On AArch64 (GICv3.3+), the GIC NMI mechanism (GICD_INMIR) is used; on
   older GIC implementations, a highest-priority FIQ is used as a pseudo-NMI (same
   technique as Linux's CONFIG_ARM64_PSEUDO_NMI). On RISC-V, sbi_send_ipi() with a
   dedicated panic IPI vector is used. **Limitation**: standard RISC-V supervisor
   interrupts are maskable — sstatus.SIE=0 blocks all supervisor interrupts
   regardless of AIA priority. If the target CPU has interrupts disabled, the IPI
   will not be delivered. Mitigation: the panic coordinator uses a 100 ms timeout
   per CPU; CPUs that do not respond are marked "unavailable" in the crash dump.
   On systems implementing the Smrnmi extension (Resumable Non-Maskable Interrupts),
   ISLE uses RNMI delivery instead, which is truly non-maskable. On PPC64LE, the OPAL
   opal_signal_system_reset() call triggers a system reset interrupt on target CPUs.
2. CAPTURE STATE — faulting CPU registers, stack backtrace (.eh_frame),
   per-CPU crash buffers (from step 1), key data structures (process list,
   cap table, driver registry, last 64KB klog)
3. SERIAL FLUSH — panic message + backtrace to serial (Tier 0, polled, always works)
4. CRASH DUMP — if configured, write ELF core dump to reserved memory region;
   if NVMe panic-write path registered, polled-mode write to disk (Section 9)
5. HALT — default halt (isle.panic=halt), or reboot (isle.panic=reboot)

Driver panic vs. kernel panic: A panic!() inside Tier 1 driver code does NOT panic the kernel. The kernel is compiled with panic = "abort", so there is no stack unwinding. Instead, panic!() calls abort(), which executes an illegal instruction (ud2 on x86-64, udf on AArch64/ARMv7, unimp on RISC-V, trap on PPC). This triggers a CPU exception (invalid opcode / undefined instruction) within the driver's isolation domain. The exception handler identifies the faulting domain as a non-core driver domain and routes the fault to driver crash recovery (Section 9) — not to the kernel panic path.

OOM policy: When physical memory is exhausted, ISLE applies pressure in stages:

Reclaim page cache: Clean pages are evicted immediately (no I/O cost). Dirty pages are written back and then evicted.
Compress to zpool: Inactive anonymous pages are compressed and moved to the in-kernel compression tier (Section 13), reducing physical memory usage 2-3x.
Swap to disk: If a swap device is configured, compressed pages that haven't been accessed spill to disk.
OOM killer: If all of the above fail to free enough memory, the OOM killer selects a process to terminate. Heuristic: largest RSS, not marked OOM_SCORE_ADJ=-1000, not system-critical, not recently started. The selected process receives SIGKILL.

isle-core itself is never OOM-killed. Core kernel allocations draw from a reserved memory pool (configured at boot, default 64MB) that is excluded from the general-purpose allocator. If the reserved pool is exhausted — a symptom of a kernel memory leak — this is a kernel panic, not an OOM kill.

11a.4 Error Reporting to Userspace

Syscall error returns: Standard Linux ABI — negative errno in the return register (rax on x86-64, x0 on AArch64, a0 on RISC-V). Applications, glibc, and musl all work unmodified.

Extended error information: For complex failures where a single errno is insufficient (e.g., "which capability was invalid?" or "which device returned an error?"), ISLE provides a per-thread extended error buffer:

/// Per-thread extended error context, populated on syscall failure.
#[repr(C)]
pub struct ExtendedError {
    pub errno: i32,        // POSIX errno (same as syscall return)
    pub subsystem: u32,    // Kernel subsystem that generated the error
    pub detail_code: u32,  // Subsystem-specific detail code
    pub object_id: u64,    // Related capability/device/inode ID (0 = N/A)
}

Queried via prctl(PR_GET_EXTENDED_ERROR, &buf) — entirely optional. Applications that don't use it see standard errno behavior with zero overhead (the buffer is only written on error). The subsystem and detail_code fields are stable, allowing diagnostic tools to produce messages like "capability 0x3f revoked by generation advance" instead of "EBADF".

Kernel log messages: Errors are logged to the kernel ring buffer (dmesg) with structured fields. The stable tracepoint ABI (Section 40) exposes these as machine-parseable events for external monitoring tools.

11a.5 Error Escalation Paths

Errors escalate through a five-level hierarchy. Each level is attempted before moving to the next:

  retry → log → degrade → isolate → panic
    1       2       3         4        5

Retry: Transient hardware errors (bus timeout, CRC mismatch, link retrain) are retried with exponential backoff. Maximum retries are configured per error class (default: 3 retries, 1ms / 10ms / 100ms backoff). If the retry succeeds, no further escalation occurs — the event is logged at DEBUG level for trending.
Log: Persistent errors that survive retries are logged to the kernel ring buffer and recorded as stable tracepoint events (Section 40). The Fault Management engine (Section 39) ingests these events for threshold-based diagnosis. No state change yet — the subsystem continues operating.
Degrade: Repeated errors from the same subsystem trigger graceful degradation. Examples: storage path failover to a redundant controller, NIC fallback from hardware offload to software path, memory controller marking a DIMM rank as degraded (Section 39 RetirePages action). The subsystem continues at reduced capacity. Degradation is reported via uevent to userspace.
Isolate: A misbehaving driver is crashed and restarted via the recovery sequence (Section 9). If the same driver crashes repeatedly — 3 times within 60 seconds — it is demoted to Tier 2 (full process isolation). If it continues crashing at Tier 2 (5 cumulative crashes within one hour), the driver is disabled entirely and its device is marked offline in the device registry (Section 7).
Panic: Reserved for corrupted isle-core state where continued operation risks data loss or silent corruption. Examples: invalid page table entries in kernel mappings, corrupted capability table metadata, scheduler invariant violations. Any state that cannot be recovered by isolating a single driver triggers a kernel panic (Section 11a.3).

See also: Section 9 (Crash Recovery) for the full driver restart sequence. Section 39 (Fault Management) for proactive, telemetry-driven error handling before faults occur. Section 40 (Stable Tracepoints) for the machine-parseable event format used at escalation levels 2-4.

12. Memory Management

Memory management runs entirely within ISLE Core. The common case (page fault handling, allocation, deallocation) involves zero protection domain crossings.

12.1 Physical Memory Allocator

Per-CPU Page Caches (hot path, no locks)
    |
    v
Per-NUMA-Node Buddy Allocator (warm path, per-node lock)
    |
    v
Global Buddy Allocator (cold path, rare fallback)

Per-CPU page caches: Each CPU maintains a private cache of free pages. Allocation and deallocation from this cache require no locking and no atomic operations.
Per-NUMA buddy allocator: When per-CPU caches are exhausted, pages are allocated from the NUMA-local buddy allocator. This uses a per-node lock (minimal contention because per-CPU caches absorb most traffic).
Page order: Buddy allocator manages orders 0-10 (4KB to 4MB).
NUMA awareness: Allocations prefer the requesting CPU's NUMA node. Cross-node allocation is a fallback with configurable policy.

See also: Section 19 (Hardware Memory Safety) hooks into the physical and slab allocators to assign MTE tags on allocation and clear them on free, providing hardware-assisted use-after-free and buffer overflow detection.

12.2 Slab Allocator

For kernel object allocation (capabilities, file descriptors, inodes, etc.):

Per-CPU slab caches with magazine-based design
Per-NUMA partial slab lists
Object constructors/destructors for complex types
SLUB-style merging of similarly-sized caches

SlabRef<T> — Owned handle to a slab-allocated object:

/// A reference to a slab-allocated object. Provides O(1) allocation
/// and deallocation from the owning slab cache. Implements `Deref<Target = T>`
/// for transparent access. `Drop` returns the object to the slab cache.
/// Not `Clone` — each `SlabRef` represents unique ownership.
pub struct SlabRef<T: ?Sized> {
    ptr: NonNull<T>,
    cache: *const SlabCache,
}

12.3 Page Cache

RCU-protected radix tree for page lookups (lock-free reads)
NUMA-aware page placement: Pages cached on the node closest to the requesting CPU
Writeback: Per-device writeback threads, dirty page ratio thresholds
Page reclaim: LRU lists (active/inactive) with per-CPU drain batching
Transparent huge pages: Automatic promotion of 4KB pages to 2MB when aligned runs are available, automatic splitting under memory pressure

12.4 Virtual Memory Manager

Maple tree for VMA (Virtual Memory Area) management (same as Linux 6.1+)
Demand paging: Pages allocated on first access (page fault handler in ISLE Core)
Copy-on-write (COW): Fork shares pages read-only, copies on write fault
Memory-mapped files: mmap backed by page cache, supports MAP_SHARED and MAP_PRIVATE
Huge pages: Both explicit (mmap with MAP_HUGETLB) and transparent (THP)
ASLR: Full address space layout randomization for user processes

12.5 PCID / ASID Management

Process Context IDentifiers tag TLB entries with a process identifier, avoiding TLB flushes on context switches. The mechanism is architecture-specific but the policy is shared:

Architecture	Mechanism	ID Width	Usable IDs	Register
x86-64	PCID	12 bits	4096	CR3[11:0]
AArch64	ASID	8 or 16 bits	256 or 65536	TTBR0_EL1[63:48] or TCR_EL1.AS=1 for 16-bit
ARMv7	ASID (CONTEXTIDR)	8 bits	256	CONTEXTIDR[7:0]
RISC-V	ASID	0-16 bits (WARL)	0-65536	satp[59:44] (same position for Sv39, Sv48, Sv57; width is implementation-defined, discovered at boot; RISC-V satp.ASID is WARL — implementations may support 0 bits (no ASIDs, full TLB flush on context switch), though most provide 9-16 bits)
PPC32	PID	8 bits	256	PID SPR (via `mtspr`)
PPC64LE	PID (Radix) / LPIDR (HPT)	20 bits (Radix) / 12 bits (HPT)	~1M (Radix) / 4096 (HPT)	PIDR SPR (POWER9+) / LPIDR (POWER8)

Common policy across all architectures: - LRU-based ID allocation: least-recently-used ID is evicted when all slots are full. x86 has 4096 slots; ISLE uses the full space with LRU eviction. - TLB flush avoidance: context switch to a process that still has a valid ID requires zero TLB flushes. - Isolation domain switches require no ID change (same address space, different permissions — MPK/POE/DACR operate independently of address space IDs; page-table isolation on RISC-V uses separate page-table mappings within the same address space).

Architecture-specific notes: - AArch64: 16-bit ASID (TCR_EL1.AS=1) requires ID_AA64MMFR0_EL1.ASIDBits == 0b0010 (available on most ARMv8.2+ cores; some ARMv8.0 cores only support 8-bit ASID). ISLE discovers ASID width at boot and falls back to 8-bit with more frequent TLB flushes. 16-bit ASID gives 65536 IDs — effectively unlimited. 8-bit ASID (256 IDs) requires more aggressive LRU eviction. TLB maintenance uses TLBI ASIDE1IS for targeted invalidation by ASID. - ARMv7: Only 8-bit ASID via CONTEXTIDR, limiting to 256 concurrent address spaces. TLB maintenance uses MCR p15, 0, Rd, c8, c7, 2 (TLBIASID). The DACR domain mechanism (Section 3) is orthogonal to ASID — domain switches never invalidate TLB. - RISC-V: ASID occupies bits [63:60]=mode, [59:44]=ASID in the satp CSR (same position for Sv39, Sv48, and Sv57). The ASID width is implementation-defined and discovered at boot (by writing all-ones to satp.ASID and reading back). SiFive U74: 9-bit (512 IDs). Other implementations may support up to 16-bit. TLB flush via sfence.vma with ASID argument for targeted invalidation.

RISC-V ASID availability: The RISC-V specification allows implementations with ASID width of 0 (the satp.ASID field is WARL — writes of non-zero values may be ignored). ISLE detects the available ASID width at boot by writing all-ones to satp.ASID and reading back the value. If ASID width is 0, ISLE falls back to full TLB flush on every context switch (issuing sfence.vma with rs1=x0, rs2=x0 after every satp write performs a full TLB flush). This is functionally correct but incurs higher context-switch overhead on ASID-less implementations. The ASID width is stored in a boot-time constant (RISCV_ASID_BITS: u32) and checked by the TLB management code to select between ASID-tagged invalidation (sfence.vma with ASID) and global invalidation (sfence.vma with rs1=x0, rs2=x0). - PPC32: 8-bit PID via the PID SPR, supporting 256 concurrent address spaces. TLB management uses tlbie (invalidate by effective address) or tlbia (invalidate all). The 16 segment registers provide an orthogonal isolation mechanism independent of PID. - PPC64LE: On POWER9+ with Radix MMU, 20-bit PID via the PIDR SPR supports up to ~1M concurrent address spaces — effectively unlimited. On POWER8 with HPT, the 12-bit LPIDR provides 4096 logical partition IDs. TLB management uses tlbie with targeting by PID (Radix) or tlbie with LPID (HPT).

12.6 Memory Tagging (Hardware-Assisted)

ARM MTE (Memory Tagging Extension): 4-bit tags on every 16-byte granule for use-after-free and buffer overflow detection in kernel allocations
Intel LAM (Linear Address Masking): Top address bits available for metadata
Used in debug/development builds and optionally in production for high-security deployments

See also: Section 19 (Hardware Memory Safety) provides the full MTE/LAM integration design including tag-aware allocators and CHERI future-proofing. Section 32 (Persistent Memory) adds DAX-mapped PMEM as an additional memory tier with crash-consistency guarantees.

12.7 NUMA Topology and Policy

Modern servers are NUMA (Non-Uniform Memory Access): memory access latency depends on which CPU socket is requesting and which physical memory bank is being accessed. A 2-socket server has ~80ns local DRAM access and ~150ns remote access (via QPI/UPI interconnect). At 4+ sockets or with CXL-attached memory (Section 47), the penalty grows further. The kernel must be NUMA-aware at every level — allocation, placement, scheduling, and rebalancing.

Topology Discovery

At boot, ISLE parses platform firmware tables to build a NUMA distance matrix:

Architecture	Firmware Source	Tables Parsed
x86-64	ACPI	SRAT (System Resource Affinity), SLIT (System Locality Information), HMAT (Heterogeneous Memory Attributes)
AArch64	ACPI or Device Tree	SRAT/SLIT (ACPI servers), `numa-node-id` property (DT-based SoCs)
ARMv7	Device Tree	`numa-node-id` property (rare; most ARMv7 is UMA)
RISC-V 64	Device Tree	`numa-node-id` property per memory and CPU node
PPC32	Device Tree	`numa-node-id` property (rare; most PPC32 is UMA)
PPC64LE	Device Tree	`ibm,associativity` property per CPU and memory node (PAPR NUMA)

The result is a symmetric distance matrix where distance[i][j] represents the relative access cost from node i to node j. Local access is always distance 10 (by ACPI convention). Cross-socket is typically 20-30. CXL-attached memory tiers (Section 47) appear as higher-distance NUMA nodes.

/// NUMA topology, populated at boot from SRAT/SLIT or device tree.
///
/// All arrays are dynamically sized at boot based on the number of NUMA nodes
/// discovered from firmware tables. No compile-time `MAX_NUMA_NODES` constant —
/// the kernel adapts to the hardware, following the same pattern as `PerCpu<T>`.
/// Allocation uses the boot allocator (Section 12.1) during early init.
///
/// On a system with CXL-attached memory (Section 47), each CXL memory device
/// appears as an additional NUMA node. A 4-socket server with 8 CXL memory pools
/// may have 12+ NUMA nodes. The dynamic sizing handles this without recompilation.
pub struct NumaTopology {
    /// Number of NUMA nodes discovered.
    pub nr_nodes: usize,
    /// Distance matrix: distance[i * nr_nodes + j] = relative access cost from
    /// node i to node j. distance[i * nr_nodes + i] = 10 (local). Higher = slower.
    /// Allocated as a flat array of size nr_nodes * nr_nodes from the boot allocator.
    ///
    /// Uses `u16` to accommodate both SLIT (0-255 range) and HMAT (0-65535 range)
    /// representations. ACPI HMAT (Heterogeneous Memory Attribute Table) uses
    /// values 0-65535 for memory latency/bandwidth attributes, which exceeds the
    /// u8 range of SLIT. SLIT-sourced distances are scaled by the HMAT
    /// normalization factor when both tables are present.
    pub distance: &'static [u16],
    /// Per-node memory ranges (physical address start, length).
    /// Allocated as an array of nr_nodes entries from the boot allocator.
    /// Each entry holds up to 4 memory ranges (ArrayVec is stack-allocated).
    pub node_mem: &'static [ArrayVec<MemRange, 4>],
    /// Per-node CPU sets.
    pub node_cpus: &'static [CpuMask],
}

impl NumaTopology {
    /// Look up the distance between two NUMA nodes.
    pub fn distance(&self, from: NumaNodeId, to: NumaNodeId) -> u16 {
        self.distance[from.0 as usize * self.nr_nodes + to.0 as usize]
    }
}

Memory Allocation Policy

Per-process and per-VMA memory policies control which NUMA nodes are used for page allocation. These are set via the set_mempolicy(2) and mbind(2) syscalls (Linux-compatible):

Policy	Behavior	Typical Use
`MPOL_DEFAULT`	Allocate on the faulting CPU's local node	General-purpose (default)
`MPOL_BIND`	Restrict allocation to specified node set; OOM if all are full	Database buffer pools, pinned workloads
`MPOL_INTERLEAVE`	Round-robin page allocation across specified nodes	Hash tables, large shared mappings
`MPOL_PREFERRED`	Try specified node first, fall back to others if full	Soft affinity
`MPOL_LOCAL`	Always the local node (explicit, not inherited)	Latency-sensitive paths

MPOL_INTERLEAVE distributes pages across nodes at page granularity (4KB or 2MB for huge pages), amortizing bandwidth across all memory controllers. This is optimal for large data structures accessed uniformly (e.g., hash maps, columnar stores).

Automatic NUMA Balancing

ISLE implements automatic NUMA balancing (same approach as Linux's numa_balancing):

Scan: A periodic scanner walks process page tables and clears the present bit on a fraction of pages (making them trigger faults on next access). Scan rate is adaptive: faster for processes with high cross-node access, slower for well-placed processes.
Trap: When a task accesses a not-present page, the NUMA fault handler records which CPU (and thus which NUMA node) caused the fault.
Decide: A cost-benefit analysis compares the expected savings from reduced remote access latency against the migration cost (~50-200 microseconds per 4KB page, dominated by TLB shootdown and memcpy). Migration proceeds only if the net benefit is positive over a configurable window (default: 10 accesses saved per migration cost).
Migrate: The page is migrated to the accessing CPU's node. During migration the PTE is updated atomically under the page table lock — the application sees no inconsistency.

/// NUMA balancing decision for a single page.
pub fn should_migrate_page(
    page: &Page,
    accessing_node: NumaNodeId,
    current_node: NumaNodeId,
    access_count: u32,
) -> bool {
    if accessing_node == current_node {
        return false; // Already local.
    }
    let distance = numa_distance(current_node, accessing_node);
    let migration_cost_ns = page_migration_cost_ns(page.order());
    let expected_savings_ns = access_count as u64 * remote_penalty_ns(distance);
    expected_savings_ns > migration_cost_ns * 2 // Conservative: require 2x payoff.
}

NUMA-Aware Kernel Allocations

The buddy allocator (Section 12.1) and slab allocator (Section 12.2) are NUMA-aware:

Buddy allocator: Per-NUMA-node free lists. Allocation prefers the local node; cross-node fallback uses the SLIT distance matrix to pick the nearest alternative.
Slab allocator: Per-node partial slab lists. Frequently allocated objects (inodes, dentries, socket buffers, capability entries) are served from node-local slabs, avoiding cross-node cache line bouncing on the hot allocation/free paths.
Per-CPU caches: Already NUMA-local by construction (each CPU's cache draws from its node's buddy allocator). No additional logic needed.

NUMA Balancing and Isolation Domain Memory

The NUMA scanner (automatic NUMA balancing above) must respect hardware isolation domain boundaries (Section 6, Tier 1 isolation):

Tier 1 driver memory: Pages mapped in a Tier 1 driver's isolation domain are tagged with the domain's protection key (MPK PKEY, ARM POE key, etc.). The NUMA scanner skips pages whose protection key does not match the current process's default domain — it will not clear the present bit on driver-private pages, because the resulting NUMA fault would fire in the wrong protection domain. Tier 1 driver memory is migrated only when the driver explicitly requests it (e.g., via a KABI call) or when the driver's domain is active on the faulting CPU.
DMA buffers: Pages marked with PG_dma_pinned (allocated via the DMA API, Section 7.3.7) are unconditionally exempt from NUMA migration. Moving a DMA buffer while a device holds its physical address would cause DMA to the wrong location. The NUMA scanner checks this flag before clearing the present bit and skips pinned pages entirely.
Kernel-internal per-CPU structures: Per-CPU run queues, slab magazines, and PerCpu slots are allocated on their home node at boot and are never candidates for NUMA migration (they have no user-space PTE to scan).

Memory Tier Classification

ISLE classifies memory sources by semantic type via the TierKind enum. This allows code to reference memory tiers by purpose rather than by ordinal position (which shifts as tiers are added or removed, e.g., when distributed mode introduces remote DRAM or CXL-attached memory — see Section 47.6.1).

/// Semantic classification of a memory tier.
/// Used by the memory subsystem to identify tier types independent of their
/// ordinal position in the latency hierarchy.
pub enum TierKind {
    /// Local DRAM on the same NUMA node as the accessing CPU.
    LocalDram,
    /// Remote DRAM on a different NUMA node (same physical machine).
    RemoteDram,
    /// CXL-attached memory (Type 2 or Type 3 devices).
    CxlMem,
    /// GPU VRAM (accessible via BAR or unified memory).
    GpuVram,
    /// Persistent memory (NVDIMM, Intel Optane DCPMM, CXL PMEM).
    PersistentMem,
    /// Remote node DRAM via DSM (distributed shared memory, Section 47.5).
    DsmRemote,
    /// Compressed in-memory pages (zswap/zram tier).
    Compressed,
    /// Local swap (block device backed).
    Swap,
}

The memory subsystem maintains a runtime mapping from TierKind to the current ordinal tier number. Code that needs to compare tier performance uses TierKind and queries mem::tier_ordinal(kind) rather than hardcoding numeric values.

12a. Process and Task Management

This section defines how ISLE represents runnable entities, creates and destroys processes, loads programs, and manages the virtual address space operations that user space relies on. The scheduler that decides when tasks run is in Section 14; this section covers what tasks are and how they are born, transformed, and reaped.

12a.1 Task Model

ISLE uses task as the schedulable unit. Tasks are grouped into processes that share an address space, capability table, and file descriptor table. This mirrors the Linux thread-group model: a "process" is a collection of tasks that share CLONE_VM | CLONE_FILES | CLONE_SIGHAND, and a "thread" is simply another task within the same process.

Task descriptor:

pub struct Task {
    /// Kernel-unique task identifier.
    pub tid: TaskId,
    /// Owning process (shared with sibling tasks).
    pub process: Arc<Process>,
    /// Scheduling state machine.
    pub state: TaskState,      // Running, Runnable, Blocked, Zombie
    /// Scheduler bookkeeping (vruntime, deadline params, etc.).
    pub sched_entity: SchedEntity,
    /// Which CPUs this task may run on.
    pub cpu_affinity: CpuSet,
    /// Blocked and pending signal masks.
    pub signal_mask: SignalSet,
    /// Saved architectural registers (Section 14.6).
    pub context: ArchContext,
    /// Per-task capability restrictions.
    pub capabilities: CapHandle,
    /// Embedded futex waiter node (Section 20a.1). A task can block on at
    /// most one futex at a time, so a single embedded node is sufficient.
    /// Linked into a FutexHashTable bucket when the task calls futex_wait;
    /// unlinked on wake or timeout. Uses intrusive linking to avoid heap
    /// allocation under the bucket spinlock.
    ///
    /// **Task exit safety**: When a task exits (including SIGKILL) while
    /// linked into a futex bucket, the exit path (`do_exit`) must unlink
    /// this node before freeing the Task struct. The cleanup sequence is:
    ///   1. Acquire the futex bucket spinlock for this waiter's bucket.
    ///   2. If `futex_waiter.is_linked()`, remove the node from the bucket
    ///      list (O(1) for intrusive doubly-linked list).
    ///   3. Release the bucket spinlock.
    /// This runs before address space teardown (step 3 of `do_exit`) and
    /// before the Task struct is freed. The bucket spinlock serializes
    /// against concurrent `futex_wake()` operations — a wakeup that races
    /// with task exit will either see the node (and wake it, which is
    /// harmless for a dying task) or see it already removed.
    pub futex_waiter: FutexWaiter,
}

pub struct Process {
    /// Kernel-unique process identifier.
    pub pid: ProcessId,
    /// Page tables and VMA tree (Section 12.4).
    pub address_space: AddressSpace,
    /// Process-wide capability table (Section 11.1).
    pub cap_table: CapSpace,
    /// Open file descriptor table.
    pub fd_table: FdTable,
    /// Parent process (None for init).
    pub parent: Option<ProcessId>,
    /// Children of this process. Intrusive doubly-linked list provides O(1)
    /// insertion at fork() and O(1) removal at child exit. Each Process embeds
    /// the sibling link node, avoiding per-child heap allocation.
    pub children: IntrusiveList<ProcessId>,
    /// All tasks sharing this address space.
    pub thread_group: ThreadGroup,
    /// Namespace membership (Section 12a.6).
    pub namespaces: NamespaceSet,
    /// uid, gid, supplementary groups (Section 11.2).
    pub cred: Credentials,
}

Each task holds its own CapHandle that can further restrict the process-wide CapSpace but never widen it. This allows individual threads to voluntarily drop privileges -- for example, a worker thread that processes untrusted input can shed network capabilities before entering its main loop.

12a.2 Process Creation

Linux problem: fork() copies the entire process state -- page tables, file descriptor table, signal handlers, credentials -- then the child almost always immediately calls exec(), discarding everything that was just copied. The clone() syscall provides fine-grained control via a combinatorial flag space (CLONE_VM, CLONE_FILES, CLONE_FS, CLONE_SIGHAND, CLONE_NEWPID, ...) that is powerful but difficult to use correctly.

ISLE native model: Capability-based spawn(). A new process is created with an explicit set of capabilities, an address space, and an entry point. Nothing is inherited implicitly -- the parent must grant each resource (memory regions, file descriptors, capabilities) that the child should receive. This makes the child's authority set visible and auditable at creation time.

pub struct SpawnArgs {
    /// ELF binary or entry point address.
    pub entry: EntrySpec,
    /// Capabilities to grant to the child (subset of caller's CapSpace).
    pub granted_caps: Vec<CapHandle>,
    /// File descriptors to pass (remapped into child's fd table).
    pub fds: Vec<(Fd, Fd)>,
    /// Initial address space configuration.
    pub address_space: AddressSpaceSpec,
    /// CPU affinity for the initial task.
    pub cpu_affinity: CpuSet,
    /// Namespace set (inherit parent's or create new).
    pub namespaces: NamespaceSpec,
}

Linux compatibility: fork() and clone() are implemented in the compat layer (Section 20) by translating to the underlying task/process primitives:

fork() = clone(SIGCHLD) = COW address space copy + fd table copy + signal handler copy. Page table entries are marked read-only and reference-counted; the actual page copy is deferred to the write-fault handler (Section 12.4).
clone(CLONE_VM | CLONE_FILES | ...) = create a new task within the same process (i.e., a thread). No address space copy, no fd table copy.
vfork() = parent blocks until the child calls exec() or _exit(). The child temporarily shares the parent's address space (no COW overhead). Implemented by setting a completion flag that the parent waits on.
clone3() = the modern extensible version. Supported with the same struct clone_args layout as Linux 5.3+.

12a.3 Program Execution (exec)

execve() replaces the current task's address space with a new program image. The previous mappings, signal dispositions, and pending signals are discarded; the file descriptor table is preserved (minus CLOEXEC descriptors).

ELF loading sequence:

Parse the ELF header and program headers from the file.
Verify the ELF machine type matches the running architecture.
For each PT_LOAD segment: create a VMA with the specified permissions and map the file region (demand-paged via the page cache).
If an ELF interpreter is specified (PT_INTERP), map it as well. This is typically ld-linux-x86-64.so.2 or ld-musl-x86_64.so.1.
Allocate a new user stack. Push auxv (auxiliary vector), envp, and argv onto the stack in the standard layout expected by the C runtime.
Set the instruction pointer to the interpreter's entry point (or the binary's entry point if statically linked).
Return to user space.

Capability grants on exec replace the traditional setuid/setgid mechanism (Section 11.5). Instead of running the new program as a different UID, the kernel consults a per-binary capability grant table and adds the specified capabilities to the task's CapHandle. The process never gains uid 0 -- it gains precisely the capabilities the binary needs (e.g., CAP_NET_BIND_SERVICE for a web server on port 80).

Security cleanup on exec:

File descriptors with CLOEXEC are closed.
Signal dispositions are reset to SIG_DFL (except SIG_IGN).
Pending signals are cleared.
The process dumpable flag is re-evaluated (non-dumpable if capabilities were gained).
Address space layout randomization (ASLR) re-randomizes all base addresses.

12a.4 Task Exit and Resource Cleanup

A task exits via exit() (single task) or exit_group() (all tasks in the process). The latter is what the C library's exit() actually calls.

Cleanup order for a single-task exit:

Cancel pending asynchronous I/O (io_uring SQEs, AIO requests).
If this is the last task in the process, proceed to process cleanup (below). Otherwise, release per-task resources (stack, ArchContext) and remove from the thread group.

Process cleanup (when the last task exits):

Close all file descriptors in the fd table.
Release all capabilities in the CapSpace.
Tear down the address space: unmap all VMAs, release page table pages, decrement page reference counts.
Deliver SIGCHLD to the parent process.
Reparent children: any surviving child processes are reparented to the nearest subreaper (a process that set PR_SET_CHILD_SUBREAPER) or to init (pid 1).
Transition to the zombie state. The task remains in the task table with its exit status until the parent calls wait() / waitpid() / waitid().

Zombie reaping: The zombie consumes only a small task-table slot (no address space, no fd table, no capabilities). The parent retrieves the exit status and resource usage via wait4() or waitid(), which frees the slot. If the parent exits without reaping, the reparented-to ancestor (init or subreaper) is responsible.

Session and process group lifecycle: When a session leader exits, SIGHUP is delivered to the foreground process group of the controlling terminal. The terminal is disassociated from the session. This matches POSIX semantics required by sshd, tmux, and shell job control.

12a.5 Address Space Operations

User space manipulates its address space through these syscalls, all of which go through capability checks on the calling process's CapSpace:

mmap() / munmap(): Create and destroy virtual memory regions. Anonymous mappings allocate from the physical allocator on demand (Section 12.1). File-backed mappings go through the page cache (Section 12.3). MAP_SHARED mappings are backed by a shared page cache entry; MAP_PRIVATE mappings COW on write.
mprotect(): Change page permissions on an existing mapping. On x86-64, this integrates with hardware domain isolation (Section 3) -- Tier 1 driver memory regions can be made accessible only when the driver's protection key is active.
brk() / sbrk(): Legacy heap expansion interface. Supported for compatibility with applications that do not use mmap-based allocators. Implemented as a resizable anonymous VMA at the process's break address.

Capability-mediated memory sharing provides a secure alternative to POSIX shared memory. Instead of a global namespace (/dev/shm/name), memory regions are shared by explicitly granting a capability to the target process:

/// Grant access to a memory region to another process.
/// Returns a transferable capability handle.
pub fn mem_grant(
    target: ProcessId,
    region: VmaId,
    perms: Permissions,
) -> Result<CapHandle>;

/// Map a previously granted region into the current address space.
/// The grant capability is consumed (single-use) or retained
/// depending on the grant's delegation policy.
pub fn mem_map(
    grant: CapHandle,
    hint_addr: Option<usize>,
) -> Result<*mut u8>;

This model has several advantages over POSIX shm_open:

No global namespace: Shared regions are not visible to unrelated processes.
Fine-grained permissions: The granter specifies read, write, or execute -- not just "owner/group/other" file modes.
Revocable: The granter can revoke the capability (Section 11.1 generation counter), and the next access by the target faults.
Auditable: Every grant and map operation flows through capability checks and can be logged.

POSIX shared memory (shm_open / mmap MAP_SHARED) is implemented on top of this mechanism via the compat layer, with a tmpfs-backed /dev/shm namespace that translates names to capability grants.

12a.6 Namespaces

Each process belongs to a set of namespaces that isolate its view of system resources. ISLE implements all 8 Linux namespace types (see Section 20.4 for the full table):

Namespace	Isolates	Key syscall flags
`pid`	Process ID space	`CLONE_NEWPID`
`net`	Network stack, interfaces, routes	`CLONE_NEWNET`
`mnt`	Mount table	`CLONE_NEWNS`
`user`	UID/GID mappings	`CLONE_NEWUSER`
`uts`	Hostname, domainname	`CLONE_NEWUTS`
`ipc`	SysV IPC, POSIX message queues	`CLONE_NEWIPC`
`cgroup`	Cgroup root directory	`CLONE_NEWCGROUP`
`time`	`CLOCK_MONOTONIC` / `CLOCK_BOOTTIME` offsets	`CLONE_NEWTIME`

Capability gating: Creating a new namespace requires the appropriate capability -- the ISLE equivalent of CAP_SYS_ADMIN (for most namespaces) or an unprivileged CLONE_NEWUSER followed by capabilities within the new user namespace. This matches Linux semantics so that rootless containers (Podman, Docker rootless mode) work unmodified.

Process creation integration: clone3() and the native spawn() both accept a NamespaceSpec that specifies whether each namespace is inherited from the parent or freshly created. Namespaces are reference-counted; they are destroyed when the last process in the namespace exits and no external references (bind mounts, open /proc/[pid]/ns/* file descriptors) remain.

See also: Section 20.4 provides the full namespace implementation details and container runtime compatibility requirements.

12a.7 User-Mode Scheduling (Fibers and M:N Threading)

ISLE provides an opt-in scheduler upcall mechanism that enables userspace libraries to implement M:N threading — multiplexing many lightweight fibers (cooperative coroutines) onto fewer OS threads, with correct behaviour when a fiber blocks in a syscall. This is the only kernel-level primitive needed for fibers; the fiber context switch itself (saving/restoring registers, swapping stacks) is purely a userspace library operation and requires no syscall.

This feature is native-ISLE-only. It does not exist on Linux. Applications that use it are not portable to Linux without a shim. Existing Linux-compatible applications that do not opt in are completely unaffected.

12a.7.1 Motivation

A fiber (cooperative coroutine, user-mode thread) is a save/restore of the integer and FPU register state plus a stack pointer swap. No kernel involvement is needed for the switch itself. The hard problem is a fiber calling a blocking syscall: without kernel cooperation, the entire OS thread blocks, starving all other fibers running on it.

Three approaches exist: 1. Async-only I/O (restrict fibers to io_uring/epoll): Works, but requires all callees to be async-aware. Incompatible with legacy synchronous code. 2. One OS thread per fiber (1:1): Works, but eliminates the efficiency advantage of fibers and limits parallelism to the thread count. 3. Scheduler upcalls (this design): The kernel calls a registered userspace function before blocking, allowing the fiber scheduler to park the current fiber and immediately run another. The OS thread never actually blocks while runnable fibers exist.

This is the scheduler activations model (Anderson et al., SOSP 1992), implemented in Solaris LWPs, early NetBSD, and macOS pthreads internals. It is the correct kernel primitive for M:N scheduling.

12a.7.2 Scheduler Upcall Registration

A thread registers an upcall handler via a new ISLE syscall:

/// Register a scheduler upcall handler for the calling thread.
///
/// When the calling thread is about to enter a blocking state (blocking
/// syscall, futex wait, page fault that requires I/O), the kernel saves
/// the thread's full register state into `upcall_stack_top - sizeof(UpcallFrame)`
/// and transfers control to `handler`.
///
/// The handler runs on `upcall_stack` (a separate dedicated stack of
/// `upcall_stack_size` bytes) to avoid corrupting the fiber's stack.
///
/// # Arguments
/// - `handler`:          Upcall entry point (see UpcallFrame below).
/// - `upcall_stack`:     Userspace-allocated stack for upcall execution.
/// - `upcall_stack_size`: Size of that stack in bytes (minimum 8 KiB).
///
/// # Returns
/// `Ok(())` on success. `EINVAL` if `upcall_stack` is not mapped writable
/// or `upcall_stack_size` is below the minimum.
SYS_register_scheduler_upcall(
    handler:           extern "C" fn(*mut UpcallFrame),
    upcall_stack:      *mut u8,
    upcall_stack_size: usize,
) -> Result<()>;

/// Deregister the upcall handler. Thread reverts to standard 1:1 blocking.
SYS_deregister_scheduler_upcall() -> Result<()>;

Re-entrancy protection: While the upcall handler is executing, the kernel sets an internal in_upcall flag on the thread. If a blocking condition occurs during upcall execution (e.g., the handler takes a page fault while reading its run queue), the kernel does not issue a nested upcall — doing so would overwrite the UpcallFrame at the top of the upcall stack, destroying the original fiber's saved state. Instead, the kernel handles the fault synchronously by blocking the OS thread directly (standard 1:1 blocking), exactly as if no upcall handler were registered. The in_upcall flag is cleared when the handler calls SYS_scheduler_upcall_resume() or SYS_scheduler_upcall_block(). This is analogous to how POSIX signal handlers mask the same signal during delivery to prevent re-entrant corruption.

/// Saved register state of the fiber that triggered the upcall.
/// Passed by pointer to the upcall handler; the fiber is resumed by
/// restoring these registers (see SYS_scheduler_upcall_resume).
#[repr(C)]
pub struct UpcallFrame {
    /// Saved general-purpose registers (architecture-specific layout).
    pub regs:        ArchRegs,
    /// Why the fiber is blocking.
    pub reason:      BlockReason,
    /// Opaque kernel handle — pass back to SYS_scheduler_upcall_resume
    /// or SYS_scheduler_upcall_block.
    pub fiber_token: u64,
}

#[repr(u32)]
pub enum BlockReason {
    /// Entering a blocking syscall (e.g., read, write, futex_wait).
    BlockingSyscall = 1,
    /// Page fault requiring disk I/O (demand paging).
    PageFaultIo     = 2,
    /// Waiting for a kernel lock (unlikely; most kernel waits are brief).
    KernelLock      = 3,
}

12a.7.3 Upcall Handler Flow

Fiber A calls read(fd, buf, len)  →  would block
  ↓
Kernel saves Fiber A registers into UpcallFrame on upcall stack
  ↓
Kernel transfers control to handler(frame) on upcall stack
  ↓
Handler (fiber scheduler):
  - Parks Fiber A: stores frame->fiber_token, records Fiber A as "blocked on read"
  - Submits read to io_uring for non-blocking completion
  - Picks Fiber B from the run queue
  - Calls SYS_scheduler_upcall_resume(fiber_b_frame) to restore Fiber B
  ↓
Fiber B runs on the OS thread
  ↓
io_uring completion arrives → event loop wakes Fiber A
  - Handler receives io_uring completion
  - Reconstructs Fiber A's UpcallFrame with the result
  - Calls SYS_scheduler_upcall_resume(fiber_a_frame) to restore Fiber A
  ↓
Fiber A resumes with read() returning the result

Two new syscalls control fiber resumption:

/// Restore a fiber that was parked by an upcall.
/// Restores the registers from `frame` and returns to the fiber's PC.
/// The `result` value is placed in the return register (rax / x0 / a0).
/// This call never returns to the caller — control goes to the fiber.
SYS_scheduler_upcall_resume(frame: *const UpcallFrame, result: i64) -> !;

/// Tell the kernel it is safe to block the OS thread now.
/// Used when all fibers are waiting and there is nothing to run.
/// The thread blocks until any previously registered io_uring completion,
/// futex wake, or signal arrives. On return, the handler is called again
/// with the newly unblocked fiber.
SYS_scheduler_upcall_block() -> !;

12a.7.4 Interaction with io_uring

For BlockingSyscall upcalls, the handler typically converts the blocking operation to a non-blocking io_uring submission (IORING_OP_READ, IORING_OP_WRITE, IORING_OP_FUTEX_WAIT, etc.) and calls SYS_scheduler_upcall_block() when the run queue is empty. The io_uring completion ring provides the wakeup. This combination fully replaces the blocking syscall with an async equivalent, transparent to the fiber.

For PageFaultIo upcalls, the handler typically has no alternative — a page must be faulted in from disk. The handler parks the faulting fiber and runs others, then calls SYS_scheduler_upcall_block() until a wakeup arrives.

Page fault completion wakeup: When the kernel completes the I/O for a demand page fault, it posts a synthetic completion event to the thread's registered io_uring completion queue (if present) or writes an 8-byte counter increment to the thread's registered eventfd (if configured via SYS_register_scheduler_upcall). The completion event carries the fiber_token of the faulting fiber, allowing the handler to identify which parked fiber is now runnable. If neither io_uring nor eventfd is registered, the kernel wakes the thread from SYS_scheduler_upcall_block() directly and issues a new upcall with reason = PageFaultIo and the original fiber_token, allowing the handler to resume the fiber.

12a.7.5 Fiber Library Design

ISLE ships a userspace fiber library (isle-fiber) in the standard library. The library provides:

// isle-fiber (userspace library, not kernel code)

pub struct Fiber { /* stack, register save area, FLS slot table */ }
pub struct FiberScheduler { /* per-OS-thread run queue, upcall stack */ }

impl FiberScheduler {
    /// Initialize the scheduler on the current OS thread.
    /// Allocates an upcall stack and calls SYS_register_scheduler_upcall.
    pub fn init() -> Self;

    /// Create a fiber with the given entry point and stack size.
    pub fn spawn(&self, f: impl FnOnce() + 'static, stack_size: usize) -> FiberId;

    /// Cooperatively yield to the next runnable fiber.
    /// If no other fiber is runnable, returns immediately.
    pub fn yield_now(&self);

    /// Run the scheduler loop. Returns when all fibers have completed.
    pub fn run(&mut self);
}

Fiber Local Storage (FLS): Each Fiber has a private FLS table (array of *mut () slots, analogous to Windows FlsAlloc/FlsGetValue). The library swaps the FLS table pointer on every SwitchToFiber — no kernel involvement. Thread-local storage (#[thread_local]) continues to work as normal and is shared across all fibers on the same OS thread (matching Windows TLS semantics; FLS is distinct).

12a.7.6 WEA Integration

Windows Fiber support in WEA (§20c.5, 05-linux-compat.md) maps directly onto this mechanism:

ConvertThreadToFiber() → FiberScheduler::init() on the calling thread.
CreateFiber(size, fn, param) → FiberScheduler::spawn(...).
SwitchToFiber(fiber) → cooperative yield to a specific fiber; pure userspace register swap, no syscall.
FlsAlloc / FlsGetValue / FlsSetValue → read/write into the current fiber's FLS slot table; implemented in ntdll by WINE, no WEA syscall needed.
Blocking syscall inside a fiber → scheduler upcall converts to io_uring; the OS thread runs other fibers while waiting.

The TEB NtTib.FiberData field is updated by WINE on every SwitchToFiber call (userspace write to the TEB in user address space). The kernel's role is only to provide the fast NtCurrentTeb() path via the per-thread GS base mapping (§20c.5, 05-linux-compat.md) and the scheduler upcall mechanism above.

13. Memory Compression Tier

Inspired by: macOS (2013), Windows 10 (2015), Linux zswap/zram. IP status: Clean — academic concept from the 1990s, BSD-licensed algorithms.

13.1 Problem

When the system is under memory pressure, the page reclaim path must free pages. The options in Section 12.3 are:

Evict clean page cache pages (free, but re-read from disk on next access)
Write dirty pages to swap (expensive: NVMe ~10us per 4KB, HDD ~5ms)

There is a third option, cheaper than swap: compress the page in memory. Modern CPUs running LZ4 compress a 4KB page in ~1-2 microseconds. If the page compresses to <2KB (typical for most workloads), you've freed 2KB without any I/O. Decompression on access is ~0.5 microseconds.

This is 5-10x faster than NVMe swap and 1000x faster than HDD swap.

13.2 Architecture

Insert a compressed tier between the LRU inactive list and swap:

Per-CPU Page Caches (hot path)
    |
    v
Per-NUMA Buddy Allocator
    |
    v
Page Cache (RCU radix tree)
    |
    v
LRU Active List --evict--> LRU Inactive List
                                |
                    +-----------+-----------+
                    |                       |
              [compress]              [swap out]
                    |                  (existing)
                    v
            Compressed Pool             Swap Device
            (zpool in memory)           (NVMe/HDD)
                    |
              [decompress]
                    |
                    v
              Page restored
              to active LRU

13.3 Compressed Page Pool

// Kernel-internal (isle-core/src/mem/zpool.rs)

/// A pool of compressed pages stored in contiguous memory.
pub struct ZPool {
    /// Backing memory regions (allocated from buddy allocator).
    /// Each region is a contiguous allocation (order 4-6, 64KB-256KB).
    regions: Vec<ZPoolRegion>,

    /// Index: maps (address_space_id, page_offset) -> compressed location.
    /// Pre-allocated open-addressed hash table with linear probing and a fixed
    /// capacity determined at init time: `capacity = max_size / PAGE_SIZE * 5 / 3`
    /// (load factor ~0.60 target). The table is backed by a virtually-contiguous
    /// allocation (`vmalloc`-style: contiguous in kernel virtual address space,
    /// physically scattered across individual pages) obtained during `ZPool::init()`
    /// before memory pressure exists. On a 256 GB server with `max_pool_percent=25`,
    /// the table may reach ~28M entries (~900 MB) — far exceeding the buddy
    /// allocator's maximum contiguous order (4 MB). Using vmalloc avoids this
    /// constraint while keeping O(1) index arithmetic via contiguous virtual
    /// addresses. The table never grows or reallocates — under severe memory
    /// pressure (the exact scenario where zpool is most active), a growable HashMap
    /// would fail to resize, causing the compression tier to fail precisely when
    /// it is needed most.
    ///
    /// **Performance note**: Linear probing degrades rapidly above ~70% load factor
    /// due to primary clustering. The 60% load factor target (capacity = max_entries × 5/3)
    /// ensures the table stays well below the degradation threshold even at maximum
    /// occupancy, keeping O(1) amortized lookups without runtime resizing. For hot-path hash tables (e.g., PFN
    /// lookup), use Robin Hood hashing or cuckoo hashing for better worst-case
    /// behavior. The ZPool index tolerates linear probing because lookups are on
    /// the page-fault path (~1-2μs total), where the hash table probe cost is a
    /// small fraction of the overall decompression latency.
    ///
    /// **Capacity exhaustion handling**: When the hash table reaches capacity (which
    /// occurs before ZPool reaches max_pool_percent due to the 4/3 sizing factor),
    /// the compression path rejects new entries with `ZPoolError::IndexFull` and
    /// the page is sent directly to swap. This is intentional: the 25% headroom between
    /// index capacity and max_pool_percent ensures index operations remain fast even
    /// when ZPool is at its configured memory limit. Batch compression operations that
    /// would temporarily exceed index capacity are throttled by the page reclaim path,
    /// which stops compressing when `pages_stored >= index_capacity * 3 / 4`.
    index: FixedHashTable<CompressedPageKey, CompressedEntry>,

    /// Free space tracker (first-fit allocator within regions).
    free_list: FreeList,

    /// Statistics.
    stats: ZPoolStats,

    /// Compression algorithm (compile-time selected).
    /// LZ4 is default: ~1-2us compress, ~0.5us decompress per 4KB.
    algorithm: CompressionAlgorithm,

    /// Maximum pool size (fraction of total RAM, default 25%).
    max_size: usize,

    /// Current pool size.
    current_size: AtomicUsize,

    /// NUMA-aware: one ZPool per NUMA node.
    numa_node: NumaNodeId,
}

#[derive(Clone, Copy)]
pub struct CompressedEntry {
    /// Region index within the ZPool. u32 supports up to 4 billion regions;
    /// with 256KB regions, that is ~1 PB of ZPool capacity — well beyond any
    /// foreseeable server. (u16 would limit to 65536 × 256KB = 16GB, which
    /// overflows on servers with >64GB RAM at 25% ZPool allocation.)
    region: u32,
    /// Offset within region (max 256KB region, so u32 is generous).
    offset: u32,
    /// Compressed size in bytes (at the default 1.5x threshold, pages exceeding
    /// ~2730 bytes are rejected; the field itself is u16, max 65535,
    /// accommodating any compression policy).
    compressed_size: u16,
    /// Original page checksum (CRC32C, for integrity verification on
    /// decompression). CRC32C is used because all supported architectures
    /// provide hardware acceleration: x86 SSE4.2 `crc32` instruction,
    /// ARMv8 CRC extension, RISC-V Zicrc32 (or software fallback).
    /// A 32-bit checksum has a ~1-in-4-billion collision probability per
    /// page. For a system with 1 million compressed pages, the expected
    /// number of undetected corruptions is ~0.0002 — acceptable for
    /// compressed swap (where corruption causes a single process crash,
    /// not kernel-wide data loss). If stronger integrity is needed, the
    /// storage layer provides per-block checksums (Section 27a).
    checksum: u32,
}

pub struct ZPoolStats {
    /// Pages stored in compressed form.
    pub pages_stored: AtomicU64,
    /// Total compressed bytes (sum of compressed sizes).
    pub compressed_bytes: AtomicU64,
    /// Total original bytes (pages_stored * PAGE_SIZE).
    pub original_bytes: AtomicU64,
    /// Compression ratio = original_bytes / compressed_bytes.
    /// Typical: 2.5-4.0x for most workloads.

    /// Pages rejected (incompressible — ratio < 1.5x).
    pub pages_rejected: AtomicU64,
    /// Pages writeback to swap (pool full or evicted from pool).
    pub pages_writeback: AtomicU64,
    /// Decompressions (page faults on compressed pages).
    pub decompressions: AtomicU64,
}

13.4 Compression Policy

Not every page should be compressed. The policy:

pub struct CompressionPolicy {
    /// Minimum compression ratio to accept a page, encoded as
    /// ratio * 100 (fixed-point to avoid FPU use in kernel).
    /// Default: 150 (meaning 1.5x). A 4KB page must compress to
    /// fewer than PAGE_SIZE * 100 / min_ratio_x100 = 2730 bytes
    /// to be accepted. Pages that compress worse go directly to swap.
    pub min_ratio_x100: u32,

    /// Maximum zpool size as percentage of total RAM (default 25).
    pub max_pool_percent: u32,

    /// When zpool is full, evict oldest compressed pages to swap.
    /// This is the "writeback" path.
    pub writeback_on_full: bool,

    /// Page types eligible for compression.
    pub compress_anonymous: bool,       // true (default)
    pub compress_file_backed: bool,     // false (default — just evict clean pages)
    pub compress_shmem: bool,           // true (default — tmpfs/shm pages)

    /// Pages with active DMA mappings are never eligible for compression
    /// or swap-out. The page reclaim path checks the DMA pin count
    /// (tracked in the device registry's DeviceResources, Section 7)
    /// before attempting reclaim. This field is always false and exists
    /// as a documented invariant — it cannot be overridden.
    pub compress_dma_pinned: bool,      // false (invariant, never set to true)
}

Decision flow when the page reclaim path needs to free a page:

Page to reclaim:
  |
  Is it DMA-pinned (active device mapping)?
  |-- Yes -> Skip. Not eligible for reclaim. Done.
  |
  Is it a clean file-backed page?
  |-- Yes -> Simply evict (will re-read from disk). Done.
  |
  Is compression enabled for this page type?
  |-- No -> Write to swap (existing path). Done.
  |
  Attempt LZ4 compression:
  |
  Compressed to < (PAGE_SIZE * 100 / min_ratio_x100)?
  |-- No -> Incompressible. Write to swap. Done.
  |
  Is ZPool below max_pool_percent?
  |-- No -> Evict oldest compressed pages to swap, then store. Done.
  |
  Store compressed page in ZPool.
  Free original page frame.
  Done.

13.5 Decompression Path

When a page fault occurs on a compressed page:

1. Page fault handler (ISLE Core) checks if the faulting address
   maps to a compressed page (index lookup).
2. Allocate a fresh page frame.
3. Decompress from ZPool into the fresh page.
4. Update page tables to point to the fresh page.
5. Remove the compressed entry from ZPool.
6. Return from page fault.

Total time: ~1-2us (page allocation + LZ4 decompress + TLB update).
Compare: swap read from NVMe = ~10-15us, from HDD = ~5-10ms.

13.6 NUMA Awareness

One ZPool per NUMA node. Compressed pages stay on the same NUMA node as their original allocation. This avoids cross-NUMA decompression latency.

NUMA Node 0:                    NUMA Node 1:
  Buddy Allocator 0               Buddy Allocator 1
  Page Cache 0                    Page Cache 1
  LRU Lists 0                    LRU Lists 1
  ZPool 0          <-local->     ZPool 1
  (compresses node-0 pages)      (compresses node-1 pages)

13.7 Compression Algorithm Selection

#[repr(u32)]
pub enum CompressionAlgorithm {
    /// LZ4: ~1-2us compress, ~0.5us decompress. Best latency.
    /// Default choice.
    Lz4     = 0,
    /// Zstd (level 1): ~3-5us compress, ~1us decompress. Better ratio.
    /// Use when memory pressure is high and slightly more latency is OK.
    Zstd1   = 1,
    /// Zstd (level 3): ~5-10us compress, ~1us decompress. Best ratio.
    /// Use when swap I/O is very expensive (HDD) and CPU is available.
    Zstd3   = 2,
}

Both LZ4 and Zstd are BSD-licensed. We include our own no_std implementations (no external C library dependency in kernel).

13.7a Latency Spikes and Fragmentation

Transparent memory compression introduces two risks that must be explicitly managed: latency spikes during decompression and fragmentation within the compressed pool.

Decompression latency spikes — when a process accesses a compressed page, the page fault handler must decompress it before the access can proceed. This adds ~1-2μs (LZ4) or ~1-5μs (Zstd) to the page fault latency. While small in absolute terms, this can cause tail latency spikes for latency-sensitive workloads:

Worst case: a process accesses 100 compressed pages in rapid succession (e.g., scanning a large array that was mostly evicted). Each page fault takes ~1-2μs, totaling ~100-200μs of stall time. This is comparable to a single NVMe read but much better than swap (~10-15μs per page from NVMe × 100 pages = ~1-1.5ms).
Mitigation — prefetch on decompression: when decompressing page N, the kernel speculatively decompresses pages N+1 through N+3 (sequential readahead into the compressed pool). This converts 4 serial page faults into 1 fault + 3 prefetches, reducing total latency by ~75% for sequential access patterns.
Mitigation — per-cgroup opt-out: latency-sensitive cgroups (real-time workloads, databases) can disable compression entirely via memory.zpool.enabled = 0 in the cgroup controller. Their pages are never compressed — they go directly to swap (or are simply not reclaimed until OOM).
Mitigation — priority-aware compression: pages belonging to high-priority tasks (RT scheduling class, latency-sensitive cgroups) are placed last on the compression candidate list, ensuring they are compressed only under severe memory pressure.

ZPool fragmentation — the compressed pool stores variable-size compressed pages (a 4KB page might compress to 500 bytes, or 2KB, or 3800 bytes). This creates internal fragmentation:

Buddy-within-zpool: the ZPool divides each region (allocated as order-4 to order-6 pages from the buddy allocator, i.e., 64KB-256KB contiguous blocks) into variable-size slots. Slots are managed with a simple first-fit allocator within each region.
Fragmentation metric: the kernel tracks compressed_bytes / pool_total_size as the pool utilization ratio. When utilization drops below 50% (meaning more than half the pool is fragmented waste), the kernel triggers compaction: live compressed pages are copied to fill gaps, and empty regions are returned to the buddy allocator.
Compaction cost: moving compressed pages requires updating the reverse-mapping index (which virtual address points to this compressed slot). This is ~100ns per page move. Compaction runs in a background kthread at low priority, processing ~10,000 pages per second (~1ms of CPU time per second). It does not block page faults.
Worst case: highly heterogeneous compression ratios (some pages compress 10:1, others 2:1) create severe fragmentation. The compaction kthread keeps up with steady- state workloads but may fall behind during allocation bursts. If the pool is >90% fragmented and compaction cannot keep up, the kernel temporarily stops compressing new pages (sending them directly to swap) until compaction catches up.

Interaction with buddy allocator — the ZPool allocates backing memory from the buddy allocator in large contiguous blocks (64KB-256KB). These allocations are infrequent (one allocation per ~16-64 compressed pages) and always order-4 or larger. This avoids polluting the buddy allocator's small-order freelists. When the ZPool shrinks (memory pressure eases), regions are returned to the buddy allocator as whole contiguous blocks, avoiding external fragmentation.

Buddy allocator view:
  Order-4+ allocations → ZPool regions (stable, long-lived)
  Order-0 allocations  → Regular page cache, anonymous pages (high churn)

These two allocation classes don't interfere: ZPool uses large blocks,
everything else uses small blocks. The buddy allocator's per-order freelists
keep them naturally separated.

13.8 Linux Interface Exposure

procfs (compatible with Linux zswap interface):

/proc/isle/zpool/
    enabled             # "1" or "0" (read/write)
    algorithm           # "lz4", "zstd1", "zstd3" (read/write)
    max_pool_percent    # 25 (read/write, percentage of total RAM; can only
                        # decrease at runtime — increasing is rejected because
                        # the hash table index was sized at init for the
                        # original max_pool_percent and cannot grow)
    pool_total_size     # Current pool size in bytes
    stored_pages        # Number of pages in pool
    compressed_bytes    # Total compressed bytes
    original_bytes      # Total original bytes
    compression_ratio   # current ratio (e.g., "3.21")
    rejected_pages      # Incompressible pages sent to swap
    writeback_pages     # Pages evicted from pool to swap
    decompressions      # Total decompress operations

/sys/kernel/mm/zpool/          # Alternative sysfs path
    # Same attributes, for tools that prefer sysfs

/proc/meminfo additions (matching Linux zswap format):

Zswapped:        512000 kB     (original size of compressed pages)
Zswap:           180000 kB     (actual compressed size in memory)

These are additive fields. free, top, htop and other tools that parse /proc/meminfo simply ignore fields they don't recognize.

14. Scheduler

14.1 Multi-Policy Design

The scheduler supports three scheduling policies simultaneously:

Policy	Algorithm	Use case	Priority range
Normal	EEVDF	General-purpose workloads	Nice -20 to 19
Real-Time	FIFO / RR	Latency-sensitive applications	RT 1-99
Deadline	EDF (CBS)	Guaranteed CPU time (audio, etc.)	Runtime/Period

14.2 Architecture

                    Global Load Balancer
                    (runs every ~4ms)
                          |
            +-------------+-------------+
            |             |             |
        CPU 0          CPU 1         CPU N
    +----------+   +----------+   +----------+
    | RT Queue |   | RT Queue |   | RT Queue |   <- Highest priority
    +----------+   +----------+   +----------+
    | DL Queue |   | DL Queue |   | DL Queue |   <- Deadline tasks
    +----------+   +----------+   +----------+
    |EEVDF Tree|   |EEVDF Tree|   |EEVDF Tree|   <- Normal tasks (red-black tree, EEVDF)
    +----------+   +----------+   +----------+

14.3 Key Properties

Preemptible locks by default: Mutexes and rwlocks are always sleeping locks with priority inheritance. Under PreemptionModel::Realtime (Section 16.2), spinlocks also become sleeping locks (RT-safe). Under Voluntary and Full preemption modes, SpinLock is a true spinlock that disables preemption for its critical section — but all spinlock-protected critical sections are bounded and O(1) in duration. Per-CPU data is protected by short IRQ-disabling guards (PerCpuMutGuard) that hold for bounded durations only — never across blocking operations. There are no unbounded preemption-disabled regions.
NUMA-aware load balancing: The load balancer models migration cost (cache invalidation, memory latency) and only migrates tasks when the imbalance exceeds the migration cost.
Per-CPU run queues: No global run queue lock. Each CPU manages its own queues independently. Each per-CPU run queue is protected by a per-CPU spinlock (rq->lock).
Work stealing: Idle CPUs steal tasks from busy CPUs at low frequency (~4ms interval) to avoid thundering-herd effects.
Run queue lock ordering: The load balancer (work stealing) acquires remote run queue locks. To prevent ABBA deadlocks, the lock ordering is: always lock the lower-numbered CPU's run queue first. When the load balancer on CPU A needs to examine or steal from CPU B, it acquires min(A,B)->rq->lock first, then max(A,B)->rq->lock. This is the same protocol used by Linux's CFS load balancer. The load balancer never holds more than two run queue locks simultaneously. The compile-time lock level hierarchy (Section 10.2) places RQ_LOCK below PI_LOCK and above TASK_LOCK: TASK_LOCK (level 1) < RQ_LOCK (level 2) < PI_LOCK (level 3).
Real-time guarantees: Dedicated RT cores can be reserved (isolcpus equivalent). Threaded interrupts ensure deterministic scheduling latency.
CPU frequency/power: Integration with cpufreq governors for power management.

14.4 Scheduler Classes

The scheduler is modular. Each scheduling class implements a standard interface:

pub trait SchedClass: Send + Sync {
    fn enqueue_task(&mut self, task: &mut Task, flags: EnqueueFlags);
    fn dequeue_task(&mut self, task: &mut Task, flags: DequeueFlags);
    fn pick_next_task(&mut self, cpu: CpuId) -> Option<&mut Task>;
    fn check_preempt(&self, current: &Task, incoming: &Task) -> bool;
    fn task_tick(&mut self, task: &mut Task, cpu: CpuId, queued: bool);
    fn balance(&mut self, cpu: CpuId, flags: BalanceFlags) -> BalanceResult;
}

Classes are checked in priority order: Deadline > RT > EEVDF. The highest-priority class with a runnable task wins.

See also: Section 16 (Real-Time Guarantees) extends deadline scheduling with bounded-latency paths, threaded interrupts, and PREEMPT_RT-style priority inheritance for hard real-time workloads.

14.5 Heterogeneous CPU Support (big.LITTLE / Intel Hybrid / RISC-V)

Modern SoCs are no longer symmetric. ARM big.LITTLE (2011+), Intel Alder Lake P-core/E-core (2021+), and RISC-V platforms with mixed hart types all present the scheduler with CPUs of different performance, power, and ISA capabilities. A scheduler that treats all CPUs as identical will either waste power (running background tasks on performance cores) or starve throughput (placing compute-heavy tasks on efficiency cores).

This section extends the scheduler with Energy-Aware Scheduling (EAS), per-CPU capacity tracking, and heterogeneous topology awareness.

14.5.1 CPU Capacity Model

Every CPU has a capacity value normalized to a 0–1024 scale, where the fastest core at its highest frequency = 1024. This is the fundamental abstraction that makes the scheduler heterogeneity-aware.

// isle-core/src/sched/capacity.rs

/// Per-CPU capacity descriptor.
/// Populated at boot from firmware tables (ACPI PPTT, devicetree, CPPC).
/// Updated at runtime when frequency changes.
pub struct CpuCapacity {
    /// Maximum capacity of this CPU at its highest OPP (Operating Performance Point).
    /// Normalized: fastest core in the system = 1024.
    /// An efficiency core might be 512 (half the throughput of a performance core).
    pub capacity: u32,

    /// Original (boot-time) maximum capacity. Does not change.
    pub capacity_max: u32,

    /// Current capacity, adjusted for current frequency.
    /// If a 1024-capacity core is running at 50% frequency, capacity_curr = 512.
    /// Updated by cpufreq governor on frequency change.
    pub capacity_curr: AtomicU32,

    /// Core type classification.
    pub core_type: CoreType,

    /// Frequency domain this CPU belongs to.
    /// All CPUs in a frequency domain share the same clock.
    pub freq_domain: FreqDomainId,

    /// ISA capabilities of this CPU.
    /// On heterogeneous ISA systems (RISC-V), different cores may support
    /// different extensions.
    pub isa_caps: IsaCapabilities,

    /// Microarchitecture ID (for Intel Thread Director).
    /// Different core types have different uarch IDs.
    pub uarch_id: u32,
}

/// Core type classification.
#[repr(u32)]
pub enum CoreType {
    /// ARM Cortex-X/A7x, Intel P-core.
    /// High single-thread performance, high power.
    Performance = 0,

    /// ARM Cortex-A5x, Intel E-core.
    /// Lower performance, significantly lower power.
    Efficiency  = 1,

    /// ARM Cortex-A7x mid-tier (e.g., Cortex-A78 in a system with X3 and A510).
    Mid         = 2,

    /// Traditional SMP — all cores identical.
    /// When all cores are Symmetric, EAS is disabled (unnecessary).
    Symmetric   = 3,
}

/// ISA capability flags.
/// On heterogeneous ISA systems, the scheduler must ensure a task only runs
/// on a CPU that supports the ISA features the task uses.
bitflags! {
    pub struct IsaCapabilities: u64 {
        // ARM
        const ARM_SVE       = 1 << 0;   // Scalable Vector Extension
        const ARM_SVE2      = 1 << 1;   // SVE2
        const ARM_SME       = 1 << 2;   // Scalable Matrix Extension
        const ARM_MTE       = 1 << 3;   // Memory Tagging Extension

        // x86
        const X86_AVX512    = 1 << 16;  // AVX-512 (P-cores only on some Intel)
        const X86_AMX       = 1 << 17;  // Advanced Matrix Extensions (P-cores only)
        const X86_AVX10     = 1 << 18;  // AVX10 (unified AVX across core types)

        // RISC-V
        const RV_V          = 1 << 32;  // Vector extension
        const RV_B          = 1 << 33;  // Bit manipulation
        const RV_H          = 1 << 34;  // Hypervisor extension
        const RV_CRYPTO     = 1 << 35;  // Cryptography extensions
    }
}

Key design property: On a fully symmetric system (all cores CoreType::Symmetric), the capacity model is a no-op. All CPUs have capacity 1024, all have the same ISA capabilities. The scheduler fast path sees capacity_curr == 1024 on every CPU and skips all heterogeneous logic. Zero overhead on symmetric systems.

14.5.2 Energy Model

The energy model describes the power cost of running a workload at each performance level on each core type. It is the foundation of Energy-Aware Scheduling.

// isle-core/src/sched/energy.rs

/// Energy model for one frequency domain.
/// A frequency domain is a group of CPUs that share the same clock.
/// All CPUs in a domain have the same core type and OPP table.
pub struct EnergyModel {
    /// Which frequency domain this model covers.
    pub freq_domain: FreqDomainId,

    /// Core type of CPUs in this domain.
    pub core_type: CoreType,

    /// Number of CPUs in this domain.
    pub cpu_count: u32,

    /// Operating Performance Points, sorted by frequency (ascending).
    /// Each OPP maps a frequency to a capacity and power cost.
    /// Fixed-capacity inline array avoids heap allocation and keeps OPP
    /// data cache-local. 16 entries accommodates all known hardware OPP
    /// tables.
    pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>,  // MAX_OPP_ENTRIES = 16
}

/// One Operating Performance Point.
pub struct OppEntry {
    /// Frequency in kHz.
    pub freq_khz: u32,

    /// Capacity at this frequency (0–1024 scale).
    /// Capacity scales linearly with frequency within a core type.
    pub capacity: u32,

    /// Power consumption at this frequency (milliwatts).
    /// This is the DYNAMIC power for one CPU running at 100% utilization.
    /// Power scales roughly as V²×f (voltage² × frequency).
    pub power_mw: u32,
}

Example: ARM big.LITTLE system (Cortex-X3 + Cortex-A510)

Performance cores (Cortex-X3), freq_domain 0:
  OPP 0:  600 MHz, capacity  256, power   80 mW
  OPP 1: 1200 MHz, capacity  512, power  280 mW
  OPP 2: 1800 MHz, capacity  768, power  650 mW
  OPP 3: 2400 MHz, capacity 1024, power 1200 mW

Efficiency cores (Cortex-A510), freq_domain 1:
  OPP 0:  400 MHz, capacity  100, power  15 mW
  OPP 1:  800 MHz, capacity  200, power  50 mW
  OPP 2: 1200 MHz, capacity  300, power 110 mW
  OPP 3: 1600 MHz, capacity  400, power 200 mW

Observation: running a task with utilization 200 (out of 1024):
  On performance core at OPP 0 (capacity 256): power = 80 mW
  On efficiency core at OPP 3 (capacity 400):  power = 200 mW
  → Wait. On efficiency core at OPP 1 (capacity 200): power = 50 mW
  → Efficiency core at OPP 1 uses 50 mW. Performance core at OPP 0 uses 80 mW.
  → Efficiency core wins. EAS places the task on the efficiency core.

  But a task with utilization 500:
  → Doesn't fit on any efficiency core OPP (max capacity 400).
  → Must go to performance core. EAS picks lowest OPP that fits: OPP 1 (512), 280 mW.

14.5.3 Energy-Aware Scheduling Algorithm

EAS runs at task wakeup time (the most impactful scheduling decision). It answers: "Which CPU should this task run on to minimize total system energy while meeting performance requirements?"

// isle-core/src/sched/eas.rs

pub struct EnergyAwareScheduler {
    /// Energy models for all frequency domains.
    energy_models: Vec<EnergyModel>,

    /// Per-CPU utilization (PELT, see §14.5.4).
    cpu_util: Vec<AtomicU32>,

    /// Threshold: a task is "misfit" if its utilization exceeds
    /// the capacity of the CPU it's running on.
    /// Misfit tasks are migrated to higher-capacity CPUs.
    misfit_threshold: u32,

    /// EAS is disabled on fully symmetric systems (no benefit).
    enabled: bool,
}

impl EnergyAwareScheduler {
    /// Find the most energy-efficient CPU for a waking task.
    /// Called from EEVDF enqueue path when EAS is enabled.
    ///
    /// Algorithm:
    ///   1. For each frequency domain:
    ///      a. Compute the new utilization if this task were placed here.
    ///      b. Find the lowest OPP that can handle the new utilization.
    ///      c. Compute energy cost = OPP power × (new_util / capacity).
    ///   2. Pick the frequency domain with the lowest energy cost.
    ///   3. Within that domain, pick the CPU with the most spare capacity
    ///      (to avoid unnecessary frequency increases).
    ///
    /// Complexity: O(domains × OPPs). Typically 2-3 domains × 4-6 OPPs = 8-18 iterations.
    /// Time: ~200-500ns. Acceptable for task wakeup path (~2000ns total).
    pub fn find_energy_efficient_cpu(&self, task_util: u32) -> CpuId {
        let mut best_energy = u64::MAX;
        let mut best_cpu = CpuId(0);

        for model in &self.energy_models {
            // Can this domain handle the task at all?
            let max_capacity = model.opps.last().map(|o| o.capacity).unwrap_or(0);
            if task_util > max_capacity {
                continue; // Task doesn't fit on this core type
            }

            // Compute energy cost for placing task in this domain.
            let energy = self.compute_energy(model, task_util);
            if energy < best_energy {
                best_energy = energy;
                best_cpu = self.find_idlest_cpu_in_domain(model.freq_domain);
            }
        }

        best_cpu
    }

    /// Estimate energy cost of adding `task_util` to a frequency domain.
    ///
    /// OPP selection uses the maximum per-CPU utilization in the domain (not
    /// the aggregate), because frequency is shared across all CPUs in a DVFS
    /// domain — the OPP must be high enough for the most loaded CPU.
    fn compute_energy(&self, model: &EnergyModel, task_util: u32) -> u64 {
        // Find max per-CPU utilization in this domain. Assumes the task
        // will be placed on the idlest CPU (same heuristic as
        // find_idlest_cpu_in_domain), so task_util is added to that
        // CPU's utilization when computing the domain's max.
        let max_cpu_util = self.max_cpu_utilization(model.freq_domain, task_util);

        // Find lowest OPP whose capacity can handle the busiest CPU.
        let opp = model.opps.iter()
            .find(|o| o.capacity >= max_cpu_util)
            .unwrap_or(model.opps.last().unwrap());

        // Energy = power × (sum of all CPU utilizations) / capacity.
        // Power is determined by the OPP (selected by max CPU), but energy
        // is proportional to total work done across all CPUs in the domain.
        let domain_util = self.domain_utilization(model.freq_domain) + task_util;
        (opp.power_mw as u64) * (domain_util as u64) / (opp.capacity as u64)
    }
}

When EAS is NOT used (symmetric systems, or when all cores are of the same type): the standard EEVDF load balancer runs instead. EAS adds zero overhead because enabled == false and the check is a single branch at the top of the wakeup path.

14.5.4 Per-Entity Load Tracking (PELT)

EAS needs accurate, up-to-date utilization data for each task and each CPU. PELT provides this with an exponentially-decaying average that balances responsiveness with stability.

// isle-core/src/sched/pelt.rs

/// Per-Entity Load Tracking state.
/// Attached to every task AND every CPU run queue.
///
/// PELT computes a running average of utilization:
///   util_avg(t) = util_avg(t-1) × decay + sample(t) × (1 - decay)
///
/// The decay factor is tuned so that the half-life is ~32ms.
/// This means: if a task was running at 100% and goes idle,
/// its util_avg drops to 50% after 32ms, 25% after 64ms, etc.
pub struct PeltState {
    /// Average utilization (0–1024).
    /// 1024 = this entity used 100% of a CPU over the averaging window.
    pub util_avg: u32,

    /// Average runnable time (0–1024).
    /// Includes time spent waiting in the run queue (not just running).
    /// util_avg <= runnable_avg (runnable includes queued time).
    pub runnable_avg: u32,

    /// Weighted load (for load balancing).
    /// load_avg = weight × runnable_avg.
    /// Weight comes from nice/priority (higher priority = higher weight).
    pub load_avg: u64,

    /// Timestamp of last update (nanoseconds).
    pub last_update_time: u64,

    /// Accumulated time in current period (for sub-period precision).
    pub period_contrib: u32,
}

impl PeltState {
    /// Update PELT state with new sample.
    /// Called at scheduler tick and at task state transitions (wake, sleep, migrate).
    ///
    /// **State-transition contract**: The caller MUST call `update()` at every
    /// state transition (running→sleeping, sleeping→runnable, runnable→running,
    /// etc.). Between consecutive calls, the `running` and `runnable` parameters
    /// must reflect a constant state — the state the entity was in during the
    /// entire `delta_ns` interval. This ensures that `period_contrib` carry-forward
    /// is correctly attributed: the carried-forward time belongs to the same
    /// state as the new delta, because the state didn't change between updates.
    /// Violating this contract (e.g., calling update with running=false after the
    /// task was actually running for part of the interval) misattributes the
    /// carried-forward time to the wrong state. This matches Linux's PELT, which
    /// calls `update_curr()` at every scheduling event (task switch, wake, sleep,
    /// tick), ensuring constant state between updates.
    ///
    /// `running`: was this entity actually executing on a CPU during the elapsed time?
    /// `runnable`: was this entity runnable (running OR waiting in the run queue)?
    ///   Invariant: `running` implies `runnable` — a running task is always runnable.
    /// `delta_ns`: time since last update.
    pub fn update(&mut self, running: bool, runnable: bool, delta_ns: u64) {
        debug_assert!(!running || runnable, "running implies runnable");

        // Split delta into complete periods (1024 μs each) and remainder.
        // Period = 1024 μs = 1,024,000 ns. Chosen as a power of two (in μs)
        // so that division becomes a shift.
        const PERIOD_NS: u64 = 1_024_000;

        // Carry forward sub-period time from the previous update. This ensures
        // that small deltas (< 1 period) accumulate correctly across updates
        // rather than being repeatedly truncated. Without this, a task updated
        // every 500μs would never complete a full period, and its sub-period
        // contributions would systematically undercount utilization.
        let total_ns = delta_ns + self.period_contrib as u64;
        let periods = (total_ns / PERIOD_NS) as u32;
        let remainder = (total_ns % PERIOD_NS) as u32;

        // Decay existing averages by the number of elapsed periods.
        // decay_factor = 0.5^(periods/32) — precomputed in lookup table.
        if periods > 0 {
            self.util_avg = decay_load(self.util_avg, periods);
            self.runnable_avg = decay_load(self.runnable_avg, periods);
            self.load_avg = decay_load_64(self.load_avg, periods);

            // Each complete period where the entity was running/runnable
            // contributes a full period (1024 utilization units). Accumulate
            // the decayed sum of complete periods (geometric series).
            if running {
                self.util_avg += accumulate_sum(periods);
            }
            if runnable {
                self.runnable_avg += accumulate_sum(periods);
            }
        }

        // Sub-period contribution: the remainder is NOT added to util_avg or
        // runnable_avg immediately. Instead, it is carried forward via
        // period_contrib and counted when it eventually completes a full
        // period on the next update. This prevents double-counting: if we
        // added remainder/1000 to util_avg here AND stored remainder in
        // period_contrib, the same time would be counted again when
        // carry-forward crosses the next period boundary.
        //
        // The tradeoff is 1024μs quantization — sub-period contributions are
        // not visible until the period completes. This is acceptable for a
        // scheduler heuristic (Linux updates PELT at 1-4ms tick intervals,
        // and the geometric decay operates over a 32ms half-life).
        self.period_contrib = remainder;
        self.last_update_time += delta_ns;
    }
}

Relationship to EAS: When a task wakes up, the scheduler reads task.pelt.util_avg to know the task's CPU demand. EAS uses this to find the core type where the task fits most efficiently. Without PELT, EAS would have no utilization data to work with.

14.5.5 Frequency Domain Awareness and Cpufreq Integration

CPUs within a frequency domain share a clock — changing one CPU's frequency changes all of them. The scheduler must be aware of this grouping.

// isle-core/src/sched/cpufreq.rs

/// Frequency domain: a group of CPUs sharing a clock source.
pub struct FreqDomain {
    /// Domain identifier.
    pub id: FreqDomainId,

    /// CPUs in this domain.
    pub cpus: CpuMask,

    /// Core type of all CPUs in this domain (always uniform within a domain).
    pub core_type: CoreType,

    /// Available OPPs for this domain. Fixed-capacity inline array avoids
    /// heap allocation and keeps OPP data cache-local. 16 entries
    /// accommodates all known hardware OPP tables. If hardware exposes more
    /// than MAX_OPP_ENTRIES OPPs, the driver selects the 16 entries with
    /// the widest frequency spread (lowest, highest, and 14 evenly
    /// distributed intermediate points), which preserves DVFS fidelity for
    /// all practical workloads.
    pub opps: ArrayVec<OppEntry, MAX_OPP_ENTRIES>,  // MAX_OPP_ENTRIES = 16

    /// Current OPP index (into `opps`).
    pub current_opp: AtomicU32,

    /// Aggregate utilization of all CPUs in this domain (sum of PELT util_avg).
    /// Updated at scheduler tick.
    pub domain_util: AtomicU32,

    /// Cpufreq governor for this domain.
    pub governor: CpufreqGovernor,
}

/// Cpufreq governor — decides when to change frequency.
pub enum CpufreqGovernor {
    /// Schedutil: frequency tracks utilization (default for EAS).
    /// New frequency = (util / capacity) × max_freq.
    /// Tight integration with scheduler — runs from scheduler context.
    Schedutil,

    /// Performance: always run at max frequency.
    Performance,

    /// Powersave: always run at min frequency.
    Powersave,

    /// Ondemand: legacy userspace sampling (Linux compat).
    Ondemand,

    /// Conservative: like ondemand but ramps gradually.
    Conservative,
}

Schedutil integration: On every scheduler tick, the schedutil governor reads the domain's aggregate utilization and adjusts frequency:

new_freq = (domain_util / domain_capacity) × max_freq

If domain has 4 CPUs at capacity 1024 each:
  domain_capacity = 4096
  If domain_util = 2048 (50% utilized):
    new_freq = (2048 / 4096) × max_freq = 50% of max_freq

Frequency change latency: ~10-50 μs (hardware-dependent).
The governor rate-limits changes to avoid oscillation (~4ms minimum interval).

14.5.6 Intel Thread Director (ITD) Integration

Intel Thread Director is a hardware feature on Alder Lake+ that classifies running workloads and provides hints about which core type is optimal. The hardware monitors instruction mix in real-time and reports via an MSR.

// isle-core/src/sched/itd.rs

/// Intel Thread Director hint (read from HW_FEEDBACK_THREAD_DIR MSR).
pub struct ItdHint {
    /// Hardware's assessment: how much this task benefits from a P-core.
    /// 0 = no benefit (pure memory-bound), 255 = maximum benefit (compute-bound).
    pub perf_capability: u8,

    /// Hardware's assessment: energy efficiency on an E-core.
    /// 0 = poor efficiency on E-core, 255 = excellent efficiency on E-core.
    pub energy_efficiency: u8,

    /// Workload classification.
    pub workload_class: ItdWorkloadClass,
}

#[repr(u8)]
pub enum ItdWorkloadClass {
    /// Scalar integer code — runs well on E-cores.
    ScalarInt       = 0,
    /// Scalar floating-point — runs well on E-cores.
    ScalarFp        = 1,
    /// Vectorized (SSE/AVX) — may benefit from P-cores (wider execution units).
    Vector          = 2,
    /// AVX-512 / AMX — P-core only (E-cores may lack these).
    HeavyVector     = 3,
    /// Branch-heavy — benefits from P-core branch predictor.
    BranchHeavy     = 4,
    /// Memory-bound — core type doesn't matter, memory is the bottleneck.
    MemoryBound     = 5,
}

Integration with EAS: ITD hints are a refinement. EAS uses PELT utilization to pick the energy-optimal core. ITD overrides this when hardware detects a mismatch:

EAS decision: task has low utilization (200/1024) → place on E-core.
ITD override: task is HeavyVector (AVX-512) → E-core lacks AVX-512 → force P-core.

EAS decision: task has high utilization (800/1024) → place on P-core.
ITD override: task is MemoryBound → P-core is wasted → suggest E-core.

ITD hints are read from the MSR at scheduler tick (~4ms). Cost: one RDMSR (~30ns). On non-Intel or pre-Alder Lake: ITD is disabled, zero overhead.

Other architectures: ARM and RISC-V do not have an ITD equivalent. On ARM big.LITTLE/DynamIQ, the scheduler relies on the capacity-dmips-mhz device tree property and runtime PELT utilization to make core placement decisions (Section 14.5.3). On RISC-V heterogeneous harts, the riscv,isa device tree property and per-hart ISA capability flags drive placement (Section 14.5.9). Neither architecture provides hardware-level workload classification hints — the scheduler's software heuristics (PELT + EAS) perform this role. This is architecturally acceptable: ITD is an optimization (~15-25% better placement for mixed workloads on Intel hybrid), not a correctness requirement.

14.5.7 Asymmetric Packing

On heterogeneous systems, idle CPU selection must be topology-aware:

Symmetric (traditional):
  Spread tasks across all CPUs evenly for maximum parallelism.

Asymmetric (big.LITTLE):
  Pack tasks onto efficiency cores first.
  Only spill onto performance cores when efficiency cores are full
  or when a task is too large (misfit).

  Why: an idle performance core at its lowest OPP still draws more power
  than a busy efficiency core. Packing onto efficiency cores first
  minimizes total system power.

Misfit migration: A task is "misfit" if its utilization (pelt.util_avg) exceeds the capacity of the CPU it's currently running on. Misfit tasks are migrated to a higher-capacity CPU at the next load balance opportunity.

/// Check if a task is misfit on its current CPU.
pub fn is_misfit(task: &Task, cpu: &CpuCapacity) -> bool {
    task.pelt.util_avg > cpu.capacity_curr.load(Ordering::Relaxed)
}

/// Misfit migration is checked at every load balance interval (~4ms).
/// If a task is misfit:
///   1. Find the closest (topology-wise) CPU with enough capacity.
///   2. Migrate the task.
///   3. Mark the source CPU's rq->misfit_task flag for faster detection.

14.5.8 Cgroup Integration

Cgroups can constrain which core types a group of tasks may use:

/sys/fs/cgroup/<group>/cpu.core_type
    # Allowed core types for this cgroup.
    # "all"         — any core type (default)
    # "performance" — only P-cores (latency-critical workloads)
    # "efficiency"  — only E-cores (background/batch workloads)
    # Multiple: "performance mid" — P-cores and mid-tier cores

/sys/fs/cgroup/<group>/cpu.capacity_min
    # Minimum per-CPU capacity for tasks in this cgroup.
    # Tasks will not be placed on CPUs with capacity below this value.
    # Default: 0 (no minimum).
    # Example: "512" — only run on CPUs with at least half maximum capacity.

/sys/fs/cgroup/<group>/cpu.capacity_max
    # Maximum per-CPU capacity for tasks in this cgroup.
    # Tasks will not be placed on CPUs with capacity above this value.
    # Default: 1024 (no maximum).
    # Example: "400" — only run on efficiency cores.

Use case examples:

# Kubernetes: latency-critical pod on P-cores only
echo "performance" > /sys/fs/cgroup/k8s-pod-frontend/cpu.core_type

# Background log processing: E-cores only (save P-cores for real work)
echo "efficiency" > /sys/fs/cgroup/k8s-pod-logshipper/cpu.core_type

# ML training: needs AVX-512 (P-cores on Intel, ISA-gated)
echo "512" > /sys/fs/cgroup/k8s-pod-training/cpu.capacity_min

14.5.9 RISC-V Heterogeneous Hart Support

RISC-V takes heterogeneity further: different harts (hardware threads) may have different ISA extensions. One hart may have the Vector extension (RVV), another may not. One hart may support the Hypervisor extension (H), another may not.

// isle-core/src/sched/riscv.rs

/// RISC-V ISA extension discovery per hart.
/// Read from the devicetree `riscv,isa` property for each hart.
///
/// Example devicetree:
///   cpu@0 { riscv,isa = "rv64imafdc_zba_zbb_v"; };  // Vector-capable
///   cpu@1 { riscv,isa = "rv64imafd"; };               // No vector
pub fn discover_hart_capabilities(dt: &DeviceTree) -> Vec<IsaCapabilities> {
    // Parse each hart's ISA string.
    // Set IsaCapabilities flags accordingly.
    // The scheduler uses these to ensure tasks that use Vector
    // instructions only run on Vector-capable harts.
}

ISA gating in the scheduler:

Task affinity includes ISA requirements:

struct TaskAffinityHint {
    /// ISA extensions this task requires (detected from ELF header
    /// or set by userspace via prctl).
    pub isa_required: IsaCapabilities,

    /// Core type preference (from cgroup or auto-detected).
    pub core_preference: CorePreference,

    /// PELT utilization (for EAS).
    pub util_avg: u32,
}

Scheduler check:
  if !cpu.isa_caps.contains(task.affinity.isa_required) {
      // This CPU lacks ISA extensions the task needs.
      // Skip this CPU. Do NOT schedule here.
      continue;
  }

This prevents illegal-instruction faults from scheduling a Vector task on a non-Vector hart, which would be a silent correctness bug on Linux today (Linux began adding per-hart ISA detection in 6.2+ and capability tracking in 6.4+, but does not yet integrate per-hart ISA awareness into the scheduler's task placement decisions — a Vector-tagged task can still be scheduled on a non-Vector hart).

14.5.10 Topology Discovery

The scheduler builds its heterogeneous topology model from firmware:

Sources (checked in order):

1. ACPI PPTT (Processor Properties Topology Table):
   — Provides core type, cache hierarchy, frequency domain.
   — Available on ARM SBSA servers and Intel Alder Lake+.

2. ACPI CPPC (Collaborative Processor Performance Control):
   — Provides per-CPU performance range (highest/lowest/nominal).
   — The ratio highest_perf / nominal_perf indicates core type:
     P-cores have higher highest_perf than E-cores.
   — Used on Intel hybrid platforms.

3. Devicetree:
   — ARM and RISC-V embedded systems.
   — `capacity-dmips-mhz` property gives relative core performance.
   — RISC-V `riscv,isa` property gives per-hart ISA extensions.

4. Intel CPUID leaf 0x1A (Hybrid Information):
   — Reports core type (Atom = E-core, Core = P-core) for the running CPU.
   — Each CPU reads its own CPUID at boot.

5. Fallback: runtime measurement
   — If no firmware data: run a calibration loop on each CPU at boot.
   — Measure instructions-per-second to derive relative capacity.
   — Last resort. ~100ms at boot.

14.5.11 Linux Compatibility

All Linux interfaces for heterogeneous CPU systems are supported:

/sys/devices/system/cpu/cpuN/cpu_capacity
    # Read-only. Capacity of CPU N (0–1024).
    # Written by the kernel at boot. Used by userspace tools.
    # e.g., "1024" for P-core, "512" for E-core.

/sys/devices/system/cpu/cpuN/topology/core_type
    # "performance", "efficiency", or "unknown".
    # New in Linux 6.x, used by systemd and schedulers.

/sys/devices/system/cpu/cpufreq/policyN/
    # Standard cpufreq interface (per frequency domain):
    # scaling_governor, scaling_cur_freq, scaling_max_freq, etc.

sched_setattr(pid, &attr):
    # SCHED_FLAG_UTIL_CLAMP: set min/max utilization clamp.
    # util_min/util_max affect EAS placement decisions.
    # Fully supported with same semantics as Linux.

prctl(PR_SCHED_CORE, ...):
    # Core scheduling (co-scheduling related tasks on the same core).
    # Supported.

Kernel command line:
    # isolcpus=2-3 (reserve CPUs, same as Linux)
    # nohz_full=2-3 (tickless for RT, same as Linux)
    # nosmt (disable SMT, same as Linux)

sched_ext compatibility: Linux 6.12+ allows BPF scheduling policies via sched_ext. ISLE provides the foundation for sched_ext through the eBPF subsystem (Section 20.4, 05-linux-compat.md) and the sched_setattr interface (Section 20). Full sched_ext support requires the BPF struct_ops framework and sched_ext-specific kfuncs (scx_bpf_dispatch, scx_bpf_consume, etc.), which are part of the eBPF subsystem implementation. BPF schedulers that use sysfs topology files and sched_setattr for configuration will work without modification once the struct_ops infrastructure is in place.

14.5.12 Performance Impact

Symmetric systems (all cores identical):
  EAS: disabled (enabled == false). One branch check at task wakeup: ~1 cycle.
  Capacity model: all CPUs = 1024. No capacity checks affect scheduling decisions.
  PELT: runs regardless (already exists in the standard EEVDF scheduler). Zero additional cost.
  Total overhead vs Linux on symmetric: ZERO.

Heterogeneous systems (big.LITTLE, Intel hybrid):
  EAS: ~200-500ns per task wakeup (iterate 2-3 domains × 4-6 OPPs).
  Task wakeup total (without EAS): ~1500-2000ns.
  Task wakeup total (with EAS): ~1700-2500ns.
  Overhead: ~15-25% of wakeup path. Same as Linux EAS.

  ITD MSR read: ~30ns per scheduler tick (4ms). Overhead: 0.00075%.

  Misfit check: one comparison per load balance (~4ms). Negligible.

  Benefit: 20-40% power reduction for mixed workloads (measured on
  Linux EAS vs non-EAS on ARM big.LITTLE). Same benefit expected.

  This is not overhead — it's a power optimization. The CPU time spent
  on EAS decisions is recovered many times over in power savings.

Summary: ISLE's heterogeneous scheduling has identical performance
to Linux EAS on the same hardware. The algorithms are the same (PELT,
EAS energy computation, schedutil). The implementation is clean-sheet
Rust but the scheduling mathematics are equivalent.

See also: Section 49 (Power Budgeting) extends EAS with system-level power caps and per-domain throttling. Section 54 (Unified Compute Model) generalizes the CpuCapacity scalar into a multi-dimensional capacity vector spanning CPUs, GPUs, and accelerators.

14.6 Extended Register State Management

Modern x86 CPUs carry large amounts of extended register state beyond the basic GPRs and x87 FPU. Blindly saving and restoring all of this on every context switch is wasteful — most threads never touch AVX-512 or AMX.

The cost problem:

State component	Size	Save/restore cost
x87 + SSE (XMM)	576 bytes	~20 ns
AVX (YMM)	256 bytes	~10 ns
AVX-512 (ZMM)	2048 bytes	~80 ns
Intel AMX (tiles)	8192 bytes	~300 ns
ARM SVE (Z regs)	256–8192 bytes (VL-dependent)	~100–500 ns

On a server running thousands of threads with microsecond-scale scheduling, 300ns of AMX save/restore overhead per switch is significant.

Lazy XSAVE policy:

ISLE tracks per-thread which extended state components have actually been used via an xstate_used bitmap that mirrors the hardware XSTATE_BV field:

/// Per-thread extended state tracking.
struct ThreadXState {
    /// Bitmap of XSAVE components this thread has used since creation.
    /// Mirrors hardware XSTATE_BV layout (bit 0 = x87, bit 1 = SSE, bit 2 = AVX, etc.)
    xstate_used: u64,

    /// Dynamically-allocated XSAVE area. Starts as None; allocated on first use.
    /// Size depends on which components are enabled (CPUID leaf 0xD).
    xsave_area: Option<XSaveArea>,
}

Context switch optimization:

On context_switch(prev, next):
  1. Determine prev's dirty components: xstate_dirty = prev.xstate_used & XSTATE_MODIFIED_BITS
  2. XSAVE only the dirty components (XSAVES with prev.xstate_used as the mask)
     — If prev never used AVX-512, the ZMM state is NOT saved (zero cost)
  3. XRSTOR next's components (XRSTORS with next.xstate_used as the mask)
     — Components not in next's mask are initialized to their reset state by hardware

Modified Optimization:
  - If a thread hasn't executed any AVX-512/AMX instruction since the last context switch,
    the corresponding XSTATE_BV bits are clear — XSAVES skips those components automatically.
  - The kernel does NOT need to track this manually; it falls out of the hardware XSAVE
    optimized mode (XSAVES/XRSTORS with INIT optimization).

Init optimization (demand allocation):

Threads that never use extended SIMD pay zero XSAVE cost:

1. New thread starts with xstate_used = 0, xsave_area = None
2. CR0.TS bit is set (or XCR0 is restricted) — any use of SIMD triggers #NM
   (Device Not Available exception)
3. #NM handler:
   a. Allocate xsave_area (sized per CPUID leaf 0xD for the used components)
   b. Set appropriate bits in xstate_used
   c. Clear CR0.TS (or extend XCR0)
   d. Return — the faulting SIMD instruction re-executes successfully
4. Subsequent SIMD use proceeds without trapping

This means a thread that only does integer arithmetic and memory copies never allocates an XSAVE area and never incurs XSAVE/XRSTOR cost during context switch.

AMX special case:

Intel AMX tile registers (8KB) are especially expensive. Additional optimization: - AMX has a TILERELEASE instruction that explicitly marks tile state as unused. - ISLE's kernel scheduler can hint userspace (via prctl or arch_prctl) to call TILERELEASE when exiting a compute-intensive section, so the next context switch doesn't save 8KB of dead tile state. - If a thread hasn't used AMX tiles in the last N context switches (configurable, default N=8), the kernel deallocates the AMX portion of the XSAVE area to reclaim memory.

ARM SVE/SME (AArch64):

ARM's Scalable Vector Extension has a variable vector length (128–2048 bits). The same lazy-allocation strategy applies, with ARM-specific mechanisms:

SVE state components and sizes (VL-dependent):

| Component        | Size at VL=128 | Size at VL=512 | Size at VL=2048 |
|------------------|----------------|----------------|-----------------|
| Z registers (Z0-Z31) | 512 bytes | 2048 bytes    | 8192 bytes      |
| P predicates (P0-P15) | 32 bytes  | 128 bytes     | 512 bytes       |
| FFR (first-fault) | 2 bytes       | 8 bytes        | 32 bytes        |
| Total             | 546 bytes     | 2184 bytes     | 8736 bytes      |

Lazy SVE allocation policy:

1. New thread starts with SVE disabled (CPACR_EL1.ZEN = 0b00).
2. First SVE instruction triggers #UND (EL1 undefined instruction trap).
3. #UND handler:
   a. Read current VL from ZCR_EL1 (or inherit from parent thread).
   b. Allocate SVE save area sized for current VL.
   c. Set CPACR_EL1.ZEN = 0b01 (enable SVE for EL0).
   d. Return — the faulting SVE instruction re-executes.
4. Context switch saves/restores only if thread has SVE enabled:
   a. Check CPACR_EL1.ZEN — if SVE was never used, skip (zero cost).
   b. If used: SVE_ST (store Z/P/FFR) and SVE_LD (load) to save area.
   c. Cost: proportional to VL, not fixed. VL=128: ~50ns. VL=512: ~200ns.

VL management:
  - Per-thread VL via prctl(PR_SVE_SET_VL, new_vl).
  - VL change takes effect on next context restore (no immediate effect).
  - If new_vl > old_vl, the save area is reallocated (grown, not shrunk).
  - System default VL set via sysctl: kernel.sve_default_vl = 256.

SME (Scalable Matrix Extension, ARMv9.2) provides matrix tiles analogous to AMX:

SME state:
  - ZA tile register: (SVL/8) x (SVL/8) bytes, where SVL is the Streaming
    Vector Length in BITS. At SVL=512 bits: 64x64 = 4KB. At the maximum
    SVL=2048 bits: 256x256 = 64KB.
  - Streaming SVE mode (SSVE): uses SVE registers at streaming VL.

Lazy SME allocation:
  1. SMSTART (enter streaming mode) traps if SMCR_EL1.ENA = 0.
  2. Handler allocates ZA storage and enables SME.
  3. SMSTOP (exit streaming mode) marks ZA as inactive.
     If ZA is inactive for N switches (default 4), deallocate ZA storage.
     This matters for memory pressure: ZA at SVL=2048 is 64KB per thread.

Context switch for SME:
  - Check PSTATE.SM (streaming mode active) and PSTATE.ZA (ZA active).
  - If neither: zero cost.
  - If ZA active: save ZA tile (up to 64KB at max SVL=2048). This is expensive.
  - The scheduler deprioritizes SME-heavy threads from migration to minimize
    ZA save/restore on context switch (locality preference).

ARMv7 VFP/NEON:

ARMv7 extended register state is simpler than x86 or AArch64 — only VFP (Vector Floating Point) and NEON (SIMD) use extended registers:

ARMv7 extended state:

| Component        | Size       | Save/restore cost |
|------------------|------------|-------------------|
| VFP/NEON (D0-D31)| 256 bytes  | ~15-30 ns         |
| FPSCR            | 4 bytes    | ~5 ns             |

Total: 260 bytes per thread. Always the same size (no variable-length
extensions like SVE or AVX-512).

Lazy allocation policy for ARMv7:

1. New thread starts with VFP/NEON disabled (FPEXC.EN = 0).
2. First VFP/NEON instruction triggers #UND trap.
3. Handler:
   a. Allocate 260-byte VFP save area.
   b. Set FPEXC.EN = 1 (enable VFP/NEON).
   c. Return — faulting instruction re-executes.
4. Context switch: VSTM/VLDM to save/restore D0-D31 + FPSCR.
   Cost is fixed (~30ns) regardless of which registers were used.

ARMv7 has no equivalent of XSAVE's selective save — all 32 double-word
registers are saved/restored as a unit. The fixed 260-byte size means no
dynamic allocation complexity. Threads that never use floating-point or
NEON pay zero VFP save/restore cost (same lazy trap approach as x86 #NM).

RISC-V Vector Extension (RVV):

RISC-V Vector extension (RVV 1.0, ratified 2021) has variable vector length (VLEN: 128–65536 bits), making it the most flexible — and most complex to manage — of any supported architecture:

RVV state components:

| Component           | Size at VLEN=128 | Size at VLEN=256 | Size at VLEN=1024 |
|---------------------|------------------|------------------|-------------------|
| V registers (v0-v31)| 512 bytes        | 1024 bytes       | 4096 bytes        |
| vtype, vl, vstart   | 24 bytes         | 24 bytes         | 24 bytes          |
| vcsr (vxrm, vxsat)  | 8 bytes          | 8 bytes          | 8 bytes           |
| Total               | 544 bytes        | 1056 bytes       | 4128 bytes        |

Lazy RVV allocation policy:

1. New thread starts with Vector disabled (mstatus.VS = Off).
2. First vector instruction triggers illegal-instruction trap.
3. Handler:
   a. Read VLEN from hart capabilities (discovered at boot from DT or CSR).
   b. Allocate vector save area: 32 × (VLEN/8) + overhead bytes.
   c. Set mstatus.VS = Initial (vector state clean).
   d. Return — faulting instruction re-executes.
4. Context switch:
   a. Check mstatus.VS — if Off, skip entirely (zero cost).
   b. If Dirty: save all 32 V registers using 4× vs8r.v (v0-v7, v8-v15,
      v16-v23, v24-v31). The RVV spec maximum whole-register store is 8
      registers per instruction; there is no vs32r.v.
   c. If Clean: skip save (state hasn't changed since last restore).
   d. Restore: 4× vl8re8.v to load all 32 registers from next thread's area.
      Total: 8 whole-register instructions (4 stores + 4 loads).
      Set mstatus.VS = Clean after restore.

Per-hart VLEN variation:
  On heterogeneous RISC-V systems, different harts may have different VLEN.
  The scheduler tracks per-hart VLEN (discovered at boot). A thread that
  uses vector instructions with VLEN=256 can only run on harts with
  VLEN >= 256. This is integrated with the ISA-gating mechanism
  (Section 14.5.9): IsaCapabilities includes VLEN range.

Alignment with scheduler hints:

The scheduler already tracks ISA capability usage per-thread (Section 14.5, lines 3053–3059: IsaCapabilities bitflags including X86_AVX512, X86_AMX, ARM_SME). The XSAVE policy uses the same flags: - A thread with X86_AVX512 set in its IsaCapabilities is known to use AVX-512. The scheduler places it on a P-core (avoiding frequency throttling on E-cores). - The XSAVE subsystem uses the same flag to pre-allocate the ZMM XSAVE area, avoiding the #NM trap latency on the first AVX-512 instruction for threads that are known to use it (e.g., after an execve of a binary with AVX-512 in its ELF .note.gnu.property).

14a. Platform Power Management

Standards: ACPI 6.5 §4 (Power Management), Intel SDM Vol 3B §14.9 (RAPL MSRs), AMD PPR (Zen2+) §2.1.9 (RAPL), ARM Energy Model (Documentation/power/energy-model.rst), IPMI v2.0 §3 / DCMI v1.5. IP status: All interfaces are open standards or documented hardware interfaces. No proprietary implementations referenced.

14a.1 Problem and Scope

Power management is a kernel responsibility — not because policy belongs in the kernel, but because the mechanisms that enforce policy require ring-0 privileges and sub-millisecond response latency:

Why ring-0 is required for power management mechanisms:

RAPL MSR writes require ring-0 access. Intel and AMD RAPL power-limit registers (e.g., MSR_PKG_POWER_LIMIT at 0x610) are privileged MSRs. A WRMSR instruction executed from ring-3 causes a #GP(0) fault. There is no userspace API that provides equivalent direct hardware control; powercap sysfs writes go through the kernel driver.
Thermal trip point response must be sub-millisecond. A thermal Critical trip point (typically 5–10 °C below the hardware PROCHOT shutdown temperature) requires an immediate forced poweroff. The kernel cannot wait for a userspace daemon to wake up, read a netlink event, and issue a shutdown ioctl — that path has unbounded latency. The kernel's thermal interrupt handler must act directly.
cgroup power accounting requires kernel-side energy counter integration. Attributing energy consumption to a cgroup requires reading RAPL energy counters at the same scheduler tick that records CPU time — these are indivisible from a correctness standpoint. A userspace poller cannot atomically correlate energy deltas to the task that was running.
VM power budgets must be enforced even if the VM misbehaves. A guest OS cannot be trusted to self-limit its power consumption. The hypervisor (isle-kvm) must enforce power caps from outside the VM, using kernel-level RAPL and cgroup mechanisms.

Scope: This section covers mechanisms only:

RAPL hardware interface abstraction (RaplInterface trait, §14a.2)
Thermal zone and trip-point framework (ThermalZone, §14a.3)
Powercap sysfs hierarchy (§14a.4)
cgroup power accounting and per-cgroup power limits (§14a.5)
VM watt-budget enforcement (§14a.6)
DCMI/IPMI rack-level power management (§14a.7)

Policy — which power profile a user selects, when to throttle a VM for economic reasons, how to balance performance against energy cost — is a userspace/orchestrator concern. The kernel provides the enforcement hooks; daemons (e.g., tuned, power-profiles-daemon, isle-kvm's scheduler) invoke them.

14a.2 RAPL — Running Average Power Limit

14a.2.1 Domain Taxonomy

RAPL partitions the platform into named power domains. Each domain has independent power-limit registers and energy-status counters:

Domain	Scope	Availability
`Pkg`	Entire CPU socket including uncore (LLC, memory controller, PCIe root complex, integrated graphics on server SKUs)	Intel SNB+, AMD Zen2+
`Core` (PP0)	CPU cores only (excluding uncore). Useful for isolating compute vs memory-bandwidth workloads.	Intel SNB+
`Uncore` (PP1)	Integrated GPU / GT on Intel client SKUs. Not present on server SKUs (Xeon).	Intel client only
`Dram`	Memory controller and attached DIMMs. Separate power rail on server platforms.	Intel IVB-EP+, AMD Zen2+ server
`Platform` (PSYS)	Entire platform as measured from the charger/PSU side. Introduced on Intel Skylake+ client. Captures power not visible to PKG (PCH, NVMe, display).	Intel SKL+ client only

The Core domain is always ≤ Pkg. Platform ≥ Pkg because it includes peripheral power not counted by the socket energy counter.

14a.2.2 MSR Interface (x86-64 / x86)

Intel RAPL is exposed via Model-Specific Registers readable/writable with RDMSR/WRMSR from ring-0. The register layout is documented in Intel SDM Vol 3B §14.9.

Key registers for the Pkg domain (other domains follow the same pattern at different base addresses):

MSR Address	Name	Direction	Purpose
`0x610`	`MSR_PKG_POWER_LIMIT`	R/W	Set short-window and long-window power limits
`0x611`	`MSR_PKG_ENERGY_STATUS`	R	Read cumulative energy counter (wraps at ~65 J for typical units)
`0x613`	`MSR_PKG_PERF_STATUS`	R	Throttle duty cycle (fraction of time spent in power throttle)
`0x614`	`MSR_PKG_POWER_INFO`	R	Thermal Design Power (TDP), minimum, and maximum power

MSR_PKG_POWER_LIMIT bit layout: - Bits 14:0 — Long-window power limit (in hardware power units from MSR_RAPL_POWER_UNIT) - Bit 15 — Enable long-window limit - Bit 16 — Clamping enable (allow limit to go below TDP; requires CLAMPING_SUPPORT flag) - Bits 23:17 — Long-window time window (tau_x, encoded as y * 2^F × base unit, typically ≤ 28 s) - Bits 30:24 — Reserved - Bit 31 — Reserved - Bits 46:32 — Short-window power limit - Bit 47 — Enable short-window limit - Bit 48 — Short-window clamping enable - Bits 55:49 — Short-window time window (tau_y, ≤ 10 ms) - Bits 62:56 — Reserved - Bit 63 — Lock bit (locks the entire register until next RESET; kernel must not set this)

The short-window limit (tau_y ≤ 10 ms) is the primary mechanism for burst suppression. The long-window limit (tau_x ≈ 28 s) enforces sustained average power. Setting both gives a two-tier policy: allow short bursts up to short_limit_W for up to 10 ms, but enforce long_limit_W on average.

Energy units are encoded in MSR_RAPL_POWER_UNIT (address 0x606). The driver must read this at boot and convert all values accordingly.

14a.2.3 AMD Equivalent

AMD Zen2 and later processors implement RAPL-compatible MSRs at the same addresses (0x610, 0x611, 0x614) with the same bit layout. This allows the same MSR driver to serve both Intel and AMD on Zen2+.

Older AMD processors (pre-Zen2) use the System Management Unit (SMU), a co-processor accessible via PCI config space (bus 0, device 0, function 0, PCI vendor/device ID varies by generation). The SMU interface is not publicly documented; ISLE uses the same reverse-engineered interface as the Linux amd_energy driver (kernel/drivers/hwmon/amd_energy.c).

The RaplInterface abstraction (§14a.2.5) hides this difference from upper layers.

14a.2.4 ARM and RISC-V Equivalents

ARM Energy Model (EM): ARM SoCs do not expose hardware energy counters equivalent to RAPL. Instead, the ARM Energy Model framework provides estimated power consumption based on empirically measured power coefficients per CPU frequency operating point (OPP). Each OPP has a power_mW coefficient stored in the device tree (operating-points-v2 table). The kernel integrates over active OPPs to estimate energy. This is less accurate than RAPL but enables the same cgroup accounting interface (§14a.5).

RISC-V: There is no standardised RAPL equivalent in the RISC-V ISA or the SBI specification as of SBI v2.0. Platform-specific power management is exposed via vendor SBI extensions (e.g., T-HEAD/Alibaba extensions for their RISC-V SoCs). ISLE implements a NoopRaplInterface for RISC-V that returns PowerError::NotSupported for all limit-setting operations and provides zero energy readings. Cgroup accounting falls back to CPU-time-weighted estimation.

14a.2.5 Kernel Abstraction

All RAPL consumers (cgroup accounting, VM power budgets, thermal passive cooling, DCMI enforcement) interact with power domains through the RaplInterface trait, never touching MSRs directly:

/// The type of a RAPL power domain.
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum PowerDomainType {
    /// Entire CPU socket including uncore (LLC, memory controller, PCIe root).
    Pkg,
    /// CPU cores only (PP0). Excludes uncore.
    Core,
    /// Integrated GPU / uncore accelerator (PP1). Intel client only.
    Uncore,
    /// Memory controller and attached DIMMs. Server platforms only.
    Dram,
    /// Entire platform power as measured from charger/PSU. Intel SKL+ client only.
    Platform,
}

/// A single power domain with its hardware interface and energy accumulator.
pub struct PowerDomain {
    /// The type of power domain this represents.
    pub domain_type: PowerDomainType,
    /// The hardware driver implementing this domain's register interface.
    pub hw_interface: Arc<dyn RaplInterface>,
    /// Cumulative energy consumed by this domain in microjoules.
    ///
    /// Updated periodically by the kernel power accounting thread (§14a.5).
    /// Wraps at `u64::MAX`; readers must handle wrap-around by tracking deltas.
    pub energy_uj: AtomicU64,
    /// Socket index (0-based) this domain belongs to.
    pub socket_id: u32,
}

/// Hardware interface for reading and controlling a RAPL power domain.
///
/// Implementations exist for: Intel MSR (`IntelRaplMsr`), AMD MSR/SMU
/// (`AmdRaplInterface`), ARM Energy Model (`ArmEmInterface`), and no-op
/// (`NoopRaplInterface` for platforms without hardware support).
///
/// # Safety
///
/// Implementations that write MSRs must only do so from ring-0 kernel context.
/// MSR writes from interrupt context are permitted but must be idempotent and
/// must not acquire locks that could be held by non-interrupt code.
pub trait RaplInterface: Send + Sync {
    /// Set a power limit on the given domain.
    ///
    /// `limit_mw` is the power limit in milliwatts.
    /// `window_ms` is the averaging window in milliseconds. Hardware may
    /// round to the nearest supported window; callers must not assume exact values.
    ///
    /// Returns `PowerError::NotSupported` if the domain or windowed limiting
    /// is not available on this platform.
    fn set_power_limit(
        &self,
        domain: PowerDomainType,
        limit_mw: u32,
        window_ms: u32,
    ) -> Result<(), PowerError>;

    /// Remove a previously set power limit on the given domain, restoring
    /// the hardware default (TDP-derived limit).
    ///
    /// Returns `PowerError::NotSupported` if the domain is not available.
    fn clear_power_limit(&self, domain: PowerDomainType) -> Result<(), PowerError>;

    /// Read the cumulative energy consumed by the given domain in microjoules.
    ///
    /// The counter wraps at `max_energy_range_uj()`. Callers must track
    /// previous values and compute deltas to handle wrap-around correctly.
    ///
    /// Returns `PowerError::NotSupported` if the domain is not available.
    fn read_energy_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;

    /// Read the Thermal Design Power (TDP) of the given domain in milliwatts.
    ///
    /// This is the sustained power level the platform is designed to dissipate.
    /// It is used as the upper bound for VM admission control (§14a.6).
    ///
    /// Returns `PowerError::NotSupported` if TDP information is not available.
    fn read_tdp_mw(&self, domain: PowerDomainType) -> Result<u32, PowerError>;

    /// Return the maximum value of the energy counter before it wraps, in microjoules.
    ///
    /// Callers use this to correctly handle wrap-around in `read_energy_uj`.
    fn max_energy_range_uj(&self, domain: PowerDomainType) -> Result<u64, PowerError>;
}

/// Errors returned by `RaplInterface` operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum PowerError {
    /// The requested domain or operation is not supported on this platform.
    NotSupported,
    /// The requested power limit is below the hardware minimum or above the TDP.
    OutOfRange { min_mw: u32, max_mw: u32 },
    /// MSR or SMU access failed (hardware error or driver not initialised).
    HardwareFault,
    /// The domain's power limit register is locked until next RESET.
    Locked,
}

The platform boot sequence probes for available RAPL domains (by attempting RDMSR and checking for #GP) and registers each discovered domain with the global PowerDomainRegistry. Upper layers iterate the registry rather than hard-coding which domains exist.

14a.3 Thermal Framework

14a.3.1 Thermal Zones

A thermal zone is a region of the system that has one or more temperature sensors and a set of trip points. Physical examples:

CPU die (one per socket; typically uses the TCONTROL MSR or PECI for temperature)
GPU die (integrated or discrete)
Battery (reported via ACPI _BTP or Smart Battery System)
Skin/chassis (NTC thermistor on laptop lid; used to prevent burns)
NVMe drive (SMART temperature, reported via hwmon §11.2.1)

/// A thermal zone: a named region with a temperature sensor and trip points.
pub struct ThermalZone {
    /// Human-readable name (e.g., `"cpu0-die"`, `"battery"`, `"skin"`).
    /// Must be unique within the system. Used as the sysfs directory name.
    pub name: &'static str,

    /// The temperature sensor for this zone.
    pub sensor: Arc<dyn TempSensor>,

    /// Ordered list of trip points, sorted by `temp_mc` ascending.
    ///
    /// The thermal monitor evaluates all trip points on each poll cycle and
    /// fires actions for any whose threshold has been crossed.
    pub trip_points: Vec<TripPoint>,

    /// Cooling devices bound to this zone with their maximum cooling state
    /// and the trip point(s) that activate them.
    pub cooling_devices: Vec<CoolingBinding>,

    /// Current polling interval in milliseconds.
    ///
    /// Starts at 1000 ms (normal), drops to 100 ms when the zone temperature
    /// is within 5 °C of any trip point, and drops to 10 ms when within 1 °C
    /// of a `Hot` or `Critical` trip point.
    pub polling_interval_ms: AtomicU32,
}

14a.3.2 Trip Points

A trip point is a temperature threshold with an associated action type:

/// A temperature threshold that triggers a thermal action when crossed.
pub struct TripPoint {
    /// Temperature at which this trip point fires, in millidegrees Celsius.
    ///
    /// For example, 95000 = 95 °C.
    pub temp_mc: i32,

    /// The action to take when this trip point is crossed.
    pub trip_type: TripType,

    /// Hysteresis in millidegrees Celsius.
    ///
    /// The trip point is considered cleared only when the temperature drops
    /// below `temp_mc - hysteresis_mc`. This prevents oscillation around the
    /// threshold. Typical value: 2000 (2 °C).
    pub hysteresis_mc: i32,
}

/// The action taken when a thermal trip point threshold is crossed.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum TripType {
    /// Reduce power consumption by notifying the cpufreq governor to lower
    /// the maximum CPU frequency. Does not forcefully reduce frequency;
    /// relies on the governor to converge. This is the primary mechanism
    /// for sustained thermal management.
    Passive,

    /// Activate a cooling device (e.g., spin up a fan to a higher speed).
    /// The bound `CoolingDevice` is set to its next higher state.
    Active,

    /// The temperature has reached a dangerous level. Post a `ThermalEvent`
    /// to userspace monitoring daemons via the thermal netlink socket.
    /// Userspace may respond by reducing workload. No kernel-side action.
    Hot,

    /// Emergency condition. The kernel immediately forces a system poweroff
    /// (equivalent to `kernel_power_off()`). This happens synchronously in
    /// the thermal interrupt handler or poll loop — userspace is not consulted.
    /// Data integrity is not guaranteed; this is a last resort before hardware
    /// thermal shutdown.
    Critical,
}

The Critical trip point is typically set 5–10 °C below the hardware's own PROCHOT# shutdown temperature to give the kernel a chance to shut down cleanly (flushing journal, unmounting filesystems) before the hardware forcibly powers off.

14a.3.3 Cooling Devices

A cooling device is something the kernel can actuate to reduce heat generation or increase heat dissipation:

/// A device that can reduce thermal load on a thermal zone.
///
/// Cooling states are represented as integers from 0 (no cooling) to
/// `max_state()` (maximum cooling). The mapping from state number to physical
/// action is device-specific.
///
/// # Examples
///
/// - `CpufreqCooler`: state 0 = max frequency, state N = minimum frequency.
/// - `FanCooler`: state 0 = fan off, state N = 100% PWM duty cycle.
pub trait CoolingDevice: Send + Sync {
    /// Return the maximum cooling state this device supports.
    ///
    /// The device can be set to any state in `[0, max_state()]`.
    fn max_state(&self) -> u32;

    /// Return the current cooling state.
    fn current_state(&self) -> u32;

    /// Set the cooling state to `state`.
    ///
    /// Must be idempotent if `state == current_state()`.
    /// Returns `ThermalError::OutOfRange` if `state > max_state()`.
    fn set_state(&self, state: u32) -> Result<(), ThermalError>;

    /// Human-readable name for this cooling device (e.g., `"cpufreq-cpu0"`,
    /// `"fan-chassis0"`). Used as the sysfs `type` file content.
    fn name(&self) -> &'static str;
}

/// Binding between a thermal zone and a cooling device.
pub struct CoolingBinding {
    /// The cooling device to actuate.
    pub device: Arc<dyn CoolingDevice>,

    /// The trip point index (into `ThermalZone::trip_points`) that activates
    /// this binding. The cooling device is stepped up one state each time the
    /// thermal zone crosses this trip point.
    pub trip_point_index: usize,

    /// The cooling state to apply when the trip point is in the active
    /// (crossed) state. When the zone cools below `temp_mc - hysteresis_mc`,
    /// the device is stepped back down toward 0.
    pub target_state: u32,
}

Standard cooling device types provided by ISLE:

Type	Description	State mapping
`CpufreqCooler`	Limits max CPU frequency via cpufreq (§14.5.5)	0 = `cpu_max_freq`, N = `cpu_min_freq`, linear interpolation
`GpufreqCooler`	Limits max GPU frequency via drm/gpu driver	Same as above
`FanCooler`	Sets fan PWM duty cycle via hwmon (§11.2.1)	0 = fan off, `max_state()` = 100% PWM
`UsbCurrentCooler`	Reduces USB charging current to lower battery heat	0 = max current, N = 0 mA
`RaplCooler`	Reduces RAPL PKG limit directly	0 = TDP, N = minimum supported limit

14a.3.4 Cooling Map Discovery

The binding between thermal zones and cooling devices is discovered at boot from:

ACPI: _TZD (thermal zone devices), _PSL (passive cooling list), _AL0–_AL9 (active cooling lists). The ACPI thermal driver evaluates these control methods and populates the cooling_devices list in each ThermalZone.
Device tree: cooling-maps node under the thermal zone node (binding documented in Linux kernel Documentation/devicetree/bindings/thermal/thermal-zones.yaml). ISLE parses this during DTB processing (§3.2).
Static board description: For platforms without ACPI or DTB thermal tables, a board-specific Rust module in isle-kernel/src/arch/ can register zones and bindings at compile time.

14a.3.5 Polling and Interrupt-Driven Monitoring

The thermal monitor uses two mechanisms:

Polling (always available): A kernel timer fires periodically to call TempSensor::read_temp_mc() and evaluate all trip points. The polling interval is adaptive:

Temperature distance from nearest trip point	Polling interval
> 5 °C below any trip point	1000 ms
1–5 °C below a `Passive` or `Active` trip	100 ms
< 1 °C below a `Hot` or `Critical` trip	10 ms

Interrupt-driven (when available): Some platforms provide hardware thermal interrupts that fire when a temperature threshold is crossed:

Intel PROCHOT interrupt: the CPU asserts PROCHOT# when the die temperature reaches the factory-programmed limit. The kernel registers an interrupt handler on APIC vector 0xFA (Linux convention for thermal LVT). This fires before RAPL-based throttling takes effect.
AMD SB-TSI alert: an SMBus alert from the SB-TSI temperature sensor on AMD platforms. Handled by the amd_sb_tsi I2C driver.
ACPI _HOT / _CRT notify: the firmware sends an ACPI notify event when a thermal zone crosses its Hot or Critical temperature. The ACPI event handler evaluates the zone immediately rather than waiting for the next poll cycle.

Interrupt-driven monitoring reduces the latency from temperature threshold crossing to kernel response from ≤ 1000 ms (polling) to ≤ 100 µs (interrupt).

14a.3.6 Temperature Sensor Abstraction

/// A hardware temperature sensor.
///
/// Implementations include: x86 PECI (Platform Environment Control Interface),
/// ACPI `_TMP` control method, I2C/SMBus sensors (LM75, TMP102, etc.),
/// and ARM SoC on-die sensors.
pub trait TempSensor: Send + Sync {
    /// Read the current temperature in millidegrees Celsius.
    ///
    /// Returns `ThermalError::SensorFault` if the hardware sensor reports
    /// an error condition (e.g., I2C NACK, PECI timeout).
    fn read_temp_mc(&self) -> Result<i32, ThermalError>;

    /// Human-readable name for this sensor (e.g., `"peci-cpu0"`, `"acpi-tz0"`).
    fn name(&self) -> &'static str;
}

/// Errors returned by thermal framework operations.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum ThermalError {
    /// The sensor or cooling device is not available or not initialised.
    NotAvailable,
    /// The sensor returned an error condition (hardware fault or communication error).
    SensorFault,
    /// The requested cooling state is outside `[0, max_state()]`.
    OutOfRange,
    /// The cooling device is currently locked by another subsystem.
    DeviceBusy,
}

14a.3.7 Linux sysfs Compatibility

ISLE exposes the thermal framework under the same sysfs paths as the Linux kernel thermal framework, enabling unmodified Linux monitoring tools:

/sys/class/thermal/
  thermal_zone0/
    type          # zone name (e.g., "x86_pkg_temp")
    temp          # current temperature in millidegrees (e.g., "52000")
    mode          # "enabled" or "disabled"
    trip_point_0_temp    # first trip point temperature
    trip_point_0_type    # "passive", "active", "hot", or "critical"
    trip_point_0_hyst    # hysteresis in millidegrees
    policy        # cooling policy: "step_wise" or "user_space"
  cooling_device0/
    type          # cooling device name (e.g., "Processor")
    max_state     # maximum cooling state
    cur_state     # current cooling state

The type file content for CPU cooling devices uses the string "Processor" for compatibility with lm_sensors, thermald, and similar tools that match on this string.

14a.4 Powercap Interface (sysfs)

The powercap sysfs hierarchy provides a unified interface for reading energy counters and setting power limits. ISLE's layout is byte-for-byte compatible with Linux's intel_rapl_msr driver output, ensuring that existing power monitoring and management tools work without modification.

14a.4.1 Directory Structure

/sys/devices/virtual/powercap/
  intel-rapl/                          # Control type: Intel RAPL
    intel-rapl:0/                      # Socket 0 PKG domain
      name                             # "package-0"
      energy_uj                        # Cumulative energy (µJ, read-only, wraps)
      max_energy_range_uj              # Counter wrap value in µJ
      constraint_0_name                # "long_term"
      constraint_0_power_limit_uw      # Long-window limit in µW (read-write)
      constraint_0_time_window_us      # Long-window duration in µs (read-write)
      constraint_0_max_power_uw        # Maximum settable limit (TDP) in µW
      constraint_1_name                # "short_term"
      constraint_1_power_limit_uw      # Short-window limit in µW (read-write)
      constraint_1_time_window_us      # Short-window duration in µs (read-write)
      constraint_1_max_power_uw        # Maximum settable short-term limit
      enabled                          # "1" to enable limits, "0" to disable
      intel-rapl:0:0/                  # Socket 0 Core (PP0) sub-domain
        name                           # "core"
        energy_uj
        max_energy_range_uj
        constraint_0_name              # "long_term"
        constraint_0_power_limit_uw
        constraint_0_time_window_us
        constraint_0_max_power_uw
        enabled
      intel-rapl:0:1/                  # Socket 0 Uncore (PP1) sub-domain (client only)
        name                           # "uncore"
        ...
    intel-rapl:1/                      # Socket 1 PKG domain (dual-socket servers)
      ...

The DRAM domain appears as a separate top-level entry on server platforms:

    intel-rapl:0:2/                    # DRAM sub-domain of socket 0
      name                             # "dram"

On AMD Zen2+ systems, the same layout is used with the control type still named intel-rapl for compatibility (Linux uses the same driver name). AMD-specific extensions (if any) appear in an amd-rapl control type directory.

14a.4.2 Tool Compatibility

The following tools work against ISLE's powercap hierarchy without modification:

Tool	Use
`powerstat`	Per-socket power consumption over time
`turbostat`	CPU frequency, power, and temperature combined
`s-tui`	Terminal UI showing frequency and power
`powertop`	Process-level power attribution (uses `/proc`, not powercap, but reads `energy_uj`)
Prometheus `node_exporter`	`--collector.powersupplyclass` and powercap collector
`rapl-read`	Low-level RAPL register dump

14a.4.3 Write Semantics

Writing constraint_N_power_limit_uw calls RaplInterface::set_power_limit() on the corresponding PowerDomain. Writes from unprivileged userspace are rejected with EPERM. Root (or a process with CAP_SYS_ADMIN) may write any domain.

Writing a limit that exceeds the domain's constraint_N_max_power_uw returns EINVAL. Writing 0 is equivalent to calling RaplInterface::clear_power_limit() (removes the software limit, restoring hardware default).

14a.5 Cgroup Power Accounting

14a.5.1 Design

Energy consumption is attributed to cgroups using a sampling-based model that parallels CPU time accounting. A dedicated kernel thread (the power accounting thread) wakes every 10 ms (configurable via /proc/sys/kernel/power_sample_interval_ms, range 1–1000 ms) and:

Reads energy_uj from all active RAPL domains (all sockets, all sub-domains).
Computes the delta from the previous sample, handling counter wrap-around.
Queries the scheduler to get, for each cgroup, the CPU time consumed in the last 10 ms interval.
Distributes the energy delta across cgroups proportional to their CPU time share.
Accumulates the attributed energy into each cgroup's power.energy_uj counter.

This is the same weighted attribution model used by Linux's cpuacct cgroup controller and, more recently, by Intel's Energy Aware Scheduling patches.

14a.5.2 Attribution Model

Let E_delta be the total PKG energy delta in the current interval (µJ), and let T_i be the CPU time consumed by cgroup i in the interval (µs). The energy attributed to cgroup i is:

E_i = E_delta × (T_i / Σ T_j)

where the sum is over all cgroups with T_j > 0. Idle time (no cgroup running) is attributed to a synthetic idle cgroup and not charged to any user cgroup.

Limitation: This model has two known imprecisions:

It does not account for memory bandwidth differences between cgroups sharing a socket. A cgroup running a memory-bandwidth-intensive workload consumes more power per CPU cycle than one running a compute-bound workload, but they receive the same energy charge per CPU time unit. This is acceptable for accounting and billing; it is not suitable for precise per-process energy metering.
On a multi-socket server, PKG energy from socket 0 may be attributed to a cgroup whose threads ran on socket 1 if the sampling window captures a migration. The error is bounded by one sampling interval (10 ms default).

14a.5.3 Cgroup Interface

The power cgroup controller provides the following files:

File	Mode	Description
`power.energy_uj`	R	Cumulative energy attributed to this cgroup in µJ. Wraps at `u64::MAX`.
`power.stat`	R	Per-domain energy breakdown: `pkg_energy_uj`, `core_energy_uj`, `dram_energy_uj`.
`power.limit_uw`	RW	Power limit for this cgroup in µW. `0` = no limit. Setting a non-zero value enables power cap enforcement (§14a.5.4).
`power.limit_window_ms`	RW	Averaging window for `power.limit_uw` enforcement, in ms. Default: 100.

These files are created under the cgroup hierarchy directory, e.g.:

/sys/fs/cgroup/my-vm/power.energy_uj
/sys/fs/cgroup/my-vm/power.limit_uw

14a.5.4 Per-Cgroup Power Limit Enforcement

When power.limit_uw is non-zero, the power accounting thread checks, at each sample interval, whether the cgroup's rolling-average power consumption (calculated from power.energy_uj deltas over power.limit_window_ms) exceeds the limit.

If the limit is exceeded:

The cgroup's effective RAPL PKG short-window limit is reduced proportionally to bring the cgroup's power consumption within budget. This is implemented by adjusting the cpu.max bandwidth (§15) for the cgroup's tasks — reducing their CPU time allocation reduces their power consumption.
A PowerLimitEvent is posted to the cgroup's event fd (readable via cgroup.events), allowing userspace monitoring daemons to observe throttling.

Limitation: RAPL enforcement at sub-PKG granularity (per-cgroup, per-core) is not directly supported by hardware. Per-cgroup limits are enforced indirectly via CPU time throttling (§15). True per-cgroup hardware power isolation would require per-core RAPL (available on some Intel Xeon generations as MSR_PP0_POWER_LIMIT) combined with strict core pinning — a configuration that isle-kvm uses for VM power budgets (§14a.6), but which is not the general case.

14a.6 VM Power Budget Enforcement

14a.6.1 Motivation

Traditional VM resource accounting models (CPU cores, RAM) do not capture actual power consumption. A VM running a STREAM memory-bandwidth benchmark or a dense linear algebra kernel (e.g., BLAS DGEMM with AVX-512) can consume 2–3× the power of a VM running a web server at equivalent CPU utilisation. In a datacenter where the binding constraint is rack PDU amperage, not CPU cores, CPU-count quotas systematically mis-model the actual cost of workloads.

Watt-based quotas reflect actual rack power budget more honestly:

A 500W rack PDU can host either ten 50W VMs or five 100W VMs regardless of how many vCPUs each is assigned.
A burst-capable VM (bursty ML inference job) can be allocated 80W sustained with a 150W burst cap for 10 ms — mirroring the RAPL two-tier limit model.
Overcommit is detectable and rejectable at admission time by comparing sum(vm_power_limit_mw) against measured or rated socket TDP.

14a.6.2 Mechanism

When isle-kvm creates a VM with a vm_power_limit_mw budget:

Dedicated cgroup: A cgroup is created at /sys/fs/cgroup/isle-vms/<vm-id>/ for the VM's vCPU threads. power.limit_uw is set to vm_power_limit_mw * 1000.
Core pinning: The VM's vCPU threads are pinned to a CPU set on a single socket (or across sockets if vm_numa_topology specifies multi-socket). This ensures energy counter attribution is accurate (§14a.5.2 limitation 2).
Socket RAPL coordination: The PKG short-window limit for each socket is set to sum(vm_power_limit_mw for all VMs pinned to that socket) + headroom_mw, where headroom_mw is a configurable per-socket constant (default: 10% of TDP) reserved for host kernel overhead.
Monitoring: The VmPowerBudget::update() method is called by isle-kvm's power accounting thread every 100 ms. If the VM exceeds its budget, vCPU scheduling quota is reduced.

/// Power budget tracking for a single VM.
pub struct VmPowerBudget {
    /// Allocated sustained power budget for this VM in milliwatts.
    pub limit_mw: u32,

    /// Allocated burst power limit in milliwatts, enforced for windows ≤ 10 ms.
    ///
    /// Maps to the RAPL short-window limit. Set to `limit_mw` if no burst
    /// allowance is configured (conservative mode).
    pub burst_limit_mw: u32,

    /// Measured average power consumption over the last 1-second sliding window,
    /// in milliwatts. Updated by `update()` every 100 ms.
    pub measured_mw: AtomicU32,

    /// Number of times vCPU quota was reduced due to power budget violation.
    ///
    /// Monotonically increasing. Used for throttle-rate monitoring and alerting.
    pub throttle_count: AtomicU64,

    /// Handle to the cgroup backing this VM's power accounting and enforcement.
    cgroup: CgroupHandle,
}

impl VmPowerBudget {
    /// Called every 100 ms by isle-kvm's power accounting thread.
    ///
    /// Reads the energy delta from the VM's cgroup, updates `measured_mw`,
    /// and reduces vCPU scheduling quota if the budget is exceeded.
    ///
    /// `sched` is the CPU bandwidth controller for this VM's vCPU threads (§15).
    pub fn update(&self, sched: &CpuBandwidth) {
        let delta_uj = self.cgroup.read_energy_delta_uj();
        // 100 ms window: delta_uj / 100 ms = µJ/ms = mW.
        let measured = (delta_uj / 100) as u32;
        self.measured_mw.store(measured, Ordering::Relaxed);

        if measured > self.limit_mw {
            // Reduce vCPU CBS quota proportional to overage fraction.
            // Uses fixed-point arithmetic (percentage × 10) to avoid FPU in kernel.
            // Example: measured = 120 mW, limit = 100 mW →
            //   overage = 20, reduction_permille = 20 * 1000 / 120 = 166 (16.6%).
            let overage = measured - self.limit_mw;
            let reduction_permille = overage * 1000 / measured;
            sched.reduce_quota_permille(reduction_permille);
            self.throttle_count.fetch_add(1, Ordering::Relaxed);
        }
    }
}

14a.6.3 Admission Control

Before isle-kvm creates a new VM with a vm_power_limit_mw budget, it checks the admission constraint:

sum(vm.limit_mw for all VMs on target socket) + new_vm.limit_mw
    ≤ socket_tdp_mw - host_headroom_mw

where socket_tdp_mw is read from RaplInterface::read_tdp_mw(PowerDomainType::Pkg) at boot, and host_headroom_mw defaults to 10% of TDP (configurable via /sys/module/isle_kvm/parameters/power_headroom_pct).

If the constraint is violated, isle-kvm returns ENOSPC to the VM creation ioctl. The caller (e.g., an orchestrator) must either reduce the requested budget, migrate an existing VM to another socket/host, or reject the workload.

This is an admission control gate, not a guarantee. Actual power consumption may exceed TDP temporarily due to:

Turbo Boost / AMD Precision Boost (transient power above TDP for ≤ 10 ms)
RAPL enforcement latency (the hardware enforces limits over the configured window; instantaneous power can spike)

These transient overages are expected and handled by the hardware's own thermal and power-delivery circuitry. ISLE's admission control operates at the sustained (long-window) level.

14a.6.4 Observability

Per-VM power accounting is exposed via:

/sys/fs/cgroup/isle-vms/<vm-id>/power.energy_uj    # Cumulative energy (µJ)
/sys/fs/cgroup/isle-vms/<vm-id>/power.stat          # Per-domain breakdown
/sys/fs/cgroup/isle-vms/<vm-id>/power.limit_uw      # Current limit (µW)

isle-kvm also exposes power metrics via the KVM statistics interface (/sys/bus/event_source/devices/kvm/), enabling Prometheus node_exporter KVM collector to report per-VM power consumption.

14a.7 DCMI / IPMI Rack Power Management

14a.7.1 Overview

In server deployments managed by a Baseboard Management Controller (BMC), the BMC may impose a platform-level power cap via the Data Center Manageability Interface (DCMI), an extension of IPMI v2.0 (specification: DCMI v1.5, published by Intel/DMTF).

DCMI provides the following power management commands over the IPMI channel:

DCMI Command	NetFn/Cmd	Description
`Get Power Reading`	`2C/02h`	Current platform power in watts (instantaneous, min, max, average over a rolling window)
`Get Power Limit`	`2C/03h`	Read the currently configured platform power cap
`Set Power Limit`	`2C/04h`	Set a platform power cap and exception action (hard power-off or OEM-defined)
`Activate/Deactivate Power Limit`	`2C/05h`	Enable or disable the configured power cap
`Get DCMI Capabilities`	`2C/01h`	Enumerate which DCMI features the BMC supports

These commands are sent by the datacenter management infrastructure (e.g., OpenBMC, Redfish, Dell iDRAC, HP iLO) to impose a rack-level power budget on individual servers.

14a.7.2 ISLE Integration

The ISLE IPMI driver (Tier 1; KCS/SMIC/BT keyboard-controller-style interfaces via I2C or LPC, as described in §11.2) handles incoming DCMI commands from the BMC. When the BMC asserts a power cap via Set Power Limit + Activate Power Limit, the kernel responds as follows:

BMC sets cap C_bmc (watts)
  │
  ▼
isle-ipmi driver receives DCMI Set/Activate Power Limit
  │
  ├─► Reduce aggregate RAPL PKG limits across all sockets
  │     new_pkg_limit = C_bmc / num_sockets  (naive; TODO: NUMA-aware split)
  │     → RaplInterface::set_power_limit(Pkg, new_pkg_limit, long_window_ms)
  │
  ├─► Notify isle-kvm to reduce VM watt budgets proportionally
  │     reduction_factor = C_bmc / current_total_vm_budget
  │     → for each VM: VmPowerBudget::limit_mw *= reduction_factor
  │     → Re-run admission control check (may trigger VM migration signal)
  │
  └─► Post PowerCapEvent to userspace monitoring channel
        → sysfs: /sys/bus/platform/drivers/dcmi/power_cap_uw (updated)
        → netlink thermal event (for `thermald` compatibility)
        → KVM statistics update (for Prometheus collector)

14a.7.3 Escalation Hierarchy

Power management operates at three levels, each enforced by a different actor:

Level	Mechanism	Enforced by	Override possible?
Software power limit	RAPL `MSR_PKG_POWER_LIMIT`	ISLE kernel	Yes (root can raise within TDP)
BMC power cap	DCMI `Set Power Limit`	BMC firmware	Only by BMC admin
Physical current limit	PSU OCP / PDU circuit breaker	Hardware	No

The kernel controls only the first level. The BMC cap (second level) is communicated to the kernel via DCMI but ultimately enforced by the BMC's power management controller, which can throttle the server via SYS_THROT# or force a hard power-off regardless of OS state. The kernel's DCMI integration is cooperative, not authoritative.

14a.7.4 `DcmiPowerCap` Interface

/// Interface for the DCMI power cap enforcement callback.
///
/// Implemented by the IPMI driver. Called when the BMC asserts or modifies
/// a DCMI power limit.
pub trait DcmiPowerCap: Send + Sync {
    /// Called when the BMC sets a new platform power cap.
    ///
    /// `cap_mw` is the new cap in milliwatts. `0` indicates the cap has been
    /// deactivated (no limit). Implementors must update RAPL limits and notify
    /// isle-kvm within this call or schedule it for immediate async processing.
    fn on_cap_set(&self, cap_mw: u32);

    /// Return the currently active BMC-imposed cap in milliwatts.
    ///
    /// Returns `None` if no cap is currently active.
    fn current_cap_mw(&self) -> Option<u32>;

    /// Return the last measured platform power reading from the BMC in milliwatts.
    ///
    /// This is the BMC's own measurement, which may differ from RAPL's
    /// (BMC measures at the PSU, RAPL measures at the socket).
    fn last_reading_mw(&self) -> u32;
}

14a.8 Battery and SMBus Monitoring

SMBus (System Management Bus) is a subset of I2C used for battery/charger chips. Example: Smart Battery System (SBS) batteries expose registers at I2C address 0x0B: - 0x08: Temperature (in 0.1K units). - 0x09: Voltage (in mV). - 0x0A: Current (in mA, signed). - 0x0D: Relative State of Charge (0-100%). - 0x0F: Remaining Capacity (in mAh).

The battery driver (Tier 1, probed via ACPI PNP0C0A device) reads these registers periodically (every 5 seconds when on battery, every 60 seconds when on AC) and exposes them via sysfs/islefs (see §14a.9.5 for the userspace interface).

14a.9 Consumer Power Profiles

This subsection defines the user-facing policy layer; §14a.1–14a.7 define the underlying mechanisms.

14a.9.1 Power Profile Enumeration

// isle-core/src/power/profile.rs

/// User-facing power profile (consumer policy; translates to §14a mechanisms).
#[repr(u32)]
pub enum PowerProfile {
    /// Maximum performance. AC adapter expected.
    Performance  = 0,
    /// Balanced performance and power. Default on AC.
    Balanced     = 1,
    /// Aggressive power saving. Default on battery.
    BatterySaver = 2,
    /// User-defined constraints loaded from /System/Power/CustomProfile.
    Custom       = 3,
}

14a.9.2 Profile → Mechanism Translation

Each profile maps to concrete §14a parameters:

Profile	RAPL PKG limit	CPU turbo	GPU freq cap	WiFi PSM	Display brightness
Performance	None (HW TDP)	Enabled	100%	Disabled	100%
Balanced	80% TDP	Enabled	80%	PSM	75%
BatterySaver	50% TDP	Disabled	40%	Aggressive	40%

set_profile() calls RaplInterface::set_power_limit() (§14a.2), the cpufreq governor, and WirelessDriver::set_power_save() (§12.1). No RAPL MSR writes happen in consumer-layer code — all hardware access goes through §14a.

14a.9.3 AC/Battery Auto-Switch

The PowerManager listens for ACPI AC adapter events (from the battery driver, §14a.8) and automatically applies the user's preferred profile for each power source. On critical battery (≤5%), BatterySaver is forced. A BatteryCritical event is posted to the §16b event ring so userspace can display a notification.

14a.9.4 Per-Process Power Attribution

Per-process energy attribution (for desktop power managers like GNOME Settings, KDE Powerdevil) is provided by the §14a.5 cgroup power accounting. Per-process granularity: each process lives in a cgroup; power.energy_uj on that cgroup gives its energy consumption. Exposed via /proc/<pid>/power_consumed_uj in isle-compat procfs.

14a.9.5 Userspace Interface

Kernel exposes power management state via the following paths:

Path	Description
`/sys/kernel/isle/power/profile`	Read/write power profile selection (`performance`, `balanced`, `battery-saver`). Writable by processes with `CAP_SYS_ADMIN`.
`/proc/<pid>/power_consumed_uj`	Per-process energy consumption in microjoules (RAPL cgroup attribution, §14a.5.3).
`/sys/class/power_supply/BAT0/capacity`	Battery charge percentage (0–100).
`/sys/class/power_supply/BAT0/energy_now`	Remaining energy in µWh.
`/sys/class/power_supply/BAT0/current_now`	Discharge/charge current in µA.
`/sys/class/power_supply/BAT0/cycle_count`	Charge cycle count.
`/sys/class/power_supply/BAT0/status`	`Charging`, `Discharging`, `Full`, `Unknown`.

Time-remaining estimation, low-battery notifications, and battery health display are handled by userspace daemons (UPower or equivalent) reading these paths.

14a.10 Suspend and Resume Protocol

Context: §14a.10.1 (S4Hibernate variant in SleepState) covers S4 hibernate. This section specifies S3 (Suspend-to-RAM) and S0ix (Modern Standby), which are the primary suspend mechanisms on consumer laptops.

14a.10.1 Sleep State Enumeration

// isle-core/src/power/suspend.rs

/// ACPI sleep state.
#[repr(u32)]
pub enum SleepState {
    /// S3: Suspend-to-RAM. CPU powered off, DRAM refreshing, ~2-5W, wake in <2s.
    S3SuspendToRam = 3,
    /// S0ix: Modern Standby (S0 Low Power Idle). CPU in deep C-states, OS "running",
    /// network alive, <1W, instant wake. Intel 6th gen+, AMD Ryzen 3000+, ARM.
    S0ixModernStandby = 0x0F, // Not a standard ACPI state, vendor-specific
    /// S4: Hibernate (already specified in §9).
    S4Hibernate = 4,
}

/// Power state machine states.
#[repr(u32)]
pub enum SuspendPhase {
    /// System running normally.
    Running = 0,
    /// Pre-suspend: freeze userspace, sync filesystems.
    PreSuspend = 1,
    /// Device suspend: call driver suspend callbacks.
    DeviceSuspend = 2,
    /// CPU suspend: save CPU state, enter ACPI sleep state.
    CpuSuspend = 3,
    /// (system asleep, this state is never observed by running code)
    Asleep = 4,
    /// CPU resume: restore CPU state.
    CpuResume = 5,
    /// Device resume: call driver resume callbacks.
    DeviceResume = 6,
    /// Post-resume: thaw userspace.
    PostResume = 7,
}

14a.10.2 Power State Machine

impl SuspendManager {
    /// Initiate suspend to a given sleep state.
    pub fn suspend(&self, state: SleepState) -> Result<(), SuspendError> {
        // Phase 1: PreSuspend
        self.set_phase(SuspendPhase::PreSuspend);
        self.freeze_userspace()?; // Stop all userspace tasks
        self.sync_filesystems()?; // Flush all dirty pages, journal commits

        // Phase 2: DeviceSuspend
        self.set_phase(SuspendPhase::DeviceSuspend);
        self.suspend_devices(state)?; // Call driver suspend callbacks in reverse probe order

        // Phase 3: CpuSuspend
        self.set_phase(SuspendPhase::CpuSuspend);
        self.save_cpu_state()?; // Save registers, page tables, GDT, IDT
        self.enter_acpi_sleep_state(state)?; // Write to ACPI PM1a_CNT, CPU halts

        // ... (system asleep, wake event occurs) ...

        // Phase 4: CpuResume (code resumes here after wake)
        self.set_phase(SuspendPhase::CpuResume);
        self.restore_cpu_state()?; // Restore registers, reload CR3, GDTR, IDTR

        // Phase 5: DeviceResume
        self.set_phase(SuspendPhase::DeviceResume);
        self.resume_devices(state)?; // Call driver resume callbacks in probe order

        // Phase 6: PostResume
        self.set_phase(SuspendPhase::PostResume);
        self.thaw_userspace()?; // Unfreeze userspace tasks

        self.set_phase(SuspendPhase::Running);
        Ok(())
    }
}

14a.10.3 Driver Suspend/Resume Callbacks

Every driver (Tier 1 and Tier 2) must implement suspend/resume:

// isle-driver-sdk/src/suspend.rs

/// Driver suspend/resume trait.
pub trait SuspendResume {
    /// Suspend the device to the given sleep state.
    ///
    /// # Contract
    /// - Flush all pending I/O to the device.
    /// - Disable interrupts (deregister interrupt handler or mask at device level).
    /// - Power down the device (write to PCI PM registers, or device-specific power control).
    /// - Save any device state that cannot be reconstructed (e.g., firmware upload not repeatable).
    ///
    /// # Timeout
    /// If suspend does not complete within 500ms, the kernel may force-kill the driver
    /// (Tier 2) or mark it as failed (Tier 1).
    fn suspend(&self, target: SleepState) -> Result<(), SuspendError>;

    /// Resume the device from the given sleep state.
    ///
    /// # Contract
    /// - Restore device state (e.g., re-upload firmware, reconfigure registers).
    /// - Re-enable interrupts.
    /// - Re-establish any connections (WiFi: reconnect to AP, NVMe: reinitialize controller).
    ///
    /// # Failure handling
    /// If resume fails, return Err. The kernel will attempt recovery (§14a.10.6).
    fn resume(&self, from: SleepState) -> Result<(), SuspendError>;
}

14a.10.4 Device Suspend Ordering

Devices must suspend in reverse dependency order and resume in dependency order:

Suspend order (leaves first, roots last):
1. GPU (no dependencies)
2. Display controller (depends on GPU for framebuffer)
3. NVMe (no dependencies)
4. Filesystem (depends on NVMe)
5. Network interface (WiFi, Ethernet)
6. Network stack (depends on NIC)

Resume order (roots first, leaves last):
1. Network stack
2. Network interface
3. Filesystem
4. NVMe
5. Display controller
6. GPU

The device registry (§7) tracks dependencies. Before suspend, the registry computes a topological sort of the device tree and calls suspend callbacks in that order.

14a.10.5 Tier 2 Driver Suspend

Tier 2 drivers are separate processes. Suspending them requires IPC:

impl SuspendManager {
    /// Suspend a Tier 2 driver (send message via ring buffer).
    fn suspend_tier2_driver(&self, driver_pid: Pid, state: SleepState) -> Result<(), SuspendError> {
        // Send DRIVER_SUSPEND message to the driver's control ring (§8a.2).
        let msg = DriverControlMessage::Suspend { state };
        driver_ring_buffer.push(msg)?;

        // Wait for response (DRIVER_SUSPEND_ACK) with 500ms timeout.
        match self.wait_for_response(driver_pid, Duration::from_millis(500)) {
            Ok(DriverControlMessage::SuspendAck) => Ok(()),
            Err(TimeoutError) => {
                // Driver did not respond: force terminate.
                process::kill(driver_pid, Signal::SIGKILL)?;
                // Mark device as unavailable until resume attempts to restart the driver.
                self.device_registry.mark_unavailable(driver_pid)?;
                Ok(()) // Continue suspend, device is orphaned but system suspends
            }
            Err(e) => Err(SuspendError::DriverFailed(e)),
        }
    }
}

14a.10.6 Resume Failure Recovery

If a driver fails to resume:

Tier 1 driver failure: 1. Attempt Function Level Reset (FLR) via PCI config space (PCI_EXP_DEVCTL_BCR_FLR). 2. If FLR succeeds, reload the driver module (call probe() again). 3. If FLR fails or reload fails, mark device unavailable, continue boot. Log error to console and /var/log/kernel.log.

Tier 2 driver failure: 1. Kill the driver process. 2. Restart the driver process (spawn new process, re-initialize ring buffers). 3. If restart succeeds, device resumes normal operation (~10-50ms total recovery). 4. If restart fails 3 times, mark device unavailable, continue boot.

Critical device failure (NVMe root filesystem, display controller on the only display): - If the root NVMe fails to resume, suspend fails, system must reboot (no recovery possible without filesystem). - If the display fails to resume, system continues to run but displays a VT panic message (via Tier 0 VGA fallback).

14a.10.7 S0ix Modern Standby

S0ix is not a true suspend state — the OS remains "running", but the CPU enters deep C-states (C10 on Intel, Ccd6 on AMD) where cores are powered off but SoC stays alive.

Differences from S3: - CPU does not power off: Scheduler still runs, interrupts still fire, but all tasks are idle (blocked on I/O or sleeping). - WiFi stays live: Driver keeps the radio in D3hot (low power but still connected to AP), wakes on packet. - Display off: Panel enters DPMS Off (§62), backlight off, but display controller stays powered. - Device D3hot, not D3cold: Devices enter D3hot (low power, can wake quickly) instead of D3cold (unpowered).

Enter S0ix:

impl SuspendManager {
    pub fn enter_s0ix(&self) -> Result<(), SuspendError> {
        // 1. Notify all drivers to enter D3hot (not D3cold).
        for driver in &self.drivers {
            driver.set_power_state(PciPowerState::D3hot)?;
        }

        // 2. Set CPU P-state to minimum frequency.
        self.cpu_freq_governor.set_min_freq()?;

        // 3. Program CPU package C-state limit to C10 (deepest).
        self.cpu_cstate_governor.set_max_cstate(CState::C10)?;

        // 4. Idle all CPUs (all tasks blocked or sleeping).
        //    Scheduler tick timer set to 1 Hz (extremely long idle periods).
        self.scheduler.set_idle_mode(true)?;

        // System now in S0ix. CPUs enter C10, wake on interrupt (timer, GPIO, PCIe PME).
        Ok(())
    }
}

Exit S0ix: Any interrupt (lid open, network packet, RTC alarm, USB device activity) wakes the CPU from C10 back to C0, resume is instant (~1-5ms).

14a.11 Integration Points

The following table maps each §14a mechanism to its consumers in other sections:

Mechanism	Defined in	Consumed by
`RaplInterface::set_power_limit()`	§14a.2.5	§14a.9 consumer power profiles, §14a.6 VM power budgets, §14a.7 DCMI enforcement, thermal passive cooling (§14a.3.3 `RaplCooler`)
`PowerDomainRegistry`	§14a.2.5	`powercap` sysfs (§14a.4), cgroup power accounting (§14a.5), VM admission control (§14a.6.3)
`ThermalZone` trip points	§14a.3.1–3.2	Scheduler passive cooling (§14): `Passive` trips reduce cpufreq max; hwmon fan control (§11.2.1): `Active` trips actuate `FanCooler`
`TripType::Critical` handler	§14a.3.2	`kernel_power_off()` — no other dependencies
cgroup `power.energy_uj`	§14a.5.3	Billing/monitoring userspace agents, isle-kvm per-VM accounting (§14a.6.2), §14a.9.4 per-process attribution
cgroup `power.limit_uw`	§14a.5.3–5.4	isle-kvm `VmPowerBudget` (sets this file on VM creation), §14a.9 power profiles (sets this on cgroup creation)
`VmPowerBudget` struct	§14a.6.2	`isle-kvm/src/power.rs`; interacts with §15 `CpuBandwidth::reduce_quota()`
`DcmiPowerCap::on_cap_set()`	§14a.7.4	BMC-driven cap propagation to RAPL (§14a.2) and isle-kvm (§14a.6)
SMBus battery registers	§14a.8	Battery driver, sysfs/islefs (§14a.9.5), §14a.9.3 AC/battery auto-switch
`PowerProfile` enum	§14a.9.1	§14a.9.2 profile translation, userspace power managers
`SuspendManager::suspend()`	§14a.10.2	System suspend/resume, §14a.10.5 Tier 2 driver IPC
`SuspendResume` trait	§14a.10.3	Tier 1/Tier 2 driver implementations

15. CPU Bandwidth Guarantees

Inspired by: QNX Adaptive Partitioning (concept only). IP status: Built from academic scheduling theory (CBS/EDF, 1998) and Linux cgroup v2 interface. QNX-specific implementation NOT referenced. Term "adaptive partitioning" NOT used.

15.1 Problem

Section 14 defines three scheduler classes: EEVDF (normal), RT (FIFO/RR), and Deadline (EDF/CBS). Cgroups v2 provides resource limits (cpu.max caps the ceiling, cpu.weight sets relative priority).

What is missing: guaranteed minimum CPU bandwidth under overload. Current mechanisms:

cpu.weight is proportional sharing — if one cgroup has weight 100 and another has weight 900, the first gets 10% of whatever is available. But if the system is fully loaded, "10% of available" might not meet the minimum requirement.
cpu.max is a ceiling, not a floor. It limits maximum, does not guarantee minimum.
Deadline scheduler provides guarantees, but only for individual tasks, not for groups.

Use case: A server runs a database (needs guaranteed 40% CPU), a web frontend (needs guaranteed 20%), and batch jobs (uses the rest). Under overload, the batch jobs must not be able to starve the database below 40%, even if the batch jobs are numerous.

15.2 Design: CBS-Based Group Bandwidth Reservation

The solution combines the existing Deadline scheduler's Constant Bandwidth Server (CBS) algorithm with cgroup v2's group hierarchy.

CBS (Abeni & Buttazzo, 1998) provides bandwidth isolation: each server (task or group) is assigned a bandwidth Q/P (Q microseconds of CPU time every P microseconds). The server is guaranteed this bandwidth regardless of other servers' behavior. Unused bandwidth is redistributed (work-conserving).

This is a well-established academic algorithm with no IP encumbrance.

15.3 Cgroup v2 Interface

New control file in the cpu controller, additive to the existing interface:

/sys/fs/cgroup/<group>/cpu.guarantee

Format: $QUOTA $PERIOD (microseconds), identical to cpu.max format.

Example:

# Database cgroup: guaranteed 40% CPU bandwidth
echo "400000 1000000" > /sys/fs/cgroup/database/cpu.guarantee
# = 400ms of CPU time every 1000ms = 40% guaranteed

# Web frontend: guaranteed 20%
echo "200000 1000000" > /sys/fs/cgroup/web/cpu.guarantee

# Batch jobs: no guarantee (uses whatever is left)
# cpu.guarantee defaults to "max" (no guarantee)

# Total guaranteed: 60%. Remaining 40% is shared by weight among all.

Semantics:

A group with cpu.guarantee set is backed by a CBS server at the specified bandwidth.
The CBS server ensures the group receives at least its guaranteed bandwidth even under full system load.
When the group is idle, its unused bandwidth is redistributed to other groups (work-conserving). This is inherent to CBS — no special logic needed.
cpu.guarantee cannot exceed cpu.max (if both are set, guarantee <= max).
Sum of all cpu.guarantee across the system must not exceed total CPU bandwidth. Attempting to overcommit returns -ENOSPC.
Nested cgroups: a child's guarantee comes out of its parent's guarantee budget.

Multi-core accounting: cpu.guarantee specifies system-wide bandwidth, not per-CPU. The implementation uses a global budget pool with per-CPU runtime slices (same pattern as Linux EEVDF bandwidth throttling for cpu.max). A CBS server with a 40% guarantee on a 4-CPU system gets a global budget of 400ms per 1000ms period. This budget is drawn down as tasks in the group run on any CPU. When exhausted, all tasks in the group are throttled until the next period. Per-NUMA-node variants are tracked in BandwidthAccounting.per_node_guaranteed for NUMA-aware scheduling hints, but the guarantee is enforced globally.

RT and DL task handling: RT and DL tasks within a CBS-guaranteed cgroup bypass the CBS server's budget accounting. Their guarantees come from the RT/DL schedulers directly. The CBS guarantee applies only to EEVDF-class (normal) tasks within the group. This matches Linux's behavior where cpu.max throttling does not apply to RT tasks.

15.4 Kernel-Internal Design

// isle-core/src/sched/cbs_group.rs (kernel-internal)

/// A CBS bandwidth server attached to a cgroup.
pub struct CbsGroupServer {
    /// Bandwidth: quota microseconds per period microseconds.
    pub quota_us: u64,
    pub period_us: u64,

    /// Current budget remaining in this period. Signed because CBS
    /// allows transient overspend: if a task's time slice crosses the
    /// budget boundary (e.g., budget was 50μs remaining but the tick
    /// granularity is 1ms), the budget goes negative. When negative:
    ///   1. The server is immediately throttled (`throttled` set to true).
    ///   2. All tasks in this server's runqueue are dequeued from the
    ///      CPU's run queue and parked until the next period.
    ///   3. The deficit carries forward: at period replenishment,
    ///      `budget_remaining_us = quota_us + budget_remaining_us`
    ///      (adding a negative value reduces the next period's budget).
    ///   4. The deadline is pushed back by one period regardless of
    ///      deficit magnitude — CBS guarantees temporal isolation by
    ///      limiting each server to its declared bandwidth over any
    ///      sliding window.
    /// Admission control (Section 15.3) ensures the sum of all servers'
    /// quota/period ratios does not exceed 1.0 per CPU, preventing
    /// starvation even when individual servers overshoot within a period.
    ///
    /// **Deficit cap**: To prevent a buggy or malicious task from accumulating
    /// an unbounded deficit that would starve it for many periods, the budget
    /// is clamped to a minimum of `-quota_us` (one full period's worth of deficit).
    /// At replenishment, if `budget_remaining_us < -quota_us`, it is set to
    /// `-quota_us` before adding `quota_us`, ensuring the server always receives
    /// at least `max(0, quota_us - |deficit|)` budget. This bounds the recovery
    /// time to at most one period regardless of how negative the deficit became.
    /// (Implementation note: since `quota_us` is `u64`, the comparison is
    /// `budget_remaining_us < -(quota_us as i64)` — the cast is safe because
    /// `quota_us` never exceeds `i64::MAX` in practice; a 1-second period uses
    /// 1_000_000, well within range.)
    pub budget_remaining_us: AtomicI64,

    /// Absolute deadline of current server period.
    pub current_deadline_ns: u64,

    /// Whether this server is currently throttled (budget exhausted).
    pub throttled: AtomicBool,

    /// Run queue of EEVDF-class (normal) tasks belonging to this cgroup.
    /// RT/DL tasks bypass this server and are scheduled by their
    /// respective schedulers directly (see Section 15.3).
    pub runqueue: EevdfRunQueue,  // Reuses existing EEVDF tree

    /// Total CPU time consumed by this server (for accounting).
    pub total_runtime_ns: AtomicU64,
}

Integration with existing scheduler (Section 14):

Per-CPU run queue structure (updated):

    +------------------+
    | RT Queue         |   <- Highest priority (unchanged)
    +------------------+
    | DL Queue         |   <- Deadline tasks (unchanged)
    +------------------+
    | CBS Group Servers|   <- NEW: CBS servers for guaranteed groups
    |  +-- db_server   |      Each server has its own EEVDF tree inside
    |  +-- web_server  |
    +------------------+
    | EEVDF Tree       |   <- Normal tasks without guarantee (unchanged)
    +------------------+

Scheduling decision: 1. Check RT queue (highest priority) — unchanged. 2. Check DL queue (deadline tasks) — unchanged. 3. Check CBS group servers (ordered by earliest deadline): - If a server has budget and runnable tasks: pick its next task. - CBS guarantees each server receives its bandwidth. 4. Check EEVDF tree (normal tasks without guarantee) — unchanged.

Unguaranteed tasks (step 4) run when all CBS servers are idle or throttled. Plus, CBS servers that are under-utilizing their budget donate the slack back to step 4 (work-conserving).

15.5 Overcommit Prevention

// isle-core/src/sched/cbs_group.rs

/// System-wide guarantee accounting.
pub struct BandwidthAccounting {
    /// Total guaranteed bandwidth across all CBS servers.
    /// Stored as fraction * 1_000_000 (e.g., 400000 = 40%).
    pub total_guaranteed: AtomicU64,

    /// Maximum allowable guarantee (default: 95%).
    /// Reserves 5% for kernel threads, interrupts, housekeeping.
    pub max_guarantee: u64,

    /// Per-NUMA-node guaranteed bandwidth (for NUMA-aware scheduling).
    /// Dynamically sized at boot based on discovered NUMA node count,
    /// following the same pattern as `NumaTopology` (Section 12.7).
    pub per_node_guaranteed: &'static [AtomicU64],
}

Setting cpu.guarantee fails with -ENOSPC if total_guaranteed + new_guarantee would exceed max_guarantee. This prevents overcommit and guarantees all promises are simultaneously satisfiable.

15.6 Interaction with Existing Controls

Control	Meaning	Interaction with cpu.guarantee
`cpu.weight`	Relative share of excess CPU	Distributes CPU beyond guaranteed minimums
`cpu.max`	Maximum CPU ceiling	Guarantee cannot exceed max; max still enforced
`cpu.guarantee`	Minimum CPU floor	NEW: CBS-backed guaranteed bandwidth
`cpu.pressure`	PSI pressure info	Reports pressure relative to guarantee

Example: a cgroup with cpu.guarantee=40%, cpu.max=60%, cpu.weight=100: - Always gets at least 40% CPU (even under full system load) - Never gets more than 60% CPU (even if system is idle — ceiling applies) - Between 40% and 60%, shares proportionally with other cgroups by weight

15.7 Use Case: Driver Tier Isolation

CPU guarantees integrate naturally with the driver tier model:

# Ensure Tier 1 drivers always have CPU bandwidth for I/O processing
echo "200000 1000000" > /sys/fs/cgroup/isle-tier1/cpu.guarantee  # 20%

# Ensure Tier 2 drivers have some guaranteed bandwidth
echo "50000 1000000" > /sys/fs/cgroup/isle-tier2/cpu.guarantee   # 5%

A misbehaving Tier 2 driver process spinning in a loop cannot starve Tier 1 NVMe or NIC drivers of CPU time.

16. Real-Time Guarantees

16.1 Beyond CBS

Section 15 provides CPU bandwidth guarantees via CBS (Constant Bandwidth Server). This ensures average bandwidth. Real-time workloads need worst-case latency bounds: interrupt-to-response always under a specific ceiling.

16.2 Design: Bounded Latency Paths

// isle-core/src/rt/mod.rs

/// Real-time configuration (system-wide, set at boot or runtime).
pub struct RtConfig {
    /// Maximum interrupt latency guarantee (nanoseconds).
    /// The kernel guarantees that ISR entry occurs within this bound
    /// after the interrupt fires.
    /// Default: 50_000 (50 μs). Achievable on x86 with careful design.
    pub max_irq_latency_ns: u64,

    /// Maximum scheduling latency for SCHED_DEADLINE tasks (nanoseconds).
    /// The kernel guarantees that a runnable DEADLINE task is scheduled
    /// within this bound.
    /// Default: 100_000 (100 μs).
    pub max_sched_latency_ns: u64,

    /// Preemption model.
    pub preemption: PreemptionModel,
}

#[repr(u32)]
pub enum PreemptionModel {
    /// Voluntary preemption (default). Preemption at explicit preempt points.
    /// Lowest overhead, highest latency variance.
    Voluntary   = 0,

    /// Full preemption. Preemptible everywhere except hard critical sections.
    /// Moderate overhead, good latency bounds.
    /// Equivalent to Linux PREEMPT (non-RT).
    Full        = 1,

    /// RT preemption. All `spinlock_t` and `rwlock_t` instances become sleeping locks
    /// (mapped to `rt_mutex`). `raw_spinlock_t` remains a true spinning lock with
    /// interrupts disabled, used for scheduler internals, interrupt handling, and
    /// hardware access paths that must not sleep.
    /// Interrupts are threaded. Maximum preemptibility.
    /// Equivalent to Linux PREEMPT_RT.
    /// Highest overhead (~2-5% throughput), tightest latency bounds.
    Realtime    = 2,
}

16.3 Key Design Decisions for RT

1. Threaded interrupts (when PreemptionModel::Realtime):
   All hardware interrupts are handled by kernel threads.
   Threads are schedulable — RT tasks can preempt interrupt handlers.
   Cost: ~1 μs additional interrupt latency (thread switch).
   Linux PREEMPT_RT does the same.

2. Priority inheritance for all locks:
   When a low-priority task holds a lock needed by a high-priority task,
   the low-priority task inherits the high-priority task's priority.
   Prevents priority inversion (classic RT problem).
   Cost: ~5-10 cycles per lock acquire (check/update priority).
   Linux PREEMPT_RT does the same.

3. No unbounded loops in kernel paths:
   Every loop has a bounded iteration count.
   Memory allocation in RT context: from pre-allocated pools (no reclaim).
   Page fault in RT context: fails immediately (no I/O wait).
   Enforced by coding guidelines + Verus verification (§50).

4. Deadline admission control (Section 14.4):
   SCHED_DEADLINE tasks declare (runtime, period, deadline).
   Kernel admits the task ONLY if it can guarantee the deadline.
   If admission would violate existing guarantees: returns -EBUSY.
   Same semantics as Linux SCHED_DEADLINE.

16.3.1 RT + Domain Isolation Interaction

The raw WRPKRU instruction takes ~23 cycles on modern Intel microarchitectures (~6ns at 4 GHz). On KABI call boundaries, the domain switch is unconditional (the caller always needs to switch to the callee's domain), so the switch cost is the raw WRPKRU cost: ~23 cycles. The performance budget (Section 4) uses this figure: 4 switches × ~23 cycles = ~92 cycles per I/O round-trip.

RT jitter analysis: In a tight RT control loop making KABI calls at 10kHz, each call requires a round-trip domain switch (out and back = ~46 cycles = ~12ns), accumulating to ~120μs/sec of jitter (10,000 × 12ns). See I/O path analysis below.

RT latency policy:
  Tier 0 drivers: run in Core isolation domain. Zero transition cost.
    RT-critical paths (interrupt handlers, timer callbacks) use Tier 0.
  Tier 1 drivers: one-way domain switch adds ~6ns per WRPKRU
    (~23 cycles at 4 GHz). Round-trip (out+back) = ~12ns.
    Acceptable for soft-RT (audio, video). Not for hard-RT (<10μs).
    RT tasks requiring <10μs determinism should only use Tier 0 paths.
  Tier 2 drivers: user-space. Context switch cost (~1μs). Not for RT.

Why domain isolation does not cause priority inversion: Unlike mutex-based isolation, domain switching is a single unprivileged instruction (WRPKRU) that executes in constant time with no blocking, no lock acquisition, and no kernel involvement. A high-priority RT task switching domains cannot be blocked by a lower-priority task holding the domain. This is fundamentally different from process-based isolation (Tier 2) where IPC involves a context switch that can be delayed by scheduling.

Shared ring buffers in RT paths: When an RT task communicates with a Tier 1 driver via a shared ring buffer, the ring buffer memory is tagged with the shared PKEY (readable/writable by both core and driver domains). Accessing the ring buffer does not require a domain switch — the shared PKEY is always accessible. Only direct access to driver-private memory requires WRPKRU. Therefore, the typical RT I/O path is:

RT task → write command to shared ring buffer (no WRPKRU)
       → doorbell write to MMIO (requires WRPKRU → driver domain → WRPKRU back: ~12ns)
       → poll completion from shared ring buffer (no WRPKRU)

Total domain switch overhead per I/O op: ~12ns (one domain round-trip: two
WRPKRU instructions at ~23 cycles each = ~46 cycles = ~12ns at 4 GHz).
At 10kHz: 120μs/sec. At 1kHz: 12μs/sec. Negligible.

Preemption during domain switch: WRPKRU is a single instruction that cannot be preempted mid-execution. If a timer interrupt arrives between two WRPKRU instructions (e.g., switch to driver domain, then switch back), the interrupt handler saves and restores PKRU as part of the register context. The RT task resumes with its PKRU intact. No special handling is needed — this is the same as any register save/restore on interrupt.

16.3.2 CPU Isolation for Hard RT

Standard Linux RT practice — fully supported:

isolcpus=2-3       Reserve CPUs 2-3: no normal tasks, no load balancing.
nohz_full=2-3      Tickless on CPUs 2-3: no timer interrupts when idle
                    or running a single RT task.
rcu_nocbs=2-3      RCU callbacks offloaded from CPUs 2-3: no RCU
                    processing on isolated CPUs.

Isolated CPUs have: no timer ticks, no RCU callbacks, no workqueues, no kernel threads (except pinned ones). This is required for hard-RT workloads (LinuxCNC, IEEE 1588 PTP, audio with <1ms latency). Cross-reference: Section 14.5.11 lists isolcpus and nohz_full as supported.

16.3.3 Driver Crash During RT-Critical Path

If a Tier 1 driver crashes while an RT task depends on it:

Policy: immediate error notification, NOT wait for recovery.

1. Domain fault detected → crash recovery starts (Section 9).
2. RT task blocked on the driver gets IMMEDIATE unblock with error:
   - Pending I/O returns -EIO.
   - Pending KABI calls return CapError::DriverCrashed.
   - Signal SIGBUS delivered if task is in a blocking syscall.
3. RT task handles the error (application-specific failsafe mode).
4. Driver recovery (~100ms) happens in background.
5. RT task can resume normal operation after driver reloads.

Rationale: RT guarantees are more important than waiting for recovery.
An RT task must ALWAYS get a response within its deadline, even if that
response is an error. Blocking an RT task for 100ms violates the RT contract.

16.4 Linux Compatibility

Real-time interfaces are standard Linux:

SCHED_FIFO, SCHED_RR:       sched_setscheduler() — supported
SCHED_DEADLINE:              sched_setattr() — supported
/proc/sys/kernel/sched_rt_*: RT scheduler tunables — supported
/sys/kernel/realtime:        "1" when PREEMPT_RT is active — supported
clock_nanosleep(TIMER_ABSTIME): deterministic wakeup — supported
mlockall(MCL_CURRENT|MCL_FUTURE): prevent page faults — supported

Existing RT applications (JACK audio, ROS2, LinuxCNC, PTP/IEEE 1588) work without modification.

NUMA-Aware RT Memory:

Hard real-time tasks must avoid remote NUMA access (unpredictable latency). Standard Linux practice applies:

numactl --membind=0 --cpunodebind=0 ./rt_application

The kernel enforces: when a process has SCHED_DEADLINE or SCHED_FIFO priority AND is bound to a NUMA node via set_mempolicy(MPOL_BIND), the memory allocator does NOT fall back to remote nodes on allocation failure — it returns ENOMEM instead. This prevents unpredictable remote-access latency spikes.

Additionally, NUMA balancing (automatic page migration based on access patterns) is disabled for RT-priority tasks. Automatic page migration adds unpredictable latency (~50-200μs per migrated page). RT tasks pin their memory explicitly.

16.5 Performance Impact

When PreemptionModel::Voluntary (default): zero overhead vs Linux. Same model.

When PreemptionModel::Full: ~1% throughput reduction. Same as Linux PREEMPT.

When PreemptionModel::Realtime: ~2-5% throughput reduction. Same as Linux PREEMPT_RT. This is the unavoidable cost of deterministic scheduling — the same cost any RT OS pays.

The preemption model is configurable at boot. Debian servers use Voluntary (default). Embedded/RT deployments use Realtime.

16.6 Hardware Resource Determinism

Software scheduling (EEVDF, CBS, threaded IRQs, priority inheritance) guarantees CPU execution time. However, on modern multi-core SoCs, shared hardware resources — L3 caches, memory controllers, interconnect bandwidth — introduce unpredictable latency spikes that violate hard real-time deadlines regardless of CPU priority.

A high-priority RT task running on an isolated CPU can still miss its deadline if a background batch job on another core evicts the RT task's data from the shared L3 cache, or saturates the memory controller with streaming writes. This is the "noisy neighbor" problem at the hardware level.

ISLE addresses this by extending its Capability Domain model to physically partition shared hardware resources, using platform QoS extensions where available.

16.6.1 Cache Partitioning (Intel RDT / ARM MPAM)

Modern server CPUs expose hardware Quality of Service (QoS) mechanisms that allow the OS to assign cache and memory bandwidth quotas per workload:

Intel Resource Director Technology (RDT): Available on Xeon Skylake-SP and later. Provides Cache Allocation Technology (CAT) for L3 partitioning and Memory Bandwidth Allocation (MBA) for memory controller throttling. Controlled via MSRs (IA32_PQR_ASSOC, IA32_L3_MASK_n). Up to 16 Classes of Service (CLOS).
ARM Memory Partitioning and Monitoring (MPAM): Optional extension introduced in ARMv8.4-A (FEAT_MPAM). Provides Cache Portion Partitioning (CPP) and Memory Bandwidth Partitioning (MBP). Thread-to-partition assignment is via system registers (MPAM0_EL1, MPAM1_EL1) which set the PARTID for the executing thread. Actual resource limits (cache way bitmasks, bandwidth caps) are configured via MMIO registers in each Memory System Component (MSC) — e.g., MPAMCFG_CPBM for cache portions and MPAMCFG_MBW_MAX for bandwidth limits. Up to 256 Partition IDs (PARTIDs).

ISLE integrates these into the Capability Domain model (Section 11). Each Capability Domain can optionally carry a ResourcePartition constraint:

// isle-core/src/rt/resource_partition.rs

/// Hardware resource partition assigned to a Capability Domain.
/// Only meaningful when the platform provides QoS extensions (RDT, MPAM).
/// On platforms without QoS support, this struct is ignored.
pub struct ResourcePartition {
    /// L3 Cache Allocation bitmask.
    /// Each set bit grants the domain access to one cache "way."
    /// On Intel RDT, this maps to IA32_L3_MASK_n for the assigned CLOS.
    /// On ARM MPAM, this maps to the cache portion bitmap for the assigned PARTID.
    /// Example: 0x000F = ways 0-3 (exclusive to this domain).
    pub l3_cache_mask: u32,

    /// Memory Bandwidth Allocation percentage (1-100).
    /// Throttles memory controller traffic generated by this domain.
    /// On Intel MBA, this maps to the delay value for the assigned CLOS.
    /// On ARM MPAM, this maps to the MBW_MAX control for the assigned PARTID.
    /// 100 = no throttling. 50 = limit to ~50% of peak bandwidth.
    pub mem_bandwidth_pct: u8,

    /// Whether this partition is exclusive (no overlap with other domains).
    /// When true, the kernel verifies that no other domain's l3_cache_mask
    /// overlaps with this one. Allocation fails with EBUSY if overlap detected.
    pub exclusive: bool,
}

Determinism strategy:

RT Domains: Granted exclusive L3 cache ways (e.g., ways 0-3 on a 16-way cache). Their hot data is never evicted by other workloads. Memory bandwidth set to 100%.
Best-Effort Domains: Restricted to the remaining L3 cache ways (e.g., ways 4-15) and throttled via MBA during contention (e.g., limited to 50% bandwidth).
Discovery at boot: The kernel queries CPUID (Intel) or MPAM system registers (ARM) to discover the number of available cache ways and CLOS/PARTID slots. If the hardware does not support RDT/MPAM, the ResourcePartition constraint is silently ignored and a warning is logged.

Cache monitoring integration: RDT also provides Cache Monitoring Technology (CMT) and Memory Bandwidth Monitoring (MBM), which report per-CLOS cache occupancy and bandwidth usage. ISLE exposes these counters via the observability framework (Section 40) as stable tracepoints, enabling operators to verify that RT workloads remain within their allocated cache partition.

16.6.2 Strict Memory Pinning for RT Domains

Hard real-time tasks cannot tolerate page faults. A single page fault in a 10 kHz control loop adds 3-50 μs of jitter (TLB miss + page table walk + potential disk I/O), which can exceed the entire deadline budget.

ISLE provides strict memory pinning semantics for RT Capability Domains:

Eager allocation: When an RT task calls mmap() or loads an executable, all physical frames are allocated and page table entries populated immediately (equivalent to MAP_POPULATE). No demand paging.
Pre-faulted stacks: The kernel pre-faults the full stack allocation for RT threads at clone() time. Stack guard pages are still present but the usable stack region is fully backed by physical memory.
Exempt from reclaim: Pages owned by an RT domain are never targeted by kswapd page reclaim (Section 12), never compressed by ZRAM (Section 13), and never swapped. The OOM killer will target non-RT domains first; it will only kill an RT task as a last resort after all non-RT tasks have been considered.
NUMA-local enforcement: When an RT task is bound to a NUMA node via set_mempolicy(MPOL_BIND), the allocator returns ENOMEM rather than falling back to remote NUMA nodes. Remote NUMA access adds 50-200 ns of unpredictable latency per cache miss — unacceptable for hard RT. NUMA auto-balancing (automatic page migration) is disabled for RT-priority tasks.

These properties are activated automatically when a task has SCHED_FIFO, SCHED_RR, or SCHED_DEADLINE policy AND is assigned to a Capability Domain with an RtConfig (Section 16.2). They can also be requested explicitly via mlockall(MCL_CURRENT | MCL_FUTURE), which is the standard Linux RT practice.

16.6.3 Time-Sensitive Networking (TSN)

For distributed real-time systems (industrial control, automotive Ethernet, robotics), determinism must extend beyond the CPU to the network. ISLE integrates with hardware Time-Sensitive Networking (IEEE 802.1) features via isle-net (Section 34):

IEEE 802.1Qbv (Time-Aware Shaper): NICs with hardware TSN support expose gate control lists (GCLs) that schedule packet transmission at precise microsecond intervals. ISLE bypasses the software Qdisc layer for TSN-tagged traffic classes, programming the NIC's hardware scheduler directly via KABI. RT packets are never queued in software — they are placed directly in a hardware TX ring whose transmission gate opens at the scheduled time.
IEEE 802.1AS (Generalized Precision Time Protocol): Hardware PTP timestamps from the NIC's clock are fed directly to the timekeeping subsystem (Section 16a). The CLOCK_TAI system clock is synchronized to the PTP grandmaster with sub-microsecond accuracy. The CBS scheduler (Section 15) uses this PTP-synchronized timebase to align RT task wakeups with hardware transmission windows — the task wakes up, computes, and its output packet hits the NIC exactly when the 802.1Qbv gate is open.
IEEE 802.1Qci (Per-Stream Filtering and Policing): Ingress traffic is filtered in hardware by stream ID. Non-RT traffic arriving on an RT-reserved stream is dropped at the NIC before it reaches the CPU, preventing interference with RT packet processing.

Architecture note: TSN support requires Tier 1 NIC drivers that implement the TSN KABI extensions (gate control list programming, PTP clock read, stream filter configuration). Standard NICs without TSN hardware operate normally but cannot provide network-level determinism. The isle-net stack detects TSN capability at driver registration via the device registry (Section 7).

16a. Timekeeping and Clock Management

Accurate, low-latency timekeeping is foundational: the scheduler needs monotonic timestamps for CBS deadlines (Section 15), real-time tasks need bounded timer latency (Section 16), and userspace applications call clock_gettime() millions of times per second. This section describes how ISLE reads hardware clocks, maintains system time, exposes fast timestamps to userspace, and manages timer events.

16a.1 Clock Source Hierarchy

Each architecture provides one or more hardware cycle counters. ISLE selects the best available source at boot and can switch at runtime if a source proves unstable (Section 16a.5).

Architecture	Primary Source	Secondary	Resolution	Access
x86-64	TSC (Time Stamp Counter)	HPET, ACPI PM Timer	sub-ns	`rdtsc` (user/kernel)
AArch64	Generic Timer (`CNTPCT_EL0`)	—	typically 1-10 ns	`mrs` (EL0 if enabled)
ARMv7	Generic Timer (`CNTPCT` via cp15)	—	typically 1-10 ns	`mrc` (PL0 if enabled)
RISC-V	`mtime` (MMIO)	`rdtime` CSR	implementation-defined	`rdtime` (U-mode)
PPC32	Timebase (TBL/TBU)	Decrementer (DEC)	typically 1-10 ns	`mftb` / `mfspr`
PPC64LE	Timebase (TB)	Decrementer (DEC)	sub-ns (POWER9: 512 MHz)	`mftb` (user/kernel)

x86-64 notes: Modern processors (Intel Nehalem+, AMD Zen+) provide an invariant TSC that runs at a constant rate regardless of frequency scaling or C-state transitions. CPUID leaf 0x8000_0007 EDX bit 8 advertises this. When invariant TSC is available, it is the preferred source: zero-cost reads (rdtsc is unprivileged), sub-nanosecond resolution, and monotonicity guaranteed across cores. When invariant TSC is absent, ISLE falls back to HPET (~100 ns read latency, MMIO) or the ACPI PM Timer (~800 ns read latency, port I/O).

AArch64 / ARMv7 notes: The ARM Generic Timer is architecturally defined and always present. The kernel configures CNTKCTL_EL1 to allow EL0 (userspace) reads of CNTPCT_EL0, enabling a vDSO fast path identical in spirit to x86 rdtsc.

RISC-V notes: The rdtime pseudo-instruction reads the platform-provided real-time counter. Frequency is discoverable from the device tree (timebase-frequency property). Resolution varies by implementation.

All clock sources implement a common abstraction:

// isle-core/src/time/clocksource.rs

/// Hardware clock source abstraction.
/// Implementations are per-architecture; the best source is selected at boot.
pub trait ClockSource: Send + Sync {
    /// Read the current cycle count from hardware.
    fn read_cycles(&self) -> u64;

    /// Nominal frequency of this clock source in Hz.
    fn frequency_hz(&self) -> u64;

    /// Quality rating: higher values are preferred when multiple sources exist.
    /// TSC invariant = 350, HPET = 250, ACPI PM Timer = 100.
    fn rating(&self) -> u32;

    /// Whether this source continues counting through CPU sleep states.
    fn is_continuous(&self) -> bool;

    /// Upper bound on single-read uncertainty in nanoseconds.
    /// Accounts for read latency and synchronization jitter.
    fn uncertainty_ns(&self) -> u32;
}

At boot, isle-core enumerates available sources, sorts by rating(), and activates the highest-rated continuous source. The secondary source (if any) is retained for watchdog cross-validation (Section 16a.5).

16a.2 Timekeeping Subsystem

ISLE maintains four clocks, matching POSIX semantics:

Clock	Semantics	Adjustable?
`CLOCK_MONOTONIC`	Time since boot, NTP-adjusted rate	No (monotonic)
`CLOCK_MONOTONIC_RAW`	Time since boot, raw hardware rate	No
`CLOCK_REALTIME`	Wall clock (UTC), NTP-adjusted	Yes (`clock_settime`, NTP)
`CLOCK_BOOTTIME`	Like `CLOCK_MONOTONIC` but includes suspend time	No

Timestamp representation: All internal timestamps use a (seconds: u64, nanoseconds: u64) tuple. Both fields are 64-bit to avoid overflow in intermediate arithmetic (nanoseconds may temporarily exceed 10^9 during computation and are normalized before storage).

Global timekeeper state is protected by a seqlock — the same pattern used in Linux timekeeping.c. Readers (including the vDSO) retry if they observe a torn update. Writers (the timer interrupt handler) are serialized by holding the seqlock write side.

// isle-core/src/time/timekeeper.rs

/// Global timekeeping state, updated on every tick or clocksource event.
pub struct Timekeeper {
    pub seq: SeqLock,                       // seqlock protecting all fields
    pub clock: &'static dyn ClockSource,    // active clock source
    pub cycle_last: u64,                    // last cycle count at update
    pub mask: u64,                          // counter wrap bitmask
    pub mult: u32,                          // ns = (cycles * mult) >> shift
    pub shift: u32,
    pub wall_sec: u64,                      // CLOCK_REALTIME
    pub wall_nsec: u64,
    pub mono_sec: u64,                      // CLOCK_MONOTONIC
    pub mono_nsec: u64,
    pub boot_offset_sec: u64,              // CLOCK_BOOTTIME delta
    pub boot_offset_nsec: u64,
    pub freq_adj: i64,                      // NTP/PTP frequency correction (scaled ppm)
    pub phase_adj: i64,                     // NTP/PTP phase correction (ns)
}

NTP/PTP discipline: An adjtimex()-compatible interface accepts frequency and phase corrections from userspace NTP or PTP daemons. Frequency adjustment modifies mult slightly so that cycles-to-nanoseconds conversion drifts at the requested rate. Phase adjustment is applied as a slew (at most 500 ppm rate adjustment) to avoid wall clock jumps. CLOCK_MONOTONIC_RAW is immune to both adjustments — it reflects raw hardware cycles converted at the nominal rate.

16a.3 vDSO Fast Path

Linux problem: clock_gettime() is the most frequently invoked syscall in many workloads (databases, trading systems, telemetry). A kernel entry costs ~100-200 ns due to mode switch, KPTI page table reload, and speculative execution mitigations. At millions of calls per second, this adds up.

ISLE design: Like Linux, ISLE maps a vDSO (virtual Dynamic Shared Object) into every process's address space. The vDSO contains userspace implementations of clock_gettime(), gettimeofday(), and time() that read the hardware clocksource directly and apply precomputed conversion parameters — no syscall needed.

The kernel maintains a read-only shared page (the vDSO data page) that it updates on every timer tick and on NTP adjustments. Userspace vDSO code reads this page under seqlock protection.

// isle-core/src/time/vdso.rs

/// Shared page mapped read-only into every process.
/// Updated by the kernel under seqlock protection.
#[repr(C)]
pub struct VdsoData {
    // Maps to the `SeqLock` counter in `Timekeeper`; the kernel writes this
    // field as part of the seqlock protocol (odd = update in progress,
    // even = consistent). Userspace reads `seq` before and after reading
    // fields to detect torn reads and retry.
    pub seq: u32,
    pub clock_mode: u32,           // which clocksource (TSC, HPET, Generic Timer, ...)
    pub cycle_last: u64,           // cycle count at last kernel update
    pub mask: u64,                 // clocksource bitmask
    pub mult: u32,                 // ns = ((cycles - cycle_last) & mask) * mult >> shift
    pub shift: u32,
    pub wall_time_sec: u64,        // CLOCK_REALTIME base
    pub wall_time_nsec: u64,
    pub monotonic_time_sec: u64,   // CLOCK_MONOTONIC base
    pub monotonic_time_nsec: u64,
    pub boottime_sec: u64,         // CLOCK_BOOTTIME base
    pub boottime_nsec: u64,
}

vDSO read path (userspace, per-architecture):

Read seq. If odd, spin (kernel is mid-update).
Read cycle_last, mult, shift, mask, and the relevant base time.
Read the hardware counter (rdtsc / mrs CNTPCT_EL0 / rdtime).
Compute delta = (now - cycle_last) & mask.
Compute ns = base_nsec + (delta * mult) >> shift. Normalize into seconds.
Re-read seq. If it changed, go to step 1.

Cost: ~5-20 ns depending on architecture (dominated by the clocksource read instruction itself). This is 10-40x faster than a syscall path.

Fallback: If clock_mode indicates no userspace-readable source is available (e.g., HPET on x86, which requires MMIO the kernel has not mapped into user address space), the vDSO falls back to a real syscall instruction.

16a.4 Timer Infrastructure

ISLE provides two timer mechanisms, matching the Linux split between coarse and high-resolution timers.

Timer wheel (coarse-grained, jiffies resolution):

Used for network retransmission timeouts, poll/epoll timeouts, and other events where millisecond precision is sufficient. Implemented as a hierarchical timer wheel with O(1) insertion and O(1) per-tick processing (cascading is amortized). The wheel uses 8 levels with 256 slots each, covering timeouts from 1 tick to ~50 days at HZ=250.

High-resolution timers (hrtimers, nanosecond precision):

Used for timer_create() (POSIX per-process timers), nanosleep() / clock_nanosleep(), timerfd_create(), and scheduler deadline enforcement. Implemented as a per-CPU red-black tree keyed by absolute expiry time. The nearest expiry programs the hardware timer (local APIC on x86, Generic Timer on ARM, mtimecmp on RISC-V) to fire at the exact time.

// isle-core/src/time/hrtimer.rs

/// A high-resolution timer.
pub struct HrTimer {
    /// Absolute expiry time (CLOCK_MONOTONIC nanoseconds).
    pub expires_ns: u64,

    /// Callback invoked on expiry. Runs in hard-IRQ context.
    pub callback: fn(&mut HrTimer),

    /// Opaque context value passed to the callback. Typically a pointer to
    /// the enclosing structure (cast via `as usize`), allowing the callback
    /// to recover its context via `unsafe { &*(context as *const T) }`.
    /// This is the Rust equivalent of Linux's `container_of` pattern for
    /// timer callbacks.
    ///
    /// # Safety
    ///
    /// Using `context` as a pointer requires the following invariants:
    ///
    /// 1. **Pinning**: The `HrTimer` must be embedded in a `Pin`-ned
    ///    allocation. The enclosing structure must not move while the timer
    ///    is armed, since `context` stores a raw pointer to it.
    /// 2. **Drop ordering**: The enclosing structure's `Drop` implementation
    ///    must cancel the timer (`hrtimer_cancel()`) before deallocation,
    ///    ensuring the callback never fires with a dangling `context`.
    /// 3. **Type agreement**: The callback is responsible for casting
    ///    `context` back to the correct type via
    ///    `unsafe { &*(self.context as *const T) }`. The caller that sets
    ///    `context` and the callback must agree on the type `T`.
    /// 4. These invariants match Linux's `container_of` + `hrtimer` pattern,
    ///    adapted for Rust's ownership model. The timer subsystem enforces
    ///    invariant (2) by requiring `Pin<&mut HrTimer>` for
    ///    `hrtimer_start()`.
    pub context: usize,

    /// Timer state.
    pub state: HrTimerState,

    /// Owning CPU (timers are per-CPU to avoid cross-CPU synchronization).
    pub cpu: u32,
}

Per-CPU timer queues: Each CPU maintains its own timer wheel and hrtimer tree. Timer insertion targets the local CPU by default. Expiry processing happens in the local timer interrupt — no cross-CPU IPI is needed. This eliminates contention and provides deterministic latency on isolated CPUs (Section 16.3.2).

Timer coalescing: When a timer is inserted with a slack tolerance (e.g., a 100 ms timeout with 10 ms acceptable slack), the kernel may delay it to coalesce with nearby timers. This reduces wakeups on idle CPUs, improving power efficiency. Coalescing is disabled for hrtimers with zero slack (RT workloads). The timer_slack_ns per-process tunable controls default slack, identical to the Linux interface.

16a.5 Clocksource Watchdog

A clocksource that reports incorrect time is worse than a slow one — it causes silent data corruption in timestamps, incorrect scheduler decisions, and broken network protocols.

Cross-validation: Every 500 ms (configurable), the kernel reads both the primary and secondary clocksource and compares the elapsed interval. If the primary's elapsed time deviates from the secondary's by more than a threshold (default: 100 ppm sustained over 5 consecutive checks), the primary is marked unstable.

TSC instability detection: On x86, the TSC can be unreliable in several scenarios:

Non-invariant TSC (pre-Nehalem Intel, pre-Zen AMD): frequency changes with P-state transitions.
TSC halts during deep C-states on some older processors.
TSC desynchronization across sockets on early multi-socket systems.

The watchdog detects all three cases. When the TSC is marked unstable:

The kernel logs a warning: clocksource: TSC marked unstable (drift >100ppm vs HPET).
The active clocksource switches to HPET (or ACPI PM Timer if HPET is absent).
The vDSO clock_mode is updated so userspace falls back to the syscall path (HPET is not readable from userspace without kernel MMIO mapping).
The switch is atomic from the perspective of seqlock readers — one consistent snapshot uses TSC parameters, the next uses HPET parameters.

Capability-gated calibration: TSC frequency calibration (reading MSRs like MSR_PLATFORM_INFO or calibrating against PIT/HPET) requires privileged operations. Only isle-core holds the capability to read/write MSRs. Tier 1 drivers cannot influence clocksource selection — a compromised driver cannot subvert system timekeeping.

16a.6 Interaction with RT and Power Management

RT timer latency: Real-time tasks (Section 16) depend on bounded timer expiry. On CPUs designated for RT workloads (isolcpus, nohz_full), hrtimer expiry is serviced directly in hard-IRQ context with a preemption-disabled path of bounded length. The worst-case path from hardware interrupt to hrtimer callback execution is: interrupt entry (~200 cycles) + hrtimer tree lookup (O(1) for the nearest timer) + callback invocation. On x86 with a local APIC timer and an isolated CPU (no frequency scaling, shallow C-states, nohz_full), the software path completes in under 1 μs. However, hardware-level non-determinism (DRAM refresh cycles ~350ns worst-case, cache miss penalties, memory controller contention) means the end-to-end observed latency on real hardware is typically 1-5 μs under favorable conditions and up to 10 μs under worst-case memory pressure. These figures match measured PREEMPT_RT Linux performance on isolated cores. Section 16.6 details the hardware resource partitioning (CAT, MBA, RDT) that ISLE uses to minimize hardware-level jitter.

When PreemptionModel::Realtime is active (Section 16.2), softirq-context timers are promoted to hard-IRQ context for RT-priority hrtimers, ensuring they cannot be delayed by threaded interrupt processing.

C-state interaction with clocksources: CPU power states affect timer behavior:

C-state	Invariant TSC	Non-Invariant TSC	Generic Timer (ARM)	mtime (RISC-V)	Timebase (PPC)
C1 (halt)	Continues	May stop	Continues	Continues	Continues
C3+ (deep sleep)	Continues	Stops	Continues	Continues	Continues

When a non-invariant TSC is detected and the system supports deep C-states, the kernel forces HPET as the clocksource and disables the vDSO fast path for timestamp reads. This is a correctness requirement, not a performance choice.

Tickless (nohz) mode: When a CPU has no pending timers and is running a single task (or is idle), the periodic tick is stopped entirely. The kernel reprograms the hardware timer to fire at the next actual event (nearest hrtimer expiry, or infinity if none). This eliminates unnecessary wakeups on isolated RT CPUs and idle CPUs.

Resuming the tick happens when: (a) a new timer is inserted on the CPU, (b) a second task becomes runnable (the scheduler needs periodic load balancing), or (c) an interrupt wakes the CPU from idle. The nohz implementation reuses Linux's nohz_full semantics: user code on an isolated CPU can run for arbitrarily long periods without a single kernel interrupt.

Power-aware timer placement: When timer coalescing (Section 16a.4) groups timers, the kernel prefers placing them on CPUs that are already awake. Waking a CPU from C3+ costs ~100 us and defeats the purpose of coalescing. The timer subsystem queries the per-CPU idle state before choosing a coalescing target.

16b. System Event Bus

The event bus is a core kernel facility that enables kernel subsystems and drivers to notify userspace of hardware and system state changes via a capability-gated, lock-free ring buffer mechanism. Netlink compatibility (for udev/systemd) is implemented in isle-compat (§20b, 05-linux-compat.md).

16b.1 Event Subscription Model

// isle-core/src/event/mod.rs

/// System event types.
#[repr(u32)]
pub enum EventType {
    /// Battery level changed.
    BatteryLevelChanged = 1,
    /// AC adapter state changed (plugged/unplugged).
    AcStateChanged = 2,
    /// WiFi connection state changed.
    WifiStateChanged = 3,
    /// Bluetooth device paired/unpaired.
    BluetoothDeviceChanged = 4,
    /// USB device inserted/removed.
    UsbDeviceChanged = 5,
    /// Display hotplug (connected/disconnected).
    DisplayHotplug = 6,
    /// Thermal event (warning, critical).
    ThermalEvent = 7,
    /// Power profile changed.
    PowerProfileChanged = 8,
}

/// Event payload (max 256 bytes, cache-line friendly).
#[repr(C)]
pub struct Event {
    /// Event type.
    pub event_type: EventType,
    /// Timestamp (monotonic ns).
    pub timestamp_ns: u64,
    /// Event-specific data.
    pub data: EventData,
}

/// Event-specific data (union of all possible payloads).
#[repr(C)]
pub union EventData {
    pub battery: BatteryEvent,
    pub ac: AcEvent,
    pub wifi: WifiEvent,
    pub bluetooth: BluetoothEvent,
    pub usb: UsbEvent,
    pub display: DisplayEvent,
    pub thermal: ThermalEvent,
    pub power_profile: PowerProfileEvent,
    _pad: [u8; 240], // Pad to 256 bytes total with event_type and timestamp
}

/// Battery event data.
#[repr(C)]
pub struct BatteryEvent {
    /// Battery percentage (0-100).
    pub percent: u8,
    /// Charging state (0=discharging, 1=charging, 2=full).
    pub charging: u8,
    /// Time remaining in minutes (0xFFFF = unknown).
    pub time_remaining_min: u16,
}

/// Display hotplug event data.
#[repr(C)]
pub struct DisplayEvent {
    /// Connector ID.
    pub connector_id: u32,
    /// Event subtype (0=disconnected, 1=connected).
    pub connected: bool,
}

16b.2 Subscription via Capability

Processes subscribe to events via the capability system (§11):

impl EventManager {
    /// Subscribe to a class of events. Returns an EventSubscription capability.
    ///
    /// # Security
    /// Requires `CAP_SYS_ADMIN` for system-wide events (thermal, power profile).
    /// Requires `CAP_NET_ADMIN` for network events (WiFi, Bluetooth).
    /// Battery, AC, USB, display events are unrestricted (visible to all processes).
    pub fn subscribe(&self, event_type: EventType, process: &Process) -> Result<CapabilityToken, EventError> {
        // Check capability grants.
        match event_type {
            EventType::ThermalEvent | EventType::PowerProfileChanged => {
                if !process.has_capability(Capability::SysAdmin) {
                    return Err(EventError::PermissionDenied);
                }
            }
            EventType::WifiStateChanged | EventType::BluetoothDeviceChanged => {
                if !process.has_capability(Capability::NetAdmin) {
                    return Err(EventError::PermissionDenied);
                }
            }
            _ => {} // Unrestricted
        }

        // Allocate event ring buffer (per-process, 4 KB = ~16 events).
        let ring = EventRing::allocate(process)?;

        // Mint capability token.
        let cap_token = self.capability_manager.mint(
            CapabilityType::EventSubscription,
            CapabilityRights::READ,
            ring.ring_id(),
        )?;

        // Register subscription.
        self.subscriptions.insert(event_type, SubscriptionInfo {
            process_id: process.pid(),
            ring,
        });

        Ok(cap_token)
    }

    /// Post an event to all subscribers.
    pub fn post_event(&self, event: Event) {
        let subs = self.subscriptions.get(&event.event_type);
        for sub in subs {
            if let Some(ring) = sub.ring.upgrade() {
                // Write event to subscriber's ring buffer (lock-free push).
                if ring.push(event).is_err() {
                    // Ring full: drop event (subscriber is too slow).
                    sub.dropped_events.fetch_add(1, Ordering::Relaxed);
                }
            }
        }
    }
}

For Netlink compatibility (udev, systemd integration), see §20b (05-linux-compat.md).

16b.3 Integration Points

Subsystem	Events posted
Battery driver (§14a.8)	BatteryLevelChanged, AcStateChanged
WiFi driver (§12.1, 03-drivers.md)	WifiStateChanged
Bluetooth (§12a, 03-drivers.md)	BluetoothDeviceChanged
USB bus (§10)	UsbDeviceChanged
Display driver (§62)	DisplayHotplug
Thermal framework (§14a.3)	ThermalEvent
Power profiles (§14a.9)	PowerProfileChanged