Chapter 22: AI/ML and Accelerators¶

Unified accelerator framework, accelerator memory/P2P DMA, isolation/scheduling, in-kernel inference, accelerator networking, unified compute model

A unified framework for heterogeneous compute accelerators: GPUs, NPUs, DPUs, and inference engines. Accelerator memory management supports P2P DMA across devices without host CPU copies. The in-kernel inference engine runs lightweight ML models for kernel policy decisions. All accelerator drivers use the standard KABI isolation tiers.

22.1 Unified Accelerator Framework¶

This section defines kernel-level AI/ML infrastructure: unified accelerator management, heterogeneous memory, peer-to-peer DMA, multi-tenant isolation, in-kernel inference, and distributed ML networking. These are kernel-internal capabilities — training frameworks and model serving remain in userspace.

Design Constraints:

Drop-in compatibility: Existing Linux GPU/accelerator userspace (CUDA runtime, ROCm, OpenCL, Vulkan compute, /dev/dri, /dev/nvidia*) must work through the compatibility layer. Existing tools must not break.
Superset, not replacement: UmkaOS exposes additional accelerator management capabilities beyond what Linux provides. Userspace software that wants better scheduling, isolation, or memory management can opt in. Software that doesn't care sees standard Linux behavior.
IP-clean: Built from public specifications (PCIe, VFIO, DRM/KMS interfaces), academic algorithms, and open standards. No proprietary API reimplementation.

22.1.1 Motivation¶

22.1.1.1 The Current State¶

In 2026, accelerators (GPUs, NPUs, TPUs, FPGAs, custom ASICs) are as fundamental to computing as network interfaces. Every cloud instance, every phone, every laptop has at least one. AI/ML workloads are the dominant growth driver in datacenter compute.

Yet the Linux kernel treats accelerators as dumb peripherals. The kernel's relationship to a GPU is roughly:

Linux kernel:
  - Maps PCI BARs (MMIO)
  - Sets up IOMMU
  - Routes interrupts
  - That's it. The driver owns everything else.

GPU driver (e.g., nvidia.ko):
  - Owns the command queue
  - Owns memory management (VRAM allocation, page tables)
  - Owns scheduling (which process runs on which compute unit)
  - Owns isolation (who can see whose memory)
  - Owns power management (GPU clock, thermal)
  - The kernel knows nothing about any of this.

This is equivalent to how operating systems managed CPUs before preemptive multitasking — the hardware vendor's runtime owns the resource, the kernel is blind.

22.1.1.2 What This Causes¶

Problem	Impact
No kernel-visible GPU scheduling	Two containers sharing a GPU have no fairness guarantees. One can starve the other.
No cgroup integration	`cpu.max` limits CPU time. Nothing limits GPU time. Kubernetes has no way to enforce GPU QoS.
No memory accounting	GPU VRAM usage is invisible to the OOM killer. A GPU memory leak crashes the GPU driver, not just the leaking process.
No preemption	A long-running GPU kernel (training batch) blocks interactive inference. Linux has solved this for CPUs since 1991.
No unified memory management	CUDA UVM, AMD SVM, Intel SVM — each vendor's "unified memory" is driver-internal. The kernel's page fault handler doesn't know about GPU page tables.
No isolation	GPU memory between processes is isolated by the driver, not the kernel. Driver bugs = security holes.
Driver crash = system crash	NVIDIA driver hang requires full system reboot. No crash recovery.
No P2P DMA management	GPUDirect is vendor-specific. No kernel facility for "DMA from NVMe directly to GPU VRAM."

22.1.1.3 The Opportunity¶

A kernel designed from scratch in 2026 can treat accelerators as first-class scheduled resources, the same way it treats CPUs, memory, and I/O bandwidth. This is not about running PyTorch in the kernel — it is about the kernel managing accelerator hardware the way it manages all other hardware: scheduling, memory, isolation, recovery.

UmkaOS's existing architecture is uniquely suited for this: - KABI: Stable driver ABI means GPU driver updates are decoupled from kernel updates - Crash recovery: GPU driver crashes can be survived (driver binary reload in ~50-150ms; total recovery including GPU hardware reset: ~100ms–5s) (varies by hardware: simple GPU reset ~100-500ms; datacenter GPUs with GSP firmware reload such as NVIDIA H100 may take 2-5s) - Capability system: Natural fit for fine-grained accelerator access control - Device registry: Models accelerator topology (GPU → engines, VRAM, display, encode) - cgroup integration: CPU bandwidth guarantees (Section 7.6) extend naturally to accelerator time - Zero-copy I/O paths: Generalize to device-to-device DMA

22.1.2 Unified Accelerator Framework¶

22.1.2.1 Design: umka-accel¶

A new KABI interface family for accelerator devices. Every accelerator driver — GPU, NPU, TPU, FPGA, DSP, custom ASIC — implements the same base interface. Hardware-specific capabilities are exposed through versioned extension vtables.

VTable cross-reference — all KABI vtables defined in this chapter:

VTable	Section	Purpose
`AccelBaseVTable`	Section 22.1	Core device lifecycle: init, reset, power, migration, fencing
`AccelComputeVTable`	Section 22.3	Compute dispatch: context create/destroy, submit, preempt
`AccelDisplayVTable`	Section 22.3	Display/KMS: mode set, framebuffer, hotplug, cursor
`InferenceServiceVTable`	Section 22.6	Tier 2 inference service: async query, model update, metrics
`RdmaDeviceVTable`	Section 22.7	RDMA operations: PD, MR, QP, CQ, post send/recv

All vtables follow the KABI versioning protocol (Section 12.5): vtable_size + version fields enable forward/backward compatibility. Extension vtables (Compute, Display) are registered independently of AccelBase — a driver may implement any combination. Accelerator drivers that depend on KABI helper services (e.g., accel-display-helpers, accel-compute-helpers) declare those dependencies in their driver manifest and the kernel resolves them at probe time (Section 12.7).

                         umka-accel (KABI interface)
                                    |
                    +---------------+---------------+
                    |               |               |
               AccelBase       AccelCompute    AccelDisplay
               (all accel)     (compute-      (display/render
                               capable)        capable)
                    |               |               |
            +-------+-----+    +---+---+       +---+---+
            |       |     |    |       |       |       |
          nvidia  amdgpu  xe  nvidia  amdgpu   nvidia  amdgpu
           GPU     GPU   GPU   GPU     GPU      GPU     GPU
                                |
                            intel_npu
                                |
                            custom_tpu

22.1.2.2 AccelBaseVTable — The Universal Interface¶

Every accelerator driver provides this vtable. It covers what the kernel needs to manage the device, not what userspace needs to program it.

/// Base vtable that every accelerator driver must implement.
/// This is the kernel-management interface, not the compute API.
// KABI vtable — crosses driver-kernel boundary. Layout is stable across kernel versions.
#[repr(C)]
pub struct AccelBaseVTable {
    /// Bounds-safety check: total size of this vtable struct in bytes. Allows the
    /// kernel to detect when a driver provides a newer (larger) vtable and safely
    /// ignore trailing fields it does not understand.
    pub vtable_size: usize,
    /// Primary version discriminant: `KabiVersion::as_u64()`.
    /// See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    /// Encoding: `(major << 32) | (minor << 16) | patch`.
    /// See [Section 12.7](12-kabi.md#kabi-service-dependency-resolution) for the full encoding
    /// definition (`KabiVersion::as_u64()`) and compatibility rules.
    pub kabi_version: u64,

    // === Device Info ===

    /// Return device capabilities and resource inventory.
    pub get_info: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_info: *mut AccelDeviceInfo,
    ) -> IoResultCode,

    // === Context Management ===

    /// Create an execution context (one per process/tenant).
    /// Returns an opaque context handle.
    pub create_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        // LONGEVITY: u32 is correct — Linux `pid_t` is i32 by POSIX/kernel ABI.
        // PID space is recycled (default max 4M, CONFIG_PID_MAX=4194304).
        // This is an external ABI constraint, not an internal counter.
        owner_pid: u32,
        priority: AccelPriority,
        limits: *const AccelContextLimits,
        out_context: *mut AccelContextHandle,
    ) -> IoResultCode,

    /// Destroy an execution context and free all its resources.
    pub destroy_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
    ) -> IoResultCode,

    // === Command Submission ===

    /// Submit a command buffer for execution.
    /// The kernel calls this after validating capabilities and
    /// scheduling the submission according to policy.
    ///
    /// **Note**: Command buffers are pre-created via `create_cmd_buffer` and
    /// recorded via `ACCEL_IOCTL_CMD_RECORD` before submission. The driver
    /// receives the opaque `AccelCmdBufferHandle` (not raw command data).
    /// This allows the driver to manage command buffer memory and validation
    /// independently of the submission path.
    pub submit_commands: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        cmd_buffer: AccelCmdBufferHandle,  // Opaque handle; length is implicit in handle
        fences: *const DmaFence,
        fence_count: u32,
        out_submission: *mut AccelSubmissionHandle,
    ) -> IoResultCode,

    /// Poll for completion of a submitted command buffer.
    pub poll_completion: unsafe extern "C" fn(
        ctx: *mut c_void,
        submission: AccelSubmissionHandle,
        out_status: *mut AccelCompletionStatus,
    ) -> IoResultCode,

    /// Request preemption or cooperative yield of a running context.
    ///
    /// Behavior depends on `AccelDeviceInfo::preemption_granularity`:
    /// - `InstructionLevel` or `DrawBoundary`: True preemption. The device saves
    ///   context state and stops the running workload within ~50μs-10ms
    ///   (depends on preemption granularity and workload; instruction-level
    ///   preemption on modern compute GPUs is typically 50-100μs, while
    ///   draw-boundary preemption can take milliseconds).
    ///   The context can be resumed later via a new `submit_commands`.
    /// - `CommandBoundary`: The device finishes the current command buffer
    ///   but does not start the next queued one. Latency is bounded by
    ///   the longest in-flight command buffer.
    /// - `None`: Cooperative yield only. The driver stops submitting new
    ///   command buffers after the current one completes. Cannot interrupt
    ///   a running dispatch.
    ///
    /// Returns `IO_OK` if the preemption/yield was initiated (completion
    /// is asynchronous — the scheduler polls for the context to become
    /// idle). Returns `IO_NOT_SUPPORTED` if the driver does not implement
    /// any form of preemption or yield.
    pub preempt_context: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        reason: PreemptReason,
    ) -> IoResultCode>,

    // === Memory Management ===

    /// Allocate device-local memory (VRAM, local SRAM, etc.).
    pub alloc_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        size: u64,
        alignment: u64,
        page_size: AccelPageSize,
        flags: AccelMemFlags,
        out_handle: *mut AccelMemHandle,
    ) -> IoResultCode,

    /// Free device-local memory.
    pub free_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        handle: AccelMemHandle,
    ) -> IoResultCode,

    /// Map device memory into a CPU-visible virtual address.
    pub map_device_memory: unsafe extern "C" fn(
        ctx: *mut c_void,
        handle: AccelMemHandle,
        offset: u64,
        size: u64,
        out_cpu_addr: *mut u64,
    ) -> IoResultCode,

    /// Migrate pages between CPU RAM and device memory.
    /// Direction determined by flags.
    ///
    /// **Fallback**: If `migrate_pages` is `None` (driver does not support
    /// hardware page migration), the framework falls back to explicit
    /// copy-based transfer: allocate a staging buffer on the destination
    /// side, DMA-copy page contents via the driver's `dma_memcpy` vtable
    /// entry, then update the page table mappings. This is slower (~2x
    /// latency) but functionally correct for all devices.
    ///
    /// **Partial failure**: If migration fails for a subset of pages
    /// (e.g., device memory full), the function returns the number of
    /// successfully migrated pages in `IoResultCode::partial_count`.
    /// Unmigrated pages retain their original location. The caller
    /// (kernel memory manager or userspace via `SYS_move_pages`) can
    /// retry the remaining pages.
    pub migrate_pages: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        pages: *const AccelMigrationEntry,
        page_count: u32,
        flags: MigrationFlags,
    ) -> IoResultCode>,

    // === Utilization Reporting ===

    /// Report current device utilization to the kernel scheduler.
    pub get_utilization: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_util: *mut AccelUtilization,
    ) -> IoResultCode,

    // === Power/Thermal ===

    /// Get current power and thermal state.
    pub get_power_state: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_state: *mut AccelPowerState,
    ) -> IoResultCode,

    /// Set performance level / clock frequency.
    pub set_performance_level: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        level: AccelPerfLevel,
    ) -> IoResultCode>,

    // === Initialization and Reset ===

    /// Called once after device enumeration to perform hardware initialization.
    /// The driver must configure firmware, set up internal queues, and return
    /// only after the device is ready to accept commands.
    ///
    /// Also called by HROT ([Section 22.5](#accelerator-isolation-and-scheduling--watchdog-implementation))
    /// after a full device reset to restore the device to a clean, fully operational
    /// state equivalent to a fresh driver load.
    ///
    /// Returns 0 on success, negative errno on failure. On failure the device
    /// is placed in `AccelDeviceState::Error` and must be re-probed by the operator.
    ///
    /// **KABI version note**: Added in KABI version 1.1. Callers must check
    /// `vtable_size >= offset_of!(AccelBaseVTable, device_init) + size_of::<Option<fn>>())`
    /// before invoking; treat as `None` if not present.
    pub device_init: Option<unsafe extern "C" fn(ctx: *mut c_void) -> IoResultCode>,

    /// Reset a single execution context (not the whole device).
    pub reset_context: unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
    ) -> IoResultCode,

    /// Abort a running context: discard all in-flight commands and return
    /// the device to an idle state for this context. Unlike `reset_context`
    /// (which reinitializes the context for reuse), `abort_context` is a
    /// last-resort teardown used when a non-preemptible submission exceeds
    /// `ACCEL_MAX_SUBMISSION_NS` and the scheduler cannot wait any longer
    /// (see [Section 22.5](#accelerator-isolation-and-scheduling--scheduling-model)). After abort, the
    /// context is in `AccelContextStatus::Error` and must be destroyed.
    ///
    /// Returns `IO_OK` on successful abort, `IO_NOT_SUPPORTED` if the
    /// device hardware cannot abort mid-execution (scheduler falls back
    /// to demoting the context to `Background` priority).
    pub abort_context: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
    ) -> IoResultCode>,

    /// Full device reset.
    pub reset_device: unsafe extern "C" fn(
        ctx: *mut c_void,
    ) -> IoResultCode,

    // === Completion Notification ===

    /// Register a kernel callback for command completion notification.
    /// Replaces polling for latency-sensitive contexts.
    pub register_completion_callback: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        callback: unsafe extern "C" fn(
            context: AccelContextHandle,
            submission: AccelSubmissionHandle,
            status: AccelCompletionStatus,
        ),
    ) -> IoResultCode>,

    // === Vendor-Private Passthrough ===

    /// Handle vendor-private ioctls (range 0xC0-0xFF). The kernel validates
    /// buffer bounds and capability permissions, then passes the raw buffer
    /// to this handler. Returns 0 on success, negative errno on failure.
    /// Drivers that do not support vendor-private ioctls set this to `None`.
    pub vendor_ioctl: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        ioctl_nr: u8,
        user_buf: *mut u8,
        buf_len: u32,
    ) -> IoResultCode>,

    // === Memory Handle Resolution (added in KABI 1.2) ===

    /// Resolve a device memory handle to a CPU-accessible pointer.
    pub resolve_mem_handle: Option<unsafe extern "C" fn(
        ctx: *mut c_void, handle: u64,
    ) -> *mut c_void>,
    /// Export a device memory handle as a dma-buf fd for cross-device sharing.
    pub export_dma_buf: Option<unsafe extern "C" fn(
        ctx: *mut c_void, handle: u64,
    ) -> i32>,
    /// Import a dma-buf fd and return a device memory handle.
    pub import_dma_buf: Option<unsafe extern "C" fn(
        ctx: *mut c_void, fd: i32,
    ) -> u64>,
}
// AccelBaseVTable: usize(ptr_width) + u64(8) + 22 fn pointers.
// KABI vtable — size is pointer-width dependent.
#[cfg(target_pointer_width = "64")]
const _: () = assert!(core::mem::size_of::<AccelBaseVTable>() == 192);
#[cfg(target_pointer_width = "32")]
const _: () = assert!(core::mem::size_of::<AccelBaseVTable>() == 104);

Callback restrictions: Completion callbacks execute in interrupt context (hardirq on x86, IRQ on ARM). They MUST NOT: - Allocate memory (no Box, Vec, or slab allocation) - Acquire sleeping locks (no Mutex, only SpinLock with try_lock) - Perform I/O (no disk, network, or MMIO beyond the accelerator's own registers) - Call schedule() or any function that may sleep

Callbacks that need to perform complex work (e.g., chaining dependent dispatches, updating shared data structures) must defer to a per-accelerator workqueue thread via accel_defer(closure). The workqueue runs in process context with full kernel capabilities. The completion callback's job is limited to: (1) recording completion status, (2) waking the workqueue if deferred work is pending, and (3) updating per-context statistics (fence value, timing).

22.1.2.3 Key Data Types¶

/// Device capability and resource inventory.
// KABI struct — crosses driver-kernel boundary via AccelBaseVTable::get_info(). Layout is stable.
#[repr(C)]
pub struct AccelDeviceInfo {
    /// Device class.
    pub device_class: AccelDeviceClass,

    /// Number of hardware compute units in this device.
    /// The definition of "compute unit" is device-class-specific:
    ///   - GPU: Streaming Multiprocessors (SMs) for NVIDIA, Compute Units (CUs) for AMD
    ///   - NPU: neural processing cores
    ///   - FPGA: configurable logic blocks
    ///   - DSP: DSP cores
    /// This is the count reported by the driver via `get_info()` and used by the
    /// AccelScheduler for proportional resource accounting. For MIG/partitioned
    /// devices, this is the compute units assigned to the partition, not the
    /// full device total.
    pub compute_units: u32,

    /// Device-local memory size (bytes). 0 if no local memory.
    pub local_memory_bytes: u64,

    /// Maximum concurrent execution contexts.
    pub max_contexts: u32,

    /// Hardware preemption support.
    pub preemption_granularity: AccelPreemptionGranularity,

    /// Maximum command buffer size (bytes).
    pub max_cmd_buffer_size: u32,

    /// Supported memory types.
    pub memory_types: AccelMemTypeFlags,

    /// PCIe atomics support (needed for fine-grained SVM).
    /// Uses `u8` (0=false, 1=true) instead of `bool` for stable KABI:
    /// C `bool` has implementation-defined size and alignment, making it
    /// unsuitable for cross-compilation-unit ABI boundaries.
    pub pcie_atomics: u8,

    /// Peer-to-peer DMA capability.
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub p2p_capable: u8,

    /// Unified virtual addressing (CPU and device share address space).
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub unified_addressing: u8,

    /// Hardware page fault support (device can fault on unmapped pages).
    /// `u8` for KABI stability (see `pcie_atomics` comment).
    pub hw_page_faults: u8,

    /// NUMA node affinity for memory allocation and scheduling.
    /// Discovered at device probe time from ACPI SRAT / device tree.
    /// Used by the accelerator scheduler and memory allocator to prefer
    /// node-local memory and CPU affinity for submissions to this device.
    pub numa_node: u32,

    /// PCIe slot identifier for topology-aware scheduling.
    /// Encodes the BDF (Bus:Device.Function) address of the device's
    /// PCIe endpoint. Used by the AccelScheduler for placement decisions:
    /// co-located devices (same PCIe switch) are preferred for P2P workloads.
    /// Set to `PcieSlotId::NONE` for non-PCIe devices (e.g., SoC-integrated NPUs).
    pub pcie_slot: PcieSlotId,

    /// Negotiated PCIe link bandwidth in GB/s (gigatransfers per second
    /// after encoding overhead, converted to bytes/s).
    /// Reflects the actual link speed x width (e.g., Gen4 x16 = 32 GB/s).
    /// Used by the scheduler to avoid cross-link bottlenecks: if a P2P
    /// workload requires more bandwidth than the link can provide, the
    /// scheduler prefers CPU bounce over saturating a narrow link.
    /// 0 for non-PCIe devices.
    ///
    /// Note: field name uses `_gbps` for brevity but the unit is GB/s
    /// (gigabytes per second), not Gbps (gigabits per second).
    pub pcie_bandwidth_gbps: u32,

    /// Peer group identifier for P2P DMA affinity.
    /// Devices that can perform direct P2P DMA (same PCIe switch or
    /// root complex, ACS routing verified) share the same `peer_group`.
    /// The AccelScheduler uses this for multi-device workload placement:
    /// prefer co-grouped devices for P2P-heavy workloads (e.g., multi-GPU
    /// training with gradient all-reduce). Assigned during device probe
    /// by `dma_p2p_distance()` ([Section 4.14](04-memory.md#dma-subsystem)): devices with
    /// distance 0 share a peer group. 0 = no peer group (isolated device).
    pub peer_group: u32,

    /// Explicit padding for AccelPeerLink alignment (8 bytes, due to
    /// DeviceNodeId: u64). peer_group ends at offset 52; AccelPeerLink
    /// needs 8-byte alignment at offset 56. Must be zeroed.
    pub _pad_inner: [u8; 4],

    /// Inter-device connectivity for multi-GPU placement decisions.
    /// The AccelScheduler uses this for P2P-aware workload placement:
    /// prefer devices with high-bandwidth direct links (NVLink, xGMI)
    /// over PCIe-only paths for multi-device training workloads.
    /// Populated at device probe time by querying peer link topology.
    /// Empty if the device has no direct P2P links.
    ///
    /// Capacity 16: DGX A100/H100 has 8 GPUs with 7 NVLink peers each
    /// (fully connected). AMD MI300X in OAM form factor has 8 GPUs with
    /// 7 xGMI links. NVSwitch-based systems (DGX SuperPOD) still have
    /// at most 8 GPUs per node with direct NVLink, and the NVSwitch
    /// fabric is represented as a single peer link per switch chip
    /// (up to 6 NVSwitch chips = 6 additional peers). 16 covers all
    /// current and announced topologies with headroom for future
    /// interconnects (UALink, CXL 3.0 fabric). Systems exceeding 16
    /// direct peers are not expected before Phase 4; if they arise,
    /// `AccelPeerLink` can be moved to a slab-allocated `Box<[AccelPeerLink]>`
    /// without ABI change (this field is kernel-internal, not exposed
    /// to userspace).
    ///
    /// Fixed-size array (not ArrayVec) because this struct is `#[repr(C)]`
    /// and crosses the KABI boundary via `get_info()`. ArrayVec is not
    /// repr(C)-compatible. Only `p2p_peer_count` entries are valid.
    pub p2p_peers: [AccelPeerLink; 16],
    /// Number of valid entries in `p2p_peers` (0..=16).
    pub p2p_peer_count: u8,

    /// Trailing padding to 8-byte alignment boundary. Must be zeroed.
    /// p2p_peer_count at offset 440 + 1 = 441; next 8-byte boundary = 448.
    pub _pad: [u8; 7],
}
const_assert!(core::mem::size_of::<AccelDeviceInfo>() == 448);

/// Inter-device peer link descriptor. Describes a direct P2P connection
/// between two accelerator devices, used by the AccelScheduler for
/// topology-aware multi-device placement.
/// Layout (24 bytes): peer_device_id(8) + bandwidth_gbps(4) + latency_ns(4) +
/// link_type(4) + _pad(4) = 24. Alignment = 8 (from DeviceNodeId: u64).
#[repr(C)]
pub struct AccelPeerLink {
    /// DeviceNodeId of the peer device.
    pub peer_device_id: DeviceNodeId,
    /// Available bandwidth on this link in GB/s.
    pub bandwidth_gbps: u32,
    /// One-way latency in nanoseconds.
    pub latency_ns: u32,
    /// Type of interconnect.
    pub link_type: AccelLinkType,
    /// Explicit trailing padding (struct alignment = 8 due to u64 field;
    /// 20 bytes of fields round up to 24). Must be zeroed. This struct is
    /// embedded in AccelDeviceInfo.p2p_peers[16], so 16 x 4 = 64 bytes
    /// of padding across the array must all be zero-initialized.
    pub _pad: [u8; 4],
}
const_assert!(core::mem::size_of::<AccelPeerLink>() == 24);

/// Interconnect type between two accelerator devices.
///
/// **Extensibility**: New interconnect types (e.g., UALink, CXL 3.0 fabric)
/// are added as new enum variants with the next available discriminant.
/// Vendor-specific interconnects use `Vendor(u32)` with a vendor-assigned
/// sub-type in the lower 16 bits. The `Unknown` variant is the catch-all
/// for unrecognized link types from future driver versions.
#[repr(u32)]
pub enum AccelLinkType {
    /// NVIDIA NVLink (proprietary GPU-to-GPU interconnect).
    NvLink  = 0,
    /// AMD xGMI / Infinity Fabric (GPU-to-GPU interconnect).
    Xgmi    = 1,
    /// PCIe peer-to-peer (standard bus, lower bandwidth than dedicated links).
    Pcie    = 2,
    /// CXL (Compute Express Link) for memory-semantic interconnect.
    Cxl     = 3,
    /// Unknown or vendor-specific interconnect. Kernel treats as opaque
    /// — topology is still usable for scheduling, but no protocol-specific
    /// optimizations are applied.
    Unknown = 0xFFFF_FFFF,
}

#[repr(u32)]
pub enum AccelDeviceClass {
    /// General-purpose GPU (compute + graphics + display).
    Gpu             = 0,
    /// Compute-only GPU (no display, e.g., datacenter SKUs).
    GpuCompute      = 1,
    /// Neural Processing Unit (fixed-function inference).
    Npu             = 2,
    /// Tensor Processing Unit / AI ASIC.
    Tpu             = 3,
    /// FPGA with compute overlay.
    Fpga            = 4,
    /// Digital Signal Processor.
    Dsp             = 5,
    /// Media processor (video encode/decode).
    MediaProcessor  = 6,
    /// Computational storage device with embedded compute capability.
    ComputeStorage  = 7,
    // Values 8-254 are reserved for future device classes (e.g., photonic
    // accelerators, neuromorphic processors, quantum co-processors). New
    // variants must be added here with sequential values and registered in
    // the AccelDeviceClass registry at boot time.
    /// Other / vendor-specific. Kernel treats as opaque — no protocol-specific
    /// optimizations are applied, but the device is still schedulable via the
    /// unified accelerator framework.
    Other           = 255,
}

/// See `AccelPreemptionGranularity` ([Section 22.3](#accelerator-vtables-and-integration)) for full enum definition with design rationale.
pub type PreemptionGranularity = AccelPreemptionGranularity;

/// PCIe slot identifier (encodes BDF address).
/// `PcieSlotId::NONE` (u32::MAX) represents non-PCIe devices (SoC-integrated NPUs).
#[repr(transparent)]
pub struct PcieSlotId(pub u32);

impl PcieSlotId {
    pub const NONE: Self = Self(u32::MAX);
}

/// Reason for requesting preemption of a running accelerator context.
/// Passed to `preempt_context` so the driver can log/report the cause
/// and, on devices that support it, choose an appropriate preemption
/// strategy (e.g., urgent drain vs. graceful yield).
#[repr(u32)]
pub enum PreemptReason {
    /// A higher-priority submission is waiting behind this context.
    /// The scheduler detected priority inversion and is preempting the
    /// lower-priority context to unblock the higher-priority one.
    PriorityInversion = 0,
    /// The context has exceeded its CBS bandwidth budget for the current
    /// scheduling period. Fairness enforcement requires yielding the
    /// device so other contexts can use their guaranteed bandwidth.
    FairnessTimeout   = 1,
    /// The context's current submission has exceeded its
    /// `AccelContextLimits::max_execution_us` timeout. The scheduler is
    /// preempting to enforce the per-submission execution time limit.
    ExecutionTimeout  = 2,
    /// Administrative eviction: the context is being forcibly removed
    /// from the device. Causes include driver unload, device reset,
    /// cgroup removal, or process exit cleanup.
    AdminEvict        = 3,
    /// A soft watchdog timer fired because the submission has been
    /// executing longer than `AccelDeviceHrotCaps::soft_timeout_ms`.
    /// This is a warning-level preemption request: on preemptible
    /// hardware the driver should attempt a graceful yield; if the
    /// submission does not complete within the hard timeout window
    /// the kernel escalates to a full device reset via `accel_hard_reset`.
    /// Distinct from `ExecutionTimeout` (which enforces the per-submission
    /// `AccelContextLimits::max_execution_us` CBS budget); this variant
    /// is the HROT watchdog path.
    WatchdogSoftTimeout = 4,
}

/// Per-context resource limits (enforced by kernel).
/// KABI struct — passed across Tier 1 driver boundary via create_context().
#[repr(C)]
pub struct AccelContextLimits {
    /// Maximum device memory this context can allocate (bytes).
    /// 0 = no limit (subject to device capacity).
    pub max_memory_bytes: u64,

    /// Maximum compute time per submission (microseconds).
    /// 0 = no limit. Submissions exceeding this are preempted.
    pub max_execution_us: u64,

    /// Bandwidth guarantee: guaranteed microseconds of compute
    /// per scheduling period. See Section 22.3 (cgroup integration).
    /// 0 = best-effort (no guarantee).
    pub guaranteed_bandwidth_us: u64,

    /// Scheduling period for bandwidth accounting (microseconds).
    /// Default: 1_000_000 (1 second).
    pub bandwidth_period_us: u64,

    pub _pad: [u8; 32],
}
// AccelContextLimits: u64(8)*4 + [u8;32] = 64 bytes. KABI struct.
const_assert!(core::mem::size_of::<AccelContextLimits>() == 64);

/// Scheduling priority for accelerator contexts.
#[repr(u32)]
pub enum AccelPriority {
    /// Background / batch. Lowest priority. Preempted by all others.
    Background  = 0,
    /// Normal interactive workload.
    Normal      = 1,
    /// High priority (e.g., real-time inference with SLO).
    High        = 2,
    /// Realtime. Highest priority. Preempts all others immediately.
    Realtime    = 3,
}

/// Entry describing a page to migrate between CPU and device memory.
/// Used by migrate_pages() in AccelBaseVTable.
/// Layout (16 bytes): vaddr(8) + current_location(1) + target_location(1) +
/// _pad(2) + result(4) = 16.
#[repr(C)]
pub struct AccelMigrationEntry {
    /// Virtual address of the page to migrate (must be page-aligned).
    pub vaddr: u64,
    /// Current location: 0 = CPU RAM, 1 = device memory.
    pub current_location: u8,
    /// Desired location after migration.
    pub target_location: u8,
    /// Explicit padding for i32 alignment. Must be zeroed at construction.
    pub _pad: [u8; 2],
    /// Result of migration (filled by driver): 0 = success, negative errno on failure.
    pub result: i32,
}
const_assert!(core::mem::size_of::<AccelMigrationEntry>() == 16);

bitflags! {
    /// Flags controlling page migration behavior.
    // KABI struct — passed via AccelBaseVTable::migrate_pages(). Layout is stable.
    #[repr(C)]
    pub struct MigrationFlags: u32 {
        /// Migrate from CPU RAM to device memory.
        const TO_DEVICE   = 1 << 0;
        /// Migrate from device memory to CPU RAM.
        const TO_CPU      = 1 << 1;
        /// Force migration even if page is hot/pinned.
        const FORCE       = 1 << 2;
        /// Asynchronous migration (return immediately, complete via callback).
        const ASYNC       = 1 << 3;
        /// Prefetch hint (migrate pages likely to be accessed soon).
        const PREFETCH    = 1 << 4;
    }
}

/// Device utilization report.
#[repr(C)]
pub struct AccelUtilization {
    /// Compute utilization 0-100%.
    pub compute_percent: u32,
    /// Memory bandwidth utilization 0-100%.
    pub memory_bw_percent: u32,
    /// Memory usage (bytes allocated).
    pub memory_used_bytes: u64,
    /// Memory total (bytes).
    pub memory_total_bytes: u64,
    /// Number of active contexts.
    pub active_contexts: u32,
    /// Current temperature (millidegrees Celsius).
    pub temperature_mc: u32,
    /// Current power draw (milliwatts).
    pub power_mw: u32,
    /// Current clock frequency (MHz).
    pub clock_mhz: u32,
}
// AccelUtilization: u32(4)*2 + u64(8)*2 + u32(4)*4 = 40 bytes. KABI struct.
const_assert!(core::mem::size_of::<AccelUtilization>() == 40);

/// Bitflags indicating supported memory types for an accelerator device.
/// Used in `AccelDeviceInfo::memory_types` to describe what kinds of memory
/// the device can allocate or access.
#[repr(transparent)]
pub struct AccelMemTypeFlags(pub u32);

impl AccelMemTypeFlags {
    /// Device-local VRAM (GPU-attached HBM, GDDR, etc.).
    pub const DEVICE_LOCAL: Self   = Self(1 << 0);
    /// Host-visible memory (CPU-accessible, uncached on device).
    pub const HOST_VISIBLE: Self   = Self(1 << 1);
    /// Host-coherent memory (no explicit flush/invalidate needed).
    pub const HOST_COHERENT: Self  = Self(1 << 2);
    /// System memory accessible via SVM/unified addressing.
    pub const SYSTEM_SVM: Self     = Self(1 << 3);
    /// Peer-to-peer accessible memory (other devices can DMA directly).
    pub const PEER_ACCESSIBLE: Self = Self(1 << 4);
}

/// Current power and thermal state of an accelerator device.
/// Returned by `AccelBaseVTable::get_power_state()`.
#[repr(C)]
pub struct AccelPowerState {
    /// Current power consumption in milliwatts.
    pub power_mw: u32,
    /// Current temperature in millidegrees Celsius (e.g., 75000 = 75.0°C).
    pub temperature_mc: u32,
    /// Current ACPI-style device power state.
    pub device_state: AccelDevicePowerLevel,
    /// Thermal throttling active (0 = no, 1 = yes).
    /// `u8` for KABI stability (C `bool` has implementation-defined size).
    pub throttled: u8,
    pub _pad: [u8; 3],
}
// AccelPowerState: u32(4)*3 + u8(1) + [u8;3] = 16 bytes. KABI struct.
const_assert!(core::mem::size_of::<AccelPowerState>() == 16);

/// ACPI-style device power levels for accelerators.
#[repr(u32)]
pub enum AccelDevicePowerLevel {
    /// D0: Fully operational.
    D0Active    = 0,
    /// D1: Low-power idle (fast resume, ~microseconds).
    D1LowPower  = 1,
    /// D2: Deeper sleep (slower resume, ~milliseconds).
    D2Standby   = 2,
    /// D3: Off (full re-initialization required on resume).
    D3Off       = 3,
}

/// Performance level hint passed to `AccelBaseVTable::set_performance_level()`.
/// The driver maps this to device-specific clock/voltage settings.
#[repr(u32)]
pub enum AccelPerfLevel {
    /// Minimum clocks — lowest power, suitable for idle or light desktop compositing.
    Low       = 0,
    /// Balanced — driver chooses mid-range clocks.
    Medium    = 1,
    /// Maximum clocks — full performance, highest power/thermal.
    High      = 2,
    /// Boost — temporary overclock above nominal max (thermal-limited duration).
    Boost     = 3,
    /// Adaptive — driver uses internal DVFS; kernel does not pin a level.
    Adaptive  = 4,
}

// Opaque handle types (all u64 newtypes, #[repr(transparent)] for
// zero-cost FFI — the newtype has the same ABI as the inner u64).
//
// **Handle allocation protocol**:
// - Handles are allocated by the **kernel** (not by the driver). The
//   kernel maintains a per-device `Idr<HandleEntry>` for each handle
//   type (context, mem, submission, p2p). The Idr assigns a u64 ID;
//   the handle newtype wraps this ID.
// - Handle scope: per-device. Handle values are unique within a single
//   accelerator device but may collide across devices. The kernel
//   prefixes the device_id internally when needed for cross-device
//   lookups (e.g., P2P mappings).
// - Validation: On every `destroy_context`, `free_memory`, etc. call,
//   the kernel looks up the handle in the per-device Idr. If not found,
//   the call returns `IoResultCode::EBADF`. This prevents use-after-free
//   and double-free by the driver or userspace.
// - Recycling: Handle IDs are recycled after deallocation (Idr reuse).
//   To prevent ABA problems, the upper 32 bits of the handle encode a
//   generation counter that is incremented on each allocation at the
//   same Idr slot. Stale handles with a wrong generation fail validation.
#[repr(transparent)]
pub struct AccelContextHandle(pub u64);
#[repr(transparent)]
pub struct AccelMemHandle(pub u64);
#[repr(transparent)]
pub struct AccelSubmissionHandle(pub u64);
#[repr(transparent)]
pub struct P2pMappingHandle(pub u64);

/// Fence for synchronizing command submissions.
// The (device_id, context_id, value) tuple uniquely identifies a fence point.
// Cross-device fence signaling requires both devices to be registered in the
// same DmaFenceRegistry (Section 22.1.2.4).
#[repr(C)]
pub struct DmaFence {
    pub fence_type: DmaFenceType,
    pub value: u64,
    pub device_id: DeviceNodeId,  // Owning device (prevents cross-device forgery)
    pub context_id: u32,          // Owning context within the device.
                                   // Lower 32 bits of AccelContextHandle (u64).
                                   // The upper 32 bits (generation) are NOT stored
                                   // in the fence — ABA safety relies on: (1) the
                                   // generation in AccelContextHandle disambiguates
                                   // context reuse at submission time, and (2) fence
                                   // wait/signal uses the exact (device_id, context_id,
                                   // value) tuple, not context_id ordering.
                                   //
                                   // **INVARIANT**: `destroy_context` MUST signal ALL
                                   // outstanding fences for the destroyed context with
                                   // `AccelCompletionStatus::DeviceLost` BEFORE the
                                   // Idr slot is recycled. The kernel enforces this:
                                   // the context's Idr slot is freed only AFTER the
                                   // fence registry confirms all pending fences for
                                   // this (device_id, context_id) are signaled.
                                   // Without this invariant, a cross-device waiter
                                   // could match a stale fence to a recycled context.
                                   //
                                   // 0 = device-global scope: used by non-accelerator
                                   // devices (camera, media pipeline) that have no
                                   // per-context GPU-style state. For these devices,
                                   // all fences share context_id=0 and are ordered
                                   // solely by (device_id, value).
}
// DmaFence: DmaFenceType(4) + 4pad + u64(8) + u64(8) + u32(4) + 4pad = 32 bytes.
// KABI struct — passed across driver boundary in submit_commands().
const_assert!(core::mem::size_of::<DmaFence>() == 32);

#[repr(u32)]
pub enum DmaFenceType {
    /// Device-local fence (GPU timeline semaphore).
    DeviceLocal  = 0,
    /// Cross-device fence (for P2P synchronization).
    CrossDevice  = 1,
    /// CPU-signalable fence (host-side event).
    CpuSignal    = 2,
}

#[repr(u32)]
pub enum AccelCompletionStatus {
    /// Command completed successfully.
    Success       = 0,
    /// Command failed with device error.
    Error         = 1,
    /// Command timed out (exceeded max_execution_us).
    Timeout       = 2,
    /// Command was preempted by higher-priority context.
    Preempted     = 3,
    /// Partial error: some work completed, some failed.
    PartialError  = 4,
    /// Owning device crashed. Fence will never signal naturally.
    /// Cross-device fence waiters receive this status during crash
    /// recovery (step 4d in [Section 22.3](#accelerator-vtables-and-integration--crash-recovery)).
    /// Waiters should retry after device recovery or propagate error.
    DeviceLost    = 5,
}

/// Global registry for cross-device fence synchronization.
/// Devices must be registered in the same DmaFenceRegistry to share
/// cross-device fences. The registry is kernel-internal; userspace
/// interacts through /dev/umka-accel-N ioctls.
///
/// There is one global DmaFenceRegistry per accelerator topology domain
/// (typically one per NUMA node or PCIe domain). Devices in different
/// domains cannot share cross-device fences.
///
/// **Device lookup:** XArray provides O(1) lookup by integer key
/// (DeviceNodeId is a u64 newtype) with native RCU-compatible reads,
/// eliminating the need for a separate flat-array optimization.
pub struct DmaFenceRegistry {
    /// Registered devices and their per-device fence state.
    ///
    /// XArray keyed by DeviceNodeId (u64). O(1) lookup for any device ID,
    /// regardless of magnitude or density. RCU-protected: cross-device fence
    /// lookups read the map lock-free under `rcu_read_lock()`; registration
    /// mutations use XArray's internal locking.
    devices: XArray<Arc<DeviceFenceState>>,

    /// Serializes device registration / unregistration (cold path only).
    devices_update_lock: Mutex<()>,
}

/// Maximum concurrent waiters on a single GPU fence.
/// Exceeding this limit returns EAGAIN to the caller.
pub const MAX_FENCE_WAITERS: u32 = 64;

/// Per-device fence state. Sharded by device: cross-device fence polling
/// only contends with operations targeting the *same* device, not all
/// devices globally. A multi-GPU training job with 8 GPUs has 8
/// independent fence tables instead of one global bottleneck.
pub struct DeviceFenceState {
    /// Fence protocol capabilities for this device.
    pub protocol: FenceProtocolSupport,

    /// Active fences exported by this device, two-level XArray:
    /// outer keyed by context_id (u32), inner keyed by fence value (u64).
    /// XArray provides O(1) integer-keyed lookup per collection policy.
    /// Per-device RwLock: polling a fence on GPU-A takes GPU-A's read lock
    /// only — GPU-B's fence operations are uncontended.
    pub fences: spin::RwLock<XArray<XArray<Arc<AtomicBool>>>>,

    /// Waiters registered for this device's fences. Callback-based:
    /// when a fence signals, the owning device iterates its waiter list.
    pub waiters: spin::RwLock<ArrayVec<FenceWaiterEntry, MAX_FENCE_WAITERS>>,
}

// **Cross-device fence polling path** (hot path for multi-GPU):
//   1. rcu_read_lock() — lock-free
//   2. XArray lookup by device_id — O(1) for any device ID
//   3. Arc::clone the DeviceFenceState — one atomic increment
//   4. rcu_read_unlock()
//   5. Take per-device `fences.read()` — only contends with same-device ops
//   6. Look up context_id in outer XArray — O(1), then fence value in inner XArray — O(1)
//   7. Read AtomicBool — single atomic load
//
// XArray provides O(1) lookup for all integer keys (sparse or dense),
// with native RCU read compatibility. No separate flat-array optimization needed.
// Total contention: per-device only. No global serialization.

/// Fence protocol capabilities for a registered device.
#[repr(C)]
pub struct FenceProtocolSupport {
    /// Device supports timeline semaphores (monotonically increasing value).
    /// u8 (0=false, 1=true) for KABI stability (bool size is not guaranteed
    /// across compiler versions; see Section 22.1.2.3 KABI rules).
    pub timeline_semaphores: u8,
    /// Device supports cross-device signaling via hardware sync objects.
    pub hw_cross_device: u8,
    /// Device supports CPU-signaling of device fences (for host wait).
    pub cpu_signal: u8,
    /// Explicit padding for u64 alignment.
    pub _pad: [u8; 5],
    /// Maximum fence value before wrap (0 = no limit).
    /// **Wrap behavior**: When `fence_value == max_fence_value`, the next
    /// fence creation wraps the value to 1 (not 0, which is reserved for
    /// "unsignaled"). Existing waiters on pre-wrap fence values remain
    /// valid — they are matched by the exact (device_id, context_id, value)
    /// tuple, not by ordering. The context's generation counter (in
    /// `AccelContextHandle` upper bits) disambiguates pre-wrap and
    /// post-wrap fences with the same numeric value.
    pub max_fence_value: u64,
}
const_assert!(core::mem::size_of::<FenceProtocolSupport>() == 16);

/// Entry in the fence waiter table.
struct FenceWaiterEntry {
    /// Fence being waited on.
    fence: DmaFence,
    /// Device waiting for the fence.
    waiter_device: DeviceNodeId,
    /// Callback to invoke when fence signals (driver-provided).
    ///
    /// **Lock ordering constraint**: Same-domain fence callbacks are invoked
    /// under the `DeviceFenceState.waiters` read lock. Callbacks must NOT
    /// acquire any lock that nests outside `waiters` — doing so risks
    /// deadlock. Cross-domain callbacks are invoked via ring buffer + eventfd
    /// (no lock held) and have no such restriction.
    callback: unsafe extern "C" fn(device_id: DeviceNodeId, fence: DmaFence),
}

Cross-device fence registration authentication (E3 fix):

To prevent unauthorized cross-device fence waiting (which could leak information about other devices' workloads), the kernel requires the ACCEL_P2P capability (0x0102) for both devices involved in a cross-device fence wait:

/// Register a cross-device fence waiter.
///
/// This allows `waiter_device` to wait on a fence owned by `fence.device_id`.
/// Authentication is required to prevent information leakage and fence forgery.
///
/// **Security requirement**: The caller must hold `ACCEL_P2P` capability (0x0102)
/// for BOTH the waiter_device AND the fence.device_id. This ensures that only
/// processes authorized for peer-to-peer access can establish cross-device
/// synchronization.
///
/// Returns:
/// - `IO_OK` on success
/// - `EACCES` if caller lacks ACCEL_P2P for either device
/// - `ENOENT` if the fence does not exist
/// - `EAGAIN` if the waiter table is full
fn register_cross_device_fence_wait(
    registry: &DmaFenceRegistry,
    fence: &DmaFence,
    waiter_device: DeviceNodeId,
    callback: unsafe extern "C" fn(DeviceNodeId, DmaFence),
    caller_caps: &CapabilitySet,  // Caller's current capabilities
) -> IoResultCode {
    // Authentication check (E3 fix):
    // Require ACCEL_P2P for BOTH devices to prevent unauthorized cross-device sync.
    if !caller_caps.has_cap_for_device(Capability::ACCEL_P2P, fence.device_id) {
        return Err(-EACCES);  // Not authorized for fence.owner_device
    }
    if !caller_caps.has_cap_for_device(Capability::ACCEL_P2P, waiter_device) {
        return Err(-EACCES);  // Not authorized for waiter_device
    }

    // Look up the target device's fence state
    let Some(target_device_state) = registry.get_device(fence.device_id) else {
        return Err(-ENOENT);  // Device not registered
    };

    // Look up the specific fence
    let fence_key = (fence.context_id, fence.value);
    let Some(target_fence) = target_device_state.fences.read().get(&fence_key) else {
        return Err(-ENOENT);  // Fence does not exist
    };

    // Add waiter (with capacity check)
    let mut waiters = target_device_state.waiters.write();
    if waiters.len() >= MAX_FENCE_WAITERS as usize {
        return Err(-EAGAIN);  // Waiter table full
    }

    waiters.push(FenceWaiterEntry {
        fence: *fence,
        waiter_device,
        callback,
    });

    IO_OK
}

Rationale for dual-device capability check:

Without the check: A compromised Tier 1 driver for device A could register waiters on device B's fences (owned by a different process/cgroup), learning when device B completes work (timing side-channel) or potentially signaling fences it doesn't own.
With the check: Only processes that already have ACCEL_P2P for both devices can establish cross-device synchronization. This is consistent with the P2P DMA authorization model (Section 22.4).
Capability binding: The ACCEL_P2P capability is bound to specific device pairs at grant time, preventing replay attacks (same as P2P DMA ACL anti-replay).

Cross-domain fence signaling mechanism: When a fence on device A signals and device B has a registered waiter, the signaling path is: (1) device A's interrupt handler (or completion poll) sets the AtomicBool in DeviceFenceState.fences; (2) the handler iterates DeviceFenceState.waiters under a read lock; (3) for each matching waiter, the callback is invoked with the DeviceNodeId and DmaFence. If device B runs in a different Tier 1 isolation domain, the callback posts a notification to device B's ring buffer (shared memory + eventfd doorbell), which wakes device B's driver. This avoids requiring a direct function call across isolation domain boundaries — the signaling device writes to shared memory and rings the doorbell; the waiting device's domain polls or is interrupted.

22.2 Kernel-Side Accelerator Scheduler¶

The kernel maintains an accelerator scheduler that sits between userspace submissions and the driver's submit_commands. This is the core innovation — the kernel sees and controls accelerator time.

// umka-core/src/accel/scheduler.rs (kernel-internal)

/// Per-accelerator scheduler.
pub struct AccelScheduler {
    /// Device this scheduler manages.
    device_id: DeviceNodeId,

    /// Device capabilities (cached from get_info).
    device_info: AccelDeviceInfo,

    /// DMA device handle for this accelerator. Used by the P2P DMA path
    /// ([Section 22.4](#accelerator-memory-and-p2p-dma)) for DMA mapping of command
    /// buffer pages, not directly in the scheduling algorithm. All DMA
    /// operations (command buffer mapping, VRAM ↔ CPU transfers, P2P DMA)
    /// use this handle. Obtained during device probe from the bus driver's
    /// DMA configuration ([Section 4.14](04-memory.md#dma-subsystem)).
    /// The DMA device outlives the scheduler (registered at probe, removed at
    /// driver unload). `&'static dyn` avoids refcount overhead on every DMA
    /// operation vs `Arc<dyn>`.
    dma_device: &'static dyn DmaDevice,

    /// Active contexts, ordered by priority and deadline.
    /// Uses a fixed-capacity sorted array (no heap allocation).
    /// Capacity is set to `device_info.max_contexts` at initialization.
    contexts: FixedSortedArray<AccelContextHandle, AccelContextState>,

    /// Per-context CBS bandwidth servers (same algorithm as CPU scheduler,
    /// see Section 7.3; `AccelCbsServer` struct defined below in §22.1.2.4).
    /// Pre-allocated at initialization with capacity equal to `device_info.max_contexts`.
    bandwidth_servers: FixedVec<AccelCbsServer>,

    /// CBS bandwidth admission control. Tracks total allocated bandwidth
    /// across all active CBS servers. New CBS server creation is rejected
    /// if `total_allocated_bw + requested_bw > max_allocable_bw`.
    /// `max_allocable_bw` defaults to 95% of the device's compute capacity
    /// (matching the CPU CBS admission cap in [Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees)).
    /// The remaining 5% ensures that best-effort contexts always make
    /// progress (avoids complete starvation by guaranteed contexts).
    total_allocated_bw_ns: AtomicU64,
    max_allocable_bw_ns: u64,

    /// Hardware command queue depth (how many submissions are in-flight).
    hw_queue_depth: AtomicU32,

    /// Maximum concurrent in-flight submissions.
    max_inflight: u32,

    /// Scheduling policy.
    policy: AccelSchedPolicy,
}

// Kernel-internal, not KABI: contains Option, FixedRingBuffer, AtomicU64 (no stable C layout).
pub struct AccelContextState {
    /// Owner process/cgroup.
    // LONGEVITY: u32 is correct — Linux `pid_t` is i32 by POSIX/kernel ABI.
    // PID space is recycled (default max 4M). External protocol constraint.
    owner_pid: u32,
    cgroup_id: u64,

    /// Priority class.
    priority: AccelPriority,

    /// Resource limits.
    limits: AccelContextLimits,

    /// Index into `AccelScheduler::bandwidth_servers` (if guaranteed
    /// bandwidth is set). The CBS server state is owned by the scheduler's
    /// `bandwidth_servers` array — storing it here as well would create
    /// a consistency hazard (two copies of the same server state). The
    /// scheduler uses this index to look up the context's CBS server
    /// during scheduling decisions.
    cbs_server_index: Option<u32>,

    /// Pending command buffers waiting to be submitted to hardware.
    /// Fixed-capacity ring buffer, pre-allocated at context creation.
    pending_queue: FixedRingBuffer<PendingSubmission>,

    /// In-flight submissions (on hardware).
    /// Fixed-capacity array, sized to `max_inflight` at initialization.
    inflight: FixedVec<InflightSubmission>,

    /// Accounting: total compute time consumed (nanoseconds).
    total_compute_ns: AtomicU64,

    /// Accounting: total memory allocated (bytes).
    total_memory_bytes: AtomicU64,

    /// Last estimated_ns charged by charge_submission_cost(). Saved so that
    /// on_submission_complete() can compute the actual-vs-estimated delta
    /// for CBS budget correction.
    last_estimated_ns: u64,

    /// Number of consecutive rounds where this context had pending work but
    /// was not selected by `pick_next_submission` (starved by higher-priority
    /// contexts). Reset to 0 when the context is selected. Used by the
    /// starvation prevention logic to boost `effective_priority`.
    starvation_rounds: u32,

    /// Number of submissions that exceeded `max_submission_ns` (hardware
    /// timeout threshold). Used by `on_submission_complete` to demote
    /// misbehaving contexts to `AccelPriority::Background` after 3 timeouts.
    timeout_count: u32,

    /// Effective scheduling priority, which may differ from the configured
    /// `priority` field due to starvation boosting or timeout demotion.
    /// Reset to `priority` when the context is re-created or explicitly reset.
    effective_priority: i32,
}

/// Pending submission waiting to be dispatched to hardware.
/// Stored in the context's `pending_queue` until the scheduler submits it.
// kernel-internal, not KABI — scheduler internal state.
#[repr(C)]
pub struct PendingSubmission {
    /// Handle to the command buffer (driver-allocated).
    pub cmd_buffer: AccelCmdBufferHandle,

    /// Submission ID assigned by the driver for tracking.
    pub driver_submit_id: u64,

    /// Timestamp when submission was queued (for timeout detection).
    pub queued_timestamp_ns: u64,

    /// Number of fences this submission depends on (0 = immediate submit).
    pub dependency_count: u32,

    /// Userspace-provided estimate of GPU execution time (nanoseconds).
    /// Set via `AccelSubmitParams::estimated_ns` at submit time.
    /// Used by `charge_submission_cost` for CBS budget accounting.
    /// Zero means "unknown" — the scheduler substitutes `period_ns / 4`.
    /// Capped at `period_ns / 4` regardless of userspace value.
    pub estimated_ns: u64,

    /// Whether `completion_semaphore` is valid (1 = present, 0 = none).
    /// Separate flag used instead of `Option<T>` because `Option` has no
    /// stable layout guarantee in repr(C) structs.
    pub has_completion_semaphore: u8,
    /// Padding for alignment.
    pub _pad: [u8; 7],
    /// Semaphore to signal on completion. Valid only when
    /// `has_completion_semaphore == 1`.
    pub completion_semaphore: AccelSemaphoreHandle,
}

/// Opaque handle identifying a fence for synchronization at the KABI boundary.
/// Drivers receive this handle when submitting work and poll or wait on it
/// to determine when the hardware has completed the submission.
/// Matches `DmaFence.id` on the kernel side.
pub type DmaFenceHandle = u64;

/// An in-flight submission tracked by the scheduler until completion.
/// Stored in the context's `inflight` array. Not to be confused with
/// `AccelSubmissionHandle` (the opaque u64 KABI handle returned to drivers).
// kernel-internal, not KABI — scheduler internal tracking.
#[repr(C)]
pub struct InflightSubmission {
    /// Driver-assigned submission ID (matches PendingSubmission::driver_submit_id).
    pub driver_submit_id: u64,

    /// Timestamp when submitted to hardware (for timeout detection).
    pub submitted_timestamp_ns: u64,

    /// Fence that will be signaled on completion.
    pub completion_fence: DmaFenceHandle,
}

#[repr(u32)]
pub enum AccelSchedPolicy {
    /// Simple round-robin between contexts. Default for NPU/FPGA.
    /// Each context gets one submission slot per scheduling round,
    /// regardless of priority or weight. Used when all contexts have
    /// equal priority and no CBS guarantees are configured. Round-robin
    /// is the default when no explicit policy is set.
    RoundRobin  = 0,
    /// Priority-based with preemption. Default for GPU.
    Priority    = 1,
    /// CBS bandwidth guarantee + priority. Default when cgroup limits set.
    Guaranteed  = 2,
}

/// CBS (Constant Bandwidth Server) state for per-context bandwidth enforcement.
/// Uses the same algorithm as the CPU scheduler ([Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees)),
/// adapted for accelerator time. Each context with a bandwidth guarantee gets
/// its own AccelCbsServer instance in the scheduler's bandwidth_servers array.
///
/// **Unit convention**: Accelerator CBS uses nanoseconds for all time fields
/// (budget, period, deadline). The CPU CBS ([Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees))
/// also uses nanoseconds internally. Both share the same CBS algorithm:
/// deficit-capped replenishment at period boundaries. The only behavioral
/// difference is that accelerator command batches are non-preemptible
/// (the budget may go transiently negative), whereas CPU CBS can preempt
/// at instruction granularity. This non-preemptibility requires the
/// deficit-cap mechanism (clamped to `-bandwidth_ns`) defined below.
///
/// **FIX-030 — 32-bit architecture note**: `AtomicI64` and `AtomicU64` require
/// native 64-bit atomic support. On ILP32 architectures (ARMv7, PPC32) that
/// lack native 64-bit atomics, these fields must use a spinlock-protected `i64`
/// / `u64` instead (or `core::sync::atomic::AtomicI64` which falls back to a
/// CAS loop on architectures with `ATOMIC_I64_LOCK_FREE < 2`). The accelerator
/// scheduler is a warm path (context switch frequency, not per-packet), so the
/// spinlock fallback cost is acceptable. The implementation must compile-time
/// select between native atomics and spinlock-protected wrappers based on
/// `cfg(target_has_atomic = "64")`.
// kernel-internal, not KABI — CBS bandwidth enforcement state.
#[repr(C)]
pub struct AccelCbsServer {
    /// Context this server is associated with.
    pub context_id: AccelContextHandle,

    /// Maximum bandwidth (nanoseconds of accelerator time per period).
    pub bandwidth_ns: u64,

    /// Period in nanoseconds (e.g., 10ms = 10_000_000).
    pub period_ns: u64,

    /// Budget remaining in this period (nanoseconds). Signed because accelerator
    /// command batches are non-preemptible — a batch that starts within budget
    /// may finish over budget, creating transient debt carried to the next period.
    /// Matches CPU CBS deficit-tracking pattern (Section 7.3).
    ///
    /// On architectures with native 64-bit atomics (`target_has_atomic = "64"`):
    /// `AtomicI64`. On ILP32 without native 64-bit atomics: `SpinLockProtected<i64>`.
    pub budget_remaining_ns: AtomicI64,

    /// Absolute deadline (nanoseconds since boot). Computed as
    /// period_start + period_ns when the context is activated.
    /// Same atomic/fallback policy as `budget_remaining_ns`.
    pub deadline: AtomicU64,

    /// Start of the current period (nanoseconds since boot).
    /// Same atomic/fallback policy as `budget_remaining_ns`.
    pub period_start: AtomicU64,

    /// High-resolution timer for period-boundary budget replenishment.
    ///
    /// Armed to fire at `deadline` (= `period_start + period_ns`). On fire,
    /// the replenishment callback resets the budget and advances the period:
    ///
    /// ```text
    /// // Replenishment via CAS loop (races with concurrent charge_submission_cost):
    /// loop {
    ///     let current = budget_remaining_ns.load(Relaxed);
    ///     if current >= bandwidth_ns as i64 { break; }  // already replenished
    ///     // Clamp deficit: debt cannot exceed one full period
    ///     let clamped = max(current, -(bandwidth_ns as i64));
    ///     let new_budget = clamped + bandwidth_ns as i64;
    ///     if budget_remaining_ns.compare_exchange_weak(
    ///         current, new_budget, Release, Relaxed
    ///     ).is_ok() { break; }
    /// }
    /// period_start.store(now, Relaxed);
    /// deadline.store(now + period_ns, Relaxed);
    /// if context is runnable: re-enqueue in accelerator scheduler
    /// rearm replenish_timer to fire at new deadline
    /// ```
    ///
    /// The CAS loop ensures atomicity with concurrent `charge_submission_cost`
    /// (which does `fetch_sub` on `budget_remaining_ns` from the submission path).
    /// Without CAS, a non-atomic clamp-then-add sequence races with concurrent
    /// deductions, potentially losing budget updates.
    ///
    /// This follows the same CBS replenishment pattern as the CPU scheduler
    /// ([Section 7.6](07-scheduling.md#cpu-bandwidth-guarantees)): budgets are replenished at period
    /// boundaries regardless of whether the context exhausted its budget
    /// (timer-triggered) or was actively running when the budget hit zero
    /// (exhaustion-triggered). Without this timer, a context that idles
    /// mid-period would never have its budget restored, causing permanent
    /// starvation.
    ///
    /// **Deficit handling**: Non-preemptible accelerator commands may cause
    /// transient overspend (negative `budget_remaining_ns`). The replenishment
    /// callback uses a CAS loop (see the replenishment pseudocode above) to
    /// atomically: (1) clamp the deficit to `-bandwidth_ns` (one full period of
    /// debt), (2) add `bandwidth_ns` to the clamped value. This carries forward
    /// exactly the bounded debt — matching the CPU CBS deficit-cap mechanism.
    /// Recovery is bounded to at most one period.
    pub replenish_timer: HrTimer,
}

CBS Replenishment.

When an AccelCbsServer's budget is exhausted (budget_remaining_ns reaches zero or goes negative due to a non-preemptible command overrun):

The context is marked throttled and removed from the accelerator scheduler's ready queue.
The replenish_timer is armed (or was already armed) to fire at the current deadline.
On timer fire, the replenishment callback:
Clamps deficit: if budget_remaining_ns < -(bandwidth_ns as i64), sets it to -(bandwidth_ns as i64).
Replenishes: budget_remaining_ns += bandwidth_ns as i64.
Advances: period_start += period_ns, deadline += period_ns.
Un-throttles the context and re-enqueues it if it has pending work.
Re-arms the timer to fire at the new deadline.

This ensures budgets are never permanently drained. The guaranteed bandwidth fraction is bandwidth_ns / period_ns over any sliding window, identical to the CPU CBS invariant (Section 7.6).

Allocation discipline: All scheduler data structures use pre-allocated, fixed-capacity storage. No heap allocation occurs on the scheduling fast path. FixedSortedArray, FixedVec, and FixedRingBuffer are kernel-internal types that allocate their backing storage once at initialization (when the device is registered or a context is created) and never resize. This ensures that submit_commands, poll_completion, and context scheduling are deterministic-latency operations with no allocator contention.

Fixed-capacity container types:

// umka-core/src/collections/fixed.rs (kernel-internal, not part of KABI)

/// Fixed-capacity sorted array with O(log n) lookup and O(n) insert.
/// Backing storage is allocated once at initialization and never resized.
/// Capacity is set at construction; attempting to insert beyond capacity
/// returns an error rather than allocating.
///
/// Generic parameters:
/// - `K`: Key type (must be `Ord + Copy`)
/// - `V`: Value type (stored inline)
///
/// The array is kept sorted by key on every insert. Use when:
/// - Capacity is small (< 1024 entries)
/// - Lookups dominate inserts
/// - Deterministic latency is required
pub struct FixedSortedArray<K: Ord + Copy, V> {
    /// Pointer to pre-allocated storage: [(K, V); capacity]
    data: NonNull<(K, V)>,
    /// Current number of elements (<= capacity).
    /// AtomicUsize allows lock-free reads of the current length (e.g., for
    /// iteration bounds in pick_next_submission). Writes are externally
    /// synchronized.
    /// **KABI boundary note**: FixedSortedArray is kernel-internal only (Tier 0).
    /// It never crosses a KABI boundary — accelerator drivers interact through
    /// the AccelOps vtable, not through scheduler data structures directly.
    /// Therefore, AtomicUsize (which varies by pointer width) is acceptable.
    len: AtomicUsize,
    /// Maximum capacity (fixed at construction).
    capacity: usize,
}

impl<K: Ord + Copy, V> FixedSortedArray<K, V> {
    /// Construct a new fixed sorted array with given capacity.
    /// The backing storage is allocated immediately and zero-initialized.
    /// Returns `None` if allocation fails.
    pub fn new(capacity: usize) -> Option<Self>;

    /// Insert a (key, value) pair, keeping the array sorted by key.
    /// Returns `Err(())` if the array is full (no allocation on error).
    /// Time complexity: O(n) due to element shifting.
    pub fn insert(&self, key: K, value: V) -> Result<(), ()>;

    /// Remove an element by key.
    /// Returns `Some(value)` if found, `None` otherwise.
    /// Time complexity: O(n) due to element shifting.
    pub fn remove(&self, key: K) -> Option<V>;

    /// Get a reference to the value with the given key.
    /// Time complexity: O(log n) via binary search.
    pub fn get(&self, key: K) -> Option<&V>;

    /// Get the minimum element (first in sorted order).
    /// Returns `None` if the array is empty.
    /// Time complexity: O(1).
    pub fn min(&self) -> Option<&(K, V)>;

    /// Get the maximum element (last in sorted order).
    /// Returns `None` if the array is empty.
    /// Time complexity: O(1).
    pub fn max(&self) -> Option<&(K, V)>;

    /// Return the current number of elements.
    pub fn len(&self) -> usize;

    /// Return whether the array is empty.
    pub fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Return the maximum capacity.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

    /// Iterate over all elements in sorted order.
    /// The iterator is valid even if elements are removed during iteration
    /// (removed elements are skipped).
    pub fn iter(&self) -> FixedSortedIter<'_, K, V>;
}

/// Fixed-capacity vector (dynamic array) with O(1) push/pop and O(1) indexing.
/// Backing storage is allocated once at initialization and never resized.
/// Capacity is set at construction; attempting to push beyond capacity
/// returns an error rather than allocating.
///
/// Generic parameters:
/// - `T`: Element type (stored inline)
///
/// Use when:
/// - Random access is required
/// - Push/pop at end dominates
/// - Deterministic latency is required
pub struct FixedVec<T> {
    /// Pointer to pre-allocated storage: [T; capacity]
    data: NonNull<T>,
    /// Current number of elements (<= capacity).
    len: AtomicUsize,
    /// Maximum capacity (fixed at construction).
    capacity: usize,
}

impl<T> FixedVec<T> {
    /// Construct a new fixed vector with given capacity.
    /// The backing storage is allocated immediately and zero-initialized.
    /// Returns `None` if allocation fails.
    pub fn new(capacity: usize) -> Option<Self>;

    /// Push an element to the end of the vector.
    /// Returns `Err(element)` if the vector is full (no mutation on error).
    /// Time complexity: O(1).
    pub fn push(&self, value: T) -> Result<(), T>;

    /// Pop an element from the end of the vector.
    /// Returns `None` if the vector is empty.
    /// Time complexity: O(1).
    pub fn pop(&self) -> Option<T>;

    /// Get a reference to the element at the given index.
    /// Returns `None` if index is out of bounds.
    /// Time complexity: O(1).
    pub fn get(&self, index: usize) -> Option<&T>;

    /// Get a mutable reference to the element at the given index.
    /// Requires exclusive access (`&mut self`) to ensure no aliasing.
    /// Returns `None` if index is out of bounds.
    /// Time complexity: O(1).
    pub fn get_mut(&mut self, index: usize) -> Option<&mut T>;

    /// Return the current number of elements.
    pub fn len(&self) -> usize;

    /// Return whether the vector is empty.
    pub fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Return the maximum capacity.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

    /// Iterate over all elements.
    pub fn iter(&self) -> FixedVecIter<'_, T>;

    /// Iterate over all elements mutably.
    pub fn iter_mut(&mut self) -> FixedVecIterMut<'_, T>;
}

/// Fixed-capacity single-producer single-consumer (SPSC) ring buffer.
/// Backing storage is allocated once at initialization and never resized.
/// Capacity is set at construction; enqueue returns `None` if full,
/// dequeue returns `None` if empty.
///
/// Generic parameters:
/// - `T`: Element type (stored inline)
///
/// Use when:
/// - FIFO ordering is required
/// - Bounded buffering is acceptable
/// - Lock-free SPSC operation is desired
///
/// **Note**: `FixedRingBuffer` is SPSC by default. For MPSC scenarios,
/// wrap access in a `SpinLock` or use multiple SPSC rings (one per producer).
/// The scheduler's per-context `pending_queue` is SPSC because only the
/// context owner enqueues and only the scheduler dequeues.
pub struct FixedRingBuffer<T> {
    /// Pointer to pre-allocated storage: [T; capacity]
    data: NonNull<T>,
    /// Capacity (fixed at construction).
    capacity: usize,
    /// Head index (write position, producer-owned).
    head: AtomicUsize,
    /// Tail index (read position, consumer-owned).
    tail: AtomicUsize,
}

impl<T> FixedRingBuffer<T> {
    /// Construct a new fixed ring buffer with given capacity.
    /// Capacity must be a power of two (enforced at construction).
    /// The backing storage is allocated immediately and zero-initialized.
    /// Returns `None` if allocation fails or capacity is not a power of two.
    pub fn new(capacity: usize) -> Option<Self>;

    /// Enqueue an element at the tail (producer operation).
    /// Returns `Err(element)` if the buffer is full (no mutation on error).
    /// Time complexity: O(1). Safe for concurrent producer/consumer.
    pub fn enqueue(&self, value: T) -> Result<(), T>;

    /// Dequeue an element from the head (consumer operation).
    /// Returns `None` if the buffer is empty.
    /// Time complexity: O(1). Safe for concurrent producer/consumer.
    pub fn dequeue(&self) -> Option<T>;

    /// Peek at the next element without removing it.
    /// Returns `None` if the buffer is empty.
    /// Time complexity: O(1).
    pub fn peek(&self) -> Option<&T>;

    /// Return the current number of elements in the buffer.
    /// Note: this is a snapshot and may be stale in concurrent usage.
    pub fn len(&self) -> usize;

    /// Return whether the buffer is empty.
    pub fn is_empty(&self) -> bool {
        self.len() == 0
    }

    /// Return whether the buffer is full.
    pub fn is_full(&self) -> bool {
        self.len() == self.capacity - 1  // One slot wasted to distinguish full/empty
    }

    /// Return the maximum capacity.
    pub fn capacity(&self) -> usize {
        self.capacity
    }

    /// Drain all elements from the buffer, calling `f` on each.
    /// This is a consumer operation that empties the buffer.
    pub fn drain<F>(&self, mut f: F)
    where
        F: FnMut(T),
    {
        while let Some(elem) = self.dequeue() {
            f(elem);
        }
    }
}

Memory safety notes: - All three types use NonNull to store the data pointer, ensuring non-null pointer optimization (no extra discriminant needed for Option). - The backing storage is allocated using PageAllocator::alloc() (4KB granularity) and is never freed until the owning object (device, context) is destroyed. - FixedVec and FixedSortedArray use AtomicUsize for len to allow concurrent reads without locking. Mutations (push, insert, remove) require external synchronization (typically a SpinLock or scheduler-level lock). - FixedRingBuffer is lock-free for single-producer single-consumer usage via atomic head/tail updates with appropriate memory ordering (Ordering::AcqRel for enqueue/dequeue).

Scheduling flow:

Userspace submits work (via /dev/umka-accel-N or DRM ioctl compat):
    |
    v
UmkaOS Core validates capabilities
  - Does this process have an AccelContext for this device?
  - Does this process's cgroup allow more accelerator time?
  - Does this context have memory budget for the command buffer?
    |
    v
AccelScheduler queues the submission
  - Assigns priority based on context priority + cgroup policy
  - Checks CBS bandwidth server: is this context within its guarantee?
    |
    v
AccelScheduler picks next submission to dispatch to hardware
  - Priority order: Realtime > High > Normal > Background
  - Within same priority: CBS server with earliest deadline first
  - If higher-priority work arrives and device supports preemption:
    preempt current context via driver's preempt_context()
    |
    v
Driver's submit_commands() sends work to hardware
    |
    v
Hardware completion interrupt
    |
    v
Driver's poll_completion() reports result
    |
    v
AccelScheduler updates accounting (compute time, memory)
    |
    v
Userspace receives completion notification

Scheduling algorithm specification:

The AccelScheduler uses a multi-level scheduling algorithm that combines priority-based scheduling with CBS (Constant Bandwidth Server) bandwidth guarantees.

Independence from CPU scheduler: GPU CBS throttling is entirely independent of the CPU EEVDF scheduler (Section 7.1). When a GPU context's CBS budget is exhausted, the AccelScheduler defers its next submission — but the CPU thread that submitted the work is not descheduled or penalized. Conversely, CPU scheduler decisions (load balancing, bandwidth throttling) do not affect GPU submission ordering. The two schedulers share only cgroup accounting data (for enforcing aggregate resource limits) and NUMA topology information (for affinity decisions).

// Scheduling decision pseudocode (Section 22.1.2.4)

/// Pick the next submission to dispatch to hardware.
/// Returns the context ID and submission index, or None if nothing is runnable.
/// Time complexity: O(log n) where n = number of active contexts.
fn pick_next_submission(&self) -> Option<(AccelContextHandle, SubmissionIndex)> {
    // Level 1: Priority classes are strictly ordered.
    // Realtime > High > Normal > Background
    // A higher-priority context is always scheduled before a lower-priority one.

    for priority in [Priority::Realtime, Priority::High, Priority::Normal, Priority::Background] {
        // Level 2: Within the same priority class, use CBS deadline ordering.
        // Each context with a bandwidth guarantee has a CBS server with a deadline.
        // The context with the earliest deadline wins (Earliest Deadline First).

        let candidates = self.contexts.iter()
            .filter(|ctx| ctx.priority == priority && ctx.has_pending_work());

        if let Some(best) = candidates.min_by_key(|ctx| {
            ctx.cbs_server_index.map(|i| {
                self.bandwidth_servers[i].deadline.load(Ordering::Relaxed)
            }).unwrap_or(u64::MAX)  // No CBS = infinite deadline
        }) {
            // Tie-breaker: if two contexts have identical priority and deadline,
            // use the context with the lowest handle (deterministic, avoids starvation).
            return Some((best.handle, best.next_submission_index()));
        }
    }

    None  // No runnable contexts
}

/// Check if preemption is warranted when a higher-priority submission arrives.
/// Returns `true` if the scheduler should preempt the currently running context.
fn should_preempt(&self, running_ctx: &AccelContextState, higher_prio_ctx: &AccelContextState) -> bool {
    // Preemption is expensive (~50μs-10ms). Only preempt if:

    // 1. Running context is lower priority AND device supports preemption.
    if running_ctx.priority < higher_prio_ctx.priority
        && self.device_info.preemption_granularity >= AccelPreemptionGranularity::DrawBoundary
    {
        return true;
    }

    // 2. Running context has exceeded its CBS budget (fairness violation).
    if let Some(server_idx) = running_ctx.cbs_server_index {
        let server = &self.bandwidth_servers[server_idx];
        if server.budget_remaining_ns.load(Ordering::Relaxed) <= 0 {
            return true;  // Budget exceeded (possibly negative from non-preemptible overspend), preempt
        }
    }

    // 3. Running context has exceeded max_execution_us (timeout violation).
    if running_ctx.current_submission_exceeded_timeout() {
        return true;
    }

    // Otherwise: let the running context continue (avoid thrashing).
    false
}

Tie-breaking and determinism: - When two contexts have identical priority and identical CBS deadline, the scheduler uses the context handle value as a deterministic tie-breaker (lower handle wins). This ensures reproducible scheduling decisions across runs. - The scheduler does NOT use random tie-breaking (unlike some Linux CFS optimizations) because determinism is valued over fairness微调 in accelerator scheduling.

Complexity: - pick_next_submission(): O(n) where n = number of active contexts (typically < 64). The FixedSortedArray keeps contexts sorted by handle, so iteration is cache-friendly. - should_preempt(): O(1) — simple comparisons and atomic loads. - CBS server update (on completion): O(1) — atomic decrement of budget_remaining_ns, periodic deadline renewal.

Preemption interaction with sorted data structure: - Preemption does NOT require re-sorting the FixedSortedArray. The array is sorted by handle (static), not by dynamic state. - The scheduling decision (pick_next_submission) performs a linear scan filtered by priority, then finds the minimum deadline among candidates. This is O(n) but n is small (< 64 contexts typical). - For systems with many contexts (> 256), the scheduler can maintain a separate priority-indexed skip list for O(log n) lookup, but this is not implemented in the base design.

**CBS budget replenishment and submission charging (Section 22.1.2.4a):**

The pseudocode above omits the two side-effecting steps that make CBS work: replenishment of
exhausted budgets at the start of each scheduling decision, and budget charging after a
command buffer is selected. Both steps are mandatory; omitting either breaks CBS fairness.

```rust
// umka-core/src/accel/scheduler.rs — CBS replenishment and charging detail

/// Replenish CBS budgets for all contexts whose deadline has passed.
/// Called at the top of every `pick_next_submission` invocation, O(n) but n ≤ 64.
///
/// CBS replenishment rule (from Section 7.3):
///   if context.budget_remaining_ns <= 0
///      AND current_time >= context.deadline:
///     context.budget_remaining_ns = context.bandwidth_ns  (debt carried: += bandwidth_ns)
///     context.period_start        = current_time
///     context.deadline            = current_time + context.period_ns
///
/// Contexts that have exhausted their budget but whose deadline has NOT yet
/// passed remain ineligible until their deadline arrives (no early replenishment).
/// This enforces the bandwidth ceiling: a context cannot borrow future budget.
/// Note: `budget_remaining_ns` may be negative after a non-preemptible batch
/// overspend; the debt is carried forward by adding `bandwidth_ns` (not resetting
/// to `bandwidth_ns`), so overspend reduces the next period's effective budget.
fn replenish_expired_budgets(sched: &mut AccelScheduler, now_ns: u64) {
    for ctx in sched.contexts.iter() {
        let Some(server_idx) = ctx.cbs_server_index else { continue };
        let server = &sched.bandwidth_servers[server_idx];

        let remaining = server.budget_remaining_ns.load(Ordering::Relaxed);
        let deadline = server.deadline.load(Ordering::Relaxed);

        if remaining <= 0 && now_ns >= deadline {
            // Budget exhausted AND period has ended: replenish (debt carried forward).
            server.budget_remaining_ns.fetch_add(server.bandwidth_ns as i64, Ordering::Relaxed);
            server.period_start.store(now_ns, Ordering::Relaxed);
            server.deadline.store(now_ns + server.period_ns, Ordering::Relaxed);
        }
        // If remaining > 0: budget not yet exhausted, no action.
        // If remaining <= 0 but now_ns < deadline: still in penalty
        //   phase, do not replenish early.
    }
}

/// Charge a command buffer's estimated cost to its context's CBS budget.
/// Called immediately after pick_next_submission selects a command buffer.
///
/// Cost estimation: the kernel uses cmd_buffer's `estimated_ns` field, which
/// is set by userspace at submit time (via `AccelSubmitParams::estimated_ns`).
/// The estimate is clamped to `period_ns / 4` to prevent a single oversized
/// command from consuming more than one quarter of the CBS period in one shot.
/// (The CBS_MIN_BUDGET floor is enforced separately in the debt carry-forward
/// path — see Section 22.3.4 for the full debt model.)
///
/// After charging: if `budget_remaining_ns <= 0`, the context is over-budget
/// (possibly carrying debt from a non-preemptible overspend) and will not be
/// selected by `pick_next_submission` until its next replenishment period.
fn charge_submission_cost(
    sched:   &mut AccelScheduler,
    ctx:     &mut AccelContextState,
    cmd:     &PendingSubmission,
) {
    let Some(server_idx) = ctx.cbs_server_index else { return };
    let server = &sched.bandwidth_servers[server_idx];

    // Retrieve the context's current period to compute the cost cap.
    let period_ns = server.period_ns;
    let cost_cap  = period_ns / 4;

    // Minimum 100µs prevents userspace from gaming the CBS scheduler
    // by claiming near-zero estimated execution time.
    const MIN_ESTIMATED_NS: u64 = 100_000;

    // Clamp estimated cost; floor at MIN_ESTIMATED_NS, ceiling at cost_cap.
    let cost_ns = cmd.estimated_ns.max(MIN_ESTIMATED_NS).min(cost_cap);

    // Save the clamped estimate so on_submission_complete() can compute
    // the actual-vs-estimated delta for CBS budget correction.
    ctx.last_estimated_ns = cost_ns;

    // Subtract cost from remaining budget. May go negative for non-preemptible
    // batches that overspend — the debt carries to the next period.
    server.budget_remaining_ns.fetch_sub(cost_ns as i64, Ordering::Relaxed);
}

Integrated pick_next_submission with replenishment and charging:

The full CBS-correct scheduling decision is thus:

pick_next_submission():
  1. replenish_expired_budgets(now)         // O(n): update any newly-eligible contexts
  2. For each priority level (Realtime → Background):
       a. Find all contexts at this priority with pending work AND
          budget_remaining_ns > 0 (i.e., still within budget).
       b. Among eligible contexts, select the one with the earliest `deadline`
          in its CBS server (Earliest Deadline First within the priority class).
       c. Contexts without a CBS server (AccelSchedPolicy::RoundRobin) are treated
          as having deadline = u64::MAX and are selected in handle order as tie-breaker.
  3. If a context was found:
       a. Dequeue its next PendingSubmission (pop_front from pending_queue).
       b. charge_submission_cost(ctx, cmd)  // O(1): update CBS budget_remaining_ns
       c. Return (context_handle, submission).
  4. If no eligible context at any priority: return None (hardware queue stays idle).

This replaces the simpler pseudocode above, which did not specify replenishment timing or budget charging. The two are equivalent for contexts that never exceed their budget; the CBS machinery only activates when a context reaches its bandwidth ceiling.

AccelScheduler vs Firmware Scheduling Boundary:

Modern GPUs have their own internal schedulers (e.g., NVIDIA's GPC/TPC scheduler, AMD's ACE/HWS). The AccelScheduler does NOT replace or duplicate firmware scheduling. The boundary is clear:

AccelScheduler (kernel, software):
  Controls WHICH contexts get to submit work and WHEN.
  Enforces fairness, cgroup limits, bandwidth guarantees, priorities.
  Decides the ORDER of submissions to the hardware queue.
  Granularity: per-submission (command buffer level).

Firmware scheduler (device, hardware/firmware):
  Controls HOW submitted work is mapped to hardware execution units.
  Distributes warps/wavefronts across SMs/CUs.
  Manages context switching on the device.
  Handles workgroup scheduling and occupancy.
  Granularity: per-instruction-group (warp/wavefront level).

Interaction:
  AccelScheduler submits command buffer → firmware takes over.
  The kernel does not see or control internal GPU scheduling.
  The kernel CAN preempt at context boundaries (via preempt_context()),
  but CANNOT preempt mid-kernel on hardware that doesn't support it.
  Preemption capability is reported by get_info().preemption_granularity.
  Devices with AccelPreemptionGranularity::CommandBoundary do not support mid-dispatch preemption;
  the scheduler uses cooperative yield for those devices.

  When the device has hardware preemption (NVIDIA compute preemption,
  AMD MES): AccelScheduler can request preemption, and the firmware
  will drain the current workgroup and save context. Modern GPUs
  (Ampere+, CDNA2+) complete preemption in ~50μs-10ms including
  context save/restore, depending on preemption granularity and
  in-flight workload (instruction-level compute preemption is
  typically 50-100μs; draw/dispatch-level can be milliseconds).
  This is expensive compared to CPU context switches (~1μs) but
  bounded. The AccelScheduler therefore avoids preemption
  on the fast path (submissions are ordered by priority before dispatch)
  and triggers preemption only when necessary: priority inversion
  (a Realtime context is blocked behind a Background dispatch),
  fairness enforcement (a context has exceeded its CBS bandwidth
  budget), or timeout (max_execution_us exceeded).

  Older GPUs (pre-Ampere NVIDIA, pre-CDNA2 AMD) may report
  AccelPreemptionGranularity::CommandBoundary, where preemption
  latency is bounded by the longest in-flight command buffer
  (potentially hundreds of milliseconds for large compute dispatches).
  The scheduler treats these as cooperative-yield devices (see below).

Non-preemptible GPU limitation and cooperative yield:

Many GPUs (older hardware, NPUs, FPGAs, and some mid-range consumer GPUs) report AccelPreemptionGranularity::CommandBoundary. On these devices, the AccelScheduler CANNOT interrupt a running workload mid-dispatch. The scheduler can only enforce time slicing at submission boundaries — between command buffer submissions — not within a single dispatch. This is coarse-grained time slicing.

Consequences for non-preemptible devices: - max_execution_us is enforced at submission granularity, not instruction granularity. A single long-running dispatch that exceeds max_execution_us cannot be aborted; the scheduler must wait for it to complete naturally before taking corrective action. - Time guarantees for other contexts degrade proportionally to the longest single dispatch in the queue. Workloads with large dispatches should set small max_cmd_buffer_size limits (enforced by the kernel at submission time) to bound worst-case dispatch duration.

Cooperative yield mechanism for non-preemptible devices: - The driver implements preempt_context() as a cooperative yield: after the current command buffer completes, do not submit the next one, and signal the scheduler. This is NOT true preemption — the driver cannot interrupt the GPU mid-dispatch. It stops feeding new work at the next natural command buffer boundary. - The preempt_context vtable entry serves double duty: on preemptible devices it triggers true hardware preemption (~50μs-10ms); on non-preemptible devices it triggers cooperative yield (latency bounded by longest in-flight command buffer). The scheduler checks preemption_granularity to know which behavior to expect. - Drivers for non-preemptible devices MUST report AccelPreemptionGranularity::CommandBoundary in get_info(). Misreporting this field is a driver correctness violation. - The AccelScheduler logs a warning when a non-preemptible context exceeds max_execution_us, records the overrun duration in AccelContextState::total_compute_ns, and applies backpressure (delayed next submission) to amortize the overrun across the cgroup's next scheduling period.

Timeslice defaults and starvation prevention:

/// Default scheduling timeslice per preemption granularity level.
/// The scheduler grants each context this much device time before considering
/// preemption. Shorter timeslice = better fairness but more preemption overhead.
/// Per-context overrides via AccelContextParams.timeslice_ns.
pub const ACCEL_TIMESLICE_NS: [u64; 4] = [
    0,              // CommandBoundary: non-preemptible, CBS budget is the only control
    10_000_000,     // DrawBoundary: 10ms (allows 2-5 draw calls per slice)
    5_000_000,      // PixelBoundary: 5ms (sub-ms preemption cost, tighter fairness)
    2_000_000,      // InstructionLevel: 2ms (~50-100μs save/restore, 20:1 ratio)
];

/// Maximum wall-clock time a single submission may occupy a non-preemptible device.
/// If exceeded: FMA HealthReport (severity: degraded), and if the device supports
/// abort (AccelBaseVTable::abort_context), it is called. Otherwise the scheduler
/// waits for completion but demotes the context to Background priority for all
/// subsequent submissions. Default 500ms; per-device via AccelDeviceInfo.max_submission_ns.
pub const ACCEL_MAX_SUBMISSION_NS: u64 = 500_000_000;

/// After this many consecutive scheduling rounds where a context had pending work
/// but was never scheduled (blocked by a higher-priority or non-preemptible context),
/// the starved context's effective priority is temporarily boosted by one level for
/// the next round. Prevents indefinite starvation on CommandBoundary devices.
pub const ACCEL_STARVATION_ROUNDS: u32 = 10;

Starvation prevention interacts with CBS as follows: when a context's starvation_rounds counter reaches ACCEL_STARVATION_ROUNDS, the scheduler temporarily treats it as one priority level higher for the next pick_next_submission call. After one submission is dispatched, the boost expires and the counter resets. This ensures that even on non-preemptible devices, every context with pending work eventually gets device time.

Submission completion and timeout tracking:

impl AccelScheduler {
    /// Called by the completion interrupt handler when a submission finishes.
    fn on_submission_complete(&self, ctx: &mut AccelContextState, elapsed_ns: u64) {
        let max_ns = self.device_info.max_submission_ns
            .unwrap_or(ACCEL_MAX_SUBMISSION_NS);
        if elapsed_ns > max_ns {
            fma_report_health(self.device_id, HealthEventClass::Performance,
                ACCEL_SUBMISSION_TIMEOUT, HealthSeverity::Degraded, &[]);
            ctx.timeout_count += 1;
            if ctx.timeout_count >= 3 {
                ctx.effective_priority = AccelPriority::Background;
            }
        }
        // === CBS budget correction (actual vs estimated) ===
        // At submit time, charge_submission_cost() deducted estimated_ns from the
        // CBS server's budget_remaining_ns. Now that the actual elapsed time is known,
        // correct the budget to reflect reality. This ensures bandwidth guarantees
        // hold over time — systematic over-estimation would accumulate unused budget,
        // while under-estimation would starve other contexts.
        if let Some(server_idx) = ctx.cbs_server_index {
            let server = &mut self.bandwidth_servers[server_idx];
            let estimated = ctx.last_estimated_ns; // saved by charge_submission_cost()
            let delta = elapsed_ns as i64 - estimated as i64;
            // Positive delta = actual took longer than estimated → subtract more budget.
            // Negative delta = actual was faster → refund excess charge.
            server.budget_remaining_ns.fetch_sub(delta, Ordering::Relaxed);
            // Clamp: budget cannot exceed the server's period (no over-refund).
            let cap = server.bandwidth_ns as i64;
            let _ = server.budget_remaining_ns.fetch_update(
                Ordering::Relaxed, Ordering::Relaxed, |v| {
                    if v > cap { Some(cap) } else { None }
                },
            );
        }

        // Update starvation counters for all OTHER contexts with pending work.
        for other in self.contexts.iter_mut() {
            if other.has_pending() && other.handle != ctx.handle {
                other.starvation_rounds += 1;
            } else {
                other.starvation_rounds = 0;
            }
        }
    }
}

CPU scheduler interaction on GPU completion:

When a submission completes, the AccelScheduler wakes the userspace thread waiting on the fence via the standard WaitQueue mechanism (Section 3.6). The CPU scheduler treats this as an I/O completion wakeup and applies the standard EEVDF sleeper bonus (Section 7.1): the thread is placed at the front of its priority's run queue for one scheduling tick, ensuring prompt processing of the GPU result.

No special cross-scheduler protocol is needed. The fence wait is a standard WaitQueue sleep, and the wakeup path is standard wake_one(). If a thread is blocked on a GPU fence for longer than 100ms (the CPU scheduler's I/O-wait threshold), the thread's EEVDF virtual time is NOT advanced — same treatment as disk-I/O-waiting threads. This prevents GPU-heavy threads from being penalized by the CPU scheduler for "not using CPU": they are classified as I/O-bound, not idle.

Submission error handling:

/// Errors returned by ACCEL_SUBMIT ioctl.
#[repr(i32)]
pub enum AccelSubmitError {
    /// Invalid context handle (destroyed or belongs to another process).
    InvalidContext      = -libc::EINVAL,
    /// Device is offline (thermally throttled, in reset, or removed).
    /// Context remains valid. Userspace must resubmit when device recovers.
    DeviceOffline       = -libc::ENODEV,
    /// Device memory exhausted (cannot allocate command buffer on device).
    /// Wait for completions to free memory, then retry.
    DeviceMemoryFull    = -libc::ENOMEM,
    /// Submission queue full (too many in-flight submissions for this context).
    /// Wait for a fence, then retry. poll() on the fd returns POLLOUT when ready.
    QueueFull           = -libc::EBUSY,
    /// Invalid fence handle in wait_fences array.
    InvalidFence        = -libc::EBADF,
    /// Command buffer size exceeds AccelCapabilities.max_cmd_buffer_size.
    CmdBufferTooLarge   = -libc::E2BIG,
    /// Permission denied (context doesn't have required capabilities).
    PermissionDenied    = -libc::EPERM,
    /// Device was reset while submission was queued (driver crash recovery).
    /// All in-flight submissions on this device are lost. All pending fences
    /// are signaled with error status. Userspace must recreate contexts and
    /// resubmit. Pending submissions are NOT auto-retried.
    DeviceReset         = -libc::EIO,
}

AccelScheduler evolvability: The AccelScheduler policy is not evolvable in Phase 3. Accelerator scheduling policy replacement is deferred to Phase 5 (full platform). The AccelScheduler is Tier 0 code but uses a fixed CBS-based policy without AtomicPtr swap in Phases 1-4.

22.3 AccelComputeVTable — Compute-Specific Extensions¶

KABI versioning note: All accelerator vtables (AccelComputeVTable, AccelDisplayVTable, and AccelBaseVTable in Section 22.1) follow standard KABI vtable versioning (Section 12.1): vtable_size as the first field serves as the forward-compatibility discriminant, kabi_version as the second field disambiguates same-size vtables across deprecation cycles. New methods are appended at the end; the kernel reads only min(vtable_size, KERNEL_*_VTABLE_SIZE) bytes.

For devices with programmable compute (GPUs, some NPUs):

// KABI vtable — crosses driver-kernel boundary. Layout is stable across kernel versions.
#[repr(C)]
pub struct AccelComputeVTable {
    pub vtable_size: usize,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,

    /// Query supported compute APIs (Vulkan compute, OpenCL, CUDA compat).
    pub get_compute_caps: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_caps: *mut ComputeCapabilities,
    ) -> IoResultCode,

    /// Set up a shared virtual address space between CPU and device.
    /// Enables SVM (Shared Virtual Memory) / unified addressing.
    pub enable_svm: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        process_page_table: u64,    // CR3 / TTBR0 of the owning process
    ) -> IoResultCode>,

    /// Handle a device-initiated page fault.
    /// Called by the driver when the device faults on an unmapped address.
    /// The kernel resolves the fault (allocate page, migrate, map) and
    /// tells the driver to retry.
    pub handle_device_fault: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        fault_addr: u64,
        fault_flags: u32,
        out_resolution: *mut FaultResolution,
    ) -> IoResultCode>,

    /// Get performance counters for a context.
    pub get_perf_counters: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        context: AccelContextHandle,
        counters: *mut AccelPerfCounters,
    ) -> IoResultCode>,
}

/// Flags passed by the driver in the `fault_flags` parameter of `handle_device_fault`.
/// The driver sets these based on the device's fault buffer entry to tell the kernel
/// the nature of the fault. Multiple flags may be OR'd together.
bitflags::bitflags! {
    #[repr(transparent)]
    pub struct AccelFaultFlags: u32 {
        /// The faulting access was a read.
        const READ       = 1 << 0;
        /// The faulting access was a write.
        const WRITE      = 1 << 1;
        /// The faulting access was an instruction fetch (compute shader).
        const EXEC       = 1 << 2;
        /// The faulting access was an atomic operation.
        const ATOMIC     = 1 << 3;
        /// The fault was a prefetch hint (non-faulting — resolution is advisory).
        const PREFETCH   = 1 << 4;
        /// The fault is replayable: the device can retry after resolution.
        /// If not set, the fault is non-replayable (context must be killed or
        /// the page must have been pre-mapped before the access).
        const REPLAYABLE = 1 << 5;
    }
}

/// Resolution of a device-initiated page fault, returned by handle_device_fault.
/// Tells the kernel how to proceed after a GPU/NPU fault on a shared virtual
/// memory address.
#[repr(C)]
pub struct FaultResolution {
    /// Action the driver should take.
    pub action: FaultAction,
    /// If action is MapLocal or MapRemote, the physical page to map.
    /// Ignored for other actions.
    pub page_paddr: u64,
    /// If action is MapLocal or MapRemote, the page table entry flags
    /// (readable/writable/executable, caching attributes).
    pub pte_flags: u64,
}
// FaultResolution: FaultAction(4) + 4pad + u64(8)*2 = 24 bytes. KABI struct.
const_assert!(core::mem::size_of::<FaultResolution>() == 24);

#[repr(u32)]
pub enum FaultAction {
    /// Retry the faulting access — kernel resolved it (e.g., allocated page).
    Retry = 0,
    /// Map the provided physical page locally (device-local memory).
    MapLocal = 1,
    /// Map the provided physical page as remote (system memory, peer device).
    MapRemote = 2,
    /// Deliver SIGBUS or SIGSEGV to the faulting context's owner process.
    Signal = 3,
    /// Kill the context (unrecoverable device error).
    KillContext = 4,
}

/// Performance counters for an accelerator context.
/// Returned by get_perf_counters() for monitoring and accounting.
///
/// Layout: 7 u64 fields (56 B) + vendor_counters[u64; 8] (64 B) + _pad (32 B)
/// = 152 bytes. Exposed across the KABI boundary; size must be stable.
#[repr(C)]
pub struct AccelPerfCounters {
    /// Total GPU/accelerator cycles executed for this context.
    pub cycles_executed: u64,
    /// Total bytes read from device memory.
    pub bytes_read: u64,
    /// Total bytes written to device memory.
    pub bytes_written: u64,
    /// Number of kernels/shaders executed.
    pub kernel_count: u64,
    /// Number of page faults (SVM/Unified Memory).
    pub page_faults: u64,
    /// Time spent stalled on memory (nanoseconds).
    pub memory_stall_ns: u64,
    /// Time spent executing (nanoseconds).
    pub execution_ns: u64,
    /// Device-specific counters (vendor-defined).
    pub vendor_counters: [u64; 8],
    /// Reserved for future fields. Zero-initialized.
    pub _pad: [u8; 32],
}

const_assert!(core::mem::size_of::<AccelPerfCounters>() == 152);
const_assert!(core::mem::align_of::<AccelPerfCounters>() == 8);

#[repr(C)]
pub struct ComputeCapabilities {
    /// Shader/kernel ISA generation (vendor-specific version number).
    pub isa_version: u32,
    /// Maximum work group / thread block size.
    pub max_workgroup_size: u32,
    /// Maximum shared / local memory per work group (bytes).
    pub max_local_memory: u32,
    /// Number of shader/compute cores.
    pub shader_cores: u32,
    /// FLOPS (single precision, billions).
    pub fp32_gflops: u32,
    /// FLOPS (half precision, billions, 0 if unsupported).
    pub fp16_gflops: u32,
    /// INT8 TOPS (billions of 8-bit integer ops, for inference).
    pub int8_tops: u32,
    /// Tensor core / matrix unit count (0 if none).
    pub tensor_cores: u32,
    pub _pad: [u8; 32],
}
// ComputeCapabilities: 8*4 + 32 = 64 bytes. KABI struct.
const_assert!(size_of::<ComputeCapabilities>() == 64);

22.3.1 AccelDisplayVTable — Display-Capable Extensions¶

For devices with display output (GPUs with connectors). This vtable covers KMS mode-setting and framebuffer management; rendering/compositing acceleration is handled through AccelComputeVTable (compute shaders) or vendor-specific extensions. Advanced display features (HDR metadata, content-adaptive sync policies, display stream compression) are reserved for Phase 4 (Section 7.1).

/// Display vtable for accelerators with display output.
/// Covers mode setting, framebuffer management, and hotplug.
// KABI vtable — crosses driver-kernel boundary. Layout is stable across kernel versions.
#[repr(C)]
pub struct AccelDisplayVTable {
    pub vtable_size: usize,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,

    /// Enumerate display connectors on this device.
    pub get_connectors: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_connectors: *mut AccelConnectorInfo,
        max_connectors: u32,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Get supported display modes for a connector.
    pub get_modes: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_modes: *mut AccelDisplayMode,
        max_modes: u32,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Set the active display mode for a connector.
    pub set_mode: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        mode: *const AccelDisplayMode,
    ) -> IoResultCode,

    /// Create a framebuffer object from device memory.
    pub create_framebuffer: unsafe extern "C" fn(
        ctx: *mut c_void,
        mem: AccelMemHandle,
        width: u32,
        height: u32,
        format: u32,
        out_fb: *mut AccelFramebufferHandle,
    ) -> IoResultCode,

    /// Schedule a page flip (swap front/back buffer).
    pub page_flip: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        fb: AccelFramebufferHandle,
        out_fence: *mut DmaFence,
    ) -> IoResultCode,

    /// Set display power management state.
    pub set_dpms_state: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        state: AccelDpmsState,
    ) -> IoResultCode,

    /// Get hotplug status for a connector.
    // KABI: u8 instead of bool for stable ABI (0=false, nonzero=true)
    pub get_hotplug_status: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_connected: *mut u8,
    ) -> IoResultCode,

    /// Read EDID data from a connector.
    pub read_edid: unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        out_edid: *mut u8,
        max_size: u32,
        out_size: *mut u32,
    ) -> IoResultCode,

    /// Set hardware cursor image and position.
    pub set_cursor: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        image: *const u8,
        width: u32,
        height: u32,
        hot_x: u32,
        hot_y: u32,
    ) -> IoResultCode>,

    /// Enable/disable variable refresh rate (FreeSync/G-Sync).
    pub set_vrr_enabled: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        connector_id: u32,
        enabled: u8,
    ) -> IoResultCode>,
}

/// Display connector information.
/// Layout (24 bytes): connector_id(4) + connector_type(4) + max_width(4) +
/// max_height(4) + connected(1) + active(1) + _pad(6) = 24.
#[repr(C)]
pub struct AccelConnectorInfo {
    /// Unique connector ID within this device.
    pub connector_id: u32,
    /// Connector type. Uses the standard DRM connector type values from
    /// `ConnectorType` ([Section 13.3](13-device-classes.md#display-subsystem)): Unknown=0, VGA=1, DVII=2,
    /// DVID=3, DVIA=4, Composite=5, ..., HDMIA=11, DisplayPort=10, EDP=14,
    /// VIRTUAL=15, USB=20. `u32` matches `ConnectorType`'s `#[repr(u32)]`.
    pub connector_type: u32,
    /// Maximum resolution width supported (0 if unknown).
    pub max_width: u32,
    /// Maximum resolution height supported (0 if unknown).
    pub max_height: u32,
    /// Currently connected (1) or disconnected (0).
    pub connected: u8,
    /// Currently active (has a mode set).
    pub active: u8,
    pub _pad: [u8; 6],
}
const_assert!(core::mem::size_of::<AccelConnectorInfo>() == 24);

/// Display mode (resolution, refresh rate, timing).
#[repr(C)]
pub struct AccelDisplayMode {
    /// Horizontal resolution in pixels.
    pub width: u32,
    /// Vertical resolution in lines.
    pub height: u32,
    /// Refresh rate in millihertz (e.g., 59950 for 59.95 Hz).
    pub refresh_millihz: u32,
    /// Clock frequency in kHz.
    pub clock_khz: u32,
    /// Horizontal front porch in pixels.
    pub hfp: u16,
    /// Horizontal sync pulse width in pixels.
    pub hsw: u16,
    /// Horizontal back porch in pixels.
    pub hbp: u16,
    /// Vertical front porch in lines.
    pub vfp: u16,
    /// Vertical sync pulse width in lines.
    pub vsw: u16,
    /// Vertical back porch in lines.
    pub vbp: u16,
    /// Preferred mode flag (monitor's preferred resolution).
    pub preferred: u8,
    pub _pad: [u8; 7],
}
const_assert!(core::mem::size_of::<AccelDisplayMode>() == 36);

/// Framebuffer handle (opaque).
#[repr(transparent)]
pub struct AccelFramebufferHandle(pub u64);

/// Command buffer handle (opaque, driver-allocated).
/// The driver allocates command buffer memory and returns this handle to the kernel
/// for tracking. The handle is used in `PendingSubmission` to identify which command
/// buffer to execute.
/// Created via `AccelBaseVTable::create_cmd_buffer`; destroyed via
/// `AccelBaseVTable::destroy_cmd_buffer`. See command buffer creation path below.
#[repr(transparent)]
pub struct AccelCmdBufferHandle(pub u64);

/// Semaphore handle (opaque, driver-allocated).
/// Used for cross-context synchronization. A submission can optionally signal a
/// semaphore on completion, allowing other contexts or userspace to wait on it.
#[repr(transparent)]
pub struct AccelSemaphoreHandle(pub u64);

/// Display power management state.
#[repr(u32)]
pub enum AccelDpmsState {
    /// Display is on.
    On = 0,
    /// Display is in standby (minimal power, fast resume).
    Standby = 1,
    /// Display is suspended (low power, slower resume).
    Suspend = 2,
    /// Display is off.
    Off = 3,
}

22.3.2 Command Buffer Creation Path¶

AccelCmdBufferHandle identifies a recorded command buffer. Handles are created and destroyed via two vtable entries appended to AccelBaseVTable:

// Appended to AccelBaseVTable (after register_completion_callback and vendor_ioctl).
// Callers must check vtable_size before invoking per Section 12.1.4 versioning rules.

/// Create a command buffer for recording driver commands.
///
/// `size_hint`: estimated number of commands to be recorded (drivers may ignore).
/// Returns an `AccelCmdBufferHandle` opaque u64 identifying the buffer.
/// The buffer is initially empty; commands are added via driver-specific
/// `ACCEL_IOCTL_CMD_RECORD` ioctl calls before submission.
pub create_cmd_buffer: Option<unsafe extern "C" fn(
    ctx:        *mut c_void,
    context_id: AccelContextHandle,
    size_hint:  u32,
    out_handle: *mut AccelCmdBufferHandle,
) -> IoResultCode>,

/// Destroy a command buffer and free its resources.
/// Must not be called while the buffer is submitted (in-flight).
pub destroy_cmd_buffer: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    buf: AccelCmdBufferHandle,
) -> IoResultCode>,

Userspace ioctl path:

ACCEL_IOCTL_CREATE_CMD_BUFFER (ioctl number 0xA2): Takes {context_id: u64, size_hint: u32} → returns {handle: u64}.
ACCEL_IOCTL_DESTROY_CMD_BUFFER (ioctl number 0xA3): Takes {handle: u64}.
Commands are recorded into the buffer via ACCEL_IOCTL_CMD_RECORD (driver-specific; layout is opaque to the KABI layer and passed through vendor_ioctl).

Lifecycle:

create_cmd_buffer(context, size_hint)
  → record commands via ACCEL_IOCTL_CMD_RECORD (driver-specific)
  → submit_commands(context, cmd_buffer_handle, ...)
  → wait for completion via DmaFenceHandle / poll_completion
  → optionally resubmit (re-execute) without recreating the buffer
  → destroy_cmd_buffer(handle)

Command buffers may be resubmitted (re-executed) by calling submit_commands again without recreating — the recorded command sequence is preserved. The driver is responsible for ensuring that the buffer is not modified while an execution is in-flight.

Complete handle lifecycle specification (F1 fix):

// umka-core/src/accel/handles.rs (kernel-internal)

/// Maximum command buffers per context.
/// Limits memory usage for command buffer tracking structures.
/// With 1024 buffers × 4KB average size = 4MB per context.
pub const MAX_CMD_BUFFERS_PER_CONTEXT: usize = 1024;

/// Maximum semaphores per context.
/// Limits memory usage for semaphore tracking structures.
/// With 2048 semaphores × 64 bytes = 128KB per context.
pub const MAX_SEMAPHORES_PER_CONTEXT: usize = 2048;

AccelCmdBufferHandle lifecycle:

State	Description	Valid Operations
`ALLOCATED`	Buffer created via `create_cmd_buffer`, empty	`record_commands`, `destroy`
`RECORDING`	Commands being recorded via ioctl	`record_commands`, `submit`, `destroy`
`SUBMITTED`	Buffer submitted to hardware via `submit_commands`	`poll_completion` (none until complete)
`COMPLETED`	Hardware completed, fence signaled	`resubmit`, `destroy`
`INVALID`	Destroyed via `destroy_cmd_buffer`	(none)

Lifecycle rules: 1. create_cmd_buffer returns a new handle in ALLOCATED state. 2. record_commands transitions to RECORDING and appends commands. 3. submit_commands transitions to SUBMITTED and increments the context's in_flight_count. 4. On completion (fence signaled), state transitions to COMPLETED. 5. destroy_cmd_buffer: - If SUBMITTED: Returns EBUSY — caller must wait for completion first. - If any other state: Frees driver resources, transitions to INVALID. 6. Context destruction with pending buffers: - All SUBMITTED buffers are marked as COMPLETED with error status. - All buffers are implicitly destroyed (no explicit destroy_cmd_buffer needed). - Process receives ECONTEXTKILLED on any pending operations.

AccelSemaphoreHandle lifecycle:

State	Description	Valid Operations
`ALLOCATED`	Semaphore created via `create_semaphore`	`signal`, `wait`, `destroy`
`SIGNAL_PENDING`	Signal operation in progress	`wait` (blocks), `destroy` (returns EBUSY)
`SIGNALED`	Semaphore has been signaled	`wait` (immediate return), `reset`, `destroy`
`INVALID`	Destroyed via `destroy_semaphore`	(none)

Lifecycle rules: 1. create_semaphore returns a new handle in ALLOCATED state (unsignaled). 2. signal_semaphore transitions to SIGNAL_PENDING during async signal, then SIGNALED. 3. wait_semaphore: - If SIGNALED: Returns immediately. - If SIGNAL_PENDING or ALLOCATED: Blocks on internal wait queue. - Timeout option available (returns ETIMEOUT on expiry). 4. reset_semaphore: Transitions SIGNALED → ALLOCATED (unsignaled). 5. destroy_semaphore: - If SIGNAL_PENDING: Returns EBUSY — wait for signal completion first. - If any other state: Wakes all waiters with ESEMDESTROYED, frees resources. 6. Context destruction with pending semaphores: - All semaphores are implicitly destroyed. - All waiters wake with ECONTEXTKILLED.

Error handling summary:

Error	Meaning	Recovery
`EBUSY`	Handle is in use (submitted/signaling)	Wait for completion, retry
`EINVAL`	Invalid handle or operation	Programming error — fix caller
`ENOENT`	Handle does not exist	Already destroyed — ignore
`ECONTEXTKILLED`	Owning context was destroyed	Recreate context, resubmit
`ESEMDESTROYED`	Semaphore destroyed while waiting	Retry with new semaphore
`ETIMEOUT`	Wait operation timed out	Retry or report error

22.3.3 In-Flight Limits and Semaphore Dependency Validation¶

The lifecycle tables above describe per-handle state transitions. This section specifies the bounded limits enforced at submit_commands time and the kernel-side tracking structures backing semaphore validation.

In-flight command buffer limit per context:

// umka-core/src/accel/limits.rs (kernel-internal)

/// Maximum command buffers in flight (submitted to hardware, not yet completed)
/// per AccelContext. This is distinct from MAX_CMD_BUFFERS_PER_CONTEXT (which
/// limits the total number of allocated, not necessarily submitted, buffers).
///
/// Backpressure behavior:
///   - Blocking submit (default): calling userspace thread sleeps until a slot
///     frees, woken by the completion interrupt handler.
///   - Non-blocking submit (ACCEL_SUBMIT_NONBLOCK flag): returns EAGAIN
///     immediately so the caller can poll or use a completion callback.
///
/// Rationale for 256: matches typical GPU hardware queue depths (e.g., NVIDIA
/// GR engine ring size = 256 entries) and ensures the kernel tracking array
/// (InflightSubmission × 256 ≈ 12KB per context) fits in a single huge page.
pub const ACCEL_MAX_INFLIGHT_PER_CTX: usize = 256;

/// Maximum semaphore dependencies (wait + signal combined) per command buffer.
///
/// Enforced at submit_commands() time: if
///   wait_semaphores.len() + signal_semaphores.len() > ACCEL_MAX_SEMAPHORE_DEPS
/// the submission is rejected with KabiError::TooManyDeps.
///
/// Rationale: dependency graph resolution at submit time is O(D) where D = dep
/// count. Bounding D at 64 keeps the worst-case submit path deterministic and
/// prevents pathological O(n²) behaviour when many contexts are waiting on the
/// same large semaphore fan-in.
pub const ACCEL_MAX_SEMAPHORE_DEPS: usize = 64;

These limits are checked in the submit_commands fast path:

// Enforced at submit_commands() entry, before any driver vtable call:
fn validate_submission_limits(
    ctx:    &AccelContextState,
    params: &AccelSubmitParams,
) -> Result<(), KabiError> {
    // Limit 1: in-flight queue depth.
    if ctx.inflight.len() >= ACCEL_MAX_INFLIGHT_PER_CTX {
        return Err(KabiError::QueueFull);
    }

    // Limit 2: semaphore dependency count.
    let total_deps = params.wait_semaphores.len()
        .saturating_add(params.signal_semaphores.len());
    if total_deps > ACCEL_MAX_SEMAPHORE_DEPS {
        return Err(KabiError::TooManyDeps);
    }

    Ok(())
}

Kernel-side semaphore tracking structure:

// umka-core/src/accel/semaphore.rs (kernel-internal)

/// Kernel-side state for a single AccelSemaphoreHandle within one AccelDevice.
/// Stored in AccelDevice::semaphore_table, keyed by the handle's u64 value.
pub struct SemaphoreState {
    /// Whether the semaphore has been signaled.
    /// Set to true atomically by the completion interrupt handler when the
    /// signaling command buffer completes on hardware.
    pub signaled: AtomicBool,

    /// The command buffer currently designated to signal this semaphore.
    /// `None` if no pending command buffer has listed this semaphore in its
    /// `signal_semaphores`. Set at submit_commands() time; cleared on completion.
    /// Weak reference: the AccelCmdBuffer may be freed (e.g., context destroyed)
    /// without invalidating this pointer — the Weak upgrade check handles that.
    /// **Weak::upgrade() failure semantics**: If `upgrade()` returns `None`,
    /// the signaling command buffer was destroyed (context teardown). The
    /// semaphore is treated as "signaled by destruction" — all waiters are
    /// unblocked with a `SemaphoreSignalReason::ContextDestroyed` status,
    /// and the semaphore transitions to signaled state.
    pub signal_cmd: Mutex<Option<Weak<AccelCmdBuffer>>>,

    /// Number of command buffers currently waiting on this semaphore.
    /// Incremented at submit_commands() when the semaphore appears in
    /// `wait_semaphores`; decremented on each waiter's completion or cancellation.
    /// destroy_semaphore() is rejected with KabiError::Busy if this is > 0.
    pub wait_count: AtomicU32,

    /// Whether userspace has called destroy_semaphore() on this handle.
    /// Once true, any command buffers still waiting on it will be aborted
    /// with KabiError::SemaphoreDestroyed on their next scheduling attempt.
    pub destroyed: AtomicBool,
}

Semaphore validation in submit_commands():

After validate_submission_limits() passes, the following checks are applied to every semaphore handle listed in the submission (O(D) where D ≤ ACCEL_MAX_SEMAPHORE_DEPS):

For each handle h in params.wait_semaphores:
  1. Look up h in AccelDevice::semaphore_table.
     → Not found: return Err(KabiError::InvalidHandle)  // handle was never created
  2. If semaphore_table[h].destroyed == true:
     → return Err(KabiError::InvalidHandle)              // already destroyed
  3. If semaphore_table[h].signaled == true:
     → treat as satisfied; do NOT add a hardware dependency for h.
       (Avoids stalling hardware on an already-completed signal.)
  4. Else: increment semaphore_table[h].wait_count by 1.
     (The decrement happens in the completion handler when this cmd buffer finishes.)

For each handle h in params.signal_semaphores:
  1. Look up h in AccelDevice::semaphore_table.
     → Not found: return Err(KabiError::InvalidHandle)
  2. If semaphore_table[h].destroyed == true:
     → return Err(KabiError::InvalidHandle)
  3. If semaphore_table[h].signal_cmd (locked) is Some(_):
     → return Err(KabiError::AlreadySignaling)           // another cmd already owns this semaphore
       (One signaler per semaphore; two concurrent signalers would produce undefined ordering.)
  4. Else: set semaphore_table[h].signal_cmd = Weak::from(this cmd buffer).

If any check fails, the entire submission is rejected (no partial side-effects — wait_counts that were incremented in earlier iterations are decremented before returning the error).

Context destruction with outstanding semaphores:

On destroy_context() with semaphores still in the semaphore table for that context:

For every semaphore owned by the context, set destroyed = true atomically.
Any command buffer in another context that is waiting on such a semaphore will, on its next scheduling check, find destroyed == true and be aborted with KabiError::SemaphoreDestroyed rather than blocking indefinitely.
The kernel does NOT automatically call destroy_semaphore for each handle — that would require enumerating all pending command buffers across all contexts, which is O(C × D) and not bounded-latency. Instead, the lazy-destroy mechanism (the destroyed flag check in the scheduling path) handles cleanup at O(1) per check.
After context destruction, wait_count may transiently remain > 0 on destroyed semaphores until their waiters are aborted. The semaphore entries are freed from the table only after wait_count reaches 0 (checked in the waiter-abort path).

22.3.4 Scheduler Integration¶

The AccelScheduler interacts with the CPU scheduler to manage threads that are waiting for GPU work completion:

GPU-blocked threads: Threads waiting on poll_completion or a completion callback are placed in interruptible sleep (same state as I/O wait). They do not consume CPU time while waiting.
Completion wakeup: When the completion notification callback fires (see register_completion_callback), the waiting thread is woken directly from the interrupt handler — no polling required.
CPU vruntime accounting: Time spent blocked on GPU completion is treated like I/O wait: it does not accrue CPU debt. On wakeup, the thread receives a latency bonus similar to I/O-bound tasks in CFS, ensuring responsive scheduling for interactive GPU workloads.
Realtime priority contexts: For AccelPriority::Realtime contexts, the completion interrupt is handled by a dedicated high-priority IRQ thread to minimize wakeup latency.

22.3.5 Integration with UmkaOS Architecture¶

22.3.6 Device Registry Integration¶

Accelerators are modeled in the device registry (Section 11.4) with rich sub-device structure:

pci0000:00
  +-- 0000:41:00.0 (NVIDIA A100 GPU)
      +-- Properties:
      |     device-class: "gpu"
      |     compute-units: 108
      |     local-memory-bytes: 85899345920 (80 GB HBM2e)
      |     tensor-cores: 432
      |     p2p-capable: true
      |     preemption: "instruction"
      +-- Services published:
      |     "accel-compute" (AccelComputeVTable)
      |     "rdma-target" (for GPUDirect RDMA)
      +-- Children:
          +-- partition0 (MIG 3g.40gb)  # A100-80GB supports up to 2x 3g.40gb
          +-- partition1 (MIG 3g.40gb)
          +-- nvenc0 (video encoder sub-device)
          +-- nvdec0 (video decoder sub-device)

Example: NVIDIA RTX 4090 (desktop GPU with display output):

pci0000:00
  +-- 0000:01:00.0 (NVIDIA RTX 4090)
      +-- Properties:
      |     device-class: "gpu"
      |     compute-units: 128
      |     local-memory-bytes: 25769803776 (24GB GDDR6X)
      |     tensor-cores: 512
      |     p2p-capable: false
      |     preemption: "instruction"
      +-- Services published:
      |     "accel-compute" (AccelComputeVTable)
      |     "accel-display" (AccelDisplayVTable)
      +-- Children:
          +-- nvenc0 (video encoder sub-device)
          +-- nvdec0 (video decoder sub-device)

Power Management Integration:

On system suspend, the AccelScheduler drains in-flight submissions (with a timeout matching the device registry's suspend timeout, Section 11.4). Pending queue entries are preserved across suspend/resume and resubmitted after the device resumes. The AccelScheduler reports idle/busy state to the device registry for runtime power management decisions. AccelPowerState maps to the registry's PowerState as follows: D0Active ↔ active, D1LowPower ↔ idle.

22.3.7 Crash Recovery¶

GPU driver crashes are currently catastrophic in Linux. In UmkaOS:

1. GPU driver (Tier 1, domain-isolated) faults.
2. UmkaOS Core detects the fault.
3. Device registry transitions GPU node to Recovering.
4. AccelScheduler:
   a. Marks all active contexts as "interrupted."
   b. Completes all pending submissions with error status.
   c. Notifies all processes waiting on completions.
   d. Resolves cross-device fence waiters: iterates FenceWaiterEntry list
      for all fences owned by the crashed device and invokes each callback
      with AccelCompletionStatus::DeviceLost. Waiters receiving DeviceLost
      can retry after device recovery or propagate the error to userspace.
      This prevents indefinite waiter stalls when a source device dies.
5. Registry orchestrates crash recovery (Section 11.4.10):
   a. Revoke driver's isolation domain.
   b. GPU device reset (PCIe FLR or driver-specific reset).
   c. Reload driver binary.
   d. Fresh KABI vtable exchange.
6. AccelScheduler recreates contexts for processes that are still running.
   - Process state is on CPU side (command buffers). Only the GPU-side state is lost.
   - Processes receive an error on their in-flight submissions and can retry.
7. Total recovery time: ~100ms–5s (dominated by GPU hardware reset and driver reload; varies by hardware).
   **Note**: This is *reset* latency, which is distinct from *preemption* latency:
   - **Preemption**: ~50μs-10ms — saving context state to allow another workload to run (reversible).
   - **Reset**: ~100ms-5s — full hardware reinitialization, firmware reload, PCIe FLR (destructive).
   Simple GPU resets (PCIe FLR + driver reload) typically complete in ~100-500ms.
   Datacenter GPUs with GSP firmware reload (e.g., NVIDIA H100) may take 2-5s.
   Preemption is used for fairness enforcement; reset is used for crash recovery and HROT.

Compare Linux: full system reboot (30-60 seconds), loss of all work.

22.3.8 GPU Firmware as Cluster Member (Future)¶

Modern GPUs run sophisticated firmware that manages scheduling, memory, and P2P transfers. NVIDIA GPUs run proprietary firmware; AMD GPUs run open-source firmware. This firmware is effectively a specialized OS running on the GPU's control processor.

In UmkaOS's distributed kernel model (Section 5.1), GPU firmware can potentially participate as a first-class cluster member:

Current state (Phase 1-3): - Host kernel controls GPU via driver (Tier 1) - GPU firmware is passive — it responds to host commands but does not initiate - All scheduling decisions, memory allocation, and work dispatch are host-driven

Future state (Phase 5+): - GPU firmware implements UmkaOS cluster membership protocol - GPU registers as a cluster node with NodeId, exposes VRAM as remote-accessible memory in the device registry - GPU firmware participates in DSM (Distributed Shared Memory) and DLM (Distributed Lock Manager) — allows GPU-initiated RDMA transfers, GPU-to-GPU locking without CPU involvement - Work can be scheduled across the cluster transparently: a cgroup's workload could span CPUs on node0, GPUs on node0 and node1, and a DPU on node2

Benefits: - GPU-initiated transfers: GPU firmware can directly request pages via DSM without waking the CPU driver, reducing latency for fine-grained data access patterns - GPU-to-GPU coordination: Two GPUs on different nodes can synchronize via DLM without routing through their respective host CPUs — critical for distributed training with NCCL/RCCL - Heterogeneous scheduling: The kernel scheduler sees CPUs and GPUs as a unified pool of compute resources, not separate domains

Requirements: - GPU firmware must implement UmkaOS's inter-kernel messaging protocol (see Section 5.2 "Device-local kernels as cluster members" for detailed protocol specification) - Three implementation paths: (A) run full UmkaOS on GPU's control processor, (B) firmware shim translating UmkaOS messages to native GPU operations, or (C) traditional driver + host-proxy service provider (Section 5.7) - Requires firmware modifications by NVIDIA/AMD/Intel or open-source GPU firmware (e.g., Nouveau, AMDGPU firmware) - Security model must prevent malicious GPU firmware from compromising host integrity — cluster membership is opt-in per device and requires signed firmware verification

This is not required for UmkaOS to function. It is an optional future enhancement that treats the "multikernel" model (one kernel per hardware domain) as a first-class design pattern rather than a hack. See Section 5.2 "Device-local kernels as cluster members" for SmartNIC/DPU equivalents.

22.3.9 FMA Integration¶

The FMA engine (Section 20.1) monitors accelerator health:

// Accelerator-specific health events
HealthEventClass::Accelerator  // New class

// Health data for accelerators
// G1 fix: Struct is exactly 64 bytes (one cache line) for optimal cache efficiency.
// Field accounting (all offsets in bytes):
//   temperature_mc:      4 (u32)     running: 4
//   power_mw:            4 (u32)     running: 8
//   _pad1:               4 ([u8;4])  running: 12 (align u64 to offset 16)
//   ecc_correctable:     8 (u64)     running: 16
//   ecc_uncorrectable:   8 (u64)     running: 24
//   throttle_count:      4 (u32)     running: 28
//   error_code:          4 (u32)     running: 32
//   pcie_replay_count:   4 (u32)     running: 36
//   _pad2:              28 ([u8;28]) running: 64
#[repr(C)]
pub struct AccelHealthData {
    /// GPU temperature (millidegrees Celsius).
    /// u32 required: GPU operating temperatures of 80–95°C = 80,000–95,000
    /// millidegrees, exceeding i16/u16 range (u16 max = 65,535).
    pub temperature_mc: u32,
    /// Power draw (milliwatts).
    pub power_mw: u32,
    /// Explicit padding to align ecc_correctable to 8-byte boundary.
    pub _pad1: [u8; 4],
    /// ECC error count (VRAM), correctable.
    pub ecc_correctable: u64,
    /// ECC error count (VRAM), uncorrectable.
    pub ecc_uncorrectable: u64,
    /// Thermal throttling events.
    pub throttle_count: u32,
    /// XID error code (NVIDIA) or equivalent vendor error code.
    pub error_code: u32,
    /// PCIe replay count (indicates link instability).
    pub pcie_replay_count: u32,
    /// Padding to reach exactly 64 bytes (one cache line).
    pub _pad2: [u8; 28],
}
// AccelHealthData: u32(4)*2 + [u8;4] + u64(8)*2 + u32(4)*3 + [u8;28] = 64 bytes.
// KABI struct — crosses driver boundary via health reporting.
const_assert!(core::mem::size_of::<AccelHealthData>() == 64);
const_assert!(core::mem::align_of::<AccelHealthData>() == 8);

FMA diagnosis rules for accelerators:

Rule	Threshold	Action
VRAM ECC degradation	100 correctable / hour	Alert + schedule maintenance
VRAM uncorrectable error	1 event	DisableDevice + Alert
Thermal throttling	10 events / hour	Alert (may indicate cooling issue)
PCIe link unstable	50 replays / minute	DemoteTier + Alert
Repeated driver crashes	3 in 60 seconds	DemoteTier (move to Tier 2)

22.3.10 Stable Tracepoints¶

New stable tracepoints for accelerator observability (Section 20.2). All tracepoints in this table use Category: TracepointCategory::Accelerator.

Tracepoint	Arguments	Description
`umka_tp_stable_accel_submit`	device_id, context, cmd_size, priority	Command submitted
`umka_tp_stable_accel_complete`	device_id, context, latency_ns, error	Command completed
`umka_tp_stable_accel_preempt`	device_id, preempted_ctx, preempting_ctx	Context preempted
`umka_tp_stable_accel_migrate`	device_id, direction, pages, bytes	Memory migration
`umka_tp_stable_accel_fault`	device_id, context, fault_addr	Device page fault
`umka_tp_stable_accel_oom`	device_id, requested, available	Device memory exhaustion
`umka_tp_stable_accel_p2p`	src_device, dst_device, bytes, latency_ns	P2P DMA transfer

22.3.11 ML Policy Observation Types¶

The accelerator scheduler emits observations to the ML policy observation bus (Section 23.1) using SubsystemId::Accel (12). The obs_type field encodes the metric:

`obs_type`	Metric	`features[]` layout
0	Utilization	`[0]` = utilization (0-1000), `[1]` = device_id
1	Queue depth	`[0]` = queue_depth, `[1]` = device_id
2	Power draw	`[0]` = milliwatts, `[1]` = device_id
3	Temperature	`[0]` = millidegrees_C, `[1]` = device_id
4	CBS throttle	`[0]` = context_id, `[1]` = device_id, `[2]` = deficit_ns low32
5	Memory pressure	`[0]` = vram_used_mb, `[1]` = vram_total_mb, `[2]` = device_id
6	Preemption	`[0]` = preempted_ctx, `[1]` = preempting_ctx, `[2]` = device_id

The accelerator scheduler calls observe_kernel!(SubsystemId::Accel, obs_type, ...) at these points: - Utilization: once per scheduler tick (1ms default) per device. - CBS throttle: on each throttle/unthrottle event. - Memory pressure: on VRAM allocation failure (before eviction). - Preemption: on each context preemption.

These observations feed the same ObservationRing as CPU scheduler metrics, enabling the ML policy service to correlate CPU and accelerator behavior for holistic resource optimization.

22.3.12 Object Namespace¶

Accelerators appear in the unified object namespace (Section 20.5):

\Devices\pci0000:00\0000:41:00.0   # GPU device object
\Accelerators\gpu0                   # Symlink to device
\Accelerators\gpu0\Contexts\        # Active execution contexts
\Accelerators\gpu0\Memory\          # Memory allocation tracking
\Accelerators\gpu0\Partitions\      # MIG partitions

Browsable via umkafs:

cat /mnt/umka/Accelerators/gpu0/Memory
# # type: AcceleratorMemory
# # total: 40000000000 (40 GB)
# # used: 27000000000 (27 GB)
# # contexts:
# #   ctx[pid=1234]: 16106127360 (15 GB) - training job
# #   ctx[pid=5678]: 8589934592 (8 GB) - inference server
# #   ctx[pid=9012]: 4294967296 (4 GB) - development

22.3.13 Partial Failure Handling¶

Single-context error: If a single execution context encounters an error (shader hang, invalid command), the AccelScheduler calls reset_context() on the affected context only. Other contexts on the same device continue unaffected. The affected process receives an error status on its next poll_completion call or via the completion callback.

ECC uncorrectable error: When device memory reports an uncorrectable ECC error, the affected page is retired (analogous to CPU page retirement, Section 20.1). The owning context is notified, and if possible, data is migrated to a healthy page. If the page content is unrecoverable, the owning process receives SIGBUS.

Timeout without crash: The AccelScheduler enforces max_execution_us via a kernel timer. When a submission exceeds its time limit: - If the device supports instruction-level preemption: preempt the overdue context, save its state, and return AccelCompletionStatus::Preempted. - If preemption is not supported: wait until the current command buffer boundary, then fail the overdue submission with AccelCompletionStatus::Timeout. - In either case, the context remains valid and the process can submit new work.

22.3.14 Resolved Design Decisions¶

1. /dev/umka-accel-N ioctl specification.

Each ioctl struct starts with a u32 version header; the kernel checks version compatibility before dispatching. Ioctl command numbers use the following range assignments:

Range	Purpose
`0x00–0x1F`	Device query (caps, topology, memory info, clock info)
`0x20–0x3F`	Context management (create, destroy, set priority)
`0x40–0x5F`	Memory management (alloc, free, map, migrate)
`0x60–0x7F`	Command submission (submit, wait, fence create/signal)
`0x80–0x9F`	Display/KMS (mode set, framebuffer, hotplug) — routed to `AccelDisplayVTable`
`0xA0–0xBF`	Health/telemetry (temperature, ECC, utilization query)
`0xC0–0xFF`	Vendor-private (opaque passthrough to driver)

The vendor-private range 0xC0–0xFF passes the raw buffer to AccelBaseVTable::vendor_ioctl() — the kernel does not interpret it, only validates buffer bounds and capability permissions. This gives vendors (NVIDIA, AMD, Intel) a stable passthrough channel without polluting the common ioctl space.

Security note (E1 fix): Vendor-private ioctls make the vendor driver part of the Trusted Computing Base (TCB) for any process with CAP_ACCEL access. A vendor driver can implement arbitrary privileged operations within the passthrough range. This is an accepted tradeoff: GPU command submission inherently requires ring 0 memory mapping (GART/GTT) and DMA that cannot be fully mediated without unacceptable overhead. The mitigation is the Tier 1 MPK isolation domain — a vendor driver bug corrupts only its own domain, not umka-core.

Vendor-private ioctl buffer validation:

To prevent buffer overflow attacks via vendor-private ioctls, the kernel enforces:

Size prefix requirement: The first 4 bytes of any vendor-private ioctl buffer must contain the buffer size (little-endian u32). This is the standard Linux ioctl pattern (_IOC_SIZE bits). The kernel reads the size prefix before passing the buffer to vendor_ioctl.
Maximum buffer size: Vendor-private ioctl buffers are capped at 4KB (MAX_VENDOR_IOCTL_SIZE = 4096). Buffers larger than this are rejected with -EINVAL. This bounds the attack surface and prevents drivers from receiving arbitrarily large buffers that could overflow internal structures.
Capability requirement: Vendor-private ioctls require CAP_ACCEL capability. The ACCEL_COMPUTE capability (0x0100) is the minimum; certain vendor ioctls may require additional capabilities (e.g., ACCEL_P2P for P2P-related vendor extensions).

Kernel validation before passthrough:

/// Validate a vendor-private ioctl buffer before passing to driver.
/// Returns the validated buffer size, or an error code.
fn validate_vendor_ioctl_buf(user_buf: *const u8, buf_len: u32) -> Result<u32, i32> {
    // 1. Check basic bounds
    if buf_len < 4 {
        return Err(-EINVAL);  // Too small to contain size prefix
    }

    // 2. Read size prefix from userspace
    let size_prefix = read_user_u32(user_buf as *const u32)?;  // -EFAULT on bad access

    // 3. Verify size prefix matches provided length
    if size_prefix != buf_len {
        return Err(-EINVAL);  // Size mismatch = potential attack
    }

    // 4. Enforce maximum size
    if size_prefix > MAX_VENDOR_IOCTL_SIZE {
        return Err(-EINVAL);  // Buffer too large
    }

    // 5. Verify buffer is readable/writable by caller
    if !access_ok(user_buf, size_prefix as usize) {
        return Err(-EFAULT);
    }

    Ok(size_prefix)
}

Driver receives validated buffer: By the time vendor_ioctl is called, the kernel has already validated the buffer size and bounds. The driver receives user_buf: *mut u8 and buf_len: u32 with the guarantee that buf_len <= 4096 and the buffer is accessible.

Versioning scheme: ioctl structs are append-only. New fields are added at the end. The version header tells the kernel which fields are present. The kernel zero-fills any fields beyond the caller's version (forward compat) and ignores trailing fields beyond its own version (backward compat). This mirrors the KABI vtable versioning protocol (Section 12.2).

Standard ioctl command table for /dev/umka-accel-N:

Standard kernel-defined commands occupy the 0x00-0x7F range. Vendor-specific commands (0x80-0xFF) pass through to the driver after validation (see above).

pub const ACCEL_IOCTL_TYPE: u8 = b'A'; // 0x41

// --- Context management ---
/// Create a new accelerator context. Returns context handle.
pub const ACCEL_CREATE_CONTEXT:  u32 = _IOWR(b'A', 0x01, AccelCreateContextParams);
/// Destroy a context. All pending submissions are cancelled.
pub const ACCEL_DESTROY_CONTEXT: u32 = _IOW(b'A', 0x02, AccelContextHandle);

// --- Work submission ---
/// Submit a command buffer for execution. Non-blocking.
/// Returns a fence handle for completion tracking.
pub const ACCEL_SUBMIT:          u32 = _IOWR(b'A', 0x10, AccelSubmitParams);
/// Wait for a fence to signal. Blocks calling thread.
pub const ACCEL_WAIT_FENCE:      u32 = _IOW(b'A', 0x11, AccelWaitFenceParams);

// --- Memory ---
/// Map a buffer object for device access. Returns device virtual address.
pub const ACCEL_MAP_BO:          u32 = _IOWR(b'A', 0x20, AccelMapBoParams);
/// Unmap a buffer object.
pub const ACCEL_UNMAP_BO:        u32 = _IOW(b'A', 0x21, AccelUnmapBoParams);
/// Hint: prefetch pages to device memory before submission.
pub const ACCEL_PREFETCH:        u32 = _IOW(b'A', 0x22, AccelPrefetchParams);

// --- Query ---
/// Query device capabilities.
pub const ACCEL_GET_CAPS:        u32 = _IOR(b'A', 0x30, AccelCapabilities);
/// Query context status (pending submissions, CBS budget remaining).
pub const ACCEL_GET_CTX_STATUS:  u32 = _IOWR(b'A', 0x31, AccelContextStatusParams);

/// Parameters for ACCEL_SUBMIT ioctl.
#[repr(C)]
pub struct AccelSubmitParams {
    /// Context handle (from ACCEL_CREATE_CONTEXT).
    pub context: AccelContextHandle,
    /// Pointer to command buffer in userspace (u64 for 32/64-bit compat).
    pub cmd_buffer_ptr: u64,
    /// Command buffer size in bytes.
    pub cmd_buffer_size: u32,
    /// Estimated execution time in nanoseconds (for CBS charging).
    /// 0 = unknown; scheduler substitutes period_ns / 4.
    /// u64 to match PendingSubmission::estimated_ns (nanosecond values can
    /// exceed 4.29 s for long-running GPU dispatches).
    pub estimated_ns: u64,
    /// Fence dependencies: submission waits for these fences before executing.
    pub wait_fences_ptr: u64,   // *const DmaFenceHandle, as u64
    pub wait_fences_count: u32,
    /// Priority override for this submission (0 = use context default).
    pub priority_override: u32,
    /// [OUT] Fence handle signaled on completion.
    pub out_fence: DmaFenceHandle,
    /// Reserved for future fields (zero-initialized by userspace; kernel
    /// ignores non-zero values in current version). Named `_reserved`
    /// (not `_pad`) to indicate semantic intent: these bytes may be
    /// repurposed in future KABI versions. `_pad` fields are purely
    /// for alignment and will never carry data.
    pub _reserved: [u8; 12],
}
// Layout (repr(C), 8-byte natural alignment):
//   context:           AccelContextHandle (u64)  offset  0, size 8
//   cmd_buffer_ptr:    u64                       offset  8, size 8
//   cmd_buffer_size:   u32                       offset 16, size 4
//   _pad0:             [u8; 4] (implicit)        offset 20, size 4   (align estimated_ns to 8)
//   estimated_ns:      u64                       offset 24, size 8
//   wait_fences_ptr:   u64                       offset 32, size 8
//   wait_fences_count: u32                       offset 40, size 4
//   priority_override: u32                       offset 44, size 4
//   out_fence:         DmaFenceHandle (u64)      offset 48, size 8
//   _reserved:         [u8; 12]                  offset 56, size 12
//   _pad1:             [u8; 4] (implicit)        offset 68, size 4   (align struct to 8)
//   Total: 72 bytes
const_assert!(core::mem::size_of::<AccelSubmitParams>() == 72);
const_assert!(core::mem::align_of::<AccelSubmitParams>() == 8);

/// Per-command flags set by the driver when recording commands into a command
/// buffer. The scheduler inspects these flags to decide preemption policy for
/// in-flight submissions (see [Section 22.5](#accelerator-isolation-and-scheduling--non-preemptible-budget-overspend-handling)).
bitflags! {
    #[repr(transparent)]
    pub struct AccelCmdFlags: u32 {
        /// This command cannot be interrupted once dispatched to hardware.
        /// The scheduler allows it to run to completion even if the CBS
        /// budget is exhausted, recording any overspend as debt carried
        /// forward to the next period. Typical uses: firmware upload,
        /// DMA scatter-gather that cannot be split, fixed-function pipeline
        /// flushes.
        const NON_PREEMPTIBLE = 1 << 0;
        /// Command writes to shared resources visible to other contexts.
        /// The scheduler may insert a memory barrier after completion.
        const WRITES_SHARED   = 1 << 1;
        /// Command is a synchronization primitive (fence signal, semaphore
        /// release). The scheduler prioritizes these to unblock dependent
        /// contexts.
        const SYNC_RELEASE    = 1 << 2;
    }
}

/// Parameters for ACCEL_PREFETCH ioctl.
#[repr(C)]
pub struct AccelPrefetchParams {
    /// Virtual address range to prefetch to device memory.
    pub vaddr: u64,
    pub size: u64,
    /// Target device (0 = the device this fd refers to).
    pub target_device: u32,
    /// Flags: MIGRATE_SYNC = 1 (block until done), MIGRATE_ASYNC = 0.
    pub flags: u32,
}
// AccelPrefetchParams: u64(8)*2 + u32(4)*2 = 24 bytes. Userspace ABI (ioctl argument).
const_assert!(core::mem::size_of::<AccelPrefetchParams>() == 24);

/// Parameters for ACCEL_WAIT_FENCE ioctl.
#[repr(C)]
pub struct AccelWaitFenceParams {
    /// Fence to wait on.
    pub fence: DmaFenceHandle,
    /// Timeout in nanoseconds (0 = non-blocking check, u64::MAX = wait forever).
    pub timeout_ns: u64,
    /// [OUT] Fence status on return: 0 = signaled, -ETIMEDOUT = not yet,
    /// -EIO = device reset (fence will never signal).
    pub out_status: i32,
    pub _pad: u32,
}
// AccelWaitFenceParams: u64(8)*2 + i32(4) + u32(4) = 24 bytes. Userspace ABI (ioctl argument).
const_assert!(core::mem::size_of::<AccelWaitFenceParams>() == 24);

2. Context save/restore for mid-shader preemption: driver-reported capability.

Context save/restore state size and latency are fundamentally hardware-defined. The kernel cannot dictate GPU context size. The driver reports preemption capabilities via a new optional vtable method:

// Appended to AccelBaseVTable per Section 12.1.4 versioning rules (Option<fn> = backward compat).
// Drivers compiled against older KABI versions see this field as None (vtable_size check).
//     pub preemption_caps: Option<unsafe extern "C" fn(
//         device: DeviceHandle,
//         out: *mut AccelPreemptionCaps,
//     ) -> IoResultCode>,

/// Preemption capability descriptor reported by the driver.
#[repr(C)]
pub struct AccelPreemptionCaps {
    /// Maximum context save state size in bytes (driver-reported).
    /// The kernel allocates this per-context from device memory.
    pub context_state_size: u64,
    /// Worst-case save latency (nanoseconds). The scheduler uses this
    /// to decide whether preemption is worth the cost for short timeslices.
    pub save_latency_ns: u64,
    /// Worst-case restore latency (nanoseconds).
    pub restore_latency_ns: u64,
    /// Preemption granularity supported by the hardware.
    pub granularity: AccelPreemptionGranularity,
    /// Whether this device supports mid-shader preemption.
    /// u8 (0=false, 1=true) for KABI stability (see Section 22.1.2.3 KABI rules).
    /// Placed after u32 granularity to avoid padding holes in #[repr(C)] layout.
    pub supports_mid_shader: u8,
    /// Explicit padding to ensure deterministic layout and zero-initialized bytes.
    /// This 3-byte array fills the gap between supports_mid_shader (u8 at offset 28)
    /// and the trailing _pad array, ensuring no compiler-inserted uninitialized padding.
    /// The entire struct is zero-initialized by the driver before filling fields.
    pub _reserved: [u8; 3],
    /// Reserved for future expansion. Must be zero-initialized.
    pub _pad: [u8; 32],
}
// Layout verification:
//   context_state_size:   0-7   (8 bytes)
//   save_latency_ns:      8-15  (8 bytes)
//   restore_latency_ns:  16-23  (8 bytes)
//   granularity:         24-27  (4 bytes, u32)
//   supports_mid_shader: 28     (1 byte)
//   _reserved:           29-31  (3 bytes)
//   _pad:                32-63  (32 bytes)
// Total: 64 bytes (one cache line). KABI struct.
const_assert!(core::mem::size_of::<AccelPreemptionCaps>() == 64);

/// Preemption granularity levels, from coarsest to finest.
#[repr(u32)]
pub enum AccelPreemptionGranularity {
    /// No mid-dispatch preemption — must wait for the entire current command buffer
    /// to complete before the scheduler can reclaim the device. Latency is bounded
    /// by the longest in-flight command buffer, which may be hundreds of milliseconds
    /// for large compute dispatches. The AccelScheduler treats devices reporting this
    /// level as cooperative-yield devices.
    CommandBoundary  = 0,
    /// Preemption point checked after each draw call or compute dispatch boundary.
    /// Medium granularity: latency is bounded by the longest single draw call or
    /// dispatch, typically 1-10ms on current hardware. Supported by most modern
    /// discrete GPUs (post-Maxwell NVIDIA, post-GCN3 AMD).
    DrawBoundary     = 1,
    /// Preemption point checked at triangle/pixel boundaries (graphics) or wavefront
    /// boundaries (compute). Finer than `DrawBoundary` but coarser than
    /// `InstructionLevel`. Latency is typically sub-millisecond.
    PixelBoundary    = 2,
    /// Preemption point checked after every GPU instruction — highest overhead, used
    /// only for latency-critical contexts such as AR/VR rendering and real-time
    /// inference. The device saves full shader register state and can resume
    /// mid-thread. Save+restore latency is typically 50-100μs on compute GPUs
    /// (e.g., NVIDIA Ampere+ with compute preemption enabled).
    InstructionLevel = 3,
}

Kernel responsibilities: - Allocate a save-state buffer of context_state_size bytes in device memory per AccelContext. - When preempting, call AccelBaseVTable::preempt_context() (Section 22.1). - The scheduler uses save_latency_ns + restore_latency_ns to compute the break-even point: do not preempt if the remaining timeslice is shorter than the save+restore cost.

Driver responsibilities: - Report accurate worst-case values (over-reporting causes conservative scheduling; under-reporting causes deadline misses). - Perform the actual hardware state save/restore when preempt_context() is called.

3. Multi-GPU unified memory coherence granularity: 64KB default, per-device configurable.

The coherence unit is 64KB as the default. Rationale:

4KB is too fine-grained. GPU memory controllers are optimized for 64KB+ transfers. 4KB coherence causes 16× more page-fault interrupts and migration metadata overhead. False sharing is rare at 64KB in real ML workloads (tensors are almost always >64KB).
64KB matches GPU hardware. NVIDIA and AMD GPU page tables operate at 64KB granularity. GPU-side page faults (via ATS/PRI or vendor-specific mechanisms) already use this unit. CPU-side overhead is acceptable — a 64KB migration over PCIe Gen4 x16 takes ~3µs (64KB / 25.6 GB/s raw bandwidth + TLP header and DMA setup overhead).
2MB is too coarse. Causes excessive data transfer for partial access patterns (e.g., accessing one element of a large tensor page). Acceptable for pre-fetching but harmful for demand-paging.

The coherence granularity is a per-device property reported by the driver in AccelCapabilities:

pub struct AccelCapabilities {
    // ...existing fields...

    /// Minimum coherence granularity in bytes (must be power of 2, >= 4096).
    /// The unified memory system does not track coherence at finer granularity.
    pub coherence_granularity: u64,

    // ...
}

The kernel's HMM (heterogeneous memory management) layer rounds up all migration operations to this granularity. For multi-GPU scenarios where devices have different granularities, the kernel uses the LCM (least common multiple) of all participating devices' granularities. In practice this is 64KB since all major GPU vendors use it.

4. HealthEventClass::Accelerator formal taxonomy.

Event codes within the Accelerator class (stored in the event_code: u32 field of HealthEvent, Section 20.1):

Code	Name	Default Severity	Recommended Action
`0x0001`	`VRAM_ECC_CORRECTABLE`	Info (single), Warning (≥100/hr)	Log. At threshold: schedule maintenance.
`0x0002`	`VRAM_ECC_UNCORRECTABLE`	Critical	Quiesce device. Kill affected contexts. Alert admin.
`0x0003`	`SRAM_ECC_CORRECTABLE`	Info	Log. L1/L2/register file errors are usually self-healing.
`0x0004`	`SRAM_ECC_UNCORRECTABLE`	Critical	Reset compute engine. Fail affected contexts.
`0x0010`	`THERMAL_THROTTLE`	Warning	Log. At threshold (≥10/hr): alert admin (cooling issue).
`0x0011`	`THERMAL_SHUTDOWN`	Critical	Device self-protected. Mark offline. Alert admin.
`0x0012`	`POWER_BRAKE`	Warning	Device hit power limit. Log for trending.
`0x0020`	`GPU_HANG_DETECTED`	Degraded	Attempt engine reset. If reset fails → Critical.
`0x0021`	`ENGINE_RESET`	Warning	Log which engine was reset and which contexts were lost.
`0x0022`	`DEVICE_RESET`	Degraded	Full device reset. All contexts lost. Alert admin.
`0x0023`	`DEVICE_LOST`	Critical	Device unresponsive after reset. Mark offline. `DisableDevice`.
`0x0030`	`PCIE_REPLAY_THRESHOLD`	Warning (≥50/min)	Link instability. `DemoteTier`. Suggest reseat/replace.
`0x0031`	`PCIE_LINK_DEGRADED`	Degraded	Link retrained at lower speed/width. Log + alert.
`0x0040`	`VRAM_USAGE_HIGH`	Info (>80%), Warning (>95%)	For trending / OOM avoidance.
`0x0041`	`CONTEXT_OOM`	Warning	A specific context's allocation failed.
`0x0050`	`DRIVER_CRASH`	Degraded	Driver process died. At threshold (≥3/hr): `DemoteTier`.
`0x0051`	`FW_ERROR`	Degraded	Device firmware reported an error (e.g., NVIDIA XID).
`0x0060`	`ENCODER_ERROR`	Warning	Video encode/decode engine error.
`0x0070`	`NVLINK_CRC_ERROR`	Warning (threshold)	Interconnect link error. Log for trending.
`0x0071`	`NVLINK_DOWN`	Degraded	Interconnect lane failed. Multi-GPU perf degraded.
`0xFF00–0xFFFF`	`VENDOR_SPECIFIC`	(driver-defined)	Opaque vendor event — logged as-is.

Each event code maps to a specific subset of AccelHealthData fields (Section 22.3) that are meaningful for that event (e.g., THERMAL_THROTTLE → temperature_mc and throttle_count; VRAM_ECC_CORRECTABLE → ecc_correctable). The vendor-specific range 0xFF00–0xFFFF lets drivers report hardware-specific events without requiring a kernel update for each new error type.

FMA diagnosis rules (Section 20.1) map these codes to automated actions:

Rule	Trigger	Action
VRAM ECC degradation	`VRAM_ECC_CORRECTABLE` ≥ 100/hr	Alert + schedule maintenance
VRAM uncorrectable	`VRAM_ECC_UNCORRECTABLE` × 1	`DisableDevice` + Alert
Thermal throttling	`THERMAL_THROTTLE` ≥ 10/hr	Alert (cooling issue)
PCIe link unstable	`PCIE_REPLAY_THRESHOLD` ≥ 50/min	`DemoteTier` + Alert
Repeated driver crashes	`DRIVER_CRASH` ≥ 3/hr	`DemoteTier` (move to Tier 2)
Device lost	`DEVICE_LOST` × 1	`DisableDevice` + Alert

22.3.15 Shared Accelerator Helper Services¶

Multiple GPU/accelerator drivers need the same utility code: EDID parsing, display mode validation, bandwidth calculation, colorspace conversion tables. Without a sharing mechanism, every driver duplicates this logic. UmkaOS provides two tiers of sharing:

Tier A: Static driver library (umka-driver-sdk)

The existing umka-driver-sdk crate already provides KABI types and ring buffer helpers. It is extended with accelerator-specific utilities that drivers statically link:

Endian conversion, bitfield packing macros, register access patterns
PCI config space helpers (BAR mapping, capability walking)
DMA scatter-gather list construction

These are tiny, hot-path-eligible functions. Zero runtime overhead (compiled into each driver binary). Updates require driver recompilation.

Tier B: Loadable KABI helper service (accel-display-helpers)

Substantial cold-path utilities are provided as a loadable Evolvable KABI service, following the same pattern as umka-vfs or umka-net. The service is loaded on demand when the first display-capable accelerator registers, and is live-evolvable independently of drivers.

/// KABI service: accel-display-helpers
/// Loaded on demand when first AccelDisplayVTable is registered.
/// Provides shared display utility functions to accelerator drivers.
///
/// Service ID: "accel-display-helpers"
/// KABI version: 1
pub struct AccelDisplayHelpersVTable {
    /// Parse raw EDID blob into a validated list of display modes.
    /// Handles EDID 1.4, DisplayID 2.0 extensions, and CTA-861 audio blocks.
    /// Returns the number of valid modes written to `out_modes`.
    ///
    /// The parser tolerates common EDID bugs (truncated blocks, bad checksums
    /// on extension blocks) by parsing what it can and skipping corrupt sections
    /// — same robustness as Linux's drm_edid.c parser.
    pub parse_edid: unsafe extern "C" fn(
        raw_edid: *const u8,
        edid_len: u32,
        out_modes: *mut AccelDisplayMode,
        max_modes: u32,
        out_mode_count: *mut u32,
    ) -> IoResultCode,

    /// Validate display timing parameters against hardware constraints.
    /// Checks pixel clock range, blanking minimums, refresh rate bounds,
    /// and bandwidth vs link capacity.
    pub validate_mode: unsafe extern "C" fn(
        mode: *const AccelDisplayMode,
        max_pixel_clock_khz: u32,
        max_link_bandwidth_mbps: u32,
    ) -> bool,

    /// Compute required display bandwidth in bytes/sec for a given mode.
    /// Accounts for bits-per-pixel, refresh rate, and blanking overhead.
    pub compute_display_bandwidth: unsafe extern "C" fn(
        mode: *const AccelDisplayMode,
        bpp: u32,
    ) -> u64,

    /// Select the best mode from a mode list that fits within hardware constraints.
    /// Prefers: highest resolution → highest refresh → lowest bandwidth.
    /// Returns index into modes array, or u32::MAX if no mode fits.
    pub select_preferred_mode: unsafe extern "C" fn(
        modes: *const AccelDisplayMode,
        mode_count: u32,
        max_pixel_clock_khz: u32,
        max_link_bandwidth_mbps: u32,
    ) -> u32,
}

Why a KABI service, not compiled-in kernel code:

Live-evolvable: Fix an EDID parsing bug for a newly-discovered monitor quirk → evolve only accel-display-helpers, no driver rebuild needed.
Loaded on demand: Headless servers (GPU compute only, no display) never load this service — zero memory footprint.
Versioned: New helper methods (HDR metadata parsing, DSC bandwidth calculation) are added in KABI v2+ without breaking v1 drivers.
Single copy: All GPU drivers (NVIDIA, AMD, Intel, virtio-gpu) share one instance in memory.

Service dependency resolution (Section 12.7) ensures that accel-display-helpers (and all other helper services) are loaded before any driver that declares them as a dependency reaches its device_init callback. Service binding is serialized by the KABI service registry lock; concurrent callers (e.g., two display-capable drivers probing in parallel) receive the same vtable pointer.

Driver usage pattern:

Driver probe:
  1. Register AccelBaseVTable (always)
  2. If display-capable: call kabi_bind_service("accel-display-helpers", 1)
     → kernel loads the service if not already loaded
     → driver receives AccelDisplayHelpersVTable pointer
  3. Use helpers: parse_edid(raw_bytes, ...) → mode list
  4. Register AccelDisplayVTable with validated modes

Driver unload:
  Service remains loaded (other drivers may use it).
  Unloaded by kernel when last display-capable driver unregisters + idle timeout.

Future helper services (same pattern, separate KABI services): - accel-compute-helpers: Compute capability detection, workgroup size validation - accel-video-helpers: Codec capability negotiation, bitstream format detection - accel-power-helpers: Thermal zone mapping, power curve interpolation

22.3.16 Implementation Phasing¶

Component	Phase	Dependencies	Notes
AccelBaseVTable KABI definition	Phase 3	Driver SDK, device registry	Define the interface first
Accelerator scheduler (basic)	Phase 3-4	AccelBase, cgroups	Round-robin, then priority
DRM/KMS compatibility shim	Phase 3-4	AccelBase	Required for desktop GPU
Heterogeneous memory (basic)	Phase 4	Memory manager, AccelBase	CPU-device migration
P2P DMA	Phase 4	IOMMU, AccelBase	Requires IOMMU support
Cgroup accel controller	Phase 4	AccelScheduler, cgroups	Compute time + memory limits
CBS bandwidth guarantees	Phase 4-5	AccelScheduler, CBS	Guaranteed compute time
Hardware preemption	Phase 5	AccelScheduler, GPU driver	Driver-dependent
In-kernel inference (basic)	Phase 4	Decision trees only	Start with page prefetching
In-kernel inference (neural)	Phase 5	Quantized INT8 runtime	Requires validation
RDMA KABI	Phase 5	Network drivers	Separate from GPU
GPUDirect RDMA	Phase 5+	RDMA + P2P DMA	Cross-driver
NVIDIA KABI driver (basic init)	Phase 3-4	AccelBase, KABI, ioctl compat	nvidia-smi + simple CUDA (Section 22.7)
NVIDIA KABI driver (UVM)	Phase 4-5	UmkaOS HMM, NVIDIA basic	cudaMallocManaged + multi-GPU (Section 22.7)
NVIDIA KABI driver (production)	Phase 5	All NVIDIA components	Full CUDA stack + crash recovery
Device partitioning	Phase 5+	AccelScheduler, registry	Hardware-dependent
Topology-assisted collectives	Phase 5+	Registry, RDMA, P2P	Optimization layer

22.3.17 Priority Rationale¶

Phase 3-4 (Real Workloads): Basic accelerator framework + DRM compat + simple scheduling. This is the minimum for "GPU works on UmkaOS."

Phase 4-5 (Production Ready): Cgroup integration, memory management, P2P DMA, in-kernel inference basics. This is when UmkaOS becomes better than Linux for AI/ML workloads.

Phase 5+ (Ecosystem): Advanced scheduling, RDMA, collectives, partitioning. This is the competitive advantage phase — features that Linux fundamentally cannot provide due to architectural constraints.

22.3.18 Licensing Summary¶

Component	IP Source	Risk
AccelBase KABI	Original design (vtable pattern from existing KABI)	None
Accelerator scheduler	CBS (academic, 1998) + priority scheduling (textbook)	None
Heterogeneous memory	Academic (HMM concepts, published research)	None
P2P DMA	PCIe spec (public), IOMMU spec (public)	None
DRM/KMS compat	Linux DRM (GPLv2 interface, ioctl numbers are facts)	None
In-kernel inference	Original design, standard ML algorithms (public)	None
RDMA	InfiniBand spec (public), RoCE spec (public)	None
Cgroup controller	Linux cgroup v2 interface (filesystem paths are facts)	None
NVIDIA KABI driver	New code inspired by MIT/GPLv2 open-source nvidia.ko	None (MIT-compatible, OKLF-clean)

All vendor-specific GPU features (MIG, NVLink, etc.) are accessed through the AccelBase KABI vtable — the driver implements them, not the kernel. The kernel provides the framework; vendors fill in the hardware-specific logic through KABI.

22.4 Accelerator Memory and P2P DMA¶

22.4.1 Heterogeneous Memory Management¶

22.4.1.1 Problem¶

AI models are growing exponentially. A 70B parameter model at FP16 is ~140GB. GPU VRAM is typically 24-80GB. Models larger than VRAM need transparent memory management — pages migrating between CPU RAM and GPU VRAM based on access patterns.

Linux has hmm (Heterogeneous Memory Management) and mmu_notifiers, but they are bolted onto the existing MM and poorly integrated. Each vendor (NVIDIA UVM, AMD SVM, Intel SVM) implements their own page migration logic in the driver.

22.4.1.2 Design: Accelerator Memory as NUMA Nodes¶

The memory manager (Section 4.1) already has NUMA awareness — per-node buddy allocators, NUMA-local page allocation, NUMA-aware page cache placement.

Key insight: Accelerator memory is just another NUMA node. A GPU with 24GB VRAM is NUMA node N+1, with different performance characteristics (higher bandwidth, higher latency from CPU, not directly CPU-accessible).

Existing NUMA topology (Section 4.1):

  NUMA Node 0 (CPU 0-15)     NUMA Node 1 (CPU 16-31)
  64GB DDR5                   64GB DDR5
  Per-CPU free page pools         Per-CPU free page pools
  Buddy allocator 0           Buddy allocator 1

Extended with accelerator memory:

  NUMA Node 2 (GPU 0 VRAM)   NUMA Node 3 (GPU 1 VRAM)
  24GB GDDR6X                 24GB GDDR6X
  Managed by GPU driver       Managed by GPU driver
  via AccelBase vtable        via AccelBase vtable

GPU NUMA affinity propagation: Each accelerator's NUMA node ID is exposed via the device registry (AccelDeviceInfo.numa_node). The CPU scheduler uses this for GPU-related kernel thread placement: threads servicing GPU interrupts, DMA completion handlers, and AccelScheduler work queues are preferentially placed on the CPU NUMA node closest to the GPU (lowest interconnect latency). This is advisory, not mandatory — the CPU scheduler may override affinity under load balancing pressure.

22.4.1.3 Memory Node Types¶

// umka-core/src/mem/numa.rs (extend existing)
//
// Canonical definition location: Section 4.9 (04-memory.md#49-numa-topology-and-policy).
// NumaNodeType extends NumaTopology with per-node type classification.
// Defined here in Section 22.2.1.3 because accelerator and CXL memory node
// types are the primary motivation; the base NumaTopology struct in Ch 4
// references this enum for node_type per-node metadata.

/// Kernel-internal enum, never crosses a KABI or wire boundary.
/// `bool` fields are acceptable (no cross-compilation-boundary risk).
pub enum NumaNodeType {
    /// Standard CPU-attached DDR memory.
    CpuMemory,

    /// Accelerator device-local memory (VRAM, HBM).
    /// Managed through the AccelBase KABI vtable.
    AcceleratorMemory {
        device_id: DeviceNodeId,
        bandwidth_gbs: u32,        // GB/s per stack. e.g., 819 for HBM3 (per-stack at
                                    //       6.4 Gbps/pin × 1024 pins; JEDEC JESD238A).
                                    //       1229 for HBM3E (JEDEC max at 9.8 Gbps/pin;
                                    //       actual products: ~1000-1200 per stack)
        latency_ns: u32,           // e.g., 500-1000 for PCIe GPU
        cpu_visible: bool,         // Can CPU access this memory via BAR?
        cpu_visible_size: u64,     // Size of CPU-visible window (may be < total)
        coherent: bool,            // CPU-device cache coherent? (CXL = yes, PCIe = no)
    },

    /// CXL-attached memory (high-capacity, CPU-accessible, higher latency).
    CxlMemory {
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
    },

    /// CXL 3.0 shared memory pool visible to multiple nodes.
    /// See Section 5.10 for CXL fabric integration details.
    CxlSharedPool {
        latency_ns: u32,
        /// Bandwidth in GB/s (gigabytes per second), matching CXL memory semantics.
        bandwidth_gbs: u32,
        /// Nodes that share this pool (bitfield).
        sharing_nodes: u64,
        /// Hardware coherence protocol version.
        coherence_version: CxlCoherenceVersion,
    },
}

Allocation ownership clarification: Device memory (GPU VRAM, accelerator SRAM) is managed by the device driver's internal allocator via the AccelBase KABI vtable, not by the kernel buddy allocator. UmkaOS tracks only resident-page accounting and OOM scoring via PageLocationTracker. The NumaNodeType::AcceleratorMemory node is used for NUMA distance modeling and memory pressure accounting, not for kernel-managed physical allocation.

22.4.1.4 Transparent Page Migration¶

When a device with hw_page_faults = 1 (modern GPUs with ATS/PRI support) faults on an unmapped address, the following occurs:

1. Device encounters page fault at virtual address VA.
2. Device sends ATS (Address Translation Services) fault to IOMMU.
3. IOMMU routes fault to UmkaOS Core's device fault handler
   (runs in kernel workqueue context, NOT in interrupt context).
4. Acquire mmap_read_lock on the owning process's MemoryMap
   (Section 4.6.2). This ensures the VMA tree is stable during
   fault resolution. The read lock allows concurrent CPU page faults
   on other pages in the same address space.
5. UmkaOS Core looks up VA in the owning process's VMA tree
   (maple_find under mmap_read_lock):
   a. If VA maps to a CPU-resident page:
      - Acquire per-page ownership spinlock in PageLocationTracker.
      - Transition page to Migrating state.
      - Release per-page spinlock.
      - Unmap from CPU page tables: clear PTE, flush TLB
        (Section 4.6.5 — TLB shootdown to all CPUs in mm_cpumask).
        This MUST complete before the DMA copy reads the source page.
      - DMA copy: CPU NUMA node → device memory
        (via driver's migrate_pages callback).
      - Install device page table mapping.
      - Re-acquire per-page spinlock, transition to DeviceLocal.
      - Release per-page spinlock, wake waiters.
      - Resume device execution.
   b. If VA is unmapped (anonymous zero-fill):
      - Allocate a fresh zero page on the device's NUMA node.
      - Map in device page tables.
      - Resume device execution.
   c. If VA maps to another device's memory:
      - P2P migration (see Section 22.2.1.9, explicit ownership
        transfer protocol).
6. Release mmap_read_lock.
7. Track this page's location in PageLocationTracker (Section 22.2.1.5).

VMM locking protocol for device page migration:

The mmap_read_lock is held for the duration of steps 4-6 to prevent VMA changes (munmap, mprotect, mremap) from racing with migration. The lock is a read lock — concurrent CPU page faults and other device faults on different pages proceed unblocked.

Lock ordering: HMM migration acquires mmap_read_lock (shared) then per-page PageLocationTracker spinlock. This is safe: (1) mmap_read_lock is shared (no conflict with other readers), (2) munmap acquires mmap_write_lock (exclusive) which waits for all read locks to drain, ensuring HMM migration completes before address space teardown. The PageLocationTracker spinlock is at lock hierarchy level 12.5 (between PAGE_LOCK and FS_SB_LOCK per Section 3.5).

CPU → device migration requires both CPU PTE modification and device page table modification. The CPU PTE is cleared under the per-PTE spinlock (PTL, Section 4.8) with a TLB shootdown to all CPUs that may have cached the mapping. The cmpxchg pattern used for PTE installation (Section 4.6.3) ensures that a concurrent CPU fault on the same page either sees the old PTE (CPU page still valid, fault resolves normally) or an empty PTE (page is being migrated, fault blocks on the per-page wait queue via PageLocationTracker).

Device → CPU migration (eviction under memory pressure or explicit request): - Revoke device page table mapping (driver callback). - DMA copy: device memory → CPU NUMA node. - Install CPU PTE under PTL via cmpxchg. - mmap_read_lock is held to ensure the VMA still exists and the target virtual address is still valid.

Cross-reference: CPU-side page fault handling and PTE locking are specified in Section 4.8. The Maple Tree VMA management and mmap_read_lock are specified in Section 4.8.

For devices WITHOUT hardware page faults (older GPUs, most NPUs): the kernel uses software-assisted migration. Before submitting a command buffer, the scheduler identifies which pages the command buffer references and pre-migrates them. This is less efficient but functionally equivalent.

Device page tables (GPU GART/GMMU, NPU MMU) are driver-opaque — the kernel does not manage device-internal address translation. The kernel manages only host-side IOMMU mappings; device-side page table operations are performed by the driver through hardware-specific register writes.

22.4.1.5 Page Location Tracking¶

// umka-core/src/mem/hmm.rs (kernel-internal)

/// Tracks where each page of a shared address space currently resides.
pub struct PageLocationTracker {
    /// Per-page location (indexed by virtual page number).
    /// Each entry records which NUMA node currently holds this page.
    locations: RadixTree<PageLocation>,

    /// Per-page migration epoch counter (indexed by virtual page number).
    /// Incremented each time the page transitions to `Migrating` state.
    /// Used to detect stale migration completions after concurrent recovery.
    /// Each entry is a single `AtomicU64`, stored in a separate radix tree
    /// to avoid bloating the `PageLocation` enum (which is 32 bytes aligned).
    migration_epochs: RadixTree<AtomicU64>,

    /// Side table for in-flight migrations. Slab-allocated, bounded by
    /// `MAX_CONCURRENT_MIGRATIONS` (default 4096). The `Migrating` variant
    /// stores only the slab index (`migration_id: u32`), keeping the
    /// per-page `PageLocation` entry small.
    active_migrations: Slab<MigrationRecord>,
}

/// Full source/target metadata for an in-flight page migration.
/// Stored in `PageLocationTracker::active_migrations`, referenced by
/// `PageLocation::Migrating { migration_id }`.
pub struct MigrationRecord {
    pub source_kind: PageLocationKind,
    pub source_node: u8,
    pub source_device: DeviceNodeId,
    pub source_addr: u64,
    pub target_kind: PageLocationKind,
    pub target_node: u8,
    pub target_device: DeviceNodeId,
    pub target_addr: u64,
}

/// Maximum concurrent page migrations system-wide.
///
/// This limit bounds memory usage for the `active_migrations` slab and prevents
/// excessive contention on the per-page spinlocks during migration-heavy workloads.
/// The value 4096 is chosen to support:
/// - ~35M pages for a 70B parameter model with 4KB pages
/// - 1% active migration rate = 350K pages in transit
/// - With 4096 concurrent migrations and ~100μs per migration, throughput is
///   ~40M migrations/sec (4096 / 100μs), sufficient for most workloads
///
/// If the limit is exceeded, new migration requests block on a global wait queue
/// until a slot becomes available. This is a backpressure mechanism that prevents
/// memory exhaustion and ensures forward progress.
pub const MAX_CONCURRENT_MIGRATIONS: usize = 4096;

/// Backoff strategy for per-page spinlock during migration metadata updates.
///
/// The per-page spinlock in `PageLocationTracker` is held only during metadata
/// updates (state transitions, epoch increments), which complete in tens of
/// nanoseconds. However, under high contention (e.g., 8 GPUs faulting on the same
/// page simultaneously in tensor parallelism), exponential backoff reduces
/// contention and power consumption.
///
/// The backoff sequence is:
/// 1. Try to acquire spinlock immediately.
/// 2. If locked, pause for 10 cycles (via `PAUSE` on x86, `YIELD` on ARM).
/// 3. Retry up to 10 times with exponential backoff (10, 20, 40, ..., 5120 cycles).
/// 4. After 10 failed retries, transition to sleep-based wait:
///    - Set a flag in the page metadata indicating "waiters pending".
///    - Block on a per-page wait queue (not spinning).
///    - The current migrator will wake all waiters after releasing the lock.
///
/// This hybrid approach (spin-then-sleep) optimizes for the common case where
/// the lock is held briefly, while avoiding wasted CPU cycles under sustained
/// contention.
pub const MIGRATION_LOCK_BACKOFF_BASE_CYCLES: u32 = 10;
pub const MIGRATION_LOCK_BACKOFF_MAX_RETRIES: u32 = 10;

/// Stored per-page in `PageLocationTracker::locations` (one entry per virtual page).
/// `#[repr(u8)]` keeps the discriminant to one byte; the full variant data is in the
/// enum payload. The radix tree stores the enum inline, so every extra byte of
/// discriminant wastes space proportional to the mapped address-space size.
#[repr(u8)]
pub enum PageLocation {
    /// Page is in CPU memory on this NUMA node.
    CpuNode(u8),

    /// Page is in accelerator device-local memory.
    DeviceLocal {
        device_id: DeviceNodeId,
        device_addr: u64,  // Device-side physical/virtual address
    },

    /// Page is in transit (being migrated).
    /// Stores only a side-table index to avoid bloating every per-page entry.
    /// The `MigrationRecord` (source, target, addresses) is stored in
    /// `PageLocationTracker::active_migrations` — a slab of fixed capacity
    /// bounded by `MAX_CONCURRENT_MIGRATIONS`. Migrations are transient
    /// (milliseconds), so the number of in-flight entries is small relative
    /// to total pages. This keeps `PageLocation`'s largest variant payload at
    /// 24 bytes (driven by `RemoteNode` / `RemoteDevice`). With the `repr(u8)`
    /// discriminant and u64 alignment, the aligned enum size is 32 bytes.
    Migrating {
        migration_id: u32,
    },

    /// Page is not yet allocated (demand-paged).
    NotPresent,

    /// Page is in compressed pool (Section 4.2).
    Compressed,

    /// Page is in swap.
    Swapped,

    // === Distributed memory locations (see Section 5.6 for the DSM protocol) ===

    /// Page is on a remote node's CPU memory, accessible via RDMA.
    RemoteNode {
        node_id: NodeId,
        remote_phys_addr: u64,
        dsm_state: DsmPageState,
    },

    /// Page is on a remote node's accelerator memory (GPUDirect RDMA).
    RemoteDevice {
        node_id: NodeId,
        device_id: DeviceNodeId,
        device_addr: u64,
    },

    /// Page is in CXL-attached memory pool (hardware-coherent).
    /// See Section 5.10 for CXL fabric integration details.
    CxlPool {
        pool_id: u32,
        pool_offset: u64,
    },
}

/// Discriminant for `PageLocation` variants, used in `MigrationRecord`
/// to record the source/target kind of an in-flight migration.
#[repr(u8)]
pub enum PageLocationKind {
    CpuNode      = 0,
    DeviceLocal  = 1,
    NotPresent   = 2,
    Compressed   = 3,
    Swapped      = 4,
    RemoteNode   = 5,
    RemoteDevice = 6,
    CxlPool      = 7,
    /// Page is currently being migrated between locations. The migration
    /// subsystem sets this transiently while a DMA copy is in flight.
    /// Any fault on a `Migrating` page blocks until migration completes
    /// (signaled via the per-page `migration_waitq`). The `MigrationRecord`
    /// holds the source and target `PageLocationKind` for the in-flight move.
    Migrating    = 8,
}

22.4.1.6 Page Descriptor ↔ PageLocation Bridge¶

The kernel's Page descriptor (Section 4.2) represents physical pages in CPU RAM. When a page migrates to VRAM, the Page descriptor remains valid but its state changes:

CPU → VRAM: The Page descriptor is retained (it describes the physical page that was in CPU RAM). The page table PTE is replaced with a device migration entry (a non-present PTE encoding the DeviceNodeId + device address). The PageLocationTracker records PageLocation::DeviceLocal { device_id, device_addr }. The CPU Page is freed back to the buddy allocator (the physical frame is reusable).
VRAM → CPU (eviction or explicit migrate): A new CPU Page is allocated via alloc_pages(). The device driver copies data from VRAM to the new page via DMA. The PTE is updated to point to the new Page. The PageLocationTracker transitions to PageLocation::CpuNode(nid).
Fault on device migration entry: When userspace accesses a VA whose PTE contains a device migration entry (page is in VRAM), the fault handler invokes hmm_fault_handler():
Look up PageLocationTracker → DeviceLocal { device_id, device_addr }.
Allocate a CPU Page via alloc_pages(GFP_KERNEL, nid).
Issue DMA read from device → CPU page.
Install PTE, update PageLocationTracker → CpuNode(nid).
Return Ok(()) — fault resolved, userspace resumes.

This model means no kernel Page exists for VRAM-resident pages — the Page descriptor tracks CPU physical frames only. VRAM pages are tracked exclusively by PageLocationTracker entries and device-side address metadata. This avoids phantom Page descriptors for memory that doesn't exist in CPU address space.

22.4.1.7 Migration Policy¶

// umka-core/src/mem/hmm.rs

/// When to trigger page migration from CPU to device memory (or vice versa).
/// This is deliberately an enum rather than a bool because the trigger logic
/// has three distinct modes with different parameters and performance trade-offs.
pub enum MigrationTrigger {
    /// Migrate on first device fault (aggressive: more migration traffic
    /// but lower steady-state fault rate). Default for latency-sensitive workloads.
    OnFirstFault,

    /// Migrate only after `threshold` faults on the same page within `window_ms`
    /// milliseconds (conservative: less migration traffic, higher fault rate).
    /// Default for throughput-oriented workloads with scattered access patterns.
    OnThreshold {
        threshold: u32,
        window_ms: u32,
    },

    /// Use learned policy (in-kernel inference, Section 22.4) to decide
    /// per-page whether to migrate. The model observes fault patterns and
    /// predicts which pages will be re-accessed, avoiding unnecessary migrations.
    Learned,
}

pub struct MigrationPolicy {
    /// When to trigger migration from CPU RAM to device memory.
    pub trigger: MigrationTrigger,

    /// Batch migration: when migrating, also prefetch nearby pages
    /// (spatial locality heuristic).
    pub prefetch_pages: u32,    // 0 = no prefetch, 16 = prefetch 16 neighbors

    /// Maximum pages in-flight (being migrated) at once.
    pub max_inflight_migrations: u32,

    /// Eviction policy when device memory is full:
    /// LRU (least recently accessed on device) is the default.
    pub eviction_policy: EvictionPolicy,
}

pub enum EvictionPolicy {
    /// Evict least recently accessed pages back to CPU memory.
    Lru,
    /// Evict pages with lowest access frequency.
    Lfu,
    /// Use learned policy (in-kernel inference, Section 22.4).
    Learned,
}

Three-tier migration hierarchy:

Tier	Mechanism	Trigger	Latency	Use case
1: Demand-paged	Device page fault → ATS → IOMMU → kernel → migrate	First access	~10-50μs/page	Default for all allocations
2: Prefetch hints	`ACCEL_PREFETCH` ioctl → background migration	Userspace hint before submit	Amortized (async)	ML framework pre-stages weight tensors
3: Pinned residency	`ACCEL_MAP_BO` with `MAP_DEVICE_PINNED` flag	Explicit userspace request	One-time migration	Persistent buffers (frame buffers, KV cache)

Tier 1 is the baseline — no hints required. Tier 2 (ACCEL_PREFETCH, defined in Section 22.3) is best-effort: the kernel pre-migrates pages if device memory is available; otherwise falls back to Tier 1. Tier 3 guarantees residency for the lifetime of the mapping but returns -ENOMEM if device memory is insufficient.

LRU eviction detail:

/// Intrusive doubly-linked list link for per-page LRU tracking.
/// Stored alongside each `PageLocation::DeviceLocal` entry in a parallel
/// radix tree (`PageLocationTracker::lru_links`), keyed by VPN.
/// Only pages currently resident in device memory have a valid `AccelLruLink`;
/// pages in CPU memory, swap, or migration have no LRU link allocated.
///
/// O(1) insertion, removal, and move-to-head:
///   - **Page fault (access)**: unlink from current position, relink at head.
///   - **Eviction**: unlink from tail.
///   - **Migration out**: unlink from wherever it sits.
/// No heap allocation per LRU operation — links are pre-allocated in the
/// radix tree when the page is first placed in device memory.
pub struct AccelLruLink {
    /// VPN of the next (hotter) page in the LRU list, or `None` if this is the head.
    pub lru_prev: Option<u64>,
    /// VPN of the next (colder) page in the LRU list, or `None` if this is the tail.
    pub lru_next: Option<u64>,
}

/// Per-device LRU tracking for memory eviction.
pub struct AccelEvictionState {
    /// LRU list: tail = coldest. Updated on every page fault (accessed page
    /// moves to head). Implemented as an intrusive doubly-linked list through
    /// `AccelLruLink` entries stored per-page in `PageLocationTracker::lru_links`.
    /// No separate allocation per LRU operation.
    pub lru_head: Option<u64>,  // VPN of most-recently-accessed page
    pub lru_tail: Option<u64>,  // VPN of least-recently-accessed page
    /// Total number of pages on the LRU list (for diagnostic/threshold checks).
    pub lru_count: AtomicU64,
    /// Number of pinned pages (excluded from eviction).
    pub pinned_count: AtomicU64,
    /// Total device memory capacity (bytes).
    pub device_memory_bytes: u64,
    /// Currently allocated device memory (bytes).
    pub allocated_bytes: AtomicU64,
    /// Spinlock protecting LRU list mutations (head/tail/link updates).
    /// Held only during pointer updates (tens of nanoseconds).
    /// Page fault hot path acquires this once to move the page to LRU head.
    pub lru_lock: SpinLock<()>,
}

Eviction procedure (when device memory is full and a migration or pinned allocation needs space):

Scan LRU from tail (coldest pages).
Skip: pinned pages (MAP_DEVICE_PINNED), pages with active in-flight submissions (checked via PageLocationTracker.migration_epochs).
Evict: DMA copy page content to CPU DRAM, unmap from device page tables, update PageLocationTracker to CpuMemory, move entry off LRU.
Batch: evict up to 64 pages per scan to amortize overhead (~5-15μs per page for device→CPU DMA + TLB invalidation).
If LRU exhausted (all pages pinned or in-flight): return -ENOMEM.

CPU-GPU memory consistency model:

UmkaOS does NOT provide hardware cache coherency between CPU and discrete GPU memory (PCIe is non-coherent; CXL Type 2 coherent devices are the exception). The consistency model is explicit, matching CUDA/OpenCL/Vulkan semantics:

CPU writes to GPU-resident pages: NOT visible to the GPU until the page is migrated back to CPU (invalidated on GPU) and then re-migrated to GPU. There is no snooping protocol over PCIe.
GPU writes to pages: NOT visible to the CPU until a fence signals completion AND the page is migrated back to CPU (triggered by CPU access fault, or explicit ACCEL_PREFETCH to CPU).
Fences are the synchronization primitive: GPU fence signal → CPU may safely read GPU-written data (after reverse migration). This is the standard model for all discrete accelerators.
CXL Type 2 devices (PeerCoordinationMode::HardwareCoherent): CPU and device share a cache-coherent view via CXL.cache back-invalidation protocol. No explicit migration or fencing for coherency. The migration policy still applies for placement optimization (moving pages closer to the accessor) but not for correctness.

Unified accelerator coherence model: The following table summarizes the coherence mechanism, ordering, and latency for every CPU/device data path:

Path	Granularity	Mechanism	Ordering	Latency
CPU to GPU	Page (4KB)	IOMMU snoop / explicit cache flush (`dma_sync_single_for_device`)	Release-Acquire via DMA fence	~1-5 us
GPU to CPU	Page (4KB)	Device-initiated invalidate + DMA fence signal	Acquire on fence wait (`dma_fence_wait()`)	~1-5 us
GPU to GPU (same `peer_group`)	64KB	P2P BAR aperture, no CPU involvement	Device-specific (NVLink: total store order; PCIe: relaxed, requires explicit fence)	~0.3-2 us
GPU to GPU (cross `peer_group`)	Page (4KB)	CPU bounce via staged DMA (device A to CPU, CPU to device B)	Release-Acquire via two DMA fences (one per leg)	~5-20 us
CPU to DMA ring	Cache line (64B)	Write-combine or coherent mapping	Release store + MMIO doorbell write	~0.1 us
CXL Type 2 device to CPU	Cache line (64B)	CXL.cache back-invalidation (hardware-coherent)	Sequential consistency (CXL.cache protocol)	~0.1-0.5 us

For discrete GPUs (PCIe-attached, non-coherent), all CPU-GPU data transfer requires explicit synchronization via DMA fences. The kernel's AccelMigrate subsystem handles page migration and fence management transparently -- userspace (CUDA/Vulkan) sees a unified address space and synchronizes via standard fence/event APIs.

22.4.1.8 Memory Oversubscription¶

When a model is larger than device VRAM:

Total model: 140GB (70B params, FP16)
GPU VRAM: 24GB
CPU RAM available: 256GB

Strategy:
  1. Most-accessed layers (attention heads, active KV cache) → GPU VRAM
  2. Less-accessed layers (early embedding, output projection) → CPU RAM
  3. Migration on demand: when GPU needs a page in CPU RAM, migrate it
     and evict a cold page from VRAM back to CPU RAM

The kernel manages this transparently. The ML framework sees a unified
140GB address space. Page migration is handled by the kernel's device
fault handler, exactly like CPU demand paging handles more-virtual-memory-
than-physical-RAM.

This is the accelerator equivalent of virtual memory. The kernel has been managing memory oversubscription since the 1960s. Now it does it for GPU VRAM too.

22.4.1.9 Huge Page Support for Tensors¶

ML tensors benefit from huge pages (fewer TLB misses, better DMA performance):

/// Page size selection for accelerator memory allocations.
/// This is a separate enum (not bitflags) because page sizes are mutually
/// exclusive — a single allocation uses exactly one page size.
#[repr(u32)]
pub enum AccelPageSize {
    /// Standard 4KB pages.
    Standard     = 0,
    /// 2MB huge pages.
    HugePage2M   = 1,
    /// 1GB huge pages.
    HugePage1G   = 2,
    /// Device-optimal page size (device driver selects the best size).
    DeviceOptimal = 3,
}

/// Accelerator memory allocation modifier flags (OR-combinable).
/// Combined with a separate AccelPageSize field to fully describe
/// an allocation request.
bitflags::bitflags! {
    #[repr(transparent)]
    pub struct AccelMemFlags: u32 {
        /// Non-migratable: pin in device memory, never evict to CPU.
        const PINNED          = 0x100;
        /// CPU-visible: map into CPU address space via BAR.
        const CPU_VISIBLE     = 0x200;
        /// Coherent: CPU-device coherent (requires CXL or similar).
        const COHERENT        = 0x400;
        /// Allocation may sleep (normal process context). Default for
        /// ioctl-driven allocations from user threads. Allows the
        /// allocator to block on VRAM eviction, compaction, or
        /// migration. Must NOT be set in interrupt or softirq context.
        const CAN_SLEEP       = 0x800;
        /// Allocation must not sleep (interrupt context or under
        /// spinlock). Uses pre-reserved emergency VRAM pool. Returns
        /// `Err(AccelAllocError::NoMem)` immediately if no memory is
        /// available. Analogous to `GfpFlags::ATOMIC_ALLOC` in the
        /// CPU page allocator ([Section 4.2](04-memory.md#physical-memory-allocator)).
        const ATOMIC          = 0x1000;
    }
}
// Usage: page_size = AccelPageSize::HugePage2M,
//        flags = AccelMemFlags::PINNED | AccelMemFlags::CPU_VISIBLE

22.4.1.10 Multi-Device Memory Coherence¶

When multiple GPUs access the same virtual address range (e.g., tensor parallelism across GPUs on the same machine), the kernel must manage coherence between device memories.

The PageLocation enum already includes DeviceLocal for single-device pages. For read-only copies replicated across multiple GPUs, the kernel uses a single-owner coherence protocol with explicit synchronization:

Single-owner with read sharing: At any time, one device owns the writable copy. Other devices may hold read-only copies. This mirrors the DSM protocol in Section 6.2.
Per-page ownership lock and wait queue: Each shared page has a per-page spinlock in the PageLocationTracker that serializes ownership transitions, plus a per-page wait queue for faulters that arrive during a migration. The spinlock protects only the page metadata (state, owner, reader list) — it is held for tens of nanoseconds, never across I/O. When a device faults on a page, the fault handler acquires this lock to inspect or modify the page's PageLocation state. If the page is in Migrating state, the faulter releases the spinlock and blocks on the per-page wait queue (not spinning) until the migration completes and wakes all waiters. The lock is per-page (not per-device or global) to avoid contention on unrelated pages. Scalability note: Per-page granularity minimizes contention on unrelated pages — two threads faulting different pages never contend. The spinlock protects only metadata updates (state, owner, reader list) held for tens of nanoseconds; actual data migration releases the lock and blocks on the per-page wait queue. For hot shared tensors where multiple threads fault the same page simultaneously, only the first faulter performs the migration — subsequent faulters block on the wait queue (non-spinning).
Explicit ownership transfer protocol: When a device writes to a page owned by another device:
Fault handler acquires the per-page ownership spinlock.
A MigrationRecord (source and target) is allocated in the side table and the page state transitions to PageLocation::Migrating { migration_id }. The PageLocationEntry's migration_epoch counter is incremented. The migrator captures this epoch value before releasing the spinlock.
Per-page ownership spinlock is released. The Migrating state prevents other faulters from modifying the page — they will see Migrating and block on the per-page wait queue until the migration completes. 3a. Pre-revocation fence (NEW): Before revoking the source mapping (step 4), the kernel issues a device-specific quiesce command via preempt_context() on the source device's active context. This ensures the source device has completed all in-flight writes to the page before the DMA copy reads the data. The fence is verified via DmaFence completion before proceeding to step 5.
The owning device's mapping is revoked (device page table unmap via driver callback). This is a slow operation (microseconds) and runs outside the spinlock.
P2P DMA transfers the page data to the new owner (Section 22.4). This may take tens to hundreds of microseconds and also runs outside the spinlock.
All devices holding read-only copies are invalidated via the AccelScheduler (which tracks all active contexts for a given address space). Each invalidation unmaps the page from the device's page table via the driver's callback.
Fault handler re-acquires the per-page ownership spinlock. Stale migration detection: the migrator checks that page.migration_epoch still matches the epoch value captured at step 2. If the epoch has changed (because a crash recovery or concurrent migration reset the page state), the migration is stale — the migrator discards the DMA result and returns EAGAIN to trigger a fresh fault. This prevents stale migration completions from corrupting page state after concurrent recovery paths have already resolved the page.
The new owner's mapping is installed and page state transitions to DeviceLocal.
Per-page ownership spinlock is released.
All waiters on the per-page wait queue are woken. They re-fault and see the new DeviceLocal state.

Steps 4-6 (revocation, DMA transfer, and reader invalidation) run outside the spinlock, so they do not block other faulters on unrelated pages or cause spin-time proportional to I/O latency. Faulters that arrive during steps 4-6 see Migrating state and sleep on the wait queue instead of spinning. Steps 4-6 must complete before step 8 (new mapping). This is a write-invalidate protocol — the same model used by CPU cache coherence (MOESI) and the DSM protocol in Section 6.2, adapted for device memory.

migration_epoch field: Each page tracking entry in PageLocationTracker includes a migration_epoch: u64 counter. This counter is incremented at step 2 each time a page transitions to Migrating. The migrator captures the epoch value after the increment and before releasing the spinlock. At step 7, after re-acquiring the spinlock, the migrator compares the current migration_epoch against the captured value. A mismatch indicates that a device crash + recovery or a concurrent migration has intervened during the lock-free DMA window and has already resolved (or restarted) the page migration independently. In that case the DMA result is discarded and EAGAIN is returned; the faulting device will re-fault and enter a fresh migration sequence.

Migration error and timeout handling: Migration is a multi-step I/O operation (device page table manipulation + DMA transfer) that can fail at any point. Without explicit recovery, a page can be stuck permanently in Migrating state, blocking all faulters on its wait queue indefinitely.

Timeouts: Every migration has a deadline: 100ms for intra-node transfers (devices on the same PCIe root complex or NVLink fabric), 1s for cross-node transfers (remote devices via RDMA or CXL). The fault handler starts a kernel timer when it transitions the page to Migrating (step 2 above). If the timer fires before the migration completes (step 8), the timeout handler initiates rollback.

Migration result:

/// Outcome of a page migration attempt.
#[repr(u8)]
pub enum MigrationResult {
    /// Migration completed successfully. Page is now owned by the target device.
    Success,

    /// Migration failed (DMA timeout, device error, IOMMU fault). Page has been
    /// rolled back to its original location on the source device. Waiters are
    /// woken to re-fault and will find the page at its original location.
    RollbackToSource,

    /// Source device crashed or was reset during migration. The page data in
    /// device memory is lost. Page is marked `NotPresent` so the next fault
    /// reloads from backing store (CPU memory copy or disk).
    SourceLost,
}

Failure recovery by case:

DMA timeout or destination device error (most common — e.g., P2P DMA times out, destination device returns an error during page write):
- Cancel or abandon the in-flight DMA transfer.
- The source device still holds the original page data (it was not unmapped until the transfer completes).
- Re-acquire the per-page ownership spinlock.
- Transition page state back from Migrating to its original DeviceLocal or CpuNode state (the MigrationRecord in the side table, looked up via the Migrating variant's migration_id, records exactly where to roll back to).
- Release the per-page ownership spinlock.
- Wake all waiters on the per-page wait queue. They re-fault and find the page at its original location.
- Increment migration_failure_count in the PageLocationTracker stats.
- Result: MigrationResult::RollbackToSource.
Source device crash during migration (rare — the source device is reset or its driver crashes while the page is in Migrating state):
- The page data in device memory is physically lost (device reset destroys VRAM contents — see Section 22.3).
- The DMA transfer (if in progress) is aborted by the device reset.
- Re-acquire the per-page ownership spinlock.
- Transition page state to PageLocation::NotPresent.
- Release the per-page ownership spinlock.
- Wake all waiters on the per-page wait queue with an error indication. Waiters re-fault and trigger a fresh page-in from backing store: if the page has a CPU memory copy (e.g., it was migrated to the device from CPU RAM and the CPU copy was preserved or the page is file-backed), the fault handler loads from the backing store. If no backing store exists (anonymous page that was only in device memory), the owning process receives SIGBUS.
- Increment migration_failure_count and source_lost_count in stats.
- Result: MigrationResult::SourceLost.
Waiter notification on failure: Waiters on the per-page wait queue are woken with an error code (MigrationFailed or SourceDeviceLost) embedded in the wait queue wake event. On wakeup, each waiter re-enters the fault handler, re-acquires the per-page spinlock, and inspects the current page state. For RollbackToSource, the page is back at its original location and the waiter proceeds normally (read-share or initiate a fresh migration attempt). For SourceLost, the page is NotPresent and the waiter triggers a fresh fault resolution (page-in from backing store or SIGBUS).
Repeated migration failures: If a page accumulates more than 3 migration failures within a 60-second window (tracked per-page in PageLocationTracker), the page is marked as non-migratable (pinned at its current location) for a cooldown period (5 minutes). This prevents pathological retry loops on pages that consistently fail to migrate (e.g., due to a flaky P2P DMA path).

Fault handler behavior by MigrationResult (F2 fix):

MigrationResult	Fault Handler Action	Waiter Wakeup	Retry Policy
`Success`	Install new mapping, update TLB, return to user	Wake with `MIGRATION_SUCCESS`	N/A — migration complete
`RollbackToSource`	Discard partial migration, restore source mapping, retry from backing store if needed	Wake with `MIGRATION_ROLLBACK` — waiter re-faults and finds page at original location	Automatic retry on next fault (up to 3 times within 60s)
`SourceLost`	Mark page `NotPresent`, signal `SIGBUS` if no backing store exists	Wake with `MIGRATION_SOURCE_LOST` — waiter re-faults from backing store or receives `SIGBUS`	No automatic retry — page must be replenished from backing store first

Retry details: - RollbackToSource triggers an automatic retry on the next page fault from the same device. This is transparent to userspace. - After 3 RollbackToSource failures within 60 seconds, the page is marked NON_MIGRATABLE and pinned at its current location. - SourceLost does NOT trigger automatic retry — the page must be replenished from backing store (CPU RAM or disk) before migration can be attempted again. - Userspace receives SIGBUS only if: (1) SourceLost occurs, AND (2) no backing store exists (anonymous page that existed only in device memory).

/// Per-page migration failure tracking.
pub struct PageLocationTracker {
    // ...existing fields...

    /// Per-page failure count (upper 2 bits of metadata field).
    /// Reset after 60 seconds of successful migrations.
    fn migration_failure_count(&self) -> u8 { ... }

    /// Increment failure count and check if threshold exceeded.
    fn record_failure(&self) -> bool {  // returns true if threshold exceeded
        let count = self.migration_failure_count();
        if count >= 3 {
            self.mark_non_migratable();
            return true;
        }
        self.set_migration_failure_count(count + 1);
        false
    }

    /// Reset failure count after successful migration.
    fn record_success(&self) {
        self.set_migration_failure_count(0);
        self.last_success_time.store(now(), Relaxed);
    }

    /// Check if failure window has expired (60 seconds).
    fn failure_window_expired(&self) -> bool {
        let last = self.last_success_time.load(Relaxed);
        now() - last > 60_000_000_000  // 60 seconds in ns
    }
}

Read-sharing acquisition: When a device reads a page owned by another device, the fault handler acquires the per-page spinlock, checks page state. If the page is in Migrating state, the faulter releases the spinlock and blocks on the per-page wait queue (same as for write faults). Otherwise, the faulter records the additional reader in the page metadata, releases the spinlock, then performs the P2P DMA copy to create a read-only replica on the requesting device (outside the spinlock). No ownership change occurs.
Distributed cluster coherence: For multi-device coherence across machines (not just local GPUs), the per-page spinlock is replaced by the DSM directory protocol (Section 6.2, 05-distributed.md). The DSM protocol already provides single-owner / multi-reader semantics with a home-node directory that serializes ownership transitions across network boundaries — the same semantics as the local per-page spinlock, but designed for cross-node operation. Local multi-GPU coherence uses the lightweight per-page spinlock and wait queue; cross-machine coherence uses the DSM directory. The DLM (Section 5.1) is for filesystem-level distributed locking, not page-level coherence. The write-invalidate coherence protocol is identical in both cases — only the ownership serialization mechanism differs (local spinlock vs. DSM directory).

/// MSI-like coherence state for a single page in the multi-GPU coherence
/// protocol. Transitions are serialized by `GpuCoherenceEntry::lock`.
///
/// State machine:
///   Invalid → Shared   (GPU reads page: fetch from owner, add to sharers bitmask)
///   Invalid → Modified (GPU writes page: acquire exclusive ownership)
///   Shared  → Modified (GPU writes page: invalidate all other sharers first)
///   Shared  → Invalid  (eviction or explicit invalidation from owner)
///   Modified → Shared  (another GPU reads: owner downgrades, supplies data)
///   Modified → Invalid (eviction or owner releases page)
#[repr(u8)]
pub enum GpuPageState {
    /// Page is not present in this GPU's address space. Any access faults.
    Invalid  = 0,
    /// Page is read-shared across one or more GPUs. Multiple GPUs may hold
    /// `Shared` state simultaneously. Write access requires promotion to
    /// `Modified` (which invalidates all other sharers).
    Shared   = 1,
    /// Page is exclusively owned by one GPU for read/write access.
    /// No other GPU has a valid mapping. The owner's `GpuCoherenceEntry::owner_gpu`
    /// identifies the exclusive holder.
    Modified = 2,
}

/// Per-page coherence tracking entry for multi-GPU systems.
/// Tracks ownership and sharing state for a single page across GPUs.
// Field accounting (all offsets in bytes):
//   owner_gpu:  1 (u8)         running: 1
//   state:      1 (GpuPageState = u8)  running: 2
//   _pad:       2 ([u8; 2], aligns u32 to offset 4)   running: 4
//   sharers:    4 (u32)        running: 8
//   version:    8 (u64)        running: 16
//   phys_addr:  8 (u64)        running: 24
//   lock:       4 (AtomicU32)  running: 28
//   _pad2:     36 ([u8; 36])   running: 64
// Total: 64 bytes, one cache line
#[repr(C)]
pub struct GpuCoherenceEntry {
    /// GPU that currently owns this page (exclusive write access).
    /// 0xFF = no owner (shared read state).
    pub owner_gpu: u8,
    /// State: Invalid, Shared, Modified (MSI-like protocol).
    pub state: GpuPageState,
    /// Explicit padding to align `sharers` to a 4-byte boundary.
    pub _pad: [u8; 2],
    /// Bitmask of GPU indices that have this page in their TLBs.
    /// Bit i set = GPU i has a valid mapping. Supports up to 32 GPUs.
    /// For systems with >32 GPUs, use GpuCoherenceEntryLarge.
    pub sharers: u32,
    /// Version counter (incremented on each ownership transfer).
    pub version: u64,
    /// Physical address of the page in the owning GPU's VRAM.
    pub phys_addr: u64,
    /// Lock for ownership transitions (spinlock, held only during transfer).
    pub lock: AtomicU32,
    /// Padding to reach exactly 64 bytes (one cache line).
    pub _pad2: [u8; 36],
}
// kernel-internal, not KABI — GPU coherence tracking, never crosses driver boundary.
const_assert!(core::mem::size_of::<GpuCoherenceEntry>() == 64);

/// Large variant of `GpuCoherenceEntry` for systems with >32 GPUs.
/// Uses a wider `sharers` bitmask and adds a page-range representation
/// for coalesced huge-page coherence tracking.
///
/// When `gpu_count > 32`, the coherence subsystem allocates a
/// `GpuCoherenceTableLarge` backed by `GpuCoherenceEntryLarge` entries
/// instead of `GpuCoherenceEntry`. The two types are not mixed within
/// a single table — the table format is chosen at device-group creation
/// time based on the number of registered GPUs.
// Field accounting (all offsets in bytes):
//   owner_gpu:        1 (u8)          running: 1
//   state:            1 (GpuPageState) running: 2
//   flags:            1 (u8 bitflags)  running: 3
//   _pad:             1 ([u8; 1])      running: 4
//   sharers:         32 ([u64; 4])     running: 36
//   _pad2:            4 ([u8; 4])      running: 40
//   page_range_start: 8 (u64)         running: 48
//   page_range_end:   8 (u64)         running: 56
//   version:          8 (u64)         running: 64
//   phys_addr:        8 (u64)         running: 72
//   lock:             4 (AtomicU32)   running: 76
//   _pad3:           52 ([u8; 52])    running: 128
// Total: 128 bytes, two cache lines
#[repr(C)]
pub struct GpuCoherenceEntryLarge {
    /// GPU that currently owns this page range (exclusive write access).
    /// 0xFF = no owner (shared read state).
    pub owner_gpu: u8,
    /// State: Invalid, Shared, Modified (MSI-like protocol).
    pub state: GpuPageState,
    /// Entry flags.
    /// Bit 0: `COALESCED` — this entry covers a contiguous range of pages
    ///   (`page_range_start..page_range_end`) rather than a single page.
    ///   When clear, `page_range_start == page_range_end` (single page).
    /// Bits 1-7: reserved, must be zero.
    pub flags: u8,
    /// Explicit padding to align `sharers` to a 4-byte boundary.
    pub _pad: [u8; 1],
    /// Bitmask of GPU indices that have this page/range in their TLBs.
    /// 4 × u64 = 256 bits, supporting up to 256 GPUs. Bit i set = GPU i
    /// has a valid mapping. Indexed as `sharers[i / 64] & (1 << (i % 64))`.
    pub sharers: [u64; 4],
    /// Padding to align `page_range_start` to 8-byte boundary.
    pub _pad2: [u8; 4],
    /// First virtual page number in the tracked range (inclusive).
    /// For single-page entries, `page_range_start == page_range_end`.
    pub page_range_start: u64,
    /// Last virtual page number in the tracked range (inclusive).
    pub page_range_end: u64,
    /// Version counter (incremented on each ownership transfer).
    pub version: u64,
    /// Physical address of the first page in the owning GPU's VRAM.
    /// Subsequent pages in the range are at contiguous physical addresses.
    pub phys_addr: u64,
    /// Lock for ownership transitions (spinlock, held only during transfer).
    pub lock: AtomicU32,
    /// Padding to reach exactly 128 bytes (two cache lines).
    pub _pad3: [u8; 52],
}
// kernel-internal, not KABI — GPU coherence tracking for >32 GPU systems.
const_assert!(core::mem::size_of::<GpuCoherenceEntryLarge>() == 128);

Huge page requirement for multi-GPU coherence: The per-page coherence tracking structures (GpuCoherenceEntry, ~64 bytes per page) mandate huge pages (2 MB) when tracking >4 GPUs simultaneously. For a 4-GPU system with 256 GB VRAM total (64 GB per GPU, 16M pages per GPU), the coherence table requires ~1 GB. Using 4 KB pages for this allocation would consume 256K page table entries and cause TLB thrashing during coherence lookups. The allocator automatically promotes coherence table allocations to 2 MB huge pages when gpu_count >= 4. On systems without huge page support, the coherence table is capped at 4 GPUs; attempting to register a 5th GPU returns -ENOMEM.

For the common case of gradient all-reduce (all GPUs read the same weights but write different gradients), the kernel keeps weights as shared-read on all devices and only migrates gradient pages on write. Since each GPU writes to its own gradient partition (non-overlapping address ranges), per-page locks are uncontended in the common case.

22.4.1.11 VRAM Allocation Model¶

Accelerator device-local memory (VRAM) is not managed by the kernel's buddy allocator. VRAM is physically separate from system RAM and is accessed through the GPU's memory controller; the kernel's page allocator has no physical address mapping for it. Instead, VRAM allocation is delegated to the device driver via the AccelBaseVTable::alloc_device_memory() callback.

Driver-side allocator: Each accelerator driver implements its own VRAM allocator internally (typically a buddy allocator or slab allocator over the VRAM address range). The kernel does not prescribe the internal algorithm — only the KABI interface contract:

/// VRAM allocation contract (enforced by AccelBase KABI):
///
/// 1. alloc_device_memory() returns an opaque AccelMemHandle.
///    The handle is per-context — it cannot be used from another context
///    unless explicitly shared via export_dma_buf() (Section 22.2.1.11).
///
/// 2. The driver tracks per-context VRAM usage. The kernel queries it
///    via get_device_stats() → AccelDeviceStats::memory_used_bytes.
///
/// 3. On context destruction, the driver MUST free all VRAM allocated
///    to that context. Leaked VRAM is a driver bug; the kernel logs a
///    warning and forces deallocation via reset_device().
///
/// 4. GFP flags are NOT used for VRAM allocation. VRAM is a fixed-size
///    resource without the kernel's zone/watermark model. The driver
///    returns IO_ERR_NOMEM if VRAM is exhausted.

OOM interaction: The kernel OOM killer cannot directly reclaim VRAM — it has no visibility into device-internal memory. However, UmkaOS provides a two-level VRAM pressure response:

Device-level eviction (transparent, managed by PageLocationTracker): When alloc_device_memory() returns IO_ERR_NOMEM, the HMM migration layer evicts cold pages from VRAM to CPU RAM using the eviction_policy in MigrationPolicy (§22.2.1.6). The eviction target is chosen by LRU (default) or learned policy. After eviction frees sufficient VRAM, the allocation is retried.
Cgroup charge on eviction: When VRAM pages are evicted to CPU RAM, the evicted pages are charged to the owning process's memory cgroup (Section 17.2). If the cgroup memory limit would be exceeded by the charge, the eviction blocks until the cgroup's reclaim path frees sufficient CPU memory (page cache eviction, swap-out). If reclaim fails and the cgroup is at its hard limit, eviction returns IO_ERR_NOMEM — the VRAM allocation that triggered eviction also fails, and the allocating process receives an error. This prevents GPU workloads from silently consuming unbounded CPU RAM during VRAM pressure.
System-level OOM (last resort): If CPU RAM is also exhausted (the evicted VRAM pages cannot be placed in system RAM), the standard OOM killer is invoked. The OOM killer sees the GPU-using process's RSS (which includes evicted VRAM pages now in CPU RAM) and may kill it. VRAM usage is reported in /proc/PID/status as VmAccel (new field) for OOM scoring visibility.

Accounting: Per-context VRAM usage is tracked by the kernel via the cgroup memory controller (Section 19.1). The memory.accel_usage cgroup file reports aggregate VRAM bytes across all accelerator contexts in the cgroup. The memory.accel_max limit is enforced: alloc_device_memory() fails with IO_ERR_NOMEM if the cgroup limit would be exceeded, even if physical VRAM is available.

Accelerator memory handles (AccelMemHandle) are context-local by default. To share a buffer between devices (GPU → display, GPU → NVMe, GPU → GPU), the handle must be exported as a dma-buf — a kernel-internal buffer-sharing primitive with reference counting, device-independent scatter-gather list, and implicit synchronization.

// Appended to AccelBaseVTable

/// Export device memory as a dma-buf file descriptor.
/// The returned fd can be passed to another device's import function
/// (or to userspace via DRM PRIME / Vulkan external memory).
///
/// The driver must:
/// 1. Pin the VRAM pages (prevent eviction while exported).
/// 2. Provide a dma_buf_ops implementation with map/unmap callbacks.
/// 3. Maintain a refcount — the VRAM is freed only when ALL importers
///    have released their attachments AND the exporter calls free.
pub export_dma_buf: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    handle: AccelMemHandle,
    out_dma_buf_fd: *mut i32,
) -> IoResultCode>,

/// Import a dma-buf and map it into device-accessible memory.
/// Returns a device-local handle that can be used in command submissions.
///
/// The driver must:
/// 1. Attach to the dma-buf (get the scatter-gather list).
/// 2. Map the SG list through the device's IOMMU domain.
/// 3. Return a handle usable in subsequent submit_command() calls.
pub import_dma_buf: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    dma_buf_fd: i32,
    out_handle: *mut AccelMemHandle,
) -> IoResultCode>,

dma-buf lifecycle and reference counting:

Exporter (GPU 0)                    Importer (GPU 1 or display)
  alloc_device_memory() → handle
  export_dma_buf(handle) → fd       import_dma_buf(fd) → imported_handle
  [VRAM pinned, refcount=2]         [SG mapped through device IOMMU]
  ...                               submit_command(imported_handle)
  free_device_memory(handle)        [refcount=1, VRAM still valid]
  [refcount=1, VRAM kept alive]     close(fd) or free_device_memory(imported_handle)
                                    [refcount=0, VRAM freed]

Synchronization: dma-buf carries implicit fences (DmaFence). When GPU 0 writes to a buffer and GPU 1 reads it, the dma-buf framework ensures GPU 1's read waits for GPU 0's write fence to signal. This is equivalent to Linux's dma_fence / dma_resv (reservation object) mechanism:

The canonical DmaBuf, DmaBufOps, and DmaBufAttachment types are defined in Section 4.14. Accelerator drivers use these types directly — the DMA-BUF framework is subsystem-agnostic and shared across GPU, NPU, camera, display, and storage drivers.

Revocation on device failure: If the exporting device crashes or is reset (§22.1.3.2), all exported dma-bufs from that device are invalidated. The kernel: 1. Iterates the device's exported dma-buf list. 2. Sets a REVOKED flag on each dma-buf. 3. Signals all pending fences as errors. 4. Subsequent map_dma_buf() calls from importers return -ENODEV. 5. Importers are notified via the AccelBaseVTable::notify_event() callback with event type DMA_BUF_REVOKED.

Use cases: - GPU → display: GPU renders frame to AccelMemHandle, exports as dma-buf, display controller imports it for scanout. - GPU → NVMe: GPUDirect Storage — GPU allocates buffer, exports as dma-buf, NVMe controller imports and DMAs training data directly to GPU VRAM. - GPU → GPU: Multi-GPU rendering — one GPU exports, another imports for P2P access (alternative to the P2P DMA path in §22.2.2 when the buffer needs to persist across multiple command submissions).

Cross-references: - DmaDevice and IOMMU domain: Section 4.14 - P2P DMA: §22.2.2 - AccelBase KABI vtable: §22.1.1 - Linux DRM PRIME compatibility: Section 19.1 (fd passing via DRM_IOCTL_PRIME_HANDLE_TO_FD / DRM_IOCTL_PRIME_FD_TO_HANDLE)

22.4.2 Peer-to-Peer DMA¶

22.4.2.1 Problem¶

AI workloads need data to flow between devices without CPU involvement:

Storage → GPU: Load training data directly from NVMe to GPU VRAM (GPUDirect Storage)
Network → GPU: Receive model weights from remote node directly to GPU (RDMA + GPUDirect)
GPU → GPU: Gradient exchange between GPUs in multi-GPU training (NVLink, xGMI, PCIe P2P)
GPU → Network: Send inference results directly from GPU to network (bypassing CPU copy)

Linux supports these as vendor-specific driver features. There is no general kernel mechanism.

22.4.2.2 Design: Generalized P2P DMA¶

P2P DMA transfers use the DMA subsystem's DmaSgl and DmaScatterEntry types (Section 4.14). The accelerator framework extends these with cross-device semantics but does not define parallel scatter-gather types.

Scope: P2P DMA is currently specified for accelerator-class devices only: GPU-to-GPU transfers (NVLink, PCIe), NVMe-to-GPU (GPUDirect Storage), and GPU-to-RDMA NIC (GPUDirect RDMA). General NIC-to-NIC P2P DMA is deferred — standard NIC receive/transmit paths use the regular DMA subsystem (Section 4.14). NIC P2P may be added in a future phase if hardware support (e.g., CXL-attached NICs) makes it practical.

Extend the KABI with a general mechanism for device-to-device DMA transfers:

AccelMemHandle resolution: The AccelMemHandle passed to p2p_dma_map is an opaque driver-allocated handle. The kernel must resolve it to a physical address before programming the IOMMU. This is done via the resolve_mem_handle callback in AccelBaseVTable:

// Appended to AccelBaseVTable (§22.1.1 AccelBaseVTable)

/// Resolve an opaque AccelMemHandle to a physical address and size.
///
/// Called by the kernel during `p2p_dma_map()` to obtain the target device's
/// physical BAR address (or device-local physical address) for IOMMU programming.
/// The driver translates its internal handle representation to the physical
/// backing address.
///
/// # Safety
///
/// * `handle` must be a valid AccelMemHandle allocated by this driver via
///   `alloc_device_memory()` and not yet freed.
/// * `out_phys` and `out_size` must be valid, aligned pointers.
///
/// # Returns
///
/// * `IO_OK` on success. `*out_phys` is set to the physical base address of the
///   allocation (PCIe BAR offset for discrete GPUs, physical DRAM address for
///   CXL-attached accelerators). `*out_size` is set to the total allocation size.
/// * `IO_ERR_INVAL` if the handle is not recognized by this driver instance.
/// * `IO_ERR_FAULT` if the backing memory has been migrated or evicted and is not
///   currently resident on-device (caller should retry after migration completes).
pub resolve_mem_handle: unsafe extern "C" fn(
    ctx: *mut c_void,
    handle: AccelMemHandle,
    out_phys: *mut PhysAddr,
    out_size: *mut u64,
) -> IoResultCode,

p2p_dma_map resolution flow: When the kernel processes a p2p_dma_map call:

Call dst_device.accel_vtable.resolve_mem_handle(dst_mem) to obtain (phys_addr, total_size).
Validate: dst_offset + size <= total_size. Return IO_ERR_INVAL if out of bounds.
Compute the target physical range: [phys_addr + dst_offset, phys_addr + dst_offset + size).
Program the IOMMU to map this range into src_device's IOMMU domain, producing an IOVA.
Return the IOVA via out_iova and a tracking handle via out_mapping.

For non-accelerator devices (e.g., NVMe in GPUDirect Storage), the kernel uses the device registry's BAR information directly — resolve_mem_handle is only called when dst_device is an accelerator with opaque memory handles.

// Appended to KernelServicesVTable

/// Set up a peer-to-peer DMA mapping between two devices.
/// The kernel programs the IOMMU to allow device A to DMA to device B's
/// memory region. For accelerator targets, the kernel calls
/// `AccelBaseVTable::resolve_mem_handle()` to translate the opaque
/// `AccelMemHandle` to a physical address before IOMMU programming.
pub p2p_dma_map: Option<unsafe extern "C" fn(
    src_device: DeviceHandle,
    dst_device: DeviceHandle,
    dst_mem: AccelMemHandle,    // Or DmaBufferHandle for non-accel devices
    dst_offset: u64,
    size: u64,
    out_iova: *mut u64,         // IOVA that src_device can use to reach dst memory
    out_mapping: *mut P2pMappingHandle,
) -> IoResultCode>,

/// Tear down a P2P DMA mapping.
pub p2p_dma_unmap: Option<unsafe extern "C" fn(
    mapping: P2pMappingHandle,
) -> IoResultCode>,

/// Initiate a DMA transfer between two devices.
/// The kernel coordinates between the two drivers.
pub p2p_dma_transfer: Option<unsafe extern "C" fn(
    mapping: P2pMappingHandle,
    src_offset: u64,
    dst_offset: u64,
    size: u64,
    out_fence: *mut DmaFence,
) -> IoResultCode>,

22.4.2.3 IOMMU Integration¶

P2P DMA requires careful IOMMU programming:

Without P2P:
  Device A → IOMMU → CPU RAM ← IOMMU ← Device B
  (two DMA transfers, CPU RAM as bounce buffer)

With P2P:
  Device A → IOMMU → Device B's BAR / memory
  (one DMA transfer, no CPU involvement)

IOMMU domain type: P2P DMA uses IommuDomainType::Kernel domains (Section 11.5). No separate P2P-specific domain type exists. When both devices share an IOMMU group, the kernel maps the target device's BAR region into the source device's existing kernel domain. When the devices are in different IOMMU groups, the kernel creates a cross-device mapping in the source device's domain (requires IOMMU hardware support for inter-group translation).

The kernel validates: 1. Both devices support P2P (PCIe ACS, correct topology) 2. IOMMU can map across devices (same IOMMU domain or IOMMU supports cross-device mapping) 3. The requesting driver has the P2P_DMA capability 4. The target memory region is within the bounds of the P2P mapping

If hardware P2P is not possible (devices behind different root complexes without P2P bridge support), the kernel falls back to CPU-mediated copy transparently. The driver API is the same either way.

22.4.2.4 Topology-Aware Placement¶

The device registry (Section 11.4) provides topology information. The kernel uses this for P2P path selection:

PCIe Topology:
  Root Complex 0
    +-- Root Port 0
    |   +-- GPU 0 (VRAM: 24GB)
    |   +-- NVMe 0
    +-- Root Port 1
    |   +-- GPU 1 (VRAM: 24GB)
    |   +-- NIC (100GbE)

P2P capability matrix:
  GPU 0 ↔ NVMe 0: Direct P2P (same root port) — optimal
  GPU 0 ↔ GPU 1:  P2P via root complex — good
  GPU 0 ↔ NIC:    P2P via root complex — good
  GPU 1 ↔ NVMe 0: P2P via root complex — good

The scheduler uses this topology when placing workloads: a training job that loads data from NVMe 0 is preferentially scheduled on GPU 0 (same root port = best P2P path).

22.4.2.5 P2P Access Control¶

Peer-to-peer DMA allows devices to directly access each other's memory without CPU involvement. Without proper access control, a malicious or compromised device could read or corrupt another device's memory. This section specifies the mandatory access control mechanisms for all P2P DMA operations.

Authorization Model

Every P2P DMA mapping requires explicit authorization from both the source and target device owners:

P2P DMA Authorization Flow:
  1. Client (process or driver) requests P2P mapping: src_device → dst_device memory
  2. Kernel checks:
     a. Client holds ACCEL_P2P capability (0x0102) for BOTH devices
     b. Both devices are in the client's allowed device set (cgroup accel.devices)
     c. Target memory region is within the client's allocated AccelMemHandle
     d. Source device's cgroup has not exceeded accel.p2p.max limit (if set)
  3. Kernel programs IOMMU with P2P mapping (never direct device-to-device without IOMMU)
  4. P2pMappingHandle is bound to the requesting context — not transferable
  5. Mapping is recorded in both devices' P2P ACL tables

Capability-Based Device ACLs

Each device maintains a P2P Access Control List (ACL) that specifies which other devices are authorized to initiate P2P DMA to its memory:

/// P2P ACL entry — one per authorized (src_device, dst_device) pair.
/// Stored in the device registry node for the target device.
// kernel-internal, not KABI — device ACL state, not exposed to userspace.
#[repr(C)]
pub struct P2pAclEntry {
    /// Source device that is authorized to initiate P2P DMA.
    pub src_device_id: DeviceNodeId,
    /// Capability token that authorized this ACL entry.
    /// Must be valid for the ACL entry to be valid.
    pub authorizing_cap: CapabilityToken,
    /// Maximum bytes that can be transferred via this ACL entry.
    /// Optional: 0 means unlimited (subject to cgroup limits).
    pub max_bytes: u64,
    /// Bytes transferred so far (reset on ACL refresh or revocation).
    pub bytes_transferred: AtomicU64,
    /// Creation timestamp (for auditing and expiration).
    pub created_ns: u64,
    /// Expiration timestamp (0 = no expiration).
    pub expires_ns: u64,
}

/// Maximum P2P ACL entries per device.
///
/// This limit bounds memory usage for ACL tables and ensures O(1) or O(log n)
/// lookup latency. The value 1024 is chosen to support:
/// - 32 GPUs in a single server × 31 peer devices = 992 entries (full mesh)
/// - Headroom for future expansion or per-cgroup ACLs
///
/// For systems requiring more than 1024 entries per device, use cgroup-based
/// ACL partitioning (each cgroup has its own ACL table).
pub const MAX_P2P_ACL_ENTRIES_PER_DEVICE: usize = 1024;

/// Maximum total P2P ACL entries system-wide.
///
/// This limit prevents unbounded memory growth in systems with many devices.
/// With 100K entries and 64 bytes per entry, total ACL memory is ~6.4MB.
pub const MAX_P2P_ACL_ENTRIES_SYSTEM: usize = 100_000;

P2P ACL lookup algorithm:

The P2P ACL is implemented as a fixed-capacity hash table with open addressing:

// umka-core/src/accel/p2p_acl.rs (kernel-internal)

/// P2P ACL table for a single device.
///
/// **Concurrency**: Both reads and writes acquire the write_lock SpinLock.
/// P2P ACL lookups occur on DMA mapping setup (warm path), so lock
/// contention is acceptable. Writes (add/remove) are serialized by the
/// same lock.
pub struct P2pAclTable {
    /// Hash table entries (capacity = power of 2, e.g., 1024).
    /// Uses linear probing for collision resolution.
    /// Protected by `lock` for both reads and writes.
    entries: SpinLock<[Option<P2pAclEntry>; CAPACITY]>,
    /// Number of active entries (for monitoring).
    len: AtomicUsize,
}

impl P2pAclTable {
    /// Look up an ACL entry by (src_device, authorizing_cap).
    /// Returns `Some(entry)` if found and valid, `None` otherwise.
    /// Time complexity: O(1) average case with good hash function.
    pub fn lookup(&self, src_device: DeviceNodeId, cap: &CapabilityToken) -> Option<&P2pAclEntry> {
        let hash = p2p_acl_hash(src_device, cap);
        let mask = CAPACITY - 1;  // CAPACITY is power of 2
        let mut idx = hash & mask;

        for _ in 0..CAPACITY {
            match self.entries.lock()[idx] {
                Some(ref entry)
                    if entry.src_device_id == src_device && entry.authorizing_cap == *cap =>
                {
                    // Check expiration
                    if entry.expires_ns != 0 && entry.expires_ns < get_time_ns() {
                        return None;  // Expired
                    }
                    return Some(entry);
                }
                None => return None,  // Empty slot = not found
                _ => {}  // Collision, continue probing
            }
            idx = (idx + 1) & mask;  // Linear probing
        }

        None  // Table is full and entry not found
    }

    /// Insert a new ACL entry.
    /// Returns `Err(())` if the table is full (no allocation on error).
    /// Time complexity: O(1) average case.
    pub fn insert(&self, entry: P2pAclEntry) -> Result<(), ()> {
        if self.len.load(Ordering::Relaxed) >= MAX_P2P_ACL_ENTRIES_PER_DEVICE {
            return Err(());  // Table full
        }

        let hash = p2p_acl_hash(entry.src_device_id, &entry.authorizing_cap);
        let mask = CAPACITY - 1;
        let mut idx = hash & mask;

        let mut entries = self.entries.lock();
        for _ in 0..CAPACITY {
            if entries[idx].is_none() {
                entries[idx] = Some(entry);
                self.len.fetch_add(1, Ordering::Relaxed);
                return Ok(());
            }
            idx = (idx + 1) & mask;
        }

        Err(())  // Table full (no empty slot found)
    }

    /// Remove an ACL entry by (src_device, authorizing_cap).
    /// Returns `Some(entry)` if found and removed, `None` otherwise.
    /// Time complexity: O(1) average case.
    pub fn remove(&self, src_device: DeviceNodeId, cap: &CapabilityToken) -> Option<P2pAclEntry> {
        let hash = p2p_acl_hash(src_device, cap);
        let mask = CAPACITY - 1;
        let mut idx = hash & mask;

        let mut entries = self.entries.lock();
        for _ in 0..CAPACITY {
            match entries[idx] {
                Some(ref entry)
                    if entry.src_device_id == src_device && entry.authorizing_cap == *cap =>
                {
                    let entry = entries[idx].take().unwrap();
                    self.len.fetch_sub(1, Ordering::Relaxed);
                    return Some(entry);
                }
                None => return None,  // Empty slot = not found
                _ => {}  // Collision, continue probing
            }
            idx = (idx + 1) & mask;
        }

        None  // Not found
    }
}

/// Hash function for P2P ACL lookups.
/// Combines src_device and capability token into a 64-bit hash.
fn p2p_acl_hash(src_device: DeviceNodeId, cap: &CapabilityToken) -> u64 {
    use umka_core::util::hash::fast_hash64;
    fast_hash64(&[
        src_device.0 as u64,
        cap.low as u64,
        cap.high as u64,
    ])
}

Memory layout and scaling:

Each P2pAclEntry is 64 bytes (1 cache line).
Per-device ACL table: 1024 entries × 64 bytes = 64KB.
System-wide ACL memory (32 GPUs): 32 × 64KB = 2MB.
For systems with 100 cgroups, each cgroup can have its own ACL table (64KB each), totaling ~6.4MB.

Lookup complexity:

Average case: O(1) with linear probing and load factor < 0.7.
Worst case (pathological hash collisions): O(n), but this is prevented by the hash function and table resizing policy.
The hash table is NOT dynamically resized (to avoid allocation during lookup). Instead, the fixed capacity is chosen to be large enough for all expected use cases.

ACL entry expiration:

Entries with expires_ns != 0 are automatically considered expired after the deadline.
Expired entries are lazily removed on next lookup or insert (tombstone behavior).

A periodic ACL sweep (every 10 seconds) removes expired entries to prevent table pollution.

The ACL is consulted on every `p2p_dma_map` call. The mapping is denied if:
- No ACL entry exists for (src_device, dst_device)
- The authorizing capability has been revoked
- The byte limit has been exceeded
- The ACL entry has expired

**Anti-replay property**: The `ACCEL_P2P` capability token (0x0102) in `authorizing_cap`
is bound to the specific (src_device, dst_device) pair at ACL creation time. A process
holding `ACCEL_P2P` for devices A and B cannot replay the same capability token to
authorize P2P between devices A and C — step 2a of the authorization flow (above)
requires the client to hold `ACCEL_P2P` for **both** the source and target device, and
the resulting ACL entry records the specific pair. If the `authorizing_cap` is revoked
(e.g., the process loses access to one device via cgroup migration), all ACL entries
referencing that token are invalidated at lookup time (lazy revocation), and the
corresponding IOMMU P2P mappings are torn down on the next `p2p_dma_map` denial or
periodic ACL sweep (every 10 seconds).

**IOMMU Mediation**

All P2P DMA transactions are mediated by the IOMMU. Direct device-to-device DMA
without IOMMU translation is **never permitted**:

Security-Critical Design Decision: P2P DMA Path: Device A → IOMMU → Device B's memory NOT: Device A → Device B (direct PCIe P2P without IOMMU)

Rationale: 1. IOMMU provides address translation and access validation on every transaction 2. IOMMU can revoke access instantly by invalidating the IOVA mapping 3. IOMMU provides fault isolation if a device goes rogue 4. IOMMU enables per-transaction logging for security auditing

The IOMMU is programmed with per-device IOVA page tables. A P2P mapping creates an
IOVA entry in the source device's page table that translates to the target device's
physical BAR address. When the mapping is revoked, the IOVA entry is invalidated.

**Revocation Protocol**

P2P access must be revoked when:
- The authorizing capability is revoked (process exit, cgroup removal)
- A device is removed from the system (hot-unplug)
- A device is quarantined due to error or security event ([Section 22.3](#accelerator-vtables-and-integration--crash-recovery))
- The cgroup limit is exceeded
- An explicit unmap request is made

Revocation is synchronous and blocking:

P2P Revocation Sequence: 1. Kernel receives revocation trigger (capability revoke, device quarantine, etc.) 2. Kernel looks up all P2P mappings involving the affected device(s) 3. For each active mapping: a. Set mapping state to REVOKING b. Issue IOMMU TLB invalidation for the IOVA range c. Wait for in-flight transactions to complete (see below) d. Remove the mapping from both devices' ACL tables e. Free the P2pMappingHandle 4. Signal completion to revocation requester

**In-Flight Transaction Handling**

The critical challenge is handling P2P DMA transactions that are in-flight when
revocation is requested. The kernel uses a quiesce-then-revoke protocol:

```rust
/// P2P mapping state machine.
#[repr(u8)]
pub enum P2pMappingState {
    /// Mapping is active and can be used for new transfers.
    Active = 0,
    /// Mapping is being revoked. No new transfers allowed.
    /// In-flight transactions are completing.
    Revoking = 1,
    /// Mapping is fully revoked. IOVA is invalid.
    Revoked = 2,
}

/// In-flight transaction tracking.
/// Each P2P mapping tracks pending transfers.
pub struct P2pMapping {
    pub handle: P2pMappingHandle,
    pub src_device: DeviceNodeId,
    pub dst_device: DeviceNodeId,
    pub state: AtomicU8,  // P2pMappingState
    pub iova_base: u64,
    pub iova_size: u64,
    /// Number of in-flight transfers using this mapping.
    /// Incremented before p2p_dma_transfer, decremented after fence completion.
    pub in_flight_count: AtomicU32,
    /// Wait queue for revocation to wait on in-flight completion.
    pub revoke_wait_queue: WaitQueue,
}

Revocation waits for in-flight transactions:

In-Flight Handling During Revocation (E2 fix):
  1. Set state to REVOKING (atomic store)
  2. New p2p_dma_transfer calls fail with EACCES (capability revoked)
  3. Wait for in_flight_count to reach zero:
     - Each transfer increments in_flight_count before submission
     - Each transfer decrements after fence signals completion
     - Timeout: 100ms for local P2P, 1s for cross-node RDMA
  4. If timeout expires (in_flight_count still > 0):
     a. **Escalation Level 1: Device quiesce via preempt_context**
        - Call `driver.preempt_context(context, PreemptReason::P2pRevocation)`
        - Wait up to 10ms for preemption to complete (via fence poll)
        - Re-check in_flight_count
     b. **Escalation Level 2: Force IOMMU TLB invalidation**
        - Issue IOMMU TLB invalidation for the IOVA range
        - Wait for IOMMU invalidation completion (typically <100μs)
        - The IOMMU blocks new translations; in-flight DMA may complete or fault
     c. **Escalation Level 3: Device reset via PCIe FLR**
        - If in_flight_count remains > 0 after 100ms of forced invalidation:
        - Call `driver.device_reset(context)` to perform a context-level reset
        - If device_reset returns error or times out: escalate to full device reset
        - Perform PCIe Function Level Reset (FLR) on the source device
        - Wait for FLR completion (typically 100-500ms)
     d. Log the escalation level reached:
        - Level 1 success: "P2P revocation completed via cooperative quiesce"
        - Level 2 success: "P2P revocation completed via forced TLB invalidation"
        - Level 3 success: "P2P revocation completed via device reset"
        - Level 3 failure: "P2P revocation FAILED — device unresponsive, marking faulted"
  5. If escalation succeeded (in_flight_count == 0 or device reset completed):
     - Set state to REVOKED
     - Remove mapping from ACL tables
     - Free P2pMappingHandle
     - Wake any waiters on revoke_wait_queue
  6. If escalation failed (device unresponsive after FLR):
     - Set state to REVOKED (forcefully, even if in_flight_count > 0)
     - Mark device as Faulted in FMA ([Section 22.3](#accelerator-vtables-and-integration--fma-integration))
     - Trigger crash recovery path ([Section 22.3](#accelerator-vtables-and-integration--crash-recovery))
     - Remove mapping from ACL tables (best effort)
     - Free P2pMappingHandle

Rationale for escalation ladder:

Level 1 (preempt_context): Cooperative quiesce is fastest (~1-10ms) and least disruptive. Works on well-behaved devices that respect preemption requests.
Level 2 (TLB invalidation): Forces IOMMU to block new translations. In-flight DMAs may complete with data or trigger device errors. Fast (~100μs) but may cause data corruption if DMA was mid-transfer.
Level 3 (device reset): Nuclear option. Guarantees device quiesce by resetting all internal state. Expensive (100ms-5s) but necessary for unresponsive devices. May affect other contexts on the same device.

Data corruption handling:

If Level 2 or Level 3 escalation is used, the target memory may contain partial or corrupted data. The kernel does NOT attempt to repair the data — that is the responsibility of the application (which should use checksums or other integrity verification for critical data). The kernel's job is to ensure system stability and prevent the hung mapping from blocking revocation indefinitely.

P2P Memory Accounting

All P2P DMA memory usage is accounted to the originating cgroup:

/sys/fs/cgroup/<group>/accel.p2p.current
# # Current bytes transferred via P2P DMA (read-only)
# # Sum across all devices this cgroup can access

/sys/fs/cgroup/<group>/accel.p2p.max
# # Maximum bytes that can be transferred via P2P in the current period
# # Format: "bytes period_us"
# # Example: "10737418240 1000000" (10GB per second)
# # Default: "max" (unlimited)

/sys/fs/cgroup/<group>/accel.p2p.stat
# # P2P statistics (read-only):
# #   total_bytes <cumulative bytes transferred>
# #   mappings <current active P2P mappings>
# #   revocations <times P2P was revoked>
# #   timeout_revocations <revocations that timed out waiting for in-flight>

The byte counter is incremented on each p2p_dma_transfer completion (fence signaled). If a transfer fails or is aborted, the bytes are not counted.

Device Quarantine and P2P

When a device is quarantined (Section 22.3), all P2P mappings involving that device are immediately revoked using the protocol above. This prevents a quarantined device from: - Initiating new P2P DMA to other devices - Receiving P2P DMA from other devices - Corrupting or exfiltrating data via P2P paths

The quarantine process blocks until all P2P revocations complete. If any revocation times out (in-flight transactions do not complete), the device is force-reset via PCIe FLR or driver-specific reset, which guarantees termination of all DMA activity.

Security Audit Trail

All P2P authorization and revocation events are logged to the kernel audit subsystem:

Audit Events (logged to /sys/kernel/debug/umka/audit.log):
  P2P_MAP:    src_device=<id> dst_device=<id> size=<bytes> cap=<token> cgroup=<path>
  P2P_UNMAP:  mapping=<handle> reason=<explicit|cap_revoke|cgroup_exit|quarantine>
  P2P_REVOKE: mapping=<handle> src_device=<id> dst_device=<id> in_flight=<count> timed_out=<bool>
  P2P_ACL_ADD:    src_device=<id> dst_device=<id> cap=<token> max_bytes=<bytes>
  P2P_ACL_REMOVE: src_device=<id> dst_device=<id> reason=<cap_revoke|expired|quarantine>

These audit events are available for security monitoring and forensics.

22.4.3 GPUDirect Storage (GDS)¶

22.4.3.1 Overview¶

GPUDirect Storage enables direct DMA between NVMe storage and GPU memory, bypassing the CPU bounce buffer and system memory entirely. This eliminates one full memory copy and frees CPU cycles for application work.

Traditional I/O path for GPU workloads:

NVMe → System RAM (page cache) → GPU memory
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       Two copies, CPU orchestrates both DMA transfers.
       Bandwidth limited by system RAM and PCIe round-trips.

GDS path:

NVMe → GPU memory
       ^^^^^^^^^^
       One DMA, NVMe controller writes directly to GPU BAR address.
       CPU only programs the NVMe submission queue entry.

Use cases requiring GDS: - AI/ML training: Loading multi-terabyte datasets from NVMe arrays into GPU HBM. Without GDS, CPU becomes the bottleneck copying data through system RAM. - HPC I/O: Checkpoint/restart where GPU state is written directly to storage. - Real-time video processing: Ingesting raw frames from NVMe into GPU decode pipelines.

NVIDIA's cuFile API (libcufile.so) is the primary userspace interface. UmkaOS provides the kernel-side GDS driver that cuFile calls into via ioctl on /dev/nvidia-fs.

22.4.3.2 GDS Architecture¶

/// GDS file handle — represents a file opened for GPU-direct I/O.
///
/// Created when userspace calls cuFileHandleRegister(). Associates a VFS file
/// with a specific GPU device so that subsequent read/write operations can
/// program NVMe to DMA directly to GPU BAR memory.
pub struct GdsFileHandle {
    /// Underlying VFS file. Must be opened with O_DIRECT to bypass page cache.
    pub file: Arc<File>,

    /// Associated GPU device for DMA. The device must expose a BAR region
    /// that is reachable from the NVMe controller via PCIe P2P.
    pub gpu_dev: Arc<dyn AccelDevice>,

    /// GPU memory mapping for this file handle. Set after cuFileBufRegister().
    pub gpu_mapping: Option<GdsGpuMapping>,

    /// Per-handle statistics for observability and cgroup accounting.
    pub stats: GdsStats,
}

/// Describes a GPU memory region registered for direct storage I/O.
pub struct GdsGpuMapping {
    /// GPU virtual address range registered for DMA.
    pub gpu_va: u64,

    /// Size of the registered region in bytes.
    pub size: u64,

    /// GPU BAR physical address for P2P DMA. This is the PCIe BAR1 aperture
    /// address that NVMe will use as the DMA target in PRP/SGL entries.
    pub bar_addr: u64,

    /// IOMMU mapping, present when P2P DMA routes through an IOMMU.
    /// The IOMMU domain must grant the NVMe device read/write access
    /// to the GPU BAR address range.
    pub iommu_domain: Option<Arc<IommuDomain>>,
}

/// Per-handle GDS statistics. All counters are u64 to satisfy the 50-year
/// uptime requirement — at 7 GB/s sustained, bytes_read wraps after ~83,000 years.
pub struct GdsStats {
    /// Total read operations completed.
    pub reads: AtomicU64,
    /// Total write operations completed.
    pub writes: AtomicU64,
    /// Cumulative bytes read through GDS (P2P + bounce combined).
    pub bytes_read: AtomicU64,
    /// Cumulative bytes written through GDS (P2P + bounce combined).
    pub bytes_written: AtomicU64,
    /// Reads completed via direct P2P DMA (fast path).
    pub p2p_reads: AtomicU64,
    /// Reads that fell back through system RAM (bounce path).
    pub bounce_reads: AtomicU64,
}

22.4.3.3 GDS Operations¶

/// Trait for GPUDirect Storage operations. Implemented by the GDS driver
/// module (Tier 1, loaded as part of the accelerator framework).
pub trait GdsOps: Send + Sync {
    /// Register a GPU memory region for direct storage I/O.
    ///
    /// Maps the GPU BAR address into the IOMMU domain so that NVMe
    /// controllers can target it. Fails if the GPU device does not
    /// expose a suitable BAR region or the IOMMU cannot map it.
    fn register_gpu_buf(
        &self,
        gpu_va: u64,
        size: u64,
        gpu_dev: &dyn AccelDevice,
    ) -> Result<GdsGpuMapping, AccelError>;

    /// Deregister a GPU memory region.
    ///
    /// Tears down the IOMMU mapping and waits for any in-flight DMA
    /// to complete before returning. Uses the same escalation ladder
    /// as P2P revocation (cooperative → TLB invalidation → device reset).
    fn deregister_gpu_buf(&self, mapping: &GdsGpuMapping) -> Result<(), AccelError>;

    /// Read from file directly into GPU memory.
    ///
    /// Uses P2P DMA when available (NVMe and GPU in same PCIe root complex,
    /// IOMMU configured). Falls back to bounce buffer (NVMe → system RAM →
    /// GPU) when P2P is not possible. Returns bytes read.
    fn pread_gpu(
        &self,
        file: &GdsFileHandle,
        gpu_offset: u64,
        size: u64,
        file_offset: u64,
    ) -> Result<u64, AccelError>;

    /// Write from GPU memory directly to file.
    ///
    /// Symmetric to pread_gpu: P2P DMA when available, bounce fallback otherwise.
    fn pwrite_gpu(
        &self,
        file: &GdsFileHandle,
        gpu_offset: u64,
        size: u64,
        file_offset: u64,
    ) -> Result<u64, AccelError>;

    /// Batch read: scatter-gather list of file regions into GPU memory.
    ///
    /// Each GdsIov entry specifies a (gpu_offset, size) pair. The file is
    /// read sequentially from file_offset, distributing data across the
    /// scatter-gather entries. This avoids per-I/O ioctl overhead for
    /// workloads that load many small tensors from a single file.
    fn preadv_gpu(
        &self,
        file: &GdsFileHandle,
        iov: &[GdsIov],
        file_offset: u64,
    ) -> Result<u64, AccelError>;
}

/// Scatter-gather entry for batch GDS operations.
pub struct GdsIov {
    /// Offset within the registered GPU buffer.
    pub gpu_offset: u64,
    /// Number of bytes to transfer for this entry.
    pub size: u64,
}

22.4.3.4 P2P DMA Path¶

The direct DMA path when P2P is available:

Application call: cuFileRead() → ioctl to GDS kernel driver (/dev/nvidia-fs).
Block mapping: GDS driver calls the filesystem (ext4, XFS, Btrfs) via fiemap/iomap to translate the file offset+length into physical block addresses on the NVMe device.
NVMe command programming: GDS constructs an NVMe Read command where the PRP (Physical Region Page) or SGL (Scatter-Gather List) entries point to the GPU BAR physical address (translated through the IOMMU if present).
DMA execution: The NVMe controller fetches data from flash and writes it directly to the GPU BAR address via PCIe. Data flows: NVMe controller → PCIe fabric → (optional IOMMU) → GPU BAR → GPU HBM.
Completion: NVMe posts a completion queue entry. The GDS driver signals the userspace completion event (cuFile uses polling or callback).

Requirements for the P2P path to be used:

NVMe and GPU must be in the same PCIe root complex or connected via a PCIe switch. Cross-root-complex P2P is not guaranteed by the PCIe spec and fails on many platforms.
The IOMMU must be configured to allow the NVMe device to write to the GPU BAR address range. This requires an IOMMU domain that maps the GPU BAR as DMA-writable for the NVMe device (Section 4.14).
The GPU must expose BAR memory for P2P access. On NVIDIA GPUs, this is the BAR1 aperture. The driver must configure the BAR1 window to cover the target GPU memory region.
The file must be opened with O_DIRECT to bypass the page cache. Buffered I/O requires data to pass through the page cache in system RAM, which defeats the purpose.

22.4.3.5 Fallback Path¶

When P2P DMA is not possible (different PCIe root complex, no IOMMU support, BAR aperture too small, non-O_DIRECT file, etc.), GDS falls back to a two-copy path:

NVMe → System RAM: Standard block I/O path reads data into a kernel bounce buffer allocated from the DMA zone.
System RAM → GPU memory: The GDS driver initiates a DMA copy from the bounce buffer to GPU memory using the GPU's DMA engine (equivalent to cudaMemcpy device-to-host, but kernel-initiated).

The fallback path is semantically identical to the P2P path — the application receives the same completion notification and the GPU memory contains the same data. The only difference is performance: two DMA transfers instead of one, with CPU involvement for orchestration. The GdsStats counters (p2p_reads vs. bounce_reads) let applications and operators detect when the fallback path is being used.

22.4.3.6 cuFile Compatibility¶

UmkaOS implements the kernel interface expected by NVIDIA's libcufile.so userspace library. The following cuFile API calls map to kernel operations:

cuFile API	Kernel Operation
`cuFileDriverOpen()`	`open("/dev/nvidia-fs", ...)` — opens the GDS control device
`cuFileHandleRegister()`	`ioctl(GDS_REGISTER_HANDLE)` — creates `GdsFileHandle`
`cuFileBufRegister()`	`ioctl(GDS_REGISTER_BUF)` — creates `GdsGpuMapping`, sets up IOMMU
`cuFileRead()` / `cuFileWrite()`	`ioctl(GDS_PREAD)` / `ioctl(GDS_PWRITE)` — P2P or bounce I/O
`cuFileBufDeregister()`	`ioctl(GDS_DEREGISTER_BUF)` — tears down IOMMU mapping
`cuFileHandleDeregister()`	`ioctl(GDS_DEREGISTER_HANDLE)` — destroys `GdsFileHandle`
`cuFileDriverClose()`	`close(fd)` — closes the GDS control device

The ioctl numbers and struct layouts match NVIDIA's GDS kernel driver ABI so that unmodified libcufile.so binaries work without recompilation.

22.4.3.7 Per-Architecture P2P Support¶

Arch	P2P DMA Support	Notes
x86-64	Full	PCIe P2P with IOMMU (Intel VT-d, AMD-Vi). Most server platforms support NVMe→GPU P2P.
AArch64	Full	PCIe P2P with SMMU. NVIDIA Grace Hopper (GH200) has native NVMe→GPU P2P via NVLink-C2C.
PPC64LE	Full	NVLink enables direct GPU↔NVMe on IBM POWER9+ (OpenCAPI/NVLink bridge).
RISC-V	Limited	Depends on SoC PCIe topology and IOMMU availability. No server-class GDS hardware exists yet.
ARMv7	No	No server-class PCIe topology for GDS workloads.
PPC32	No	No relevant hardware.

22.4.3.8 Performance Budget¶

Path	Latency	Throughput	CPU Usage
P2P (NVMe→GPU direct)	~5-10 µs	NVMe line rate (7 GB/s per NVMe Gen4 x4)	Near zero (only SQ doorbell write)
Bounce (NVMe→RAM→GPU)	~20-50 µs	Limited by system RAM bandwidth	Moderate (DMA engine programming)
Traditional (read + cudaMemcpy)	~50-100 µs	Limited by cudaMemcpy serialization	High (CPU copies or orchestrates both)

With 8 NVMe Gen4 drives in P2P mode, aggregate read throughput to a single GPU reaches ~56 GB/s, which saturates a PCIe Gen4 x16 slot. Gen5 doubles these numbers.

22.4.3.9 GDS Cross-References¶

Section 22.4 — P2P DMA base infrastructure (this section)
Section 22.1 — AccelDevice trait that GdsFileHandle.gpu_dev implements
Section 4.14 — DMA and IOMMU mapping used by GDS for NVMe→GPU IOMMU domains
Section 15.19 — NVMe command submission (SQ/CQ) and PRP/SGL programming
Section 15.2 — Block I/O path used by the bounce fallback
Section 14.1 — fiemap/iomap interfaces for file→block mapping

22.5 Accelerator Isolation and Scheduling¶

22.5.1 Capability-Based Access Control¶

Every accelerator context is gated by the UmkaOS capability system:

// Extend existing cap_id constants in umka-driver-sdk/src/capability.rs

pub const ACCEL_COMPUTE: u32    = 0x0100;  // Submit compute work
pub const ACCEL_MEMORY: u32     = 0x0101;  // Allocate device memory
pub const ACCEL_P2P: u32        = 0x0102;  // P2P DMA transfers
pub const ACCEL_PREEMPT: u32    = 0x0103;  // Preempt other contexts (admin)
pub const ACCEL_PERF: u32       = 0x0104;  // Read performance counters
pub const ACCEL_POWER: u32      = 0x0105;  // Change power/clock state (admin)
pub const ACCEL_CONTEXT_RT: u32 = 0x0106;  // Create realtime-priority contexts

A container gets a capability token that says "50% compute time on GPU 0, 8GB VRAM limit." The kernel enforces this through the AccelScheduler and memory accounting.

22.5.2 Cgroup Integration¶

New cgroup controller: accel.

/sys/fs/cgroup/<group>/accel.devices
# # Which accelerators this cgroup can access (by device index)
# # Format: "0 1 3" (devices 0, 1, 3 allowed)

/sys/fs/cgroup/<group>/accel.memory.max
# # Maximum total device memory across all accelerators (bytes)
# # Format: "8589934592" (8GB)

/sys/fs/cgroup/<group>/accel.memory.current
# # Current device memory usage (read-only)

/sys/fs/cgroup/<group>/accel.compute.guarantee
# # Guaranteed compute bandwidth (microseconds per second, per device)
# # Format: "device_idx quota period"
# # Example: "0 500000 1000000" (50% of GPU 0 guaranteed)

/sys/fs/cgroup/<group>/accel.compute.max
# # Maximum compute bandwidth ceiling
# # Format: same as guarantee

/sys/fs/cgroup/<group>/accel.compute.weight
# # Relative share of excess compute time (like cpu.weight)
# # Default: 100

/sys/fs/cgroup/<group>/accel.priority
# # Default priority for contexts created by this cgroup
# # "background", "normal", "high"
# # ("realtime" requires ACCEL_CONTEXT_RT capability)

/sys/fs/cgroup/<group>/accel.stat
# # Usage statistics (read-only):
# #   compute_time_us <total compute microseconds>
# #   memory_current <current bytes>
# #   memory_peak <peak bytes>
# #   submissions <total command submissions>
# #   preemptions <times preempted by higher priority>
# #   faults <device page faults>
# #   migrations <pages migrated>

22.5.3 Memory Isolation¶

Each AccelContext has its own device-side page table (or partition of the device's address space). The kernel ensures:

Context A cannot access Context B's device memory (separate page tables / address spaces).
Context A cannot exceed its accel.memory.max cgroup limit.
When Context A is destroyed, all its device memory is freed immediately.
The OOM killer is aware of device memory. If a process is consuming excessive device memory and the device is under pressure, the OOM killer can target it.

Device Memory and the OOM Killer:

Device memory is additive to a process's OOM score — device_bytes is counted as an RSS equivalent. Before the OOM killer selects a victim, the kernel attempts soft reclaim: evicting device pages back to CPU RAM (using the migrate_pages vtable call). Per-device memory pressure is tracked with high/low watermarks; when the high watermark is breached, the AccelScheduler proactively triggers eviction of cold pages from device memory to CPU RAM. If a process exceeds its cgroup accel.memory.max limit, the allocation returns ENOMEM rather than triggering OOM kill — the process must handle the allocation failure gracefully.

22.5.4 Compute Time Isolation¶

The AccelScheduler enforces compute time limits per context:

accel.compute.max: Hard ceiling. If a context has used its quota for this period, its submissions are queued until the next period. (Same semantics as cpu.max.)
accel.compute.guarantee: Minimum bandwidth via CBS server. (Same algorithm as cpu.guarantee from Section 7.6.)
accel.compute.weight: Proportional sharing of compute time not covered by guarantees. (Same semantics as cpu.weight.)
Preemption: If a high-priority context's CBS server becomes runnable, the scheduler preempts the current context (if hardware supports it). Otherwise, the current submission runs to completion and the high-priority context gets the next slot.

22.5.4.1 Context Preemption Memory Policy¶

When a context is preempted, its VRAM allocations cannot simply be abandoned — the device memory contains live computation state and data that may be needed when the context is rescheduled. This subsection specifies what happens to a context's VRAM allocations when it is preempted.

Memory residency state: Each AccelContextState carries a memory residency field:

/// Tracks where a preempted context's device-local memory currently resides.
/// Only meaningful when the context is not actively running on hardware.
#[repr(u32)]
pub enum AccelMemResidency {
    /// Memory is in device VRAM. Context can resume immediately upon
    /// rescheduling (subject to command-queue setup overhead only).
    InVram = 0,

    /// Memory has been migrated to CPU RAM via HMM (Section 22.2.1.4).
    /// Context must migrate pages back to VRAM before resuming.
    EvictedToCpuRam = 1,

    /// Memory is in CPU RAM and has been compressed by the memory subsystem
    /// (Section 4.10). Context must decompress and migrate back before resuming.
    /// Transparent to the AccelScheduler — HMM handles decompression on demand.
    EvictedAndCompressed = 2,
}

Three-tier policy: When the AccelScheduler preempts a context, it applies one of three memory handling strategies depending on system pressure at the time of preemption:

Tier 1 — Leave in place (default, zero cost)

If VRAM utilization across the device is below vram_pressure_threshold (a tunable parameter, default 85%), the preempted context's VRAM allocations are left untouched. The context is descheduled from the hardware queue but its memory remains resident. The AccelMemResidency state is InVram. On rescheduling, the context resumes with no migration cost (<1μs overhead for command-queue re-setup only).

Tier 2 — Migrate to CPU RAM (VRAM pressure)

If vram_utilization > vram_pressure_threshold at the time of preemption, the kernel invokes HMM migration to move the preempted context's VRAM pages to CPU RAM (Section 22.4). The driver's migrate_pages() vtable entry (Section 22.1) is called with MigrationFlags::TO_CPU. Migration cost is approximately 10–100μs per 4MB depending on PCIe generation and link utilization. The context's AccelMemResidency transitions to EvictedToCpuRam.

The vram_pressure_threshold parameter is registered with the tunable parameter store (Section 23.1) under the name "accel.vram_pressure_threshold_pct" (default 85, range 50–99).

Tier 3 — Compress in CPU RAM (system memory pressure)

If CPU RAM is also under memory pressure after Tier 2 migration, the memory subsystem's LZ4 compression path (Section 4.12) transparently compresses the migrated pages and places them in the compressed pool. This is not triggered by the AccelScheduler directly — it occurs automatically when the memory subsystem reclaims pages from contexts in EvictedToCpuRam state. The AccelMemResidency transitions to EvictedAndCompressed. The HMM layer handles decompression transparently when pages are needed for migration back to VRAM.

Priority-driven eviction: A new high-priority context that cannot fit its VRAM allocation in available device memory MAY trigger eviction of an already-preempted lower-priority context's CPU RAM pages (Tier 3 compression). Selection criteria:

Only contexts with AccelMemResidency::EvictedToCpuRam are eligible for forced Tier 3 compression (contexts that are InVram are preempted first per normal scheduling policy before their memory is migrated).
Among eligible contexts, select the one with the lowest CBS priority (highest deadline value).
Among ties in priority: select the largest VRAM footprint, maximizing the space freed per eviction operation.

Higher-priority contexts (lower CBS deadline) never have their memory evicted to satisfy lower-priority allocations. The invariant is: a context's AccelMemResidency can only be advanced toward EvictedAndCompressed by a request from an equal-or-higher-priority context that cannot otherwise make progress.

Resume latency summary:

Residency state	Resume cost	Trigger condition
`InVram`	<1μs	VRAM utilization < threshold
`EvictedToCpuRam`	10–100μs per 4MB (PCIe transfer)	VRAM utilization ≥ threshold
`EvictedAndCompressed`	50–500μs (decompress + PCIe)	CPU RAM pressure after eviction

Resume latency is added to the context's AccelContextState::total_compute_ns accounting as wait time, not compute time, so CBS bandwidth budgets are not penalized for memory residency delays caused by other contexts' pressure.

22.5.4.2 Non-Preemptible Budget Overspend Handling¶

Unlike CPU CBS where a task can be preempted mid-quantum, accelerator operations often cannot be interrupted once submitted to hardware. When a CBS context's budget is exhausted mid-operation:

Detection: The CBS accounting thread (runs on each scheduler tick, 1ms resolution) detects overspend when runtime_consumed > bandwidth_ns for the current period (see AccelCbsServer fields above). It checks whether the context is currently executing a non-preemptible command (AccelCmdFlags::NON_PREEMPTIBLE set by the driver at submission, or preemption_granularity == PreemptionGranularity::None for the device).

Response to overspend:

Non-preemptible operation in progress: The current command is allowed to complete. The overspend amount (runtime_consumed - bandwidth_ns) is recorded as debt_us in the CBS state.

Debt carry-forward: In the next CBS period, the context's effective budget is reduced by debt_us:

effective_budget_next = max(CBS_MIN_BUDGET, period_budget - debt_us)
CBS_MIN_BUDGET = 100μs  // prevents complete starvation from one oversized command

Debt cap: Accumulated debt cannot exceed 3× the period budget. Excess debt is forgiven (prevents a single oversized command from penalizing a context across many future periods).
Suspension: If a context accumulates debt equal to 5× its period budget within a 10-second window (indicating systematic abuse), it is suspended for one full CBS period and the process is sent SIGXCPU (compatible with Linux's CPU bandwidth enforcement signal for user-facing tooling).
Accounting precision: Overspend is measured using hardware completion timestamps (from the device's completion event ring) for preemptible commands, and estimated from submitted_timestamp_ns + avg_cmd_duration_us for non-preemptible commands (hardware does not always report sub-command timing).

FMA reporting: Systematic overspend (debt > period budget for 10+ consecutive periods) triggers a cbs_budget_violation FMA event (Section 20.1) with the context ID, device, and debt statistics. Operators can use this to identify and cap runaway accelerator workloads.

22.5.5 Device Partitioning¶

For hardware that supports partitioning (NVIDIA MIG, AMD spatial partitioning):

/// Device partition descriptor.
#[repr(C)]
pub struct AccelPartition {
    /// Partition index.
    pub index: u32,
    /// Compute units assigned to this partition (same semantics as
    /// `AccelDeviceInfo::compute_units` — see Section 22.1.2.3).
    pub compute_units: u32,
    /// Memory assigned to this partition (bytes).
    pub memory_bytes: u64,
    /// Unique partition ID (for cgroup binding).
    pub partition_id: u64,
}
// AccelPartition: u32(4)*2 + u64(8)*2 = 24 bytes. KABI struct.
const_assert!(core::mem::size_of::<AccelPartition>() == 24);

The device registry models partitions as child nodes of the GPU device node:

pci0000:00
  +-- 0000:41:00.0 (GPU, NVIDIA A100)
      +-- partition0 (MIG 2g.20gb: 28 SMs, 20GB)
      +-- partition1 (MIG 2g.20gb: 28 SMs, 20GB)
      +-- partition2 (MIG 3g.40gb: 42 SMs, 40GB)
      -- (108 total SMs, 38 reserved for system/L2 cache management; 70 available for MIG partitions)

Each partition is an independently schedulable and isolatable unit. Cgroups can be bound to specific partitions: echo "0000:41:00.0/partition0" > accel.devices.

22.5.6 GPU Virtualization Modes¶

GPUs support multiple virtualization approaches. The kernel accommodates all four without imposing one model.

Note: VFIO passthrough (mode 1) and SR-IOV (mode 2) are general-purpose PCIe virtualization technologies — widely used for NICs, NVMe controllers, FPGAs, and other devices. The general IOMMU/VFIO mechanism is described in Section 11.5. This section describes their specific application to GPUs and accelerators. Modes 3 (mdev) and 4 (MIG) are GPU/accelerator-specific.

1. PCIe Passthrough (VFIO):
   Entire GPU assigned to a single VM via IOMMU.
   Kernel role: IOMMU group management, VFIO device file (/dev/vfio/N).
   Performance: near-native. No sharing between VMs.
   Use case: dedicated GPU per VM (HPC, gaming).
   (This is the same VFIO mechanism used for NIC passthrough, NVMe
   passthrough, etc. — the interface is device-agnostic.)

2. SR-IOV (Single Root I/O Virtualization):
   Hardware creates Virtual Functions (VFs), each a separate PCIe
   device with its own BARs and IOMMU mapping.
   Kernel role: enumerate VFs, create device registry nodes for each,
   assign VFs to VMs via VFIO. AccelScheduler is NOT involved — each VF
   is an independent device scheduled by its own firmware.
   GPU support: Intel Data Center GPU Max, some AMD MI-series.
   Not supported by: NVIDIA consumer GPUs, most AMD consumer GPUs.
   (SR-IOV is more common for NICs — e.g., Intel E810, Mellanox
   ConnectX — where each VF provides an independent network interface.
   The kernel handles GPU VFs and NIC VFs identically at the PCIe level.)

3. Mediated Passthrough (mdev / vGPU):
   Software-defined GPU partitions, managed by the GPU driver.
   NVIDIA vGPU (GRID) and Intel GVT-g use this model.
   (AMD MxGPU uses SR-IOV, not mdev — see #2 above.)
   Kernel role:
     - mdev framework: /sys/bus/mdev/ device lifecycle (same as Linux).
     - Each mdev appears as a VFIO device to the VM (same API as #1).
     - The GPU driver (Tier 1) handles the actual time-slicing and
       memory partitioning internally.
     - AccelScheduler can enforce per-mdev cgroup limits if the driver
       exposes per-mdev utilization via get_utilization().
   Trade-off: more flexible than SR-IOV, but relies on driver quality.

4. MIG (Multi-Instance GPU) — NVIDIA A100/H100:
   Hardware partitioning into isolated GPU instances, each with
   dedicated SMs, memory controllers, and L2 cache.
   Modeled as child nodes in device registry (Section 22.3.5 above).
   Can be combined with VFIO: each MIG instance can be passed through
   to a separate VM.

The KABI AccelBase vtable supports all four modes — the kernel sees devices (physical, SR-IOV VF, MIG partition, or mdev) through the same interface. The virtualization mode is a configuration choice, not an architectural difference.

22.5.7 Hardware Reset-on-Timeout (HROT)¶

Problem: Many GPU and AI accelerator architectures (older CUDA generations, most NPUs, custom ASICs) do not support fine-grained preemption of long-running compute kernels. Once a kernel is submitted to hardware, it runs to completion — or hangs forever. A single misbehaving or hostile workload can render the accelerator unusable for all other processes.

Section 22.3.4 describes compute time isolation via CBS bandwidth servers and cooperative preemption, but these mechanisms depend on the device eventually yielding control. preempt_context() (Section 22.1) can only stop feeding new work at the next command buffer boundary — it cannot interrupt a running dispatch. If the dispatch itself hangs (infinite loop shader, firmware deadlock, hardware fault), no amount of cooperative scheduling recovers the device. HROT is the escalation path for this failure mode.

22.5.7.1 HROT State Machine¶

Every AccelContext managed by the AccelScheduler has a watchdog timer. The state machine governs the lifecycle of a submission from the kernel's perspective:

IDLE ──submit──> RUNNING ──complete──> IDLE
                    │
                    ├── timeout ──> WATCHDOG_EXPIRED ──soft_reset──> RESETTING
                    │                                                    │
                    │                                              reset_complete
                    │                                                    │
                    └─────────────────────────────────────────────────> IDLE
                                                              (context destroyed if owner still hung)

State transitions:

IDLE -> RUNNING: The AccelScheduler calls submit_commands() on the driver. The watchdog timer is armed with hrot.soft_timeout_ms.
RUNNING -> IDLE (normal): The driver signals completion via the fence mechanism (Section 22.1, DmaFence). The watchdog is disarmed.
RUNNING -> WATCHDOG_EXPIRED: The soft timeout fires. The kernel attempts driver-level preemption via preempt_context(). If the device supports preemption (AccelPreemptionGranularity is not CommandBoundary), this may succeed within ~1ms. A hard timeout timer is armed with hrot.hard_timeout_ms.
WATCHDOG_EXPIRED -> RESETTING: The hard timeout fires without the device completing or yielding. The kernel escalates to a hardware reset according to hrot.hard_action.
RESETTING -> IDLE: The device reset completes, the driver re-initializes (same flow as crash recovery in Section 22.3), and the context is destroyed or recreated depending on whether the owning process is still alive.

22.5.7.2 HROT Configuration¶

System-level tunable defaults: The default HROT timeout values are registered as KernelTunableParam entries (Section 23.1) rather than compile-time constants. This allows operators to adjust them at runtime for workloads with legitimately long execution times (e.g., large model compilation, offline batch inference) without recompiling the kernel.

// umka-core/src/accel/hrot.rs

/// Soft watchdog timeout default (milliseconds).
/// On soft timeout expiry, the kernel sends PreemptReason::WatchdogSoftTimeout
/// to the driver. The hard timeout timer is then armed.
/// Registered in KernelParamStore at boot via register_param!().
pub const ACCEL_SOFT_TIMEOUT_DEFAULT_MS: u64 = 5_000;

/// Hard watchdog timeout default (milliseconds).
/// On hard timeout expiry, the kernel escalates to accel_hard_reset().
/// Registered in KernelParamStore at boot via register_param!().
pub const ACCEL_HARD_TIMEOUT_DEFAULT_MS: u64 = 30_000;

// The params are registered at boot as:
//
//   register_param!("accel.watchdog.soft_timeout_ms",
//       default: ACCEL_SOFT_TIMEOUT_DEFAULT_MS,
//       min: 1_000,       // 1 second minimum — prevents spurious resets
//       max: 300_000,     // 5 minutes maximum
//       decay_period_ms: 0,  // manual only, no auto-decay
//       subsystem: SubsystemId::Accel,
//   );
//
//   register_param!("accel.watchdog.hard_timeout_ms",
//       default: ACCEL_HARD_TIMEOUT_DEFAULT_MS,
//       min: 2_000,
//       max: 600_000,     // 10 minutes maximum
//       decay_period_ms: 0,
//       subsystem: SubsystemId::Accel,
//   );

Invariant: hard_timeout_ms > soft_timeout_ms. The kernel MUST enforce this on every write to either parameter. Attempts to set hard_timeout_ms <= soft_timeout_ms are rejected with EINVAL. Attempts to lower soft_timeout_ms to a value that would violate the invariant against the current hard_timeout_ms are similarly rejected. Operators must update both parameters atomically (write hard_timeout_ms first if increasing, soft_timeout_ms first if decreasing) to avoid transient invariant violations.

Per-context override: Individual contexts may specify shorter timeouts via HrotConfig (embedded in AccelContextLimits, described below). The effective timeout applied to any submission is:

effective_soft_timeout_ms = min(HrotConfig::soft_timeout_ms,
                                 param_store["accel.watchdog.soft_timeout_ms"])

effective_hard_timeout_ms = min(HrotConfig::hard_timeout_ms,
                                 param_store["accel.watchdog.hard_timeout_ms"])

A per-context timeout of 0 means "use the system tunable." A per-context timeout greater than the system tunable is silently clamped to the system tunable — individual contexts cannot exceed the system-wide ceiling.

HrotConfig is embedded in AccelContextLimits and configurable per-context at creation time. The kernel enforces minimum values to prevent userspace from disabling the watchdog entirely (a process requesting hard_timeout_ms = 0 gets the system tunable default).

/// Hardware Reset-on-Timeout configuration for an accelerator context.
/// Embedded in AccelContextLimits. Governs how the kernel responds when
/// a submitted command does not complete within the expected time.
/// Timeout values here are per-context overrides; the system-level defaults
/// are governed by the "accel.watchdog.soft_timeout_ms" and
/// "accel.watchdog.hard_timeout_ms" KernelTunableParam entries
/// (see KernelParamStore, Section 23.1.3).
#[repr(C)]
pub struct HrotConfig {
    /// Soft timeout override (milliseconds). 0 = use system tunable default.
    /// Clamped to [1_000, system_tunable_soft_timeout_ms].
    /// Warn + attempt driver-level preemption via preempt_context() on expiry.
    pub soft_timeout_ms: u32,

    /// Hard timeout override (milliseconds). 0 = use system tunable default.
    /// Must be > soft_timeout_ms (kernel rejects with EINVAL otherwise).
    /// Clamped to [soft_timeout_ms + 1_000, system_tunable_hard_timeout_ms].
    pub hard_timeout_ms: u32,

    /// Action to take when the hard timeout fires.
    pub hard_action: HrotAction,

    /// Explicit padding for repr(C) stability.
    pub _pad: [u8; 28],
}
// HrotConfig: u32(4)*2 + HrotAction(1) + [u8;28] = 37, padded to 40 (align 4).
// KABI struct — embedded in AccelContextLimits.
const_assert!(core::mem::size_of::<HrotConfig>() == 40);

/// Action taken when HROT hard timeout fires.
#[repr(u8)]
pub enum HrotAction {
    /// Destroy the offending context, deliver SIGKILL to the submitting
    /// process, reset the accelerator engine, and allow other contexts
    /// to continue. This is the default and preferred action.
    /// Requires AccelDeviceHrotCaps::supports_context_reset == true.
    /// If the device does not support per-context reset, the kernel
    /// automatically falls back to ResetDevice.
    KillContextAndReset = 0,

    /// Destroy ALL contexts on the device and perform a full device reset
    /// (PCIe FLR or vendor-specific reset sequence). Used when per-context
    /// reset is not supported by the hardware, or when a per-context reset
    /// has already failed.
    ResetDevice = 1,

    /// Do not reset — just log a warning and deliver SIGKILL. Use when
    /// the hardware does not support reset (e.g., some FPGAs without
    /// partial reconfiguration). The device remains unavailable until
    /// the process exits and the driver reinitializes.
    LogAndKill = 2,
}

22.5.7.3 Watchdog Implementation¶

The watchdog runs on the kernel's timer subsystem (Section 7.8), not on the accelerator. One timer per active submission, managed by the AccelScheduler.

/// Called by the AccelScheduler's periodic tick (typically 1ms resolution).
/// Checks whether the current submission on the given context has exceeded
/// its timeout thresholds.
fn accel_watchdog_tick(ctx: &AccelContext) {
    let elapsed = now() - ctx.last_submit_time.load(Acquire);

    if elapsed > ctx.hrot.soft_timeout_ms as u64 * 1_000_000 {
        if ctx.soft_reset_attempted.compare_exchange(
            false, true, AcqRel, Acquire
        ).is_ok() {
            // Attempt driver-level preemption (send stop command to
            // hardware queue). If the hardware supports it, this
            // completes within ~1ms. On non-preemptible hardware,
            // this is a cooperative yield request — it has no effect
            // if the dispatch is truly hung.
            ctx.driver.vtable.preempt_context(
                ctx.handle,
                PreemptReason::WatchdogSoftTimeout,
            );
            // Log to FMA telemetry ring (Section 22.1.3.4).
            fma_report(FmaEvent::AccelSoftTimeout {
                device: ctx.device_id,
                context: ctx.id,
                elapsed_ms: (elapsed / 1_000_000) as u32,
            });
        }
    }

    if elapsed > ctx.hrot.hard_timeout_ms as u64 * 1_000_000 {
        // Hard timeout — escalate to hardware reset.
        accel_hard_reset(ctx);
    }
}

/// Escalation path when soft preemption has failed and the hard timeout
/// fires. This is the last resort — the device is assumed hung.
fn accel_hard_reset(ctx: &AccelContext) {
    match ctx.hrot.hard_action {
        HrotAction::KillContextAndReset => {
            // 1. Mark context as zombie (reject new submissions).
            ctx.state.store(ContextState::Zombie, Release);
            // 2. Send SIGKILL to the owning process.
            signal_send(ctx.owner_pid, SIGKILL);
            // 3. Call driver reset. The driver calls PCIe FLR or a
            //    vendor-specific reset sequence (e.g., NVIDIA GSP
            //    reset, AMD SDMA drain). The reset is synchronous
            //    from the kernel's perspective — the driver returns
            //    when the device is ready for re-initialization.
            ctx.driver.vtable.device_reset(ctx.handle);
            // 4. Re-initialize the device for other contexts.
            //    Same flow as crash recovery (Section 22.1.3.2 step 5).
            ctx.driver.vtable.device_init(ctx.handle);
            // 5. Log the event via FMA.
            fma_report(FmaEvent::AccelHardReset {
                device: ctx.device_id,
                context: ctx.id,
            });
        }
        HrotAction::ResetDevice => {
            // Reset affects ALL contexts on the device. Each context's
            // owning process receives an error on pending submissions.
            // This reuses the crash recovery path (Section 22.1.3.2).
            accel_reset_all_contexts(ctx.device);
        }
        HrotAction::LogAndKill => {
            // Cannot reset the device. Kill the process and hope the
            // driver can reinitialize when the process's context is
            // dropped. This is the worst case — the device may remain
            // hung until the driver is reloaded or the system reboots.
            signal_send(ctx.owner_pid, SIGKILL);
            fma_report(FmaEvent::AccelHung {
                device: ctx.device_id,
            });
        }
    }
}

device_reset and device_init vtable entries (F3 fix): HROT requires two new entries added to AccelBaseVTable (these are additions to the base vtable defined in Section 22.1):

// Additions to AccelBaseVTable:

/// Reset a single context's hardware state. Called when HROT fires with
/// KillContextAndReset. The driver must drain the device's command queue
/// for this context, discard any in-flight work, and leave the device in
/// a state where other contexts can continue submitting.
/// Returns IO_NOT_SUPPORTED if per-context reset is not possible (kernel
/// falls back to ResetDevice).
pub device_reset: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    context: AccelContextHandle,
) -> IoResultCode>,

/// Full device re-initialization after a reset. Called after device_reset
/// when per-context reset was used, or after PCIe FLR when full device
/// reset was used. The driver must restore the device to a clean state
/// equivalent to a fresh driver load.
pub device_init: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
) -> IoResultCode>,

Error handling and retry policy (F3 fix):

Operation	Error Return	Kernel Action	Retry Count
`device_reset` returns error	`EIO`, `ETIMEDOUT`, etc.	Escalate to full device reset (PCIe FLR) immediately	0 retries — escalate immediately
`device_init` returns error	Any error	Mark device as `Faulted`, trigger crash recovery path	Up to 3 retries with 100ms backoff
`device_init` fails 3×	—	Give up, mark device permanently faulted, alert administrator	N/A — maximum retries exceeded

Retry details for device_init:

/// Attempt to initialize a device after reset, with retry logic.
/// Returns IO_OK on success, or an error code after all retries exhausted.
fn hrot_device_init_with_retry(
    driver: &AccelDriver,
    ctx: *mut c_void,
) -> IoResultCode {
    const MAX_INIT_RETRIES: u32 = 3;
    const INIT_RETRY_DELAY_MS: u64 = 100;

    for attempt in 0..MAX_INIT_RETRIES {
        match unsafe { driver.vtable.device_init(ctx) } {
            IO_OK => return IO_OK,

            err => {
                fma_report(FmaEvent::AccelDeviceInitFailed {
                    device: driver.device_id,
                    attempt,
                    error_code: err,
                });

                if attempt < MAX_INIT_RETRIES - 1 {
                    // Wait before retry (exponential backoff: 100ms, 200ms, 400ms)
                    sleep_ms(INIT_RETRY_DELAY_MS << attempt);
                }
            }
        }
    }

    // All retries exhausted
    Err(-ENODEV)  // Device failed to initialize
}

/// HROT escalation path when device_reset or device_init fails.
fn hrot_reset_failure_escalation(ctx: &AccelContext) {
    // Step 1: Attempt per-context reset first (least disruptive)
    if let Err(e) = ctx.driver.vtable.device_reset(ctx.handle) {
        // Per-context reset failed or not supported
        // Escalate to full device reset via PCIe FLR
        fma_report(FmaEvent::AccelContextResetFailed {
            device: ctx.device_id,
            context: ctx.id,
            error_code: e,
        });

        // Step 2: Full device reset (affects ALL contexts)
        if let Err(e) = ctx.driver.vtable.device_init(ctx.handle) {
            // device_init failed — retry up to 3 times
            if let Err(e) = hrot_device_init_with_retry(&ctx.driver, ctx.handle) {
                // All retries exhausted — device is permanently faulted
                fma_report(FmaEvent::AccelDevicePermanentlyFaulted {
                    device: ctx.device_id,
                    reason: "device_init failed after 3 retries",
                    error_code: e,
                });

                // Transition device to Faulted state
                ctx.device.mark_faulted();

                // Trigger crash recovery path (driver unload/reload)
                trigger_crash_recovery(ctx.device);
            }
        }
    }
}

State transitions on error:

device_reset error → device_init (full device re-init)
                       ↓
                  device_init success → Device state: Active, contexts: re-established
                       ↓
                  device_init error (attempt 1) → Retry after 100ms
                       ↓
                  device_init error (attempt 2) → Retry after 200ms
                       ↓
                  device_init error (attempt 3) → Device state: Faulted
                                                   Trigger crash recovery
                                                   Alert administrator

Crash recovery trigger:

When device_init fails after all retries, the kernel triggers the crash recovery path (Section 22.3): 1. Mark device as Faulted in FMA. 2. Unload the driver module (if possible). 3. Perform PCIe FLR to reset the device to a known state. 4. Reload the driver module. 5. Attempt device_init one more time (fresh driver load). 6. If still failing, keep device offline and alert administrator.

22.5.7.4 Per-Submission Timeout¶

In addition to the context-level watchdog, individual command submissions can carry a deadline. This is useful for ML inference workloads with SLO (Service Level Objective) requirements — a single inference that exceeds its deadline should be killed without waiting for the full HROT hard timeout.

/// Per-submission deadline. Carried in the submission descriptor passed
/// to submit_commands() (Section 22.1.2.2). If the submission does not
/// complete by the deadline, the context is treated as hung and HROT
/// triggers immediately (bypassing the normal soft/hard timeout sequence).
pub struct AccelSubmissionDeadline {
    /// Absolute time (monotonic clock, nanoseconds since boot) by which
    /// this submission must complete. None = use context-level HROT config.
    pub deadline_ns: Option<u64>,
}

The per-submission deadline interacts with the context-level watchdog as follows:

If deadline_ns is set and earlier than the context-level soft timeout, the watchdog fires at deadline_ns instead.
If deadline_ns is set but later than the context-level hard timeout, the context-level hard timeout takes precedence (the kernel never allows a single submission to extend the overall HROT window).
Per-submission deadlines do not affect other submissions on the same context. Each submission's watchdog is independent.

22.5.7.5 Hardware Capability Detection¶

The AccelBase KABI requires drivers to advertise their reset and preemption capabilities so the kernel can select the appropriate HROT action automatically. This is reported via a new capabilities structure returned by get_info():

/// HROT-related device capabilities, reported by the driver at
/// registration time via get_info().
#[repr(C)]
pub struct AccelDeviceHrotCaps {
    /// Device supports per-context reset without affecting other contexts.
    /// If false, KillContextAndReset automatically falls back to ResetDevice.
    /// u8 instead of bool for stable repr(C) ABI (0 = false, 1 = true).
    pub supports_context_reset: u8,

    /// Device supports driver-level soft preemption of running kernels
    /// (via preempt_context). If false, the soft timeout phase is skipped
    /// and HROT proceeds directly to the hard timeout.
    /// u8 instead of bool for stable repr(C) ABI (0 = false, 1 = true).
    pub supports_preemption: u8,

    /// Minimum reset latency in microseconds. The device is unavailable
    /// for new submissions during this time. The AccelScheduler uses this
    /// to estimate recovery time when deciding whether to migrate contexts
    /// to another device vs. waiting for the reset to complete.
    /// 0 = unknown (kernel assumes 500ms as conservative default).
    pub reset_latency_us: u32,

    /// Maximum number of consecutive resets before the kernel marks the
    /// device as permanently faulted (via FMA). Prevents reset storms
    /// caused by hardware defects.
    /// 0 = no limit (not recommended; default: 5).
    pub max_consecutive_resets: u32,

    /// Explicit padding for repr(C) stability.
    pub _pad: [u8; 20],
}
// AccelDeviceHrotCaps: u8(1)*2 + 2pad + u32(4)*2 + [u8;20] = 32 bytes. KABI struct.
const_assert!(core::mem::size_of::<AccelDeviceHrotCaps>() == 32);

The kernel uses these capabilities to make automatic decisions:

`supports_context_reset`	`supports_preemption`	Kernel behavior
true	true	Full HROT: soft preempt -> context reset -> device reset (escalation)
true	false	Skip soft timeout, go directly to hard timeout -> context reset
false	true	Soft preempt -> full device reset on hard timeout
false	false	Hard timeout -> full device reset (worst case: affects all contexts)

22.5.7.6 Interaction with Live Migration¶

If a VM with a GPU context is being live-migrated (Section 18.1), the HROT watchdog must account for the checkpoint/restore window during which the device is quiesced but not hung. Without adjustment, a migration that takes longer than hrot.hard_timeout_ms would trigger a false reset.

Rules:

When a live migration begins for a VM that owns accelerator contexts, the AccelScheduler enters migration hold for those contexts. The HROT watchdog is suspended (timers are paused, not cancelled).
The migration hold extends the effective hard timeout by the migration duration, up to a maximum extension of hrot.hard_timeout_ms (i.e., the total allowed time is at most 2 * hrot.hard_timeout_ms). This prevents indefinite watchdog suspension in the case of a migration that itself hangs.
If the migration completes successfully, the watchdog resumes with the remaining time from before the hold.
If the migration fails or is cancelled, the watchdog resumes immediately. Any time already elapsed before the hold still counts toward the timeout.
The migration hold is logged via FMA so administrators can correlate HROT events with migration activity.

22.5.7.7 Interaction with Crash Recovery¶

HROT and the crash recovery path (Section 22.3) share the same device reset mechanism but are triggered by different conditions:

Crash recovery: triggered by a driver fault (segfault, panic, domain violation in Tier 1 isolation). The driver itself is broken.
HROT: triggered by a device hang (the driver is healthy but the hardware is not responding). The driver is still running and can perform the reset.

When HROT triggers a ResetDevice, it reuses the crash recovery path from step 5 onward (PCIe FLR, driver reload, vtable re-exchange). The difference is that in the HROT case, the driver is asked to perform the reset itself first (device_reset vtable call). Only if the driver's reset fails (returns an error or times out) does the kernel fall back to the full crash recovery path with driver unload/reload.

Reset storm protection: If a device triggers HROT more than AccelDeviceHrotCaps::max_consecutive_resets times within a 60-second window, the kernel marks the device as permanently faulted via FMA (Section 22.3), transitions its device registry node to Faulted, and refuses new context creation. This prevents a hardware defect from causing an infinite reset loop that degrades system performance. An administrator can manually clear the fault via umkafs (Section 20.5).

22.5.8 Multi-Instance GPU (MIG) Partitioning¶

MIG is an NVIDIA technology (A100, H100, B200) that divides a single physical GPU into multiple isolated instances, each with dedicated compute units, memory controllers, and memory bandwidth. Unlike time-slicing (below), MIG provides hardware-level isolation: a noisy neighbor in one partition cannot affect latency or throughput in another.

UmkaOS models MIG partitions as first-class device registry nodes. Each partition appears as an independent accelerator device to the rest of the kernel — the AccelScheduler, cgroup controller, VFIO passthrough, and Kubernetes device plugin all operate on MIG partitions without special-casing.

/// MIG partition descriptor — represents one isolated GPU instance.
/// Each MIG partition has dedicated compute units (SMs), a dedicated
/// memory controller slice, and dedicated L2 cache. Cross-partition
/// interference is eliminated at the hardware level.
pub struct MigPartition {
    /// Partition ID (0-based within the GPU).
    pub partition_id: u32,
    /// GPU instance ID assigned by the GPU firmware.
    pub gi_id: u32,
    /// Compute instance ID within the GPU instance.
    pub ci_id: u32,
    /// MIG profile used to create this partition.
    pub profile: MigProfile,
    /// Number of compute units (SMs) in this partition.
    pub compute_units: u32,
    /// Memory size in bytes.
    pub memory_bytes: u64,
    /// Memory bandwidth fraction in per-mille (0-1000 = 0.0%-100.0%).
    pub memory_bw_permille: u16,
    /// Associated DRM minor device node.
    pub drm_minor: u32,
    /// cgroup association (if any).
    pub cgroup: Option<Arc<CgroupCss>>,
    /// Current state.
    pub state: MigPartitionState,
}

/// MIG profile — predefined partition configurations.
/// Profile names follow NVIDIA convention: NxGPU.Nsm.Ngb
pub enum MigProfile {
    /// 1g.10gb — 1/7 of GPU (1 SM group, 10 GB)
    Profile1g10gb,
    /// 1g.20gb — 1/7 compute, 2/7 memory
    Profile1g20gb,
    /// 2g.20gb — 2/7 of GPU (2 SM groups, 20 GB)
    Profile2g20gb,
    /// 3g.40gb — 3/7 of GPU
    Profile3g40gb,
    /// 4g.40gb — 4/7 of GPU
    Profile4g40gb,
    /// 7g.80gb — full GPU (all 7 SM groups)
    Profile7g80gb,
    /// Custom profile (for future GPUs with different granularity)
    Custom { compute_units: u32, memory_gb: u32 },
}

pub enum MigPartitionState {
    /// Partition created but not yet activated.
    Created,
    /// Partition active and available for workloads.
    Active,
    /// Partition being destroyed (draining workloads).
    Draining,
    /// Partition destroyed, resources reclaimed.
    Destroyed,
}

22.5.8.1 MIG Operations Trait¶

GPU drivers that support MIG partitioning implement the AccelMigOps trait. This trait extends AccelDevice and is discovered by the kernel at driver registration time. Drivers that do not support MIG simply do not implement this trait — the kernel never attempts MIG operations on non-MIG-capable hardware.

/// Trait for GPU drivers that support MIG partitioning.
pub trait AccelMigOps: AccelDevice {
    /// Query whether MIG is supported and currently enabled on this device.
    fn mig_mode(&self) -> MigMode;

    /// Enable or disable MIG mode. Requires no active compute contexts.
    /// Enabling MIG resets the GPU and reconfigures the memory controller.
    fn set_mig_mode(&self, mode: MigMode) -> Result<(), AccelError>;

    /// List available MIG profiles for this GPU.
    fn available_profiles(&self) -> ArrayVec<MigProfile, 16>;

    /// Create a GPU instance with the given profile.
    fn create_gpu_instance(&self, profile: MigProfile) -> Result<u32, AccelError>;

    /// Create a compute instance within a GPU instance.
    fn create_compute_instance(&self, gi_id: u32, profile: MigProfile) -> Result<u32, AccelError>;

    /// Destroy a compute instance.
    fn destroy_compute_instance(&self, gi_id: u32, ci_id: u32) -> Result<(), AccelError>;

    /// Destroy a GPU instance (all compute instances must be destroyed first).
    fn destroy_gpu_instance(&self, gi_id: u32) -> Result<(), AccelError>;

    /// List current partitions.
    fn list_partitions(&self) -> ArrayVec<MigPartition, 8>;
}

pub enum MigMode {
    Disabled,
    Enabled,
}

22.5.8.2 MIG and Kubernetes Integration¶

MIG partitions integrate with the container and orchestration stack through existing UmkaOS primitives — no MIG-specific kernel subsystems are needed:

Each MIG partition appears as a separate DRM device node (/dev/dri/renderD128, etc.) and a separate device registry entry. The Kubernetes device plugin discovers these entries and advertises them as nvidia.com/mig-1g.10gb (or similar) resources.
Container runtime binds the specific DRM device node and associated /dev/nvidia* character devices into the container's mount namespace.
UmkaOS's device cgroup controller (Section 17.2) restricts container access to its assigned MIG partition. A container granted mig-1g.10gb cannot access any other partition's device node — the cgroup enforcement is independent of the container runtime's bind mount configuration.
IOMMU isolation: each MIG partition has its own IOMMU context (Section 18.5). DMA from one partition cannot reach another partition's memory, even if a malicious driver attempts cross-partition access. This is the same IOMMU enforcement used for Tier 2 driver isolation (Section 11.3).
MIG partitions can be combined with VFIO passthrough: each partition is assignable to a separate VM, with full IOMMU-enforced isolation between VMs sharing the same physical GPU.

22.5.9 GPU Time-Slicing¶

For GPUs without MIG support (consumer GPUs, older data center GPUs, non-NVIDIA hardware), time-slicing provides multi-tenant GPU sharing. Time-slicing multiplexes compute contexts on a single GPU by rapidly switching between them, analogous to CPU time-sharing. Unlike MIG, time-slicing does not provide memory isolation — all contexts share the same memory controller and can potentially interfere with each other's cache and bandwidth.

/// GPU time-slice scheduler — multiplexes compute contexts on a single GPU.
///
/// **Relationship to `AccelScheduler`**: `AccelScheduler`
/// ([Section 22.2](#accelerator-scheduler)) is the generic, device-agnostic scheduler
/// framework that manages submission queuing, CBS bandwidth servers, cgroup
/// accounting, and watchdog timers for ALL accelerator types (GPU, NPU, DSP,
/// FPGA). `GpuTimeSliceScheduler` is the GPU-specific scheduling **policy**
/// that plugs into `AccelScheduler` — it implements the time-slicing logic
/// (quantum management, context switching, preemption) that is specific to
/// GPU hardware. The `AccelScheduler` calls into `GpuTimeSliceScheduler`
/// via the `pick_next` / `preempt_current` dispatch points when the device
/// type is `ComputeResourceType::GpuCu`. For non-GPU accelerators, the
/// `AccelScheduler` uses the default FIFO or CBS policy directly.
pub struct GpuTimeSliceScheduler {
    /// GPU device this scheduler manages.
    pub device: Arc<dyn AccelDevice>,
    /// Scheduling quantum in microseconds (default: 1000 = 1ms).
    pub quantum_us: u32,
    /// Runqueue of compute contexts waiting for GPU time.
    /// Bounded: MAX_GPU_CONTEXTS = 256 (one per open DRM fd with active submissions).
    /// ArrayVec avoids heap allocation under the SpinLock on the scheduler tick path.
    pub runqueue: SpinLock<ArrayVec<Arc<GpuContext>, MAX_GPU_CONTEXTS>>,
    /// Currently executing context (None if GPU idle).
    pub current: RcuCell<Option<Arc<GpuContext>>>,
    /// Preemption support level (reuses the unified `AccelPreemptionGranularity`
    /// enum defined in [Section 22.3](#accelerator-vtables-and-integration--preemption-granularity-levels)).
    pub preemption: AccelPreemptionGranularity,
    /// Scheduling policy.
    pub policy: GpuSchedulingPolicy,
}

pub enum GpuSchedulingPolicy {
    /// Round-robin: equal quantum for all contexts.
    RoundRobin,
    /// Weighted: quantum proportional to cgroup GPU shares.
    Weighted,
    /// Priority: highest priority context runs until completion or preemption.
    Priority,
}

22.5.9.1 GPU Time-Slicing and cgroup Integration¶

The GpuTimeSliceScheduler reads cgroup parameters from the accel controller (defined earlier in this section) to enforce per-container GPU time budgets:

accel.compute.weight determines weighted time allocation. A container with weight=200 gets twice the GPU time of a container with weight=100 when both are contending for the same GPU.
accel.compute.max limits maximum GPU time percentage (0-100%). A container with max=50% cannot exceed 50% of GPU time even when no other containers are active.
Oversubscription: when total demand exceeds GPU capacity, contexts are time-sliced proportional to their weights. The scheduler enforces fairness over a sliding window of 10 quanta (default: 10ms).
Undersubscription: idle GPU time is redistributed to active contexts (work-conserving). A container with weight=100 gets 100% of the GPU if no other container is using it.
Context switch overhead varies by hardware:

Vendor / Generation	Preemption Level	Context Switch Latency
NVIDIA Pascal (GP100+)	Thread-level	~50-100 us
NVIDIA Volta+ (V100, A100, H100)	Compute preemption	~10 us
AMD RDNA 3+ (RX 7000)	Wave-level preemption	~20 us
Intel Arc (Alchemist+)	EU-level preemption	~30 us

These overheads are measured by the driver at initialization and reported to the GpuTimeSliceScheduler via the AccelPreemptionGranularity field in AccelDeviceInfo. The scheduler uses them to set the minimum scheduling quantum — a quantum shorter than the context switch latency would waste more time switching than computing.

22.5.9.2 MIG vs Time-Slicing Trade-offs¶

Feature	MIG	Time-Slicing
Isolation	Hardware (memory + compute)	Software (temporal only)
Memory isolation	Yes (dedicated memory controller)	No (shared memory, relies on page tables)
Performance interference	None	Possible (cache thrashing, memory BW contention)
Minimum GPU	NVIDIA A100/H100/B200	Any GPU with preemption
Granularity	Fixed profiles (1/7, 2/7, ...)	Arbitrary time percentage
Reconfiguration	Requires GPU reset (`set_mig_mode`)	Dynamic (add/remove contexts at runtime)
Kubernetes support	nvidia-device-plugin (per-partition resources)	time-slicing-config ConfigMap
VFIO passthrough	Yes (per-partition IOMMU context)	No (single device, cannot split for VMs)
Fault isolation	Partition fault does not affect others	Hung context blocks all until HROT fires

For production multi-tenant deployments, MIG is preferred when available. Time-slicing is the fallback for hardware that does not support MIG, or for development environments where the fixed MIG granularity is too coarse.

22.5.9.3 Per-Architecture GPU Ecosystem¶

Arch	GPU Ecosystem	MIG Support	Time-Slicing
x86-64	NVIDIA, AMD, Intel	Full (NVIDIA A100+)	Full
AArch64	NVIDIA (Jetson, Grace Hopper), Mali	NVIDIA server GPUs only	Full where supported
ARMv7	Mali, PowerVR	No MIG	Limited (no preemption on most SoC GPUs)
RISC-V	Emerging (no major GPU vendor support yet)	No MIG	Future
PPC32	None typical	N/A	N/A
PPC64LE	NVIDIA (NVLink)	Full (NVIDIA A100+)	Full

22.5.9.4 Cross-references¶

Section 17.2 — GPU cgroup controller for MIG partition and time-slice enforcement
Section 18.5 — IOMMU isolation per MIG partition
Section 22.4 — GPU memory management within MIG partitions
Section 22.1 — AccelDevice trait that MIG partitions implement
Section 22.8 — compute dispatch to MIG partitions and time-sliced contexts
Section 11.3 — MIG partitions as Tier 2 isolation boundaries

22.6 In-Kernel Inference Engine¶

22.6.1 Rationale¶

The kernel makes millions of decisions per second: which page to evict, which I/O to schedule next, which task to migrate, whether a health metric is anomalous. Today these decisions use hand-tuned heuristics. Machine learning can do better for pattern-dependent decisions.

22.6.2 Constraints¶

In-kernel inference is not general-purpose ML. It must satisfy:

Deterministic execution time: Every inference call must complete within a bounded number of cycles. No data-dependent loops, no dynamic allocation. For multi-layer models (TinyNeuralNet), inference is preemptible at layer boundaries: the inference loop checks need_resched() between layers and yields if a higher-priority task or interrupt is pending. This ensures CPU inference (which may run for ~1-20μs) does not cause scheduling latency spikes on latency-sensitive systems.
No FPU or SIMD registers: Kernel inference must not use FPU or SIMD registers. The kernel does not save/restore FPU state on entry (doing so requires kernel_fpu_begin/end which holds a mutex and cannot be used in all interrupt contexts). All arithmetic uses scalar INT8/INT32 only. Rationale: The TinyNeuralNet architecture is bounded to 4 layers × 64 neurons × 64 neurons = 16K MACs. At ~4 scalar INT8 MACs/cycle on a 4 GHz core, this is ~4K cycles ≈ 1μs — feasible without SIMD. Weights fit in L1 cache (4 × 64 × 64 × 1 byte = 16KB). Architectures without SIMD (RISC-V, ARMv7 without NEON) execute at the same speed.
Bounded memory: Model size is fixed at load time. No dynamic allocation during inference.
Safe fallback: If the model produces nonsensical output, the system falls back to a traditional heuristic. The model is advisory, never authoritative for safety-critical decisions.
Offline training: Models are trained in userspace (with full floating-point, GPU acceleration, unlimited time). Only the trained, quantized model is loaded into the kernel.

22.6.3 Supported Model Types¶

// umka-core/src/inference/mod.rs (kernel-internal)

pub enum KernelModelType {
    /// Decision tree / random forest.
    /// Bounded depth (max 32), bounded node count (max 64K).
    /// Inference = walk tree from root to leaf, O(depth) comparisons.
    /// Best for: classification decisions with <50 features.
    DecisionTree,

    /// Quantized lookup table.
    /// Input is quantized to N bits, output is a table lookup.
    /// O(1) inference. Best for: 1-2 dimensional functions.
    LookupTable,

    /// Quantized linear model.
    /// weights: [i16; N], bias: i32, threshold: i32.
    /// Inference = dot product + compare. O(N) multiplications.
    /// Best for: simple binary classification / regression.
    LinearModel,

    /// Quantized tiny neural network.
    /// Fixed architecture: input → hidden(s) → output.
    /// All weights INT8, activations INT8, accumulation INT32.
    /// Maximum: 4 layers, 64 neurons per layer.
    /// Weights fit in L1 cache (4 × 64 × 64 × 1 byte = 16 KB).
    /// Scalar-only arithmetic — no SIMD required or permitted in kernel context.
    /// Inference time bounded by architecture constants (~1-20μs).
    TinyNeuralNet,
}

22.6.4 Model Loading and Lifecycle¶

Models are trained in userspace and loaded into the kernel via a sysfs interface:

/sys/kernel/umka/inference/models/
    page_prefetch/
        model.bin       # Write: load model binary. Read: model metadata.
        active          # "1" to enable, "0" to disable, "heuristic" to fallback
        accuracy        # Read: online accuracy estimate (correct predictions / total)
        latency_ns      # Read: average inference latency
        invocations     # Read: total inference calls
        fallbacks       # Read: times fell back to heuristic
    io_scheduler/
        model.bin
        active
        accuracy
        latency_ns
        ...
    fma_anomaly/
        model.bin
        active
        ...

// umka-core/src/inference/model.rs (kernel-internal)

pub struct KernelModel {
    /// Model type determines the inference algorithm.
    pub model_type: KernelModelType,

    /// Model parameters (weights, tree nodes, lookup table).
    /// Allocated as a single contiguous block, fixed at load time.
    pub params: &'static [u8],

    /// Input feature count.
    pub input_features: u32,

    /// Output count (1 for regression, N for N-class classification).
    pub outputs: u32,

    /// Maximum inference latency in nanoseconds (pre-computed from model size).
    pub max_latency_ns: u64,

    /// Whether this model is currently active.
    pub active: AtomicBool,

    /// Online accuracy tracking.
    pub stats: ModelStats,
}

pub struct ModelStats {
    pub total_invocations: AtomicU64,
    pub correct_predictions: AtomicU64,  // When ground truth is available
    pub total_latency_ns: AtomicU64,
    pub fallback_count: AtomicU64,
}

run_inference method specification (H2 fix):

// umka-core/src/inference/model.rs (kernel-internal)

/// Result of an inference operation.
#[repr(u8)]
pub enum InferenceResult {
    /// Inference completed successfully.
    Success = 0,
    /// Output validation failed (NaN, Inf, or out-of-range values).
    /// Caller should use fallback heuristic.
    OutputInvalid = 1,
    /// Execution exceeded max_latency_ns (should not happen with
    /// load-time validation, but possible on preempted execution).
    Timeout = 2,
}

impl KernelModel {
    /// Run inference on the given input, writing output to the provided buffer.
    ///
    /// **Parameters:**
    /// - `input`: Slice of input features (length must equal `self.input_features`).
    /// - `output`: Mutable slice for output values (length must equal `self.outputs`).
    /// - `yielded`: Optional mutable flag. If `Some`, the method sets it to `true`
    ///   if the scheduler yielded during execution (TinyNeuralNet only).
    ///
    /// **Returns:**
    /// - `InferenceResult::Success` on successful completion.
    /// - `InferenceResult::OutputInvalid` if output contains invalid values.
    /// - `InferenceResult::Timeout` if execution exceeded `max_latency_ns`.
    ///
    /// **Behavior by model type:**
    ///
    /// - `DecisionTree`: Walk tree from root to leaf using input features as
    ///   comparison values. O(depth) comparisons. Maximum depth is 32.
    ///   Does not check `need_resched()` — completes in <200ns.
    ///
    /// - `LookupTable`: Quantize input to N bits, use as table index.
    ///   O(1) table lookup. Does not check `need_resched()`.
    ///
    /// - `LinearModel`: Compute dot product of input and weights, add bias,
    ///   compare to threshold. O(N) multiplications where N = input_features.
    ///   Uses INT8 arithmetic with INT32 accumulator. Does not check
    ///   `need_resched()` — completes in <500ns for typical N < 256.
    ///
    /// - `TinyNeuralNet`: Forward propagation through layers.
    ///   - All weights and activations are INT8; accumulation is INT32.
    ///   - Maximum 4 layers, 64 neurons per layer.
    ///   - Between each layer, checks `need_resched()` and yields if
    ///     a higher-priority task is pending. Sets `*yielded = true` if yielded.
    ///   - Total compute: ~16K MACs for max architecture (4×64×64).
    ///   - Scalar INT8 arithmetic only (no SIMD): ~1-20μs typical.
    ///   - Weights (16 KB max) fit in L1 cache on all supported architectures.
    ///
    /// **Output validation:**
    /// After inference completes, the output is validated:
    /// - For classification: all outputs must be in range [0, num_classes).
    /// - For regression: outputs must be finite (no NaN/Inf).
    /// - If validation fails, returns `OutputInvalid` and zeroes the output.
    ///
    /// **Security note:**
    /// The model parameters are read-only and validated at load time.
    /// Inference cannot modify kernel state beyond updating `stats`.
    pub fn run_inference(
        &self,
        input: &[i32],
        output: &mut [i32],
        yielded: &mut bool,
    ) -> InferenceResult {
        // Validate input/output lengths
        if input.len() != self.input_features as usize
            || output.len() != self.outputs as usize
        {
            return InferenceResult::OutputInvalid;
        }

        *yielded = false;

        let result = match self.model_type {
            KernelModelType::DecisionTree => {
                self.run_decision_tree(input, output)
            }
            KernelModelType::LookupTable => {
                self.run_lookup_table(input, output)
            }
            KernelModelType::LinearModel => {
                self.run_linear_model(input, output)
            }
            KernelModelType::TinyNeuralNet => {
                self.run_neural_net(input, output, yielded)
            }
        };

        // Validate output (defense-in-depth)
        if result == InferenceResult::Success {
            if !self.validate_output(output) {
                output.fill(0);
                self.stats.fallback_count.fetch_add(1, Relaxed);
                return InferenceResult::OutputInvalid;
            }
        }

        result
    }

    /// Validate output values are within expected range.
    fn validate_output(&self, output: &[i32]) -> bool {
        for &val in output {
            // Check for NaN/Inf equivalents in integer representation
            if val == i32::MIN || val == i32::MAX {
                return false;
            }
        }
        true
    }
}

22.6.5 Use Cases¶

Use case 1: Learned Page Prefetching

Replace Linux's simple sequential readahead with a learned prefetcher:

Input features (per page fault):
  - Faulting virtual address (quantized to page range)
  - Previous N fault addresses (pattern detector)
  - Process ID (different processes have different patterns)
  - Time since last fault
  - Memory region type (heap, stack, mmap, file-backed)

Output:
  - Next K pages to prefetch (page offsets relative to current fault)

Model type: TinyNeuralNet (2 hidden layers, 64 neurons each, INT8)
Inference time: ~1-5μs
Benefit: 20-40% reduction in page fault rate for workloads with learnable patterns
Fallback: Standard sequential readahead (Linux default)

Ground truth for online accuracy: track whether prefetched pages are actually accessed within a time window. If accuracy drops below threshold, fall back to heuristic.

Use case 2: I/O Scheduling

Optimize I/O queue ordering based on learned workload patterns:

Input features (per I/O request):
  - LBA (sector address, quantized)
  - Request size
  - Read vs write
  - Queue depth
  - Device utilization
  - Recent I/O pattern (last N requests, summarized)

Output:
  - Priority score (determines queue ordering)

Model type: DecisionTree (depth 16, ~4K nodes)
Inference time: ~200ns
Benefit: 5-15% IOPS improvement for mixed workloads
Fallback: mq-deadline heuristic

Use case 3: FMA Anomaly Detection

Detect correlated hardware degradation that threshold rules miss:

Input features (per health telemetry window):
  - ECC error rate (current window)
  - ECC error rate (historical baseline)
  - Temperature delta from baseline
  - PCIe correctable error rate
  - SMART attribute trends (multiple attributes)
  - Device age / power-on hours

Output:
  - Anomaly score (0.0 - 1.0 in fixed-point)
  - Predicted failure class (memory, storage, bus, thermal)

Model type: DecisionTree (depth 20, ~8K nodes)
Inference time: ~300ns
Benefit: Detect multi-signal degradation patterns that simple thresholds miss
Fallback: Threshold rules (Section 20.1.5)

Use case 4: Accelerator Memory Migration Policy

Decide which pages to migrate between CPU and GPU memory:

Input features (per page):
  - Access count on device (recent window)
  - Access count on CPU (recent window)
  - Time since last access
  - Page size
  - Current device memory pressure

Output:
  - Migrate to device / keep on CPU / evict from device

Model type: LinearModel (fast, simple)
Inference time: ~50ns per page
Benefit: Better migration decisions than fixed LRU
Fallback: LRU eviction

22.6.6 Safety Guarantees¶

/// Maximum number of operations per inference batch submission.
/// Batches exceeding this limit are rejected with EINVAL.
/// Rationale: prevents a single tenant from monopolizing the inference
/// queue with an unbounded batch, ensuring fair scheduling across
/// multiple inference contexts (Section 22.3 CBS bandwidth servers).
pub const INFERENCE_MAX_OPS_PER_BATCH: u32 = 65536;

The primary safety mechanism is mandatory structural validation at model load time (Section 22.6). The load-time validator statically proves that every accepted model terminates in bounded time before it is ever invoked:

Decision trees: Verify tree depth <= max_depth and that the tree is acyclic (DAG check). A tree of bounded depth with no cycles has a statically known maximum number of comparisons per inference.
Linear models: Verify the number of input features and outputs matches the declared dimensions. Inference is a single matrix-vector multiply with bounded operations.
Neural networks: Verify the layer count is fixed (no recurrent connections that could loop), and that the total multiply-accumulate (MAC) operations <= INFERENCE_MAX_OPS_PER_BATCH. A feedforward network with bounded layers and bounded dimensions has a statically known operation count.

Any model that cannot be statically proven to terminate in bounded time is rejected at load time and never reaches the inference path. This is the preemptive guarantee.

The post-hoc cycle check in infer_safe below remains as a defense-in-depth assertion — if the load-time validator has a bug and admits a model that runs longer than expected, the cycle check catches it and disables the model. But the cycle check is not the primary safety mechanism, because it measures cycles after run_inference completes; it cannot interrupt an infinite loop.

// umka-core/src/inference/safety.rs

/// Every model invocation goes through this wrapper.
/// The primary termination guarantee comes from load-time structural
/// validation (Section 22.4.9). This wrapper provides defense-in-depth:
/// post-hoc cycle checking, output validation, and fallback.
pub fn infer_safe<const MAX_NS: u64>(
    model: &KernelModel,
    input: &[i32],
    output: &mut [i32],
    fallback: impl FnOnce(&[i32], &mut [i32]),
) {
    // 1. Check model is active
    if !model.active.load(Ordering::Acquire) {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 2. Run inference with post-hoc cycle measurement.
    //    Termination is guaranteed by load-time structural validation
    //    (bounded tree depth / bounded layer count / bounded MAC ops).
    //    The cycle check is defense-in-depth only.
    //
    //    For TinyNeuralNet models, run_inference() checks need_resched()
    //    between layers. If a higher-priority task is pending, it yields
    //    and resumes after rescheduling. This adds ~5-20ns per layer
    //    boundary but prevents CPU inference from causing scheduling
    //    latency spikes. DecisionTree and LinearModel complete in <200ns
    //    and do not need preemption checks.
    //
    //    Because run_inference() may yield to the scheduler (via
    //    need_resched()), wall-clock TSC cycles include time spent
    //    sleeping/waiting — not actual model execution time. On a busy
    //    system, scheduler latency during a yield could push the
    //    wall-clock measurement far beyond MAX_NS even though the model
    //    executed correctly within its budget. To avoid false positives,
    //    we track whether a yield occurred during inference and skip the
    //    post-hoc cycle check in that case. The load-time structural
    //    validator (Section 22.4.9) is the primary termination guarantee;
    //    the cycle check is defense-in-depth that is only meaningful
    //    when the model ran without interruption.
    let mut yielded = false;
    let start = arch::current::cpu::read_cycle_counter();
    let result = model.run_inference(input, output, &mut yielded);
    let elapsed = arch::current::cpu::read_cycle_counter() - start;

    // 2a. Check inference result. If run_inference() returned an error
    //     (e.g., OutputInvalid), it has already zeroed the output buffer.
    //     We must invoke fallback BEFORE validate_output(), because the
    //     zeroed buffer would pass validation (all zeros != sentinel) but
    //     produce silently wrong results.
    if result != InferenceResult::Success {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 3. Defense-in-depth: check execution time was within bounds.
    //    This should never fire if the load-time validator is correct.
    //    If it does fire, it means the validator has a bug — disable
    //    the model and fall back to the heuristic.
    //    Skip the check if a yield occurred during inference, because
    //    the elapsed TSC cycles include scheduler/sleep time that is
    //    not attributable to the model. The load-time validator remains
    //    the primary safety mechanism regardless.
    if !yielded && elapsed > tsc_cycles_from_ns(MAX_NS) {
        model.active.store(false, Ordering::Release);
        fallback(input, output);
        log_warning!("inference model {} exceeded time bound (validator bug?), disabled", model.name);
        return;
    }

    // 4. Sanity-check output (model-specific validation)
    if !model.validate_output(output) {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 5. Update statistics
    model.stats.total_invocations.fetch_add(1, Ordering::Relaxed);
    model.stats.total_latency_ns.fetch_add(
        tsc_ns_from_cycles(elapsed), Ordering::Relaxed
    );
}

Execution time bounding: Static analysis (infer_safe) cannot prove termination for arbitrary GPU kernels (this is equivalent to the halting problem). Instead, UmkaOS uses a hardware watchdog approach: - Every accelerator dispatch includes a max_execution_us: u64 timeout (default: 5,000,000 μs = 5 seconds for compute kernels, 16,667 μs = 1/60th second for graphics). - The driver programs the GPU's hardware watchdog timer (GFX_TIMEOUT on AMD, TDR on NVIDIA via VGPU, or software timer for accelerators without hardware watchdog) before submitting the kernel. - If the kernel exceeds max_execution_us, the GPU engine is reset (per-engine reset, not full GPU reset where hardware supports it) and the dispatch returns ETIMEDOUT to the submitter. Other engines and other contexts are not affected. - infer_safe provides a BEST-EFFORT execution time estimate when possible (e.g., bounded loop counts × estimated per-iteration cost), which is used to set a tighter watchdog timeout. Kernels that infer_safe cannot analyze use the default timeout.

Hardware-accelerated inference timeout (NPU/DSP offload):

The infer_safe wrapper above covers CPU-side inference only. When a TinyNeuralNet model is offloaded to a hardware accelerator (e.g., an NPU or DSP via the AccelBase KABI), the post-hoc cycle counter cannot bound execution: once a command buffer is submitted to the device, the CPU cannot observe or interrupt mid-execution (for non-preemptible devices, see Section 22.2).

For hardware-offloaded inference, the scheduler enforces a hardware command timeout using the max_execution_us field in AccelContextLimits (Section 22.1). The inference subsystem creates a dedicated AccelContext for each model with a tight timeout derived from KernelModel::max_latency_ns:

// umka-core/src/inference/hw_offload.rs (kernel-internal)

/// Number of consecutive hardware timeouts before the accelerator
/// driver disables the device and reports a permanent fault to FMA.
pub const HW_TIMEOUT_DISABLE_THRESHOLD: u32 = 16;

/// Multiplier applied to `model.max_latency_ns` to derive the hardware
/// command timeout. Allows for device load variance while still bounding
/// worst-case lock-up time to 4x the declared model latency.
pub const HW_TIMEOUT_MULTIPLIER: u64 = 4;

/// Submit an inference request to a hardware accelerator with a hard timeout.
/// If the device does not complete within `model.max_latency_ns * HW_TIMEOUT_MULTIPLIER`,
/// the AccelScheduler cancels the submission and `infer_hw` falls back to the CPU path.
pub fn infer_hw(
    model: &KernelModel,
    ctx: &AccelContextHandle,
    input: &[i32],
    output: &mut [i32],
    fallback: impl FnOnce(&[i32], &mut [i32]),
) {
    // 1. Submit inference command buffer to hardware.
    //    AccelContextLimits::max_execution_us is set to
    //    (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000 at context creation.
    //    The AccelScheduler arms a kernel timer on submission; if the device does not
    //    complete before the timer fires, it calls preempt_context() or, for
    //    non-preemptible devices, marks the submission Timeout at the next boundary.
    let submit_result = accel_submit_inference(ctx, model, input);
    if submit_result.is_err() {
        fallback(input, output);
        model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        return;
    }

    // 2. Wait for completion with the same timeout bound.
    //    poll_completion() returns Timeout if max_execution_us elapsed.
    match accel_poll_completion(ctx, model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) {
        AccelCompletionStatus::Success => { /* copy results from device buffer to output */ }
        AccelCompletionStatus::Timeout | AccelCompletionStatus::Error => {
            // Device did not complete in time — fall back to CPU heuristic.
            // Disable hardware offload for this model after repeated timeouts.
            model.hw_timeout_count.fetch_add(1, Ordering::Relaxed);
            if model.hw_timeout_count.load(Ordering::Relaxed) > HW_TIMEOUT_DISABLE_THRESHOLD {
                model.hw_offload_enabled.store(false, Ordering::Release);
                log_warning!("inference model {} exceeded hw timeout {} times, disabling hw offload",
                    model.name, HW_TIMEOUT_DISABLE_THRESHOLD);
            }
            fallback(input, output);
            model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        }
        AccelCompletionStatus::Preempted => {
            // Higher-priority work displaced the inference — fall back to CPU.
            fallback(input, output);
            model.stats.fallback_count.fetch_add(1, Ordering::Relaxed);
        }
    }
}

The AccelContext for in-kernel inference is created at model load time with: - max_execution_us = (model.max_latency_ns * HW_TIMEOUT_MULTIPLIER) / 1000 - priority = AccelPriority::Background (inference is advisory; never starves user workloads) - max_memory_bytes = model parameter size + fixed scratch buffer (no dynamic allocation)

This ensures that a misbehaving model or faulty NPU firmware cannot lock the accelerator indefinitely. The kernel timer in the AccelScheduler provides the hard bound; the CPU fallback path ensures the kernel continues operating correctly even if the hardware accelerator is unresponsive.

22.6.7 Adversarial Robustness¶

Section 22.6 addresses runtime safety (cycle budgets, fallback, output clamping). This section addresses a different threat: adversarial inputs — workload patterns deliberately crafted to exploit learned kernel models.

Threat model — an unprivileged attacker crafts memory access patterns, I/O sequences, or scheduling behavior designed to: 1. Degrade performance: trick the page prefetcher into evicting hot pages, or trick the I/O scheduler into making sub-optimal ordering decisions 2. Denial of service: cause the model to consistently produce worst-case outputs, degrading system throughput for co-tenants on the same machine 3. Information leakage: infer information about other processes' behavior by observing how the model's decisions change in response to probing inputs

Why kernel models are partially resistant — unlike ML models in adversarial ML research (image classifiers, NLP), kernel models operate on aggregate statistics (page fault rate over the last 1000 faults, I/O queue depth histogram), not on raw inputs from a single source. An attacker controls only their own process's behavior, which is one signal among many feeding into the model.

Mitigations:

Per-process model state isolation: page prefetch and I/O scheduling models maintain per-process (per-cgroup) input features. Attacker process A's access pattern cannot directly influence model decisions for victim process B. The model sees A's features and B's features independently.
Output clamping (already in Section 22.6): model output is clamped to a safe range. Even the worst possible model output (prefetch the maximally wrong pages, schedule I/O in the worst possible order) produces bounded degradation — the system falls back to LRU/FIFO behavior, which is the same as having no model at all. The adversary's best attack reduces performance to the no-model baseline, not below it.
Anomaly detection on model inputs: the infer_safe wrapper tracks input feature distributions. If input features drift significantly from the training distribution (measured by simple range checks and mean/variance tracking), the model is automatically disabled for that process/cgroup and the heuristic fallback activates. This prevents an attacker from driving the model into an untrained region of the input space.
Model-decision audit tracing: all model decisions are available via stable tracepoints (Section 20.2). Security-sensitive deployments can monitor for anomalous model behavior (e.g., prefetch hit rate dropping below a threshold for a specific cgroup) and trigger investigation.
Side-channel hardening: the model's decision is not directly observable by unprivileged processes. An attacker cannot call "what did the prefetcher decide?" — they can only observe timing (whether a page fault occurred or not). This is equivalent to the existing cache side-channel problem, not a new attack surface. Standard cache partitioning mitigations (CAT, MBA) apply equally to model-influenced cache behavior.

Accepted risk — a sophisticated attacker with co-tenant access can degrade their own performance or (with difficulty) slightly degrade co-tenant prefetch accuracy, but cannot cause worse-than-baseline behavior (output clamping), cannot crash the kernel (cycle watchdog), and cannot corrupt other processes' data (model outputs are advisory — they influence which pages to prefetch, not which pages are accessible). This risk profile is comparable to existing cache pollution attacks, which are accepted in shared environments.

22.6.8 Fallback Mode Safety Specification¶

When in-kernel inference encounters errors, timeouts, or hardware failures, the system must transition to a safe fallback mode without compromising system integrity. This section specifies the safety guarantees that MUST be maintained even when fallback is active.

22.6.8.1 Fallback Trigger Conditions¶

Fallback mode is triggered by any of the following conditions:

Condition	Scope	Automatic Recovery
Model disabled (active=false)	Per-model	Yes, when admin re-enables
Cycle budget exceeded	Per-model	No, requires admin review
Hardware offload timeout (NPU/DSP)	Per-model + per-device	Yes, after cooldown period
Output validation failure	Per-model, per-invocation	Yes, model auto-disables after threshold
Input distribution anomaly	Per-cgroup, per-model	Yes, after distribution normalizes
Accelerator device failure	Per-device	No, requires device reset
Model binary signature failure	Per-model	No, requires valid signed model

Critical invariant: Fallback mode NEVER disables safety-critical kernel checks. The inference engine provides advisory decisions only (page prefetch hints, I/O ordering hints). The following safety guarantees remain active unconditionally:

Memory access validation (page tables, permission checks)
Capability checks for all resource accesses
I/O command validation (DMA bounds, register ranges)
Interrupt masking and priority enforcement
Scheduler fairness and preemption guarantees
Watchdog timers for all hardware operations

22.6.8.2 Fallback Scope and Isolation¶

Fallback is NEVER system-wide. The isolation granularity is:

Per-model: Each inference model (page_prefetch, io_scheduler, etc.) operates independently. A failure in one model does not affect others.
Per-cgroup for input anomaly: If a specific cgroup's input features drift outside the training distribution, fallback activates for that cgroup only. Other cgroups continue using the model normally.
Per-device for hardware offload: If NPU A times out, models offloaded to NPU A fall back to CPU inference. NPU B continues operating normally.

// umka-core/src/inference/fallback.rs (kernel-internal)

/// Fallback state for a single model instance.
/// This is tracked independently per (model, cgroup) pair.
pub struct FallbackState {
    /// Why fallback is active (None = not in fallback)
    pub reason: Option<FallbackReason>,

    /// Timestamp when fallback mode was entered (monotonic clock)
    pub entered_at: Option<u64>,

    /// Number of consecutive fallback invocations
    pub consecutive_count: u64,

    /// Maximum consecutive fallbacks before escalation
    pub escalation_threshold: u64,
}

/// Reasons for entering fallback mode
pub enum FallbackReason {
    /// Model explicitly disabled by admin
    ModelDisabled,

    /// Cycle budget exceeded (validator bug suspected)
    CycleBudgetExceeded,

    /// Hardware accelerator timeout
    HwTimeout { device_id: u32, timeout_us: u64 },

    /// Output failed model-specific validation
    OutputValidationFailed,

    /// Input features outside training distribution
    InputDistributionAnomaly {
        feature_index: u32,
        observed_value: i32,
        expected_range: (i32, i32),
    },

    /// Accelerator device reported error
    DeviceError { device_id: u32, error_code: u32 },
}

22.6.8.3 Fallback Duration and Escalation¶

Fallback mode has bounded duration with automatic escalation:

Duration	Action
0-60 seconds	Automatic recovery: continue using heuristic fallback, periodically retry model
60-300 seconds	Escalation: emit kernel warning event, log to audit trail
300+ seconds	Critical escalation: emit high-priority alert to system management daemon
3600+ seconds (configurable)	Model auto-disables: set active=false, require admin to re-enable

/// Fallback duration thresholds (configurable via sysfs)
pub const FALLBACK_RETRY_INTERVAL_SECS: u64 = 10;    // Retry model every 10s
pub const FALLBACK_WARNING_SECS: u64 = 60;           // Log warning after 60s
pub const FALLBACK_ESCALATION_SECS: u64 = 300;       // Alert after 5 minutes
pub const FALLBACK_AUTO_DISABLE_SECS: u64 = 3600;    // Disable model after 1 hour

Recovery attempts: While in fallback mode, the kernel periodically attempts to use the model again (every FALLBACK_RETRY_INTERVAL_SECS). If the model succeeds three consecutive times, fallback mode exits and normal operation resumes. This handles transient hardware glitches without operator intervention.

Escalation path:

Fallback entered
    ↓
[Retry every 10s, up to 60s]
    ↓ (still failing)
Log warning: "Model X in fallback for 60s, reason=Y"
    ↓
[Continue retries, up to 300s]
    ↓ (still failing)
Emit event: umka_inference_fallback(model="X", duration_s=300, reason="Y")
    ↓
[Continue retries, up to 3600s]
    ↓ (still failing)
Auto-disable model, emit critical event
    ↓
Require admin to re-enable via sysfs write

22.6.8.4 Admin Intervention Requirements¶

Certain fallback conditions require explicit admin action to exit fallback mode:

Requires admin intervention: - Model disabled due to cycle budget exceeded (indicates validator bug) - Accelerator device failure (requires device reset or replacement) - Model binary signature verification failure

Automatic recovery allowed: - Hardware timeout (transient NPU load) - Input distribution anomaly (workload shifted temporarily) - Output validation failure (if under threshold)

The sysfs interface exposes recovery control:

/sys/kernel/umka/inference/models/page_prefetch/
    fallback_reason      # Read: current fallback reason, empty if not in fallback
    fallback_duration_ms # Read: milliseconds since fallback entered
    fallback_auto_recover  # Write: "1" to attempt immediate recovery, "0" to stay in fallback
    require_admin_reset  # Read: "1" if admin action required to exit fallback

To exit a fallback state requiring admin intervention:

# # Check current state
cat /sys/kernel/umka/inference/models/page_prefetch/fallback_reason
# # Output: "cycle_budget_exceeded"

# # This condition requires admin review
cat /sys/kernel/umka/inference/models/page_prefetch/require_admin_reset
# # Output: "1"

# # After reviewing logs and validating the model is correct, admin resets:
echo 1 > /sys/kernel/umka/inference/models/page_prefetch/active

22.6.8.5 Audit Logging¶

All fallback state transitions are logged to the kernel audit subsystem:

/// Audit event emitted on fallback state changes
pub struct FallbackAuditEvent {
    /// Timestamp (monotonic nanoseconds since boot)
    pub timestamp_ns: u64,

    /// Model identifier (e.g., "page_prefetch")
    pub model_name: &'static str,

    /// Cgroup ID (0 if fallback is model-wide)
    pub cgroup_id: u64,

    /// Device ID (0 if not device-related)
    pub device_id: u32,

    /// Event type
    pub event_type: FallbackEventType,

    /// Reason for the event
    pub reason: FallbackReason,
}

pub enum FallbackEventType {
    /// Entered fallback mode
    Entered,

    /// Automatic recovery attempt (retrying model)
    RecoveryAttempt,

    /// Successfully recovered, exiting fallback
    Recovered,

    /// Escalation: warning logged
    EscalationWarning,

    /// Escalation: critical alert emitted
    EscalationCritical,

    /// Model auto-disabled due to prolonged fallback
    AutoDisabled,

    /// Admin re-enabled model after manual review
    AdminReset,
}

Audit log entries (visible via tracefs and forwarded to system audit daemon):

[12345.678901] UmkaOS-INFERENCE: model=page_prefetch event=entered reason=hw_timeout device=4 timeout_us=2000
[12355.678901] UmkaOS-INFERENCE: model=page_prefetch event=recovery_attempt
[12355.679001] UmkaOS-INFERENCE: model=page_prefetch event=recovered
[12405.678901] UmkaOS-INFERENCE: model=io_scheduler event=entered reason=input_anomaly cgroup=1234 feature=2
[12465.678901] UmkaOS-INFERENCE: model=io_scheduler event=escalation_warning duration_s=60

Audit retention: Fallback audit events are retained for a minimum of 30 days (configurable) to support post-incident analysis. Security-sensitive deployments MAY configure longer retention or external log forwarding.

22.6.8.6 Fallback Heuristic Specification¶

When fallback is active, the system MUST use a well-defined heuristic that provides correct (if suboptimal) behavior:

Model	Fallback Heuristic
Page prefetch	Sequential readahead: next 4 pages from current fault address
I/O scheduling	mq-deadline: order by LBA, 50ms deadline for reads, 500ms for writes
FMA anomaly	Static thresholds: trigger alert if any single metric exceeds hard limit
Memory migration	Access-count based: migrate if device_access_count > cpu_access_count × 2
Power budget	Equal distribution: allocate total_power / num_domains to each domain

Critical safety property: Fallback heuristics MUST NOT bypass any safety checks. For example, the sequential readahead fallback still validates that: - The target pages fall within the process's virtual address space - The underlying physical pages are accessible (not protected, not corrupt) - The readahead does not exceed the process's memory limits

The fallback heuristic is simply a decision algorithm for which pages to prefetch or which I/O to prioritize — it does NOT bypass the memory management or I/O subsystem's safety validation layers.

22.6.9 Model Binary Format¶

In-kernel models use a simple binary format for loading:

Header (4790 bytes — larger than original 256 bytes due to ML-DSA signature):
  magic:          u32    = 0x49534C45 ("UmkaOS")     //   4 bytes
  version:        u32    = 1                        //   4 bytes
  model_type:     u32    (KernelModelType discrim.) //   4 bytes
  input_features: u32                               //   4 bytes
  outputs:        u32                               //   4 bytes
  param_size:     u64    (bytes of parameter data)  //   8 bytes
  max_latency_ns: u64    (worst-case inference ns)  //   8 bytes
  sha256_hash:    [u8; 32]  (SHA-256 over entire model file: header fields above
                              [magic through max_latency_ns] concatenated with parameter
                              data — integrity check covering both structure and weights)
                                                    //  32 bytes
  ed25519_sig:    [u8; 64]  (Ed25519 signature over entire model — authentication)
                                                    //  64 bytes
  mldsa_sig_len:  u16       (actual ML-DSA sig length; ML-DSA-65 = 3309, ML-DSA-87 = 4627)
                                                    //   2 bytes
  mldsa_sig:      [u8; 4627] (ML-DSA signature — post-quantum authentication;
                              sized for largest variant ML-DSA-87; actual length in mldsa_sig_len)
                                                    // 4627 bytes
  _reserved:      [u8; 29]  (must be zero)          //  29 bytes
                                                    // ────────────
                                                    // Total: 4790

Parameter data: [u8; param_size]

Model binaries are verified using the same hybrid signature scheme as kernel modules (Section 9.3). The SHA-256 hash covers the entire model file (header + parameters), providing integrity over both the model structure and weights — an attacker cannot modify the header to reinterpret parameter data without invalidating the hash. The Ed25519 + ML-DSA signatures over the entire model provide authentication. Both signatures must verify against the kernel's built-in model signing public keys. Unsigned or incorrectly signed models are rejected unless the system is booted with accel.allow_unsigned=1 (disabled by default, requires CAP_ACCEL_ADMIN). This prevents an attacker from loading arbitrary model binaries even if they can construct one with a matching SHA-256 hash.

Load-time validation:

Verify magic bytes (0x49534C45).
Check param_size against maximum allowed model size (configurable, default 1 MB).
Verify SHA-256 hash of entire model file (header fields magic through max_latency_ns concatenated with parameter data) matches sha256_hash.
Signature verification — verify both Ed25519 and ML-DSA signatures over the entire model file (header + parameters) against the kernel's built-in model signing public keys. The ML-DSA signature uses the length indicated by mldsa_sig_len. Reject if either signature fails (unless accel.allow_unsigned=1).
Structural termination proof — the validator statically proves bounded execution based on model type:
Decision trees: Verify tree depth <= max_depth and that the tree is acyclic (DAG check via topological sort). Reject if any cycle is found or depth exceeds the configured maximum.
Linear models: Verify input/output dimensions match the declared input_features and outputs fields. A single matrix-vector multiply is inherently bounded.
Neural networks: Verify the layer count is fixed (no recurrent connections), and that total multiply-accumulate operations <= INFERENCE_MAX_OPS_PER_BATCH (computed from layer dimensions). Reject any model with recurrent or self-referencing layers. Any model that cannot be statically proven to terminate in bounded time is rejected. This is the primary safety mechanism — see Section 22.6.
Compute worst-case latency from the proven model structure (tree depth, layer count, MAC operations) and reject if it exceeds the target subsystem's latency budget.

Atomic replacement: Models are replaced via double-buffer. The new model is loaded into a shadow slot while the current model continues serving. Once the new model is fully loaded and validated, an atomic pointer swap activates it. The old model is freed after a grace period (RCU-style) to ensure no in-flight inference uses stale data.

22.6.10 Model Drift Detection and Retraining Pipeline¶

Models trained on one workload profile may degrade when hardware changes (new storage devices with different latency characteristics), workload shifts (database server repurposed as a build server), or system configuration changes (memory added, CPUs hotplugged). This section addresses how UmkaOS detects and responds to model drift.

Online accuracy tracking — every model maintains a correct_predictions counter (Section 22.6 ModelStats). For the page prefetcher, a "correct prediction" means the prefetched page was actually accessed within a configurable window (~100ms). For the I/O scheduler, it means the predicted optimal ordering actually reduced tail latency. The kernel continuously computes a rolling accuracy rate:

accuracy = correct_predictions / total_invocations (over last 60 seconds)

Drift detection thresholds — the model is automatically disabled (falling back to heuristic) when accuracy drops below a configurable threshold:

/sys/kernel/umka/inference/models/page_prefetch/
    accuracy              # Current rolling accuracy (read-only)
    accuracy_threshold    # Minimum acceptable accuracy (read/write, default: 0.60)
    drift_detected        # "1" if accuracy < threshold (read-only)
    auto_disable          # "1" to auto-disable on drift (read/write, default: "1")

When drift_detected transitions to 1: 1. Model is disabled, heuristic fallback activates immediately 2. A kernel event is emitted: umka_inference_drift(model="page_prefetch", accuracy=0.52) 3. The event is visible via stable tracepoints (Section 20.2) and sysfs

Retraining pipeline — model retraining is deliberately out-of-kernel. The kernel collects training data; userspace trains models; the kernel loads the result:

┌─────────────┐    trace data    ┌──────────────┐   model.bin   ┌─────────────┐
│ UmkaOS Kernel  │ ─────────────→  │  Userspace    │ ───────────→  │ UmkaOS Kernel  │
│ (inference)  │  sysfs/tracefs  │  Trainer      │  sysfs write  │ (inference)  │
│              │                 │  (umka-mltool)│               │              │
│ Emits:       │                 │ - Reads trace │               │ Atomic swap: │
│ - features   │                 │ - Trains model│               │ new model    │
│ - outcomes   │                 │ - Quantizes   │               │ replaces old │
│ - accuracy   │                 │ - Validates   │               │              │
└─────────────┘                 └──────────────┘               └─────────────┘

The umka-mltool userspace utility (shipped with UmkaOS) automates the pipeline: 1. Reads training features from tracefs ring buffer 2. Trains a decision tree or lookup table using the collected features + outcomes 3. Quantizes to int8/int16 4. Validates against a holdout set 5. Writes the new model to sysfs (atomic replacement)

This can run as a cron job, a systemd timer, or triggered by the drift detection event. The kernel never trains a model itself — training requires floating-point, unbounded memory, and access to training libraries, all of which belong in userspace.

Why not online learning in-kernel? — Online learning (updating model weights incrementally as new data arrives) would eliminate the retraining round-trip. However: - Online learning requires floating-point arithmetic (kernel uses integer-only inference) - Convergence guarantees are weaker (adversarial inputs could steer the model — see Section 22.6) - Deterministic execution is impossible to guarantee with continuous weight updates - The kernel's cycle budget (~500-5000 cycles per inference) leaves no room for gradient computation

The offline-train / online-infer split is deliberate: the kernel is a fast, dumb inference engine; intelligence lives in userspace where it can be debugged, validated, and rolled back.

22.6.11 Tier 2 Inference Services¶

Section 22.6 defines in-kernel inference: tiny integer-only models running in the kernel's cycle budget (~500-5000 cycles) for per-decision hot paths. But many kernel optimization decisions don't need per-I/O latency — they're slow-path, strategic decisions made every few seconds or minutes. These can benefit from much more powerful models running in Tier 2 userspace drivers.

The architectural fit — Tier 2 drivers (Section 11.2) run as isolated userspace processes communicating with umka-core via ring buffers. They can: - Use floating-point and SIMD (no kernel FP restrictions) - Allocate unlimited heap memory - Link against full ML frameworks (ONNX Runtime, XGBoost, scikit-learn, PyTorch) - Access GPU/NPU accelerators via umka-accel (Section 22.1) - Crash without affecting the kernel (full process isolation)

This makes Tier 2 the natural home for advanced AI capabilities that exceed what the in-kernel inference engine can provide.

Tier 2 inference service model:

┌─────────────────────────────────────────────────────────┐
│ UmkaOS Core (Ring 0)                                      │
│                                                         │
│  In-kernel inference (45):       Tier 2 IPC client:     │
│  - Decision trees (fast)         - Ring buffer query →  │
│  - Lookup tables (fast)          - ← Ring buffer reply  │
│  - ~500-5000 cycles              - ~50-200 μs round-trip│
│  - Hot path (per-I/O)            - Warm path (periodic) │
│                                                         │
└──────────────────────┬──────────────────────────────────┘
                       │ Ring buffer IPC
                       │ (shared memory, zero-copy)
┌──────────────────────▼──────────────────────────────────┐
│ Tier 2 Inference Service (Userspace process)            │
│                                                         │
│  - Full FP, GPU access, unlimited memory                │
│  - GBM / Random Forest / small transformer models       │
│  - Online learning (incremental weight updates)         │
│  - Training data ingest from kernel ring buffer         │
│  - Model export back to in-kernel engine                │
│                                                         │
└─────────────────────────────────────────────────────────┘

Use cases for Tier 2 inference services:

Decision	Frequency	In-kernel (45)	Tier 2 service
Page prefetch (which pages next?)	Per fault (~μs)	Decision tree	N/A — too latency-sensitive
I/O scheduling (reorder queue)	Per I/O (~μs)	Lookup table	N/A — too latency-sensitive
NUMA rebalancing (move pages?)	Every ~10s	Too complex	GBM model: predict migration benefit from memory access histograms
Compression algorithm selection	Per cgroup, every ~30s	N/A	Random forest: select LZ4/Zstd/none based on cgroup's page entropy distribution
Swap tier selection (Section 5.1)	Every ~5s	N/A	Regression model: predict optimal local-vs-remote swap ratio from RDMA latency measurements
Anomaly detection (FMA, Section 20.1)	Every ~1s	N/A	Autoencoder or isolation forest: detect anomalous device behavior patterns
Power budget optimization (Section 7.7)	Every ~1s	N/A	RL agent: learn optimal RAPL power limits given current workload mix
Driver crash prediction	Every ~5s	N/A	Time-series model: predict imminent driver failure from error rate trends

Online learning — safely in Tier 2:

The in-kernel inference engine cannot do online learning (Section 22.6 explains why: no floating-point, no convergence guarantees, adversarial risk). But a Tier 2 service can, because:

Full FP available: Tier 2 runs in userspace with standard libm, SSE/AVX, GPU
Crash-safe: if the online learning algorithm diverges or crashes, the Tier 2 process restarts. The kernel falls back to its in-kernel heuristic. No kernel impact.
Adversarial resistance: the Tier 2 service can implement proper input validation, outlier detection, and gradient clipping — techniques too expensive for in-kernel but standard practice in ML engineering
Auditable: the Tier 2 service's model weights, training data, and decisions can be logged, inspected, and rolled back using standard userspace debugging tools

The online learning loop:

1. Kernel emits training features to Tier 2 service via ring buffer
   (page fault patterns, I/O latencies, scheduling decisions + outcomes)
2. Tier 2 service updates model incrementally (mini-batch SGD, online RF, etc.)
3. Periodically, Tier 2 service quantizes a snapshot of its model to int8/int16
4. Tier 2 service writes the quantized model to the kernel via sysfs
   (atomic model replacement — Section 22.4.4)
5. In-kernel inference engine picks up the new model instantly
6. Kernel continues fast, deterministic, integer-only inference with fresh weights

This gives the system continuous adaptation — the in-kernel model is refreshed every few minutes with weights learned from the actual running workload — without any of the risks of in-kernel online learning. The Tier 2 service absorbs all the complexity (floating-point, convergence, validation), and the kernel sees only a stream of pre-validated, quantized model snapshots.

Tier 2 inference service KABI:

/// KABI interface for Tier 2 inference services.
/// Registered via the standard Tier 2 driver mechanism.
// kernel-internal, not KABI
#[repr(C)]
pub struct InferenceServiceVTable {
    pub vtable_size: usize,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,

    /// Kernel sends a query to the inference service.
    /// Returns immediately — result arrives asynchronously via ring buffer.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer and
    /// `features` points to at least `feature_count` valid i32 values.
    pub submit_query: unsafe extern "C" fn(
        service: *mut ServiceContext,
        query_type: InferenceQueryType,
        features: *const i32,
        feature_count: u32,
    ) -> KabiResult,

    /// Kernel retrieves the latest model snapshot (int8/int16 quantized).
    /// Called periodically to update the in-kernel inference engine.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer and
    /// `buffer` points to at least `buffer_len` bytes of writable memory.
    pub get_model_snapshot: unsafe extern "C" fn(
        service: *mut ServiceContext,
        model_type: KernelModelType,
        buffer: *mut u8,
        buffer_len: usize,
    ) -> KabiResult,

    /// Kernel reports ground truth (outcome of a previous prediction).
    /// Enables online accuracy tracking in the Tier 2 service.
    ///
    /// # Safety
    /// Caller must ensure `service` is a valid context pointer.
    pub report_outcome: unsafe extern "C" fn(
        service: *mut ServiceContext,
        query_id: u64,
        outcome: i32,
    ) -> KabiResult,
}
// InferenceServiceVTable: u64(8)*2 + 3 fn pointers. KABI vtable.
#[cfg(target_pointer_width = "64")]
const _: () = assert!(core::mem::size_of::<InferenceServiceVTable>() == 40);
#[cfg(target_pointer_width = "32")]
const _: () = assert!(core::mem::size_of::<InferenceServiceVTable>() == 28);

#[repr(u32)]
pub enum InferenceQueryType {
    /// Should we rebalance NUMA memory for this cgroup?
    NumaRebalance = 0,
    /// Which compression algorithm for this cgroup?
    CompressionSelect = 1,
    /// Optimal swap tier ratio (local vs remote)?
    SwapTierSelect = 2,
    /// Anomaly score for this device's health metrics?
    DeviceAnomaly = 3,
    /// Optimal power budget for current workload mix?
    PowerBudget = 4,
}

Latency budget — Tier 2 IPC round-trip is ~50-200μs (ring buffer submission + context switch + userspace inference + ring buffer reply). This is acceptable for decisions made every 1-30 seconds. For any decision needed per-I/O or per-fault, the in-kernel inference engine (Section 22.6) remains the only option.

Fallback hierarchy:

Tier 2 inference service available?
├── Yes → Use Tier 2 for slow-path decisions, in-kernel for hot-path
│         Tier 2 periodically refreshes in-kernel model weights
│
└── No  → In-kernel inference engine with static model
          ├── Model loaded? → Use model
          └── No model?     → Heuristic fallback (LRU, FIFO, static thresholds)

The system is fully functional at every level of the fallback hierarchy. Tier 2 inference services are a performance optimization, not a requirement. A system with no ML models at all behaves identically to a traditional kernel using standard heuristics.

Full framework: The Tier 2 inference services described here are the inference execution layer. The complete closed-loop framework — kernel observation bus, tunable parameter store, per-subsystem parameter catalogs, big-model integration pattern, and security model — is specified in Section 23.1.

Shipped Tier 2 services — UmkaOS ships the following reference Tier 2 inference services (optional, loaded on demand):

Service	Model type	Purpose
`umka-ml-numa`	Gradient-boosted trees	NUMA page placement and migration decisions
`umka-ml-compress`	Random forest	Per-cgroup compression algorithm selection
`umka-ml-anomaly`	Isolation forest	FMA device anomaly detection
`umka-ml-power`	Online RL (contextual bandit)	RAPL power budget optimization

Each is a standalone Tier 2 driver (~500-2000 LOC) that implements the InferenceServiceVTable. They are independently upgradable, independently crashable (Tier 2 process restart in ~10ms), and independently optional.

22.7 Accelerator Networking, RDMA, and Linux GPU Compatibility¶

22.7.1 RDMA and Collective Operations¶

22.7.1.1 Problem¶

Distributed ML training on N GPUs across M machines requires:

All-Reduce: After computing gradients on each GPU, average them across all GPUs. This is the single most important collective operation in distributed training.
All-Gather: Assemble distributed tensor shards.
Reduce-Scatter: Reduce and distribute results.
Point-to-point: Direct GPU-to-GPU transfer across network.

These operations need RDMA (Remote Direct Memory Access) for low latency and high throughput. Linux provides RDMA through the rdma-core / libibverbs userspace stack, but the kernel's role is minimal (IB/verbs kernel interface is thin).

22.7.1.2 Design: Kernel-Assisted Collectives¶

The kernel can optimize collectives by understanding the topology:

8-GPU training cluster:
  Machine A: GPU 0, GPU 1 (NVLink connected)
  Machine B: GPU 2, GPU 3 (NVLink connected)
  Machine C: GPU 4, GPU 5 (NVLink connected)
  Machine D: GPU 6, GPU 7 (NVLink connected)
  All machines connected via 100GbE RDMA

Optimal all-reduce strategy:
  1. Intra-machine: GPU 0 ↔ GPU 1 via NVLink (200 GB/s)
  2. Inter-machine: GPU 0 ↔ GPU 2 ↔ GPU 4 ↔ GPU 6 via RDMA (12.5 GB/s)
  3. Intra-machine: Broadcast result GPU 0 → GPU 1, etc.

The kernel knows the topology (device registry + network fabric).
Userspace NCCL/RCCL libraries currently discover this themselves.
The kernel can provide it as a service.

22.7.1.3 RDMA KABI Interface¶

/// RDMA device vtable (extends AccelBase for RDMA-capable NICs).
/// KABI vtable — crosses the driver-kernel boundary. RDMA NIC drivers (ConnectX,
/// EFA) run as Tier 1 drivers in separate isolation domains, so calls to this
/// vtable cross the KABI boundary. Version checks via `kabi_version` and bounds
/// safety via `vtable_size` are mandatory per [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
#[repr(C)]
pub struct RdmaDeviceVTable {
    pub vtable_size: usize,
    /// Primary version discriminant: `KabiVersion::as_u64()`. See [Section 12.2](12-kabi.md#kabi-abi-rules-and-lifecycle) Rule 6.
    pub kabi_version: u64,

    /// Create a protection domain (PD).
    pub create_pd: unsafe extern "C" fn(
        ctx: *mut c_void,
        out_pd: *mut RdmaPdHandle,
    ) -> IoResultCode,

    /// Register a memory region for RDMA.
    /// The registered region can be targeted by remote DMA.
    pub register_mr: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        addr: u64,         // Virtual or physical address
        size: u64,
        access: RdmaAccessFlags,
        out_mr: *mut RdmaMrHandle,
        out_rkey: *mut u32,
    ) -> IoResultCode,

    /// Register device memory (GPU VRAM) for RDMA.
    /// Enables GPUDirect RDMA: remote machine can DMA directly to/from GPU.
    pub register_device_mr: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        accel_mem: AccelMemHandle,
        offset: u64,
        size: u64,
        access: RdmaAccessFlags,
        out_mr: *mut RdmaMrHandle,
        out_rkey: *mut u32,
    ) -> IoResultCode>,

    /// Post a send work request. See `RdmaSendWr` below.
    pub post_send: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        wr: *const RdmaSendWr,
    ) -> IoResultCode,

    /// Post a receive work request. See `RdmaRecvWr` below.
    pub post_recv: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        wr: *const RdmaRecvWr,
    ) -> IoResultCode,

    /// Poll completion queue.
    pub poll_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        cq: RdmaCqHandle,
        max_entries: u32,
        out_wc: *mut RdmaWorkCompletion,
        out_count: *mut u32,
    ) -> IoResultCode,

    /// Create a queue pair (QP) for RDMA communication.
    pub create_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
        qp_type: RdmaQpType,
        send_cq: RdmaCqHandle,
        recv_cq: RdmaCqHandle,
        out_qp: *mut RdmaQpHandle,
    ) -> IoResultCode,

    /// Destroy a queue pair.
    pub destroy_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
    ) -> IoResultCode,

    /// Create a completion queue (CQ).
    pub create_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        min_entries: u32,
        out_cq: *mut RdmaCqHandle,
    ) -> IoResultCode,

    /// Destroy a completion queue.
    pub destroy_cq: unsafe extern "C" fn(
        ctx: *mut c_void,
        cq: RdmaCqHandle,
    ) -> IoResultCode,

    /// Deregister a memory region.
    pub deregister_mr: unsafe extern "C" fn(
        ctx: *mut c_void,
        mr: RdmaMrHandle,
    ) -> IoResultCode,

    /// Destroy a protection domain.
    pub destroy_pd: unsafe extern "C" fn(
        ctx: *mut c_void,
        pd: RdmaPdHandle,
    ) -> IoResultCode,

    // === Connection Management (RC queue pairs) ===

    /// Resolve a route to a remote RDMA destination and initiate connection.
    /// Transitions the QP from INIT → RTR → RTS (for RC transport).
    /// `dest` contains the remote node's LID/GID, QP number, and path info.
    pub connect_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        dest: *const RdmaConnParams,
    ) -> IoResultCode,

    /// Bind a QP to a local port and listen for incoming RC connections.
    /// The QP must be in the INIT state. When a remote peer calls
    /// `connect_qp` targeting this endpoint, the kernel invokes the
    /// `conn_request_callback` registered via `set_conn_callback`.
    pub listen_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        port: u8,
    ) -> IoResultCode,

    /// Accept an incoming RC connection request on a listening QP.
    /// Transitions the QP to RTS. `request_id` is the identifier
    /// provided by the `conn_request_callback`.
    pub accept_conn: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        request_id: u64,
        resp_params: *const RdmaConnParams,
    ) -> IoResultCode,

    /// Gracefully disconnect an RC queue pair. Sends a DREQ to the
    /// remote peer and transitions the QP to the SQD (Send Queue Drained)
    /// and then ERROR state. Outstanding work requests are flushed with
    /// error completions. After disconnect, the QP can be destroyed.
    pub disconnect_qp: unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
    ) -> IoResultCode,

    /// Register a callback for incoming connection requests on a listening QP.
    /// The callback receives a `request_id` that can be passed to `accept_conn`
    /// or ignored (to reject the connection).
    pub set_conn_callback: Option<unsafe extern "C" fn(
        ctx: *mut c_void,
        qp: RdmaQpHandle,
        callback: unsafe extern "C" fn(
            qp: RdmaQpHandle,
            request_id: u64,
            remote_params: *const RdmaConnParams,
        ),
    ) -> IoResultCode>,
}

/// RDMA send work request. Describes a send, RDMA write, or RDMA read operation.
#[repr(C)]
pub struct RdmaSendWr {
    /// Work request ID (returned in completion).
    pub wr_id: u64,
    /// Opcode: Send, SendWithImm, RdmaWrite, RdmaWriteWithImm, RdmaRead.
    pub opcode: RdmaSendOpcode,
    /// Send flags (signaled, solicited, inline, fence).
    pub flags: u32,
    /// Scatter-gather list (pointer to array of RdmaSge entries).
    pub sg_list: *const RdmaSge,
    /// Number of scatter-gather entries.
    pub num_sge: u32,
    /// Remote key (for RDMA Read/Write operations).
    pub rkey: u32,
    /// Remote virtual address (for RDMA Read/Write operations).
    pub remote_addr: u64,
    /// Immediate data (for SendWithImm / RdmaWriteWithImm).
    pub imm_data: u32,
    pub _pad: [u8; 4],
}
// RdmaSendWr: u64 + u32*2 + ptr + u32*2 + u64 + u32 + [u8;4] = 48 bytes. KABI struct.
const_assert!(core::mem::size_of::<RdmaSendWr>() == 48);

/// RDMA receive work request. Posts a receive buffer for incoming messages.
#[repr(C)]
pub struct RdmaRecvWr {
    /// Work request ID (returned in completion).
    pub wr_id: u64,
    /// Scatter-gather list (receive buffer descriptors).
    pub sg_list: *const RdmaSge,
    /// Number of scatter-gather entries.
    pub num_sge: u32,
    pub _pad: [u8; 4],
}
// RdmaRecvWr: u64 + ptr + u32 + [u8;4] = 24 bytes. KABI struct.
const_assert!(core::mem::size_of::<RdmaRecvWr>() == 24);

/// Scatter-gather entry for RDMA operations.
#[repr(C)]
pub struct RdmaSge {
    /// Virtual address of the buffer.
    pub addr: u64,
    /// Length of the buffer in bytes.
    pub length: u32,
    /// Local key for the memory region containing this buffer.
    pub lkey: u32,
}
// RdmaSge: u64(8) + u32(4)*2 = 16 bytes. KABI struct.
const_assert!(core::mem::size_of::<RdmaSge>() == 16);

/// RDMA send opcode (matches InfiniBand Verbs `ibv_wr_opcode`).
#[repr(u32)]
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum RdmaSendOpcode {
    /// Two-sided send: data delivered to remote receive buffer.
    Send          = 0,
    /// Send with immediate 32-bit value (delivered in CQE).
    SendWithImm   = 1,
    /// One-sided RDMA write to remote memory (no receive buffer consumed).
    RdmaWrite     = 2,
    /// RDMA write + immediate value (consumes remote receive buffer for CQE).
    RdmaWriteImm  = 3,
    /// One-sided RDMA read from remote memory.
    RdmaRead      = 4,
    /// Atomic compare-and-swap on remote memory.
    AtomicCas     = 5,
    /// Atomic fetch-and-add on remote memory.
    AtomicFetchAdd = 6,
}

/// RDMA access flags for memory region registration.
/// Bitflags matching InfiniBand Verbs `ibv_access_flags`.
pub struct RdmaAccessFlags(u32);

impl RdmaAccessFlags {
    /// Allow local read access (always implicitly set).
    pub const LOCAL_WRITE:  Self = Self(1 << 0);
    /// Allow remote write access via RDMA Write.
    pub const REMOTE_WRITE: Self = Self(1 << 1);
    /// Allow remote read access via RDMA Read.
    pub const REMOTE_READ:  Self = Self(1 << 2);
    /// Allow remote atomic operations.
    pub const REMOTE_ATOMIC: Self = Self(1 << 3);
    /// Memory window binding allowed.
    pub const MW_BIND:       Self = Self(1 << 4);
}

/// RDMA queue pair transport type.
#[repr(u32)]
#[derive(Clone, Copy, PartialEq, Eq)]
pub enum RdmaQpType {
    /// Reliable Connection (most common — TCP-like reliability).
    Rc  = 2,
    /// Unreliable Connection (connectionless, no retransmission).
    Uc  = 3,
    /// Unreliable Datagram (no connection setup, no reliability).
    Ud  = 4,
    /// Extended Reliable Connection (RDMA-CM based, scalable connections).
    Xrc = 6,
}

/// RDMA work completion entry (returned by `poll_cq`).
/// Matches InfiniBand Verbs `ibv_wc` layout.
#[repr(C)]
pub struct RdmaWorkCompletion {
    /// Work request ID from the original send/recv WR.
    pub wr_id: u64,
    /// Completion status: 0 = success, non-zero = error code.
    pub status: u32,
    /// Opcode of the completed operation (RdmaSendOpcode for sends,
    /// or RECV/RECV_WITH_IMM for receives).
    pub opcode: u32,
    /// Number of bytes transferred (for receive completions).
    pub byte_len: u32,
    /// Immediate data (valid for SendWithImm / RdmaWriteImm).
    pub imm_data: u32,
    /// Source QP number (for UD completions).
    pub src_qp: u32,
    /// Flags: GRH present, etc.
    pub wc_flags: u32,
}
// RdmaWorkCompletion: u64(8) + u32(4)*6 = 32 bytes. KABI struct.
const_assert!(core::mem::size_of::<RdmaWorkCompletion>() == 32);

/// Connection parameters for RC queue pair connection management.
#[repr(C)]
pub struct RdmaConnParams {
    /// Remote QP number (24 bits, upper 8 bits reserved).
    pub remote_qpn: u32,
    /// Remote LID (local identifier, for InfiniBand subnets).
    pub remote_lid: u16,
    /// Service level (QoS, 0-15).
    pub service_level: u8,
    /// Global routing: remote GID (for RoCE and cross-subnet IB).
    pub remote_gid: [u8; 16],
    /// Explicit padding: `remote_gid` ends at offset 22; `path_mtu` (u32)
    /// needs 4-byte alignment, so 1 byte of padding aligns it to offset 24.
    pub _pad1: u8,
    /// Path MTU (256, 512, 1024, 2048, 4096 bytes).
    pub path_mtu: u32,
    /// Retry count for RC transport (0-7).
    pub retry_count: u8,
    /// RNR (Receiver Not Ready) retry count (0-7).
    pub rnr_retry: u8,
    /// Private data for application-level connection negotiation (up to 56 bytes).
    pub private_data: [u8; 56],
    /// Length of valid private data bytes.
    pub private_data_len: u8,
    pub _pad: [u8; 3],
}
// RdmaConnParams: u32(4) + u16(2) + u8(1) + [u8;16] + u8(1) + u32(4) + u8(1)*2
//   + [u8;56] + u8(1) + [u8;3] = 90, padded to 92 (align 4). KABI struct.
const_assert!(core::mem::size_of::<RdmaConnParams>() == 92);

22.7.1.4 Topology Export¶

The kernel exports accelerator and network topology so that collective libraries (NCCL, RCCL, Gloo, etc.) can make optimal routing decisions:

/sys/kernel/umka/topology/
    accelerators        # List of all accelerators with NUMA/PCIe location
    p2p_matrix          # NxN matrix of P2P bandwidth between accelerators
    rdma_links          # RDMA network links with bandwidth/latency
    collective_groups   # Pre-computed optimal collective groups

$ cat /sys/kernel/umka/topology/p2p_matrix
# # Bandwidth in GB/s (0 = no direct path)
# #       GPU0    GPU1    GPU2    GPU3
GPU0    -       200     12.5    12.5
GPU1    200     -       12.5    12.5
GPU2    12.5    12.5    -       200
GPU3    12.5    12.5    200     -

This is additive — NCCL currently discovers topology through a combination of nvidia-smi topo, sysfs traversal, and trial-and-error. A kernel-provided topology file is faster and more reliable.

22.7.2 Linux Compatibility Layer¶

22.7.2.1 DRM/KMS Compatibility¶

Linux's Direct Rendering Manager (DRM) is the standard GPU kernel interface. Userspace graphics stacks (Mesa, Vulkan, Wayland compositors) use it via /dev/dri/card* and /dev/dri/renderD*.

UmkaOS provides DRM compatibility through umka-sysapi:

Userspace (Vulkan, OpenGL, CUDA, etc.)
    |
    | Standard DRM/KMS ioctls
    v
umka-sysapi/src/drm/
    |
    | Translates DRM ioctls to umka-accel KABI calls
    v
umka-accel scheduler + AccelBase/AccelCompute vtable
    |
    | KABI vtable calls
    v
GPU driver (Tier 1, domain-isolated)

DRM ioctls translated to umka-accel operations:

DRM ioctl	umka-accel equivalent
`DRM_IOCTL_MODE_*`	Display subsystem (AccelDisplayVTable, not covered here)
`DRM_IOCTL_*_CTX_CREATE/DESTROY`	`create_context` / `destroy_context`
`DRM_IOCTL_*_GEM_CREATE`	`alloc_device_memory`
`DRM_IOCTL_*_GEM_MMAP`	`map_device_memory`
`DRM_IOCTL_*_EXECBUFFER`	`submit_commands` (via AccelScheduler)
`DRM_IOCTL_*_WAIT`	`poll_completion`
`DRM_IOCTL_PRIME_*`	P2P DMA handle export/import

22.7.2.2 NVIDIA Compatibility¶

NVIDIA's userspace stack (CUDA runtime, cuDNN, TensorRT) communicates with the kernel driver through proprietary ioctls on /dev/nvidia*. The compatibility approach:

Option A: NVIDIA ships a KABI-native kernel interface layer
  - NVIDIA already has a clean internal split between their proprietary
    compute core and the "kernel interface layer" (nvidia.ko)
  - UmkaOS provides a KABI implementation of this interface layer
  - NVIDIA's compute core links against our KABI implementation
  - Same approach described in Section 24.1.4

Option B: ioctl compatibility shim
  - Translate NVIDIA's /dev/nvidia* ioctls to umka-accel KABI calls
  - More fragile (NVIDIA changes their ioctl interface between releases)
  - Stopgap until Option A is available

Option A is strongly preferred and is the medium-term plan.

22.7.2.3 Feasibility Analysis: Porting Open-Source NVIDIA Drivers¶

NVIDIA open-sourced their kernel modules (nvidia.ko, nvidia-uvm.ko, nvidia-modeset.ko) in 2022 under dual MIT/GPLv2 license. This section analyzes the feasibility of writing a new UmkaOS-native driver based on this open-source code, preserving binary compatibility with NVIDIA's proprietary userspace stack (CUDA, cuDNN, TensorRT, etc.).

Why this matters: If unmodified libcuda.so, libnvidia-ml.so, cuDNN, TensorRT, and the entire CUDA toolkit work on UmkaOS without recompilation, the kernel is immediately viable for the entire GPU computing ecosystem.

22.7.2.3.1 NVIDIA's Architecture (Post-2022 Open-Source Release)¶

The modern NVIDIA driver stack has a clean three-layer architecture:

Layer 3: Proprietary Userspace (binary-only, NOT recompiled)
  ┌─────────────────────────────────────────────────────┐
  │  libcuda.so (CUDA runtime)                          │
  │  libnvidia-ml.so (NVML monitoring)                  │
  │  cuDNN, TensorRT, NCCL                              │
  │  Vulkan/OpenGL ICD (libnvidia-glcore.so)            │
  │  NVENC/NVDEC (video encode/decode)                  │
  └──────────────────────┬──────────────────────────────┘
                         │ ioctl() on /dev/nvidia*
                         │ (stable-ish interface, versioned by driver release)
                         v
Layer 2: Open-Source Kernel Module (MIT/GPLv2, THIS IS WHAT WE PORT)
  ┌─────────────────────────────────────────────────────┐
  │  nvidia.ko                                          │
  │  ├── OS interface layer (nv-linux.c, nv-pat.c,     │
  │  │   nv-mmap.c, nv-i2c.c, nv-acpi.c, etc.)        │
  │  │   → Linux kernel API calls (PCI, DMA, IRQ, MM)  │
  │  │   → THIS is what we rewrite for UmkaOS       │
  │  │                                                  │
  │  ├── RM (Resource Manager) core                     │
  │  │   → Hardware-agnostic resource management        │
  │  │   → Talks DOWN to GSP firmware via RM RPC        │
  │  │   → Talks UP to userspace via control ioctls     │
  │  │   → We can reuse this largely unchanged          │
  │  │                                                  │
  │  └── Entry points: module_init/exit, ioctl dispatch │
  │      → Rewrite for KABI driver lifecycle             │
  │                                                     │
  │  nvidia-uvm.ko (Unified Virtual Memory)             │
  │  ├── Page fault handler, migration engine           │
  │  ├── Uses Linux HMM (mmu_notifiers)                 │
  │  └── Integrate with umka-core HMM (Section 22.2)       │
  │                                                     │
  │  nvidia-modeset.ko (Display/KMS)                    │
  │  ├── Mode setting, display management               │
  │  └── Map to AccelDisplayVTable                       │
  └──────────────────────┬──────────────────────────────┘
                         │ RM RPC (register read/write commands)
                         v
Layer 1: GSP Firmware (runs ON the GPU, opaque, NOT our concern)
  ┌─────────────────────────────────────────────────────┐
  │  GPU System Processor firmware                      │
  │  - Loaded by kernel module during initialization    │
  │  - Handles actual hardware programming              │
  │  - Context switching, power management, ECC, etc.   │
  │  - Communicates with kernel via shared memory + IRQ │
  │  - Binary blob, but runs on GPU, not on CPU         │
  │  - We just need to load it and talk RM RPC to it    │
  └─────────────────────────────────────────────────────┘

Key insight: nvidia.ko is NOT a traditional monolithic driver. On modern GPUs (Turing and newer, i.e., everything from 2018+), the kernel module is primarily a thin translation layer between Linux kernel APIs and the GSP firmware running on the GPU itself. The "intelligence" is in the GSP firmware, not in the kernel module.

22.7.2.3.2 Component-by-Component Porting Analysis¶

Component 1: OS Interface Layer → KABI Translation (Mechanical, ~60% of work)

Linux API Used	UmkaOS Equivalent	Difficulty
`pci_register_driver()`	KABI device registry match + `register_driver()`	Trivial
`request_firmware()` / `release_firmware()`	`kabi_request_firmware()` / handle release (Section 11.6)	Trivial
`request_irq()` / `free_irq()`	KABI `request_irq()` / `free_irq()`	Trivial
`dma_alloc_coherent()` / `dma_map_*()`	KABI `dma_alloc()` / `dma_map()`	Trivial
`ioremap()` / `iounmap()`	KABI `ioremap()` / `iounmap()`	Trivial
`alloc_pages()` / `__free_pages()`	KABI `alloc_pages()`	Trivial
`mutex_lock()` / `spin_lock()`	KABI `mutex` / `spinlock` (or Rust equivalents)	Trivial
`timer_setup()` / `schedule_work()`	KABI `timer` / `workqueue` equivalents	Easy
`pci_enable_msi()` / MSI-X	KABI MSI/MSI-X support	Easy
`sysfs_create_group()`	KABI property export (device registry)	Easy
ACPI methods (power management)	KABI power state callbacks	Easy
`vm_area_struct` / `vm_operations`	KABI memory mapping interface	Medium
`mmu_notifier_*()` (for UVM)	UmkaOS HMM `PageLocationTracker` (Section 22.4)	Medium
`/proc/driver/nvidia/`	`/sys/kernel/umka/accel/` (Section 22.7)	Easy

This is ~90+ files in the open-source nvidia.ko that implement Linux kernel API calls. The work is mechanical: each Linux API call maps to its KABI equivalent. There are no algorithmic decisions — it's pure API translation.

Component 2: ioctl Interface → /dev/nvidia* Compatibility (Critical)

The binary userspace communicates exclusively through ioctls on: - /dev/nvidia0, /dev/nvidia1, ... (per-GPU control) - /dev/nvidiactl (system-level control) - /dev/nvidia-uvm (unified virtual memory) - /dev/nvidia-uvm-tools (profiling) - /dev/nvidia-modeset (display)

Approach:
  umka-sysapi provides /dev/nvidia* character devices.
  Each ioctl is dispatched to the UmkaOS NVIDIA driver (Tier 1).
  The driver's RM core processes the ioctl exactly as it does on Linux.

  The critical point: we do NOT reinterpret ioctls.
  We pass them through to the same RM core code (open-source).
  The RM core talks to GSP firmware, which does the actual work.
  The ioctl ABI is preserved byte-for-byte.

The ioctl numbers and structures are defined in NVIDIA's open-source headers (nvidia-uvm/uvm_linux_ioctl.h, nvidia/nv-ioctl-numbers.h). Since the RM core is open-source and handles ioctl dispatch internally, we preserve the exact same ioctl interface. Binary libcuda.so cannot distinguish UmkaOS from Linux.

Component 3: UVM (Unified Virtual Memory) → UmkaOS HMM Integration (Hardest Part)

nvidia-uvm.ko implements CUDA Unified Memory. On Linux, it hooks deeply into the memory management subsystem:

Linux UVM hooks:
  - mmu_notifier (track CPU page table changes)
  - hmm_range_fault() (resolve CPU page faults for GPU access)
  - migrate_vma() (migrate pages between CPU and GPU)
  - Fault handler for GPU page faults (ATS/PRI)

UmkaOS's advantage: Section 22.4 (Heterogeneous Memory Management) provides exactly these primitives as first-class kernel features. The mapping is:

Linux UVM mechanism	UmkaOS equivalent
`mmu_notifier_register()`	`PageLocationTracker` subscription
`hmm_range_fault()`	UmkaOS HMM `handle_device_fault()` callback
`migrate_vma_setup/pages/finalize()`	AccelBase `migrate_pages()`
`fault_handler()` for ATS	UmkaOS device fault handler (Section 22.4)
Per-process GPU page tables	AccelContext memory management
UVM counters / access tracking	UmkaOS memory accounting (cgroup `accel.memory.*`)

This is the most complex component because UVM deeply interweaves with memory management internals. However, UmkaOS's HMM design (Section 22.4) was designed with exactly this use case in mind, so the mapping is cleaner than on Linux.

Component 4: GSP Firmware Loading and Communication (Straightforward)

The kernel module loads GSP firmware from the filesystem (/lib/firmware/nvidia/) via kabi_request_firmware() (Section 11.6), then uploads it into GPU VRAM and communicates via shared memory regions and interrupts. The GSP communication layer (RM RPC) is self-contained within the RM core and does not depend on kernel APIs beyond basic DMA and interrupt handling — both of which map trivially to KABI.

Component 5: Display / KMS (nvidia-modeset.ko) (Standard)

Maps to AccelDisplayVTable. Standard DRM/KMS translation (Section 22.7) handles most of this. NVIDIA's modeset module is relatively thin compared to the compute path.

22.7.2.3.3 Difficulty Rating Summary¶

Component	Difficulty	Notes
OS interface layer rewrite	3/10	Mechanical API translation
ioctl passthrough (`/dev/nvidia*`)	2/10	Character device + dispatch
GSP firmware loading	2/10	Self-contained in RM core
PCI/DMA/IRQ setup	2/10	Trivial KABI mapping
UVM → UmkaOS HMM integration	8/10	Deepest integration point (~40K lines in upstream UVM; essentially a rewrite of MM integration layer)
Display / modeset	4/10	Standard DRM/KMS path
Power management (ACPI/PCIe)	3/10	Device registry PM callbacks
Testing + binary userspace validation	5/10	Must test full CUDA stack

22.7.2.3.4 Binary Userspace Compatibility Verification¶

The following must work without recompilation on UmkaOS:

Component	Interface Used	Compatibility Path
`libcuda.so` (CUDA runtime)	`/dev/nvidia*` ioctls	ioctl passthrough to RM core
`libnvidia-ml.so` (NVML)	`/dev/nvidiactl` ioctls + sysfs	ioctl passthrough + sysfs compat
cuDNN / TensorRT	Links against libcuda.so	Transitive (libcuda works → these work)
`nvidia-smi`	NVML library	Transitive
NCCL (multi-GPU)	libcuda + `/dev/nvidia-uvm`	UVM compat + P2P DMA support
Vulkan ICD	DRM + `/dev/nvidia*`	DRM compat (8.1) + ioctl passthrough
NVENC/NVDEC	`/dev/nvidia*` ioctls	ioctl passthrough
Container runtime (nvidia-container-toolkit)	cgroup + device files	cgroup compat + device file compat

22.7.2.3.5 What UmkaOS Does Better Than Linux for NVIDIA GPUs¶

Capability	Linux Behavior	UmkaOS Behavior
GPU crash recovery	System reboot required	Driver reload in ~100ms–5s (Section 22.3)
GPU scheduling	Driver-internal, invisible	Kernel-managed, cgroup-integrated (Section 22.2)
GPU memory limits	None (driver tracks, no enforcement)	cgroup `accel.memory.max` (Section 22.5)
GPU compute QoS	None	cgroup `accel.compute.guarantee` (Section 22.5)
GPU memory in OOM killer	Invisible	Full visibility, OOM-killable (Section 22.5)
Multi-tenant isolation	MIG only (hardware-dependent)	Software scheduling + MIG (Section 22.5)
GPU observability	nvidia-smi (polling)	Stable tracepoints + eBPF (Section 22.3)
UVM performance	Bolted-on HMM, driver-specific	First-class HMM, kernel-managed (Section 22.4)
P2P DMA (GPUDirect)	NVIDIA-specific API	Generalized KABI P2P (Section 22.4)
Power management	Driver-internal	Topology-driven, device-registry-integrated

22.7.2.3.6 Risk Assessment¶

Risk	Likelihood	Impact	Mitigation
NVIDIA changes ioctl ABI between releases	Medium	High	Pin to specific driver release initially; NVIDIA's ioctl ABI is versioned and backwards-compatible in practice
UVM integration bugs cause data corruption	Low	Critical	Extensive testing with CUDA memory stress tests; UVM has well-defined semantics
GSP firmware version incompatibility	Low	High	Support specific firmware versions matching the open-source driver release
Performance regression vs Linux	Medium	Medium	Profile early; most performance is in GSP firmware and userspace, not kernel module
NVIDIA open-source license compliance	None	N/A	MIT/GPLv2 dual license; our OKLF is compatible with both

22.7.2.3.7 Implementation Strategy¶

Phase 1: Basic GPU Initialization
  - Port OS interface layer to KABI
  - GSP firmware loading
  - /dev/nvidia* character devices with ioctl passthrough
  - PCI, DMA, IRQ setup via KABI
  - Goal: nvidia-smi shows GPU info

Phase 2: Compute Workloads
  - Full RM core integration
  - CUDA simple programs work (vector add, matmul)
  - Basic context management through AccelBase
  - Goal: CUDA samples compile and run

Phase 3: UVM and Multi-GPU
  - UVM integration with UmkaOS HMM
  - Unified memory (cudaMallocManaged) works
  - P2P DMA for multi-GPU (NVLink, PCIe)
  - NCCL multi-GPU collectives
  - Goal: PyTorch distributed training works

Phase 4: Production Hardening
  - Display / modeset (AccelDisplayVTable)
  - Full cgroup integration
  - Crash recovery testing
  - Performance parity validation
  - Goal: Production-ready for datacenter + desktop

22.7.2.3.8 Licensing Note¶

NVIDIA's open-source kernel modules are dual-licensed MIT/GPLv2. Since we are writing a new driver inspired by the open-source code (not copy-pasting it), and our KABI interface layer is original work, there are no licensing concerns. The ioctl numbers and structures are functional interfaces (facts, not copyrightable expression). The GSP firmware is a binary blob loaded onto the GPU (not linked into our kernel). This is fully compatible with OKLF v1.3.

22.7.2.4 VFIO Passthrough¶

For VMs that need direct device access. VFIO is a general-purpose mechanism (see Section 11.5) — it works identically for GPUs, NICs, NVMe controllers, and any other PCIe device. The GPU-specific example:

/dev/vfio/ interface (Section 11.4, Tier 2 driver path)
    |
    v
umka-kvm (KABI Tier 1 driver)
    |
    | Programs IOMMU for VM isolation
    v
PCIe device (entire device assigned to VM, VM's guest driver manages it)

VFIO passthrough works unchanged from Linux. The device is assigned to the VM at the IOMMU group level (Section 11.5).

22.7.2.5 UmkaOS-Specific Interfaces (Superset)¶

Beyond Linux compatibility, new interfaces for UmkaOS-aware software:

/dev/umka-accel-0           # UmkaOS-native accelerator access
/dev/umka-accel-1           # (one per accelerator device)

/sys/kernel/umka/accel/
    devices/
        0/
            info            # AccelDeviceInfo (JSON-formatted)
            utilization     # Current utilization stats
            contexts        # Active context list
            memory          # Memory usage breakdown
            power           # Power/thermal/clock state
            topology        # PCIe/NVLink/xGMI connections
            partitions/     # MIG/SR-IOV partitions (if available)
                0/
                    info
                    bind    # Bind to cgroup
    scheduler/
        policy              # Global scheduling policy
        stats               # Global scheduling statistics
    topology/
        p2p_matrix          # Bandwidth matrix
        rdma_links          # Network links
        numa_map            # Accelerator-to-NUMA mapping
    inference/
        models/             # In-kernel model management (Section 22.4)

Existing Linux tools (nvidia-smi, rocm-smi, intel_gpu_top) continue to work through the DRM/sysfs compatibility layer. UmkaOS-specific tools (umka-accel-top, umka-gpu-smi) can use the richer /sys/kernel/umka/accel/ interface for more detailed information and control.

The modern Linux display stack centers on Wayland compositors consuming DRM/KMS (Section 22.7). UmkaOS provides the full kernel infrastructure these compositors require, with crash-recovery advantages that Linux cannot offer.

DMA-BUF (Cross-Device Buffer Sharing)

DMA-BUF is the kernel primitive for sharing memory buffers between devices — GPU to display controller, GPU to video encoder, camera to GPU, GPU to network (RDMA). In Linux, DMA-BUF is a global namespace with weak access control. In UmkaOS, each DMA-BUF is a capability-protected kernel object:

DMA-BUF lifecycle:
  1. Exporter (e.g., GPU driver) creates a DMA-BUF from device memory
  2. Returns a capability token (not a raw file descriptor)
  3. Importer (e.g., display driver) receives the capability via IPC
  4. Importer maps the DMA-BUF into its device's address space
  5. Zero-copy path: both devices access the same physical memory

For Linux compatibility, DMA-BUF capabilities are presented as file descriptors through the SysAPI layer, so existing userspace (Mesa, Wayland compositors, GStreamer) works unmodified.

Explicit Synchronization (dma_fence / sync_file)

GPU and display operations are asynchronous. Explicit sync primitives coordinate them:

dma_fence: kernel-internal synchronization point representing GPU work completion. Created by the GPU driver when work is submitted, signaled when the GPU finishes.
sync_file: userspace-visible wrapper around one or more dma_fences. Exported as a file descriptor. Wayland compositors use these to know when a client's rendering is complete before presenting.
Timeline semaphores: Vulkan-style monotonic sync objects. More efficient than binary fences for pipelined workloads (render frame N while displaying frame N-1).

UmkaOS implements the Linux-compatible sync_file ioctl interface (SYNC_IOC_MERGE, SYNC_IOC_FILE_INFO) so existing Wayland compositors and Vulkan drivers work without modification.

GBM (Generic Buffer Manager)

GBM is the userspace buffer allocation library that Wayland compositors use to allocate scanout-capable buffers. It talks to DRM via ioctl. UmkaOS's DRM compatibility layer (Section 22.7) ensures GBM works unmodified — gbm_create_device(), gbm_bo_create(), and related functions issue standard DRM ioctls that UmkaOS handles.

Render Nodes (/dev/dri/renderD*)

Render nodes provide unprivileged GPU compute and render access without modesetting capability. Any user can open a render node to submit GPU work (3D rendering, compute shaders, video encode/decode) without needing root or DRM master status. UmkaOS exposes render nodes via the DRM compat layer, mapping each to an umka-accel context creation with appropriate capability restrictions (no modesetting, no display control).

DRM Leases

DRM leases allow a client to take exclusive control of a display connector — the primary consumer is VR headsets (SteamVR uses DRM leases to drive the HMD display independently from the desktop compositor). UmkaOS supports the lease ioctl family (DRM_IOCTL_MODE_CREATE_LEASE, DRM_IOCTL_MODE_LIST_LESSEES, etc.) through the DRM compatibility layer.

KMS Atomic Modesetting

The modern display pipeline API. Replaces legacy drmModeSetCrtc with atomic commits that update multiple display properties (resolution, refresh, gamma, position) as a single transaction — either the entire commit succeeds or nothing changes. UmkaOS's DRM layer implements the full atomic commit interface (DRM_IOCTL_MODE_ATOMIC) including: - TEST_ONLY mode (compositor can test configurations without applying them) - Non-blocking commits (compositor doesn't stall waiting for vblank) - Property-based interface (all display state expressed as properties)

Multi-GPU Display (PRIME)

Render on GPU A, scanout on GPU B via DMA-BUF sharing (known as PRIME in the DRM ecosystem). The P2P DMA infrastructure (Section 22.4) handles the underlying data transfer. For the display case specifically: - DMA-BUF export from render GPU → DMA-BUF import on display GPU - If GPUs share a PCIe switch, transfer is direct P2P (no CPU copy) - If GPUs are on different NUMA nodes, transfer goes through host memory

HDR and Wide Color Gamut

Modern displays support HDR10, Dolby Vision, and wide color gamuts (DCI-P3, Rec. 2020). Kernel support is exposed via KMS atomic properties: - HDR_OUTPUT_METADATA: HDR static/dynamic metadata per connector - COLORSPACE: signal color space to the display (BT.709, DCI-P3, BT.2020) - MAX_BPC: maximum bits-per-channel for the connector (8, 10, 12, 16) - COLOR_ENCODING / COLOR_RANGE: YCbCr encoding and quantization range

These are standard KMS properties exposed through atomic modesetting. Wayland compositors (wlroots, Mutter, KWin) use these to negotiate HDR output with the display.

Crash Recovery Advantage

If the DRM/GPU driver crashes (Tier 1 reload, Section 11.9):

The driver binary is reloaded in ~50-150ms
DMA-BUF metadata (size, format, sharing graph, capability tokens) survives the reload — this metadata is managed by umka-core's memory subsystem, not the driver. Buffer handles remain valid. However, VRAM contents are lost on device reset — a GPU reset physically reinitializes the device memory, destroying all framebuffer data, texture contents, and compute buffers stored in VRAM. Applications must re-upload data to the GPU after recovery.
Sync objects are re-created in signaled state (conservative — forces resubmission of any pending work, but doesn't deadlock the compositor)
The Wayland compositor sees a brief stall (~100ms–5s, the full recovery window) and must re-render its buffers (VRAM contents are lost), but does not lose its buffer handles or capability tokens, does not crash, and does not need to re-negotiate with clients. The compositor's CPU-side state (window tree, client list, layout) is fully preserved.
Running applications still hold valid DMA-BUF capabilities. They receive an ENODATA error on the next access to a VRAM-backed buffer, indicating that the buffer contents are stale and must be re-uploaded. Applications that maintain CPU-side copies of their rendering data (the common case for double-buffered Wayland clients) can re-upload and resume rendering immediately. Applications that relied solely on VRAM-resident data with no CPU-side copy must regenerate or reload from disk.

In Linux, a GPU driver crash typically kills the entire X/Wayland session, requiring all graphical applications to restart. UmkaOS's display stack crash recovery preserves the sharing graph and buffer metadata, allowing rapid re-rendering rather than full session teardown. This is a direct consequence of the capability-managed DMA-BUF design and Tier 1 driver reload.

22.7.3 Accelerator Service Provider¶

Provider model: A GPU or FPGA with Tier M firmware IS the accelerator service provider directly — no host-side KABI GPU driver needed. The device advertises ACCELERATOR via CapAdvertise, the host creates AccelServiceClient via PeerServiceProxy (Section 5.11). Command buffers are opaque to the host (vendor-specific format generated by userspace CUDA/OpenCL/Vulkan runtimes) — the kernel just transports them. Sharing model: shared per-context (multiple local and remote peers create independent contexts on the same device).

A host can provide its accelerator devices (GPU, FPGA, inference engines) as cluster services via the peer protocol. Remote peers create contexts, submit command buffers, and receive completions as if the accelerator were local.

This is the accelerator-framework instantiation of the capability service provider model described in Section 5.7.

// umka-accel/src/service_provider.rs

/// Exports a local accelerator device to cluster peers.
pub struct AccelServiceProvider {
    /// The local accelerator being exported.
    device: AccelDeviceHandle,
    /// Transport endpoint for receiving remote submissions.
    endpoint: PeerEndpoint,
    /// Maximum concurrent remote contexts.
    max_contexts: u32,
    /// Maximum in-flight command buffers across all remote contexts.
    max_inflight_cmds: u32,
}

/// Type of accelerator context. Determines what command buffer
/// formats are accepted and what device resources are allocated.
#[repr(u8)]
pub enum AccelContextType {
    /// General-purpose compute context (CUDA kernels, OpenCL kernels,
    /// Vulkan compute shaders). Allocates compute pipelines on device.
    Compute = 0,
    /// ML inference context. Optimized for model loading + batch
    /// inference. May use dedicated inference engines (tensor cores,
    /// NPU cores) if available.
    Inference = 1,
    // Note: Render/display contexts are intentionally excluded.
    // Display contexts require sub-millisecond latency to the display
    // pipeline, which is incompatible with network round-trip times.
    // See "Scope limitations" below.
}

bitflags! {
    /// Flags for device memory allocation via `AccelServiceOp::AllocDeviceMem`.
    pub struct DmaAllocFlags: u32 {
        /// Memory is cached (write-back) on the device. Best for
        /// repeated access patterns (model weights). Default.
        const CACHED = 0;
        /// Memory is uncached (write-combining). Best for streaming
        /// writes (input data uploaded once, consumed once).
        const WRITE_COMBINING = 1 << 0;
        /// Memory is page-aligned (minimum 4 KiB alignment).
        /// Required for mmap() access from userspace.
        const PAGE_ALIGNED = 1 << 1;
        /// Memory is accessible for peer-to-peer DMA (future, Phase 4+).
        const P2P_ACCESSIBLE = 1 << 2;
    }
}

/// Accelerator operations forwarded from a remote peer.
#[repr(u8)]
pub enum AccelServiceOp {
    /// Create a new compute/inference context on the device.
    /// Returns a context handle used for subsequent submissions.
    CreateContext { context_type: AccelContextType },
    /// Destroy a previously created context.
    DestroyContext { context: AccelContextHandle },
    /// Submit a command buffer for execution. The command buffer data
    /// is transferred via push_page() into a pre-registered server-side
    /// data region before this message is sent.
    SubmitCmdBuf {
        context: AccelContextHandle,
        /// Offset within the server's input data region where the command
        /// buffer data was written (already pushed by the client via the
        /// peer transport before sending this op).
        cmd_region_offset: u64,
        cmd_len: u32,
    },
    /// Query device capabilities (compute units, memory, supported formats).
    GetDeviceInfo,
    /// Allocate device memory for DMA. Returns a transport-registered
    /// data region that the remote peer can write into via the peer
    /// transport.
    /// See `DmaAllocFlags` for valid flag combinations.
    AllocDeviceMem { size: u64, flags: DmaAllocFlags },
    /// Free previously allocated device memory.
    FreeDeviceMem { mem: DeviceMemHandle },
}

/// Completion sent back to the requesting peer.
pub struct AccelServiceCompletion {
    /// Matches the request.
    pub request_id: u64,
    /// 0 on success, negative errno on failure.
    pub status: i32,
    /// For CreateContext: the new context handle.
    /// For SubmitCmdBuf: execution result metadata.
    /// For AllocDeviceMem: region descriptor for the allocated memory.
    pub result: AccelCompletionResult,
}

/// Result payload in an `AccelServiceCompletion`. Tagged union discriminated
/// by the original request type.
pub enum AccelCompletionResult {
    /// CreateContext succeeded: returns the new context handle.
    Context { handle: AccelContextHandle },
    /// SubmitCmdBuf succeeded: returns execution metadata.
    Submit {
        /// GPU/NPU timestamp at command completion (device clock ticks).
        device_timestamp: u64,
        /// Elapsed execution time on device (nanoseconds).
        elapsed_ns: u64,
    },
    /// AllocDeviceMem succeeded: returns region offset and size.
    Memory { region_offset: u64, size: u64 },
    /// GetDeviceInfo: returns serialized device capability descriptor.
    DeviceInfo { caps_len: u32 },
    /// FreeDeviceMem: no payload (status field carries success/failure).
    None,
}

AccelCompletionResult wire encoding (tagged union): the result is serialized as a 1-byte discriminant followed by variant-specific payload. Total fixed size per result: 24 bytes (padded).

discriminant 0 (Context): handle: u64, _pad: [u8; 15]
discriminant 1 (Submit):  device_timestamp: u64, elapsed_ns: u64, _pad: [u8; 7]
discriminant 2 (Memory):  region_offset: u64, size: u64, _pad: [u8; 7]
discriminant 3 (DeviceInfo): caps_len: u32, _pad: [u8; 19]
  (caps data follows via bulk push if caps_len > 0)
discriminant 4 (None):    _pad: [u8; 23]

Wire layout of full AccelServiceCompletion:
  request_id:       u64  (8 bytes, offset 0)
  status:           i32  (4 bytes, offset 8)
  discriminant:     u8   (1 byte,  offset 12)
  _pad:             [u8; 3] (3 bytes, offset 13)
  result_payload:   [u8; 24] (variant data + padding, offset 16)
  Total: 40 bytes, naturally aligned.

Wire protocol: command buffer submission over the peer protocol. Bulk data (command buffers, input tensors) is transferred via push_page() — the client writes data directly into a pre-registered server-side data region, then sends a lightweight SubmitCmdBuf control message. Completions (including output region offsets) are returned via ring pair send.

DMA buffers: both client and server pre-register data regions with the peer transport at context creation time. The client pushes input data into the server's registered region; the server pushes output data into the client's registered region. Zero CPU copies on both sides.

Exportable context types: compute contexts (CUDA/OpenCL equivalent) and inference contexts (ML model execution) are exportable. Display/render contexts are NOT exportable — rendering requires sub-millisecond latency to the display pipeline, which is incompatible with network round-trip times. Remote rendering remains a userspace concern (VNC, RDP, GPU remoting protocols).

Capability gating: remote accelerator access requires CAP_ACCEL_REMOTE (Section 9.1). Discovery via PeerRegistry::peers_with_cap(ACCELERATOR).

For peer accelerator devices (firmware shim implementing the umka-protocol): accelerator service provider is unnecessary. The device is directly addressable as a peer -- remote hosts submit work via the peer protocol without any export layer on the host.

Kubernetes integration: when a Kubelet schedules a pod requiring GPU resources, the cluster scheduler can select a remote GPU. The pod's userspace runtime (CUDA, OpenCL) communicates through the standard device interface; the kernel transparently routes to the remote accelerator via the service provider. The pod does not need modification.

22.7.3.1 Accelerator Service Client (Consuming Peer)¶

On the consuming peer, the accelerator service client creates a device node, manages remote context lifecycle, and enforces local cgroup quotas for remote accelerator usage.

AccelServiceProperties (32-byte discovery descriptor, carried in PeerServiceDescriptor.properties):

/// Accelerator capabilities advertised during service discovery.
/// Returned in PeerServiceDescriptor.properties for ACCELERATOR services.
/// Enables clients to select an appropriate accelerator before binding.
#[repr(C)]
pub struct AccelServiceProperties {
    /// Device memory in bytes. Zero for devices without dedicated memory
    /// (e.g., integrated GPUs sharing system RAM).
    pub device_memory_bytes: u64, // 8 bytes  (offset 0)
    /// Number of compute units. This is device-specific: NVIDIA SMs,
    /// AMD CUs, Intel EUs, generic NPU cores. No normalization —
    /// the client uses compute_units + supported_apis + device_name
    /// to identify the device type and interpret the count. Cross-vendor
    /// performance comparison requires benchmarking, not unit counting.
    pub compute_units: u32,       // 4 bytes  (offset 8)
    /// Bitmask of supported APIs.
    /// bit 0: CUDA (Nvidia compute)
    /// bit 1: OpenCL
    /// bit 2: Vulkan Compute (SPIR-V compute shaders)
    /// bit 3: UmkaOS native inference API
    /// bit 4: RDMA verbs (for RDMA-capable NICs exported as accelerators)
    pub supported_apis: u32,      // 4 bytes  (offset 12)
    /// Maximum concurrent contexts the provider allows per client.
    pub max_contexts: u16,        // 2 bytes  (offset 16)
    /// Maximum in-flight command buffers per context.
    pub max_inflight_per_ctx: u16,// 2 bytes  (offset 18)
    /// Short device name for diagnostics/sysfs (null-terminated).
    pub device_name: [u8; 12],    // 12 bytes (offset 20)
    // Total: 32 bytes. No padding holes (u64 at offset 0, u32s at 8/12,
    // u16s at 16/18, byte array at 20).
}
// AccelServiceProperties: u64(8) + u32(4)*2 + u16(2)*2 + [u8;12] = 32 bytes.
// Wire struct (PeerServiceDescriptor.properties payload).
const_assert!(core::mem::size_of::<AccelServiceProperties>() == 32);

/// Client-side state for a bound remote accelerator. One instance per
/// active ServiceBind to an accelerator service provider.
///
/// Tier assignment: Tier 1 (runs in the accelerator framework isolation domain).
pub struct AccelServiceClient {
    /// ServiceBind connection to the remote accelerator provider.
    connection: ServiceBindHandle,
    /// Peer providing the accelerator.
    peer_id: PeerId,
    /// Cached device properties from discovery.
    properties: AccelServiceProperties,
    /// Device node path: /dev/umka_accel/peer{N}_dev{M}.
    /// Created at bind time, removed on unbind.
    device_node: DeviceNodeHandle,
    /// Active contexts on this device. Maps context handle (u64) to
    /// client-side tracking state. XArray for O(1) lookup on the
    /// submission hot path (integer-keyed, per collection policy in
    /// [Section 3.6](03-concurrency.md#lock-free-data-structures)).
    contexts: XArray<AccelClientContext>,
    /// Pre-registered data region for input data. Client pushes command
    /// buffers and input tensors here for the provider to read.
    input_region: ServiceDataRegion,
    /// Pre-registered data region for output data. Provider pushes
    /// results here for the client to read.
    output_region: ServiceDataRegion,
    /// In-flight command buffer tracking. Maps request_id (u64) to
    /// submission metadata. XArray, integer-keyed.
    inflight: XArray<AccelInflight>,
    /// Next request ID. Monotonically increasing u64.
    next_request_id: AtomicU64,
    /// cgroup controller reference for quota enforcement.
    cgroup_ctrl: Option<AccelCgroupRef>,
}

/// Per-context client state for a remote accelerator context.
struct AccelClientContext {
    /// Remote context handle returned by CreateContext.
    remote_handle: AccelContextHandle,
    /// Context type (compute, inference).
    context_type: AccelContextType,
    /// Number of in-flight command buffers for this context.
    inflight_count: AtomicU32,
    /// Cumulative GPU-time charged to this context (nanoseconds).
    /// Used for cgroup accounting.
    charged_gpu_ns: AtomicU64,
    /// Owning userspace file descriptor. Used to destroy context
    /// when the fd is closed.
    owner_fd: FileDescriptor,
}

/// Tracks an in-flight command buffer submission.
struct AccelInflight {
    /// Context this submission belongs to.
    context_handle: AccelContextHandle,
    /// Submission timestamp (for timeout detection).
    submit_time: Instant,
    /// Userspace completion cookie (returned to userspace on completion).
    user_cookie: u64,
}

Device node creation: At ServiceBind time, after receiving AccelServiceProperties, the client registers the remote accelerator with the accelerator framework (Section 22.3) and creates a device node at /dev/umka_accel/peer{N}_dev{M} where N is the provider's peer index and M is the device index within that peer. A sysfs entry is created under /sys/class/accel/peer{N}_dev{M}/ with attributes:

compute_units: number of compute units
memory_bytes: device memory size
supported_apis: API bitmask (human-readable in sysfs)
peer: provider peer ID
device_name: from AccelServiceProperties.device_name

Userspace libraries (CUDA runtime, OpenCL ICD loader, Vulkan loader) discover the device through the standard /dev/ enumeration or /sys/class/accel/ scan.

Connection handshake: At ServiceBind time, the client sends GetDeviceInfo as the first operation. The provider returns a full capability descriptor (compute units, memory layout, supported command buffer formats). The client validates that supported_apis matches what was advertised in discovery, registers data regions with the peer transport for bulk data transfer, and creates the device node. If validation fails, the client sends ServiceUnbind and returns ENODEV to the discovery caller.

Context lifecycle:

Userspace opens /dev/umka_accel/peer{N}_dev{M} and issues a CREATE_CONTEXT ioctl specifying context type (compute or inference).
The client kernel sends AccelServiceOp::CreateContext to the provider.
Provider creates the context on the physical device and returns the handle in AccelCompletionResult::Context.
The client stores the handle in contexts XArray and returns a local context ID to userspace.
Userspace submits command buffers via SUBMIT_CMD_BUF ioctl, referencing the context. The client pushes the command buffer data into the provider's registered input region, then sends SubmitCmdBuf with the region offset.
On completion, the provider sends AccelServiceCompletion. The client removes the inflight entry, charges GPU time to the cgroup, and delivers the completion to userspace (via eventfd or poll).
On fd close, the client sends DestroyContext for all contexts owned by that fd. Provider releases device resources.

Userspace ioctl interface:

/// Accelerator device ioctls. Issued on an open fd to
/// `/dev/umka_accel/peer{N}_dev{M}`.
pub const ACCEL_IOC_CREATE_CONTEXT: u32 = 0xAC01;
pub const ACCEL_IOC_DESTROY_CONTEXT: u32 = 0xAC02;
pub const ACCEL_IOC_SUBMIT_CMD_BUF: u32 = 0xAC03;
pub const ACCEL_IOC_WAIT_COMPLETION: u32 = 0xAC04;
pub const ACCEL_IOC_ALLOC_MEM: u32 = 0xAC05;
pub const ACCEL_IOC_FREE_MEM: u32 = 0xAC06;
pub const ACCEL_IOC_GET_DEVICE_INFO: u32 = 0xAC07;

/// Submission descriptor passed to ACCEL_IOC_SUBMIT_CMD_BUF.
#[repr(C)]
pub struct AccelSubmitDesc {
    /// Context ID (from ACCEL_IOC_CREATE_CONTEXT return value).
    pub context_id: u64,
    /// Userspace pointer to command buffer data.
    pub cmd_buf_ptr: u64,
    /// Length of command buffer in bytes.
    pub cmd_buf_len: u32,
    /// User-defined cookie returned on completion.
    pub user_cookie: u64,
    pub _pad: u32,
}
// AccelSubmitDesc: u64(8)*2 + u32(4) + 4pad + u64(8) + u32(4) + 4pad = 40 bytes.
// Userspace ABI struct (ioctl argument).
const_assert!(core::mem::size_of::<AccelSubmitDesc>() == 40);

/// Completion event delivered to userspace.
#[repr(C)]
pub struct AccelCompletionEvent {
    /// User cookie from the original submission.
    pub user_cookie: u64,
    /// 0 on success, negative errno on failure.
    pub status: i32,
    /// GPU execution time in nanoseconds.
    pub elapsed_ns: u64,
    pub _pad: u32,
}
// AccelCompletionEvent: u64(8) + i32(4) + 4pad + u64(8) + u32(4) + 4pad = 32 bytes.
// Userspace ABI struct (delivered via read on accelerator fd).
const_assert!(core::mem::size_of::<AccelCompletionEvent>() == 32);

Completion delivery: each open accelerator fd has an associated eventfd. The kernel signals the eventfd when a completion is available. Userspace uses poll()/epoll() on the eventfd, then calls ACCEL_IOC_WAIT_COMPLETION to read completion events (up to 32 per call, batched). This is the same pattern as io_uring's eventfd notification — low overhead, epoll-compatible.

Submission timeout: command buffer submissions time out after 120 seconds (configurable via sysfs at /sys/class/accel/peer{N}_dev{M}/cmd_timeout_sec). On timeout, the client sends AccelServiceOp::DestroyContext for the timed-out context (GPU hang recovery requires context destruction), completes all pending submissions for that context with -ETIME, and notifies userspace. The context must be recreated by the application. This matches GPU reset semantics -- in-flight work is lost.

Context isolation on timeout: GPU hang recovery is per-context, not per-device, when the hardware supports it (NVIDIA: per-context channel reset; AMD: per-queue reset; Intel: per-context GuC reset). The provider:

Attempts per-context reset first (kills the hung context only).
If per-context reset fails (hardware does not support it, or the hang is device-wide): falls back to full device reset.
On full device reset: ALL contexts on the device are destroyed. All clients with contexts on that device receive -ETIME completions for all in-flight submissions. Each client must recreate contexts.
The provider reports the reset scope in the completion:
status = -ETIME: only this context was destroyed (isolated reset).
status = -ENODEV: full device reset, all contexts destroyed.

Clients should check status and react accordingly (recreate one context vs. full rediscovery of the device).

Memory management: AccelServiceOp::AllocDeviceMem allocates memory on the remote device and returns a region offset for direct access via the peer transport.

AllocDeviceMem error codes:

Status	Meaning
`-ENOMEM`	Device memory exhausted (no free memory on device)
`-EDQUOT`	cgroup memory quota exceeded (client's `accel.mem_limit`)
`-EINVAL`	Invalid flags combination or `size = 0`
`-ENOSYS`	Device does not support DMA allocation (no dedicated memory)
as a "device memory" region visible to userspace via `mmap()` on the device fd.
Userspace can then push input data (tensors, textures) into device memory and
fetch output data back via the peer transport, achieving zero-copy data transfer
between client userspace and remote device memory. The transport operations are
initiated by the kernel on behalf of userspace (userspace does `read()`/`write()`
or `mmap()` + memcpy; the kernel translates to peer transport operations).

Failure handling: If the provider disconnects (ServiceDrainNotify, link failure, or timeout), the client:

Sets a disconnected flag on the AccelServiceClient.
Completes all in-flight submissions with status -EIO.
Destroys all AccelClientContext entries. Userspace reads/writes/ioctls on open fds return -ENODEV.
Removes the device node from /dev/umka_accel/.
Userspace must re-discover and re-bind to recreate contexts. There is no transparent reconnection -- accelerator state (loaded models, allocated memory, compiled kernels) is lost on disconnect.

cgroup enforcement: Remote accelerator usage counts against the client's local cgroup GPU time quota, not the provider's. The client kernel tracks cumulative elapsed_ns from AccelServiceCompletion and charges it to the submitting process's AccelCgroupController (Section 17.2). This is enforced locally:

Before submitting a command buffer, the client checks the cgroup's remaining GPU time budget. If exhausted, the submission blocks (or returns EAGAIN for non-blocking fds) until the next accounting period.
The provider is not trusted for quota enforcement. It may serve multiple clients from different clusters, each with different policies. The client kernel is the sole enforcer of its own cgroup policy.
GPU memory allocation (AllocDeviceMem) is charged against the cgroup's memory limit for accelerator devices.

Charging granularity: GPU time is charged on completion, not submission. The elapsed_ns from AccelServiceCompletion is added to the context's charged_gpu_ns counter. The cgroup controller aggregates across all contexts for the process group.

Accounting period: 100ms (configurable via cgroup parameter accel.period_ms). At each period boundary, the charged GPU time is compared against the quota (accel.quota_us). If over quota, new submissions block until the next period.

Rounding: sub-microsecond elapsed_ns values are rounded UP to 1 us to prevent zero-cost submissions that bypass quota.

Scope limitations (Phase 2):

No peer-to-peer GPU memory access (GPUDirect between remote GPUs). Requires multi-hop RDMA path setup -- deferred to Phase 4+.
No display/render context forwarding. Rendering requires sub-millisecond latency to the display pipeline, incompatible with network RTT. Remote rendering remains a userspace concern (VNC, SPICE, GPU remoting protocols).
No shader compilation forwarding. The client compiles shaders/kernels locally and sends binary command buffers. This avoids trusting the provider with proprietary shader source and eliminates compiler version skew.

22.8 Unified Compute Model¶

22.8.1 The Convergence Problem¶

The architecture currently treats CPU scheduling (Section 7.1) and accelerator scheduling (Section 22.2) as separate worlds:

World 1 — CPU Scheduler (Section 7.1):
  Input:     threads (instruction streams)
  Resources: CPU cores (P-core, E-core, RISC-V harts)
  Decision:  which core runs this thread?
  Cgroup:    cpu.max, cpu.weight

World 2 — Accelerator Scheduler (Section 22.1.2.4):
  Input:     command buffers (GPU kernels, inference requests)
  Resources: accelerator contexts (GPU CUs, NPU engines)
  Decision:  which context gets device time?
  Cgroup:    accel.compute.max, accel.compute.weight

These worlds share no abstraction. The kernel cannot answer: "given a fixed power budget and a matrix workload, is it more efficient to run on CPU-AMX, GPU, or NPU?" It cannot balance compute load across device types. It cannot make holistic energy decisions.

Meanwhile, the hardware is converging:

Trend	Example	Implication
CPU gains matrix ops	Intel AMX (P-cores), ARM SME	CPU can do what GPUs used to do
CPU+GPU share memory	APU (AMD), Apple M-series, Grace Hopper NVLink-C2C	No DMA copy between CPU↔GPU
Heterogeneous ISA within CPUs	RISC-V: some harts have Vector, some don't (Section 7.2)	"CPU with Vector" vs "GPU CU" is a matter of degree
CXL 3.0 shared memory	Samsung CMM-H, Intel Ponte Vecchio + CXL	Hardware-coherent memory shared by CPU and accelerator
On-die NPU	Intel Meteor Lake NPU, Qualcomm Hexagon	NPU is as close to CPU as an E-core

The conceptual leap from Section 7.2 (RISC-V harts with different ISA extensions) to "GPU CU as another compute unit type" is small. Both are compute resources with different capability profiles and power characteristics.

22.8.2 Design Principle: Overlay, Not Replacement¶

Critical constraint: This must work Day 1 as a Linux drop-in replacement. NVIDIA's proprietary userspace (libcuda, libnvidia-ml, cuDNN) runs unmodified. CUDA applications explicitly target the GPU — the kernel does NOT redirect them. AMD ROCm, Intel oneAPI, all work as-is.

The unified compute model is an advisory overlay on top of the existing separate schedulers:

                    ┌─────────────────────────────┐
                    │   Unified Compute Topology   │  ← NEW (advisory)
                    │   Multi-dimensional capacity │
                    │   Cross-device energy model  │
                    └──────┬──────────────┬────────┘
                           │              │
                    ┌──────▼──────┐ ┌─────▼──────┐
                    │CPU Scheduler│ │Accel Sched  │  ← UNCHANGED
                    │ CFS + EAS   │ │ CBS + Prio  │
                    │ (Section 7.1)│ │ (Section 22.1.2.4)   │
                    └─────────────┘ └─────────────┘

Existing schedulers continue to make execution decisions independently.
The unified layer provides topology, capacity, and energy data that both schedulers and userspace runtimes can consume.
No vendor must rewrite anything. Benefits accrue from kernel-side visibility.

22.8.3 Multi-Dimensional Compute Capacity¶

Section 7.2 defines CpuCapacity as a single scalar (0–1024). This works for heterogeneous CPUs because all cores execute the same type of work (general-purpose instructions) at different speeds.

Once accelerators enter the picture, capacity becomes a vector — different compute units excel at different workload types:

// umka-core/src/compute/capacity.rs

/// Multi-dimensional capacity profile for any compute unit.
///
/// Values are ABSOLUTE (device-intrinsic), not normalized to the system.
/// Each dimension is in hardware-specific units that do not change when
/// devices are hot-plugged. The kernel maintains a per-system max for
/// each dimension (updated lazily on device arrival/departure) and
/// normalizes to 0–1024 ONLY for sysfs display (compute.capacity_normalized).
///
/// Why absolute: normalizing internally means hot-plugging a faster GPU
/// would silently change every other device's capacity values, breaking
/// comparisons across snapshots and racing with concurrent readers.
///
/// Used for:
///   - Power budgeting: informed cross-device throttling (Section 7.4)
///   - Intent-based management: workload-to-device advisory (Section 7.7)
///   - Userspace runtime hints: exposed via sysfs for OpenCL/SYCL
///
/// NOT used for: actual scheduling decisions (those remain in
/// CPU scheduler and AccelScheduler respectively).
pub struct ComputeCapacityProfile {
    /// Scalar integer throughput (million instructions per second, MIPS-equivalent).
    /// CPU P-core: ~50000. GPU CU: ~200. NPU: 0.
    pub scalar: u32,

    /// Vector/SIMD throughput (GFLOPS single-precision equivalent).
    /// GPU CU: ~2000. CPU P-core with AVX-512: ~300. NPU: 0.
    pub vector: u32,

    /// Matrix throughput (TOPS, tera-operations per second for matmul).
    /// NPU: ~40. GPU tensor core: ~300. CPU AMX: ~5. CPU scalar: ~0.
    pub matrix: u32,

    /// Memory bandwidth (GB/s).
    /// GPU HBM3: ~3000. CPU DDR5: ~100. NPU on-chip SRAM: ~500.
    pub memory_bw: u32,

    /// Launch overhead (inverse, microseconds to first useful work).
    /// CPU: 1 (thread wakeup ~1μs). GPU: 30 (kernel launch ~30μs).
    /// NPU: 200 (model load + inference setup).
    /// Lower = better. Determines crossover: small tasks favor CPU.
    pub launch_overhead_us: u32,
}

Example profiles on a Grace Hopper system (ARM CPU + H100 GPU):

CPU Grace core (ARM Neoverse):
  scalar=45000  vector=300  matrix=2    memory_bw=100  launch_overhead_us=1

GPU H100 SM:
  scalar=200    vector=2000 matrix=300  memory_bw=3000 launch_overhead_us=30

Intel Alder Lake system with discrete GPU and NPU:
CPU P-core:   scalar=50000  vector=300  matrix=5   memory_bw=80  launch_overhead_us=1
CPU E-core:   scalar=25000  vector=150  matrix=2   memory_bw=80  launch_overhead_us=1
iGPU EU:      scalar=100    vector=600  matrix=10  memory_bw=50  launch_overhead_us=20
Discrete GPU: scalar=200    vector=2000 matrix=250 memory_bw=900 launch_overhead_us=30
NPU:          scalar=0      vector=0    matrix=40  memory_bw=50  launch_overhead_us=200

RISC-V SoC with heterogeneous harts + accelerators:
Hart (RV64GC):   scalar=15000  vector=0    matrix=0   memory_bw=30  launch_overhead_us=1
Hart (RV64GCV):  scalar=15000  vector=200  matrix=1   memory_bw=30  launch_overhead_us=1
Custom ML hart:  scalar=3000   vector=100  matrix=20  memory_bw=60  launch_overhead_us=5
Attached NPU:    scalar=0      vector=0    matrix=10  memory_bw=40  launch_overhead_us=200

(These are illustrative values — actual values are runtime-discovered from hardware
capability queries. Actual values are populated from driver-reported specs at device
registration time. Sysfs normalizes to 0-1024 per dimension for display, where
best-in-system = 1024.)

Key property: CPU cores already have entries (derived from CpuCapacity in Section 7.2). Accelerators get profiles from AccelBase get_utilization (Section 22.1) extended with a get_capacity_profile vtable entry. This is a minor KABI extension — one new function pointer that returns static data.

// Appended to AccelBaseVTable per Section 12.1.4 versioning rules (Option<fn> = backward compat).
// Drivers compiled against older KABI versions see this field as None (vtable_size check).

/// Return the multi-dimensional compute capacity profile for this device.
///
/// The returned `ComputeCapacityProfile` describes the device's intrinsic
/// compute capabilities using absolute, hardware-specific units (not
/// system-normalized). The kernel calls this once at device registration
/// and caches the result in the `ComputeUnit` entry in the unified
/// compute topology ([Section 22.8](#unified-compute-model--unified-compute-topology)).
///
/// If the driver does not implement this entry (legacy or compat drivers),
/// the kernel estimates a profile from `AccelDeviceClass` and
/// `get_utilization()` telemetry. Drivers that provide this entry get
/// a precise profile without kernel-side guesswork.
///
/// The profile is treated as static for the lifetime of the device
/// registration. If device capabilities change (e.g., FPGA reconfiguration),
/// the driver must unregister and re-register to update the profile.
///
/// **KABI version note**: Added in KABI version 1.2. Callers must check
/// `vtable_size >= offset_of!(AccelBaseVTable, get_capacity_profile) + size_of::<Option<fn>>()`
/// before invoking; treat as `None` if not present.
pub get_capacity_profile: Option<unsafe extern "C" fn(
    ctx: *mut c_void,
    out_profile: *mut ComputeCapacityProfile,
) -> IoResultCode>,

22.8.4 Unified Compute Topology¶

The device registry (Section 11.4) already models all devices in one tree. The unified compute layer adds a compute view that flattens this into a map of compute units with their capabilities, power profiles, and memory domains:

// umka-core/src/compute/topology.rs

/// A compute unit in the unified topology.
/// Can be a CPU core, GPU SM, NPU engine, DSP core, etc.
pub struct ComputeUnit {
    /// Device registry node ID.
    pub device_id: DeviceNodeId,

    /// What kind of compute unit this is.
    pub unit_type: ComputeUnitType,

    /// Multi-dimensional capacity profile.
    pub capacity: ComputeCapacityProfile,

    /// Which memory domain is local to this compute unit?
    /// CPU cores → system RAM NUMA node.
    /// GPU SMs → VRAM NUMA node (Section 22.2).
    /// APU GPU → same NUMA node as CPU (shared memory).
    pub memory_domain: MemoryDomainId,

    /// Is memory shared with CPU without DMA copy?
    /// true for: APU, Apple M-series, Grace Hopper, CXL-attached accelerator.
    /// false for: discrete PCIe GPU (data must be explicitly transferred).
    pub memory_unified_with_cpu: bool,

    /// Energy model: OPP table (same format as Section 7.1.5.2).
    /// GPU OPPs come from the driver via AccelBase get_utilization/set_performance_level.
    pub energy_model: Option<EnergyModelRef>,

    /// Current utilization (0–1024), updated periodically.
    /// CPU: from PELT (Section 7.1.5.4).
    /// Accelerator: from AccelBase get_utilization (Section 22.1.2.2).
    pub utilization: AtomicU32,
}

#[repr(u32)]
pub enum ComputeUnitType {
    /// General-purpose CPU core. Managed by CPU scheduler (Section 7.1).
    CpuCore         = 0,
    /// GPU compute unit (SM/CU/EU). Managed by AccelScheduler (Section 22.1.2.4).
    GpuCompute      = 1,
    /// Neural processing unit. Managed by AccelScheduler.
    NpuEngine       = 2,
    /// Digital signal processor. Managed by AccelScheduler.
    DspCore         = 3,
    /// FPGA reconfigurable logic. Managed by AccelScheduler.
    FpgaSlot        = 4,
    /// Computational storage processor (Section 15.8). Managed by AccelScheduler.
    CsdProcessor    = 5,
    /// Mixed workload — no single domain exceeds the dominance threshold.
    Mixed           = 6,
}

Population: The topology is built at boot and updated on hot-plug:

1. CPU cores:    discovered by existing CPU topology (Section 7.1.5.10).
                 ComputeCapacityProfile derived from CpuCapacity + IsaCapabilities.

2. Accelerators: discovered by device registry (Section 11.4) when AccelBase driver loads.
                 Driver provides ComputeCapacityProfile via get_capacity_profile().
                 If driver doesn't implement it (legacy, compat):
                   → kernel estimates from AccelDeviceClass + get_utilization().
                   → NVIDIA compat driver: AccelDeviceClass::GpuCompute,
                     bandwidth/utilization from nvidia-smi equivalent queries.
                   → No NVIDIA code change needed. Kernel reads existing telemetry.

22.8.5 Cross-Device Energy Optimization¶

The power budgeting system (Section 7.7) currently reads power per domain (CPU package, DRAM, Accelerator) and throttles independently. With unified topology, it can make informed cross-device decisions:

Current (independent throttling):
  Container exceeds power.max.
  → Throttle CPU (reduce frequency).
  → Throttle GPU (reduce clock).
  → Both throttled equally. Dumb.

With unified compute awareness:
  Container exceeds power.max.
  → Kernel reads workload profile:
      80% of compute is matrix ops (GPU-bound).
      20% is scalar (CPU, mostly waiting for GPU).
  → Informed decision:
      Keep GPU at high clock (it's doing the useful work).
      Aggressively throttle CPU (it's mostly idle-waiting anyway).
      Save more power with less performance loss.

This requires no change to throttle mechanisms — just better information for PowerBudgetEnforcer (Section 7.7) to decide WHICH domain to throttle.

// Extension to umka-core/src/power/budget.rs

impl PowerBudgetEnforcer {
    /// When a cgroup exceeds its power budget, decide which domain to throttle.
    /// Uses unified compute topology to understand where useful work is happening.
    fn select_throttle_target(
        &self,
        cgroup: CgroupId,
        excess_mw: u32,
    ) -> ArrayVec<ThrottleAction, MAX_POWER_DOMAINS> {
        let topology = unified_compute_topology();
        let workload = cgroup_workload_profile(cgroup);

        // Score each domain by "usefulness" = domain's contribution to
        // the cgroup's primary workload type.
        // Throttle the LEAST useful domain first.
        // domains is bounded by MAX_POWER_DOMAINS (no heap allocation).
        let mut domains: ArrayVec<_, MAX_POWER_DOMAINS> = self.domains.iter()
            .filter(|d| d.cgroup_attribution(cgroup) > 0)
            .map(|d| (d, workload.usefulness_score(d, &topology)))
            .collect();

        // Sort: least useful first (throttle first).
        domains.sort_by_key(|(_, score)| *score);

        // Apply throttle actions starting from least useful domain
        // until excess_mw is recovered.
        self.build_throttle_plan(&domains, excess_mw)
    }
}

22.8.6 Workload Profile Classification¶

Intel Thread Director (Section 7.2) classifies CPU workloads by instruction mix. Generalize this to a system-wide workload profile that covers all compute domains:

// umka-core/src/compute/classify.rs

/// System-wide workload classification for a cgroup or process.
/// Updated periodically (~1 second) from multiple sources.
///
/// Fractions are fixed-point: 0–1000 representing 0.0%–100.0%.
/// No floating-point in kernel data structures (kernel does not use FPU).
///
/// Invariant: `scalar_fraction + vector_fraction + matrix_fraction <= 1000`.
/// These three fields partition the compute demand; their sum must not exceed 1000.
/// `accel_wait_fraction` and `memory_bound_fraction` are independent blocking-time
/// metrics and are not included in the partition sum.
/// Values that would push the sum above 1000 are saturated at 1000 by the
/// classifier before storing; the dominant fraction absorbs any excess.
pub struct WorkloadProfile {
    /// Fraction of compute demand that is scalar (0–1000).
    /// Source: PELT utilization on CPU cores + ITD hints.
    pub scalar_fraction: u32,

    /// Fraction of compute demand that is vector/SIMD (0–1000).
    /// Source: hardware performance counters (SIMD instruction retired).
    pub vector_fraction: u32,

    /// Fraction of compute demand that is matrix/tensor (0–1000).
    /// Source: AMX/SME counters on CPU, utilization reports from AccelBase.
    pub matrix_fraction: u32,

    /// Fraction of time spent waiting for accelerator completion (0–1000).
    /// Source: CPU scheduler (time in interruptible sleep waiting for accel).
    /// High value = GPU-bound workload.
    pub accel_wait_fraction: u32,

    /// Fraction of time spent waiting for memory/I/O (0–1000).
    /// Source: hardware counters (LLC miss rate, stall cycles).
    pub memory_bound_fraction: u32,

    /// Dominant compute domain for this workload.
    /// Derived from the fractions above.
    pub dominant_domain: ComputeUnitType,
}

dominant_domain derivation algorithm: The dominant_domain field is computed from scalar_fraction, vector_fraction, and matrix_fraction using the following rule:

/// Compute the dominant compute domain from the workload fractions.
/// All fractions are per-mille (0–1000 = 0.0%–100.0%).
///
/// A domain is "dominant" only if its fraction is at least 400 (40.0%).
/// If no single domain clears the threshold, `ComputeUnitType::Mixed` is returned.
/// This prevents spurious dominant-domain assignments when work is evenly distributed.
fn compute_dominant_domain(profile: &WorkloadProfile) -> ComputeUnitType {
    let candidates = [
        (ComputeUnitType::CpuCore,   profile.scalar_fraction),
        (ComputeUnitType::GpuCompute, profile.vector_fraction),
        (ComputeUnitType::NpuEngine,  profile.matrix_fraction),
    ];
    let (domain, max_frac) = candidates
        .iter()
        .max_by_key(|(_, f)| *f)
        .copied()
        .unwrap();
    // Must exceed 40% (400 per-mille) to be considered dominant; otherwise Mixed.
    if max_frac >= 400 {
        domain
    } else {
        ComputeUnitType::Mixed
    }
}

This function is called at the end of each profile update cycle (after EMA smoothing and saturation) to refresh dominant_domain. Callers must not set dominant_domain directly — the field is always derived, never policy-driven input.

Fractions invariant and validation: The sum scalar_fraction + vector_fraction + matrix_fraction must not exceed 1000. accel_wait_fraction and memory_bound_fraction are independent blocking-time metrics and are not included in the partition sum; each may range 0–1000 independently.

If a policy update would cause the partition sum to exceed 1000, the kernel rejects the update with EINVAL and logs a KERN_WARNING. Callers must normalize fractions before submitting. The validate() method enforces this invariant; it is called by the policy consumer vtable dispatch before applying any profile:

impl WorkloadProfile {
    /// Validate the fraction invariant.
    ///
    /// Returns `Ok(())` if `scalar_fraction + vector_fraction + matrix_fraction <= 1000`
    /// and all fields are within [0, 1000].
    /// Returns `Err(KernelError::InvalidArgument)` otherwise.
    pub fn validate(&self) -> Result<(), KernelError> {
        let partition_sum = self.scalar_fraction
            .saturating_add(self.vector_fraction)
            .saturating_add(self.matrix_fraction);
        if partition_sum > 1000 {
            return Err(KernelError::InvalidArgument);
        }
        if self.accel_wait_fraction > 1000 || self.memory_bound_fraction > 1000 {
            return Err(KernelError::InvalidArgument);
        }
        Ok(())
    }
}

/// Behavioral profile assigned to each `AccelContext` by the accelerator
/// scheduler's classification algorithm. Drives time-slice sizing,
/// preemption priority, and colocation decisions.
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub enum AccelContextProfile {
    /// Interactive workloads: inference serving, video encode frames.
    /// Short command durations, low variance. Preempts HighThroughput.
    LowLatency,
    /// Batch compute: training, bulk encode. Long commands, deep queues.
    /// Longer time slices; yields to LowLatency.
    HighThroughput,
    /// Memory-bandwidth-limited: attention layers, large embedding lookups.
    /// Colocated with compute-bound contexts (no resource contention).
    MemoryBound,
    /// Irregular submission patterns: game engines, mixed rendering/compute.
    /// Extra queue headroom (10%) for burst absorption.
    Bursty,
    /// Default profile for contexts that don't match any specialized category.
    /// Standard round-robin with CBS bandwidth enforcement.
    Balanced,
}

AccelContext WorkloadProfile Classification Algorithm:

The accelerator scheduler assigns a behavioral WorkloadProfile to each AccelContext based on runtime signals. Classification runs every 100ms (low-overhead background task, separate from the CPU-level WorkloadProfile above which runs at sched_latency_ns).

Signals sampled per context (all available from the scheduler's per-context counters): - avg_cmd_duration_us: Exponential moving average of command completion time. - cmd_stddev_us: Standard deviation of completion times (variability). - queue_depth: Average number of commands in flight. - memory_bandwidth_gbps: DMA bandwidth used (from IOMMU counters). - compute_utilization_pct: Fraction of device time active (from device query).

Classification rules (applied in priority order):

if avg_cmd_duration_us < 50 AND cmd_stddev_us < 10:
    → AccelContextProfile::LowLatency     // Interactive (inference serving, video encode frames)

elif avg_cmd_duration_us > 5000 AND queue_depth >= 4:
    → AccelContextProfile::HighThroughput // Batch compute (training, bulk encode)

elif compute_utilization_pct < 30 AND memory_bandwidth_gbps > 50:
    → AccelContextProfile::MemoryBound    // Memory-bandwidth-limited (attention layers, large embed)

elif cmd_stddev_us > avg_cmd_duration_us * 0.5:
    → AccelContextProfile::Bursty         // Irregular submission patterns (game engines, mixed)

else:
    → AccelContextProfile::Balanced       // Default

Profile → scheduling policy mapping: - LowLatency: Preempt HighThroughput workloads; shorter time slices; wake immediately on submission. - HighThroughput: Longer time slices; batch mode enabled; yield to LowLatency. - MemoryBound: Colocate with compute-bound contexts (memory bandwidth and compute don't contend on the same hardware resources). - Bursty: Reserve 10% excess queue depth for burst absorption. - Balanced: Standard round-robin with CBS bandwidth enforcement.

Profile stability: Profile changes are subject to a minimum hold time of 500ms to prevent thrashing. A profile change is applied only if the new profile has been consistently indicated for 500ms or 5 consecutive classifications.

CPU-level WorkloadProfile classification: The scheduler updates WorkloadProfile every sched_latency_ns (default: 6ms on server, 4ms on desktop) using a PELT-style exponential moving average:

Read hardware counters (per-CPU, sampled at context switch):
scalar_fraction: (retired_instructions - simd_retired - amx_retired) / total_retired × 1000
vector_fraction: simd_retired / total_retired × 1000
matrix_fraction: amx_retired / total_retired × 1000 (or AccelBase utilization report for GPU/NPU)
memory_bound_fraction: stall_cycles_memory / total_cycles × 1000
accel_wait_fraction: time_in_accel_wait / total_wall_time × 1000
EMA smoothing: new = (alpha × sample) + ((1 - alpha) × old), where alpha = 1/4 (fast response).
Saturation: if scalar + vector + matrix > 1000, the dominant fraction absorbs the excess.
Dominant domain: call compute_dominant_domain(profile) and store the result in dominant_domain.

On architectures without per-instruction-class counters (e.g., some ARM cores), vector_fraction and matrix_fraction are estimated from the accelerator utilization report (Section 22.1) and ITD (Intel Thread Director) hints where available.

Where this data is used:

Power budgeting (Section 7.7): Which domain to throttle (Section 22.8 above).
Intent-based management (Section 7.10): When intent.efficiency = 80 (prefer efficiency), and the workload is matrix-dominant, the optimizer suggests moving from GPU to NPU (lower power per matrix op).
Userspace runtimes: Exposed via sysfs for consumption by OpenCL/SYCL/oneAPI runtimes that make device selection decisions.

/sys/fs/cgroup/<group>/compute.profile
# # Read-only. Current workload classification (0-1000 = 0.0%-100.0%):
# #   scalar: 150
# #   vector: 50
# #   matrix: 700
# #   accel_wait: 600
# #   memory_bound: 100
# #   dominant: gpu_compute

/sys/kernel/umka/compute/topology
# # Read-only. JSON: all compute units with capacity profiles.
# # Consumed by userspace runtimes for device selection.

/sys/kernel/umka/compute/unit/<device_id>/capacity
# # Read-only. Per-unit: "scalar=1024 vector=300 matrix=150 ..."

22.8.7 Unified Cgroup Compute Budget (Optional)¶

An optional cgroup knob that expresses total compute need abstractly, leaving device selection to the kernel:

/sys/fs/cgroup/<group>/compute.weight
# # Proportional share of total system compute (across ALL devices).
# # Default: 0 (disabled — use existing cpu.weight + accel.compute.weight).
# # When set: kernel adjusts cpu.weight and accel.compute.weight internally
# # to optimize for the cgroup's workload profile.
    #
# # Example: two cgroups, both compute.weight=100.
# # Cgroup A is GPU-bound → kernel gives A more GPU time, less CPU.
# # Cgroup B is CPU-bound → kernel gives B more CPU time, less GPU.
# # Both get "equal compute" in terms of actual useful work done.

Implementation: compute.weight is an orchestration knob. The kernel's intent optimizer (Section 7.10) reads compute.weight + compute.profile and adjusts the existing per-domain knobs (cpu.weight, accel.compute.weight) every ~1 second. No new scheduling fast path. No change to CPU scheduler or AccelScheduler.

When compute.weight is 0 (default): existing separate knobs work exactly as they do on Linux. Zero overhead. Full backward compatibility.

22.8.8 Unified Memory Domain Tracking¶

When CPU and accelerator share physical memory (no DMA copy boundary), the memory manager should understand this for page placement:

// umka-core/src/compute/memory.rs

/// Memory domain descriptor in the unified compute topology.
pub struct MemoryDomain {
    /// NUMA node ID (integrates with existing memory manager, Section 4.1).
    pub numa_node: u8,

    /// Which compute units have local access to this memory?
    /// On an APU: both CPU cores and GPU CUs list the same domain.
    /// On discrete GPU: GPU CUs list VRAM domain, CPU lists DDR domain.
    /// Populated at boot/hot-plug (cold path, heap available). The count is
    /// bounded by the number of compute units in the system.
    pub local_compute_units: Vec<DeviceNodeId>,  // heap: cold-path only, after allocator init

    /// Is this domain coherent across all local compute units?
    /// true: APU shared memory, CXL 3.0 coherent pool.
    /// false: discrete GPU VRAM (requires explicit flush/invalidate).
    pub hardware_coherent: bool,

    /// Bandwidth and latency from each compute unit type.
    /// Used by page placement decisions.
    /// Populated at boot/hot-plug (cold path, heap available).
    pub access_costs: Vec<MemoryAccessCost>,  // heap: cold-path only, after allocator init
}

pub struct MemoryAccessCost {
    pub from_unit: DeviceNodeId,
    pub latency_ns: u32,
    pub bandwidth_gbs: u32,
}

What this enables:

Discrete GPU (PCIe, separate memory):
  CPU cores → DDR NUMA node 0 (latency: 80ns, BW: 50 GB/s)
  GPU SMs   → VRAM NUMA node 2 (latency: 100ns, BW: 900 GB/s)
  CPU→VRAM: latency 500ns, BW 25 GB/s (PCIe)
  → Page migration between CPU and GPU is expensive.
  → Applications MUST explicitly manage data placement (cudaMemcpy).
  → Kernel's role: NUMA-aware allocation. Same as Linux.

APU (shared memory):
  CPU cores → DDR NUMA node 0 (latency: 80ns, BW: 50 GB/s)
  GPU CUs   → DDR NUMA node 0 (latency: 90ns, BW: 45 GB/s)  ← SAME NODE
  → No page migration needed. CPU and GPU see the same pages.
  → Kernel can optimize page placement within the shared domain
    (e.g., cache-line alignment for GPU access patterns).
  → Workload migration CPU↔GPU is a scheduling decision only, no data movement.

Grace Hopper (NVLink-C2C unified memory):
  CPU cores → LPDDR5X NUMA node 0 (latency: 80ns, BW: 500 GB/s)
  GPU SMs   → HBM3 NUMA node 1 (latency: 100ns, BW: 3000 GB/s)
  CPU↔GPU:  NVLink-C2C (latency: 200ns, BW: 900 GB/s, COHERENT)
  → Hardware-coherent. Kernel can migrate pages transparently.
  → Hot pages accessed by GPU → migrate to HBM3 (faster).
  → Cold pages → migrate to LPDDR5X (more capacity).
  → Same mechanism as NUMA balancing between CPU sockets, extended to GPU.

This is an extension of the existing NUMA-aware page placement (Section 4.1, Section 22.4), not a new mechanism. The PageLocationTracker (Section 22.4) already tracks which NUMA node pages belong to. Unified memory domains just ensure accelerator-local memory is correctly represented as a NUMA node with proper distance/bandwidth metadata.

22.8.9 NVIDIA Compatibility: No Changes Required¶

The unified compute model is specifically designed to NOT require driver changes:

NVIDIA driver stack (discrete GPU, PCIe):

  Userspace (closed-source, binary compat):
    libcuda.so         — CUDA runtime        → unchanged
    libnvidia-ml.so    — management library   → unchanged
    libnvcuvid.so      — video decode         → unchanged
    All communicate via ioctl to kernel driver → unchanged

  Kernel driver (open-source nvidia.ko, ported per Section 22.1 KABI):
    Implements AccelBase vtable:
      get_utilization() → kernel reads GPU utilization, power, clock
      submit_commands() → kernel sees command flow
      set_performance_level() → kernel can request clock changes

  What the unified compute layer reads (no new driver code):
    1. GPU utilization % → from get_utilization() (already required by AccelBase)
    2. GPU power draw mW → from get_utilization() (already required)
    3. GPU clock MHz     → from get_utilization() (already required)
    4. Memory bandwidth  → from get_utilization() or static spec data

  What the unified compute layer estimates (kernel-side, no driver involvement):
    5. ComputeCapacityProfile → derived from AccelDeviceClass::GpuCompute +
       known GPU specs (SM count, tensor core presence, memory type).
       Spec database in kernel, keyed by PCI device ID. Same approach as
       Linux's GPU frequency tables.

  Optional future enhancement (minor KABI extension):
    6. get_capacity_profile() → driver provides precise profile.
       Not required. Kernel estimate works without it.

CUDA applications continue to explicitly target the GPU. The kernel does NOT intercept CUDA calls or redirect compute. The benefit is: - Better power budgeting (kernel knows GPU is the useful domain) - Better cgroup fairness (compute.weight distributes across CPU+GPU) - Better topology data for orchestrators (Kubernetes reads sysfs)

22.8.10 What the Kernel Does NOT Do¶

To be explicit about boundaries — the kernel does NOT:

Automatically redirect CUDA/ROCm/oneAPI workloads between devices. Applications that explicitly target a device continue to target that device. The kernel respects explicit choices.
Implement a compute compiler that translates CPU code to GPU kernels or vice versa. That's a userspace runtime concern (OpenCL, SYCL, Vulkan Compute).
Require drivers to expose internal scheduling decisions. GPU drivers still schedule internally. The kernel provides cross-device orchestration data.
Add overhead to the compute submission hot path. Command submission (submit_commands) goes through AccelScheduler exactly as before. The unified topology is a background advisory system consulted at ~1 second intervals.
Break on systems with no accelerators. When only CPUs are present, the unified compute topology contains only CPU entries. It degrades to exactly the Section 7.2 CpuCapacity model. Zero overhead.

22.8.11 Sysfs Interface for Userspace Runtimes¶

The key practical benefit: userspace runtimes (OpenCL, SYCL, oneAPI, future CUDA alternatives) can query the kernel for topology + workload data instead of each runtime re-discovering hardware independently:

/sys/kernel/umka/compute/
    topology.json                    # Full compute topology (all units)
    unit_count                       # Number of compute units

/sys/kernel/umka/compute/unit/<id>/
    type                             # "cpu_core", "gpu_compute", "npu_engine", ...
    capacity                         # Absolute: "scalar=50000 vector=400 matrix=5 ..."
    capacity_normalized              # Normalized 0-1024: "scalar=1024 vector=390 ..."
    memory_domain                    # NUMA node ID
    memory_unified                   # "1" if shared with CPU, "0" if separate
    utilization                      # Current utilization (0-1024)
    energy_model                     # OPP table (freq, capacity, power)

/sys/fs/cgroup/<group>/
    compute.profile                  # Workload classification (read-only)
    compute.weight                   # Unified compute budget (optional, default 0)

Use case: A SYCL runtime deciding between CPU and GPU for a kernel launch:

1. Read /sys/kernel/umka/compute/unit/*/capacity
2. Read /sys/kernel/umka/compute/unit/*/memory_unified
3. Know: GPU has matrix=800, memory_unified=1 (APU).
4. Decision: matrix workload + shared memory → GPU (no copy cost).
   vs. on discrete GPU: memory_unified=0 → compare launch overhead
   + data transfer cost vs GPU throughput gain.

Today on Linux, each runtime does its own hardware discovery via vendor-specific APIs (nvml, rocm-smi, level-zero). The kernel provides no unified view. UmkaOS's sysfs topology eliminates redundant discovery and gives runtimes data the kernel already has (NUMA distances, power state, utilization).

22.8.12 Linux Compatibility¶

No existing Linux interfaces are affected. All new interfaces are additive:

Existing (preserved):
  /sys/devices/system/cpu/cpuN/*           — CPU topology, unchanged
  /sys/class/drm/card0/*                   — GPU sysfs, unchanged
  /dev/nvidia*, /dev/dri/*                 — device nodes, unchanged
  sched_setattr(), ioctl(GPU_SUBMIT, ...)  — syscalls, unchanged

New (additive):
  /sys/kernel/umka/compute/*               — unified compute topology
  /sys/fs/cgroup/<group>/compute.profile   — workload classification
  /sys/fs/cgroup/<group>/compute.weight    — optional unified budget

Applications unaware of new interfaces see standard Linux behavior.

22.8.13 Convergence Path: Accelerators as Peer Kernel Nodes¶

The unified compute topology (Section 22.8) treats accelerators as opaque compute units behind AccelBase vtables. This works Day 1 with existing proprietary firmware. But the architecture already contains the design for the next step.

Observation: every modern accelerator already has its own processor and runs its own kernel or firmware:

Device                 Processor            Runs today           Transport
─────────────────────  ───────────────────  ───────────────────  ─────────
NVIDIA GPU (Ada+)      RISC-V (GSP cores)   Proprietary μkernel  PCIe/NVLink
NVIDIA BlueField DPU   ARM A78              Full Linux kernel    PCIe
Intel Gaudi NPU        Custom cores         Firmware             PCIe
AMD Instinct           Embedded μctrl       Firmware             PCIe/xGMI
CXL memory expander    Management proc      Firmware             CXL
Crypto coprocessor     Dedicated core       Firmware/RTOS        PCIe/SPI
Future RISC-V accel    RV64 harts           Firmware             PCIe/CXL/custom

The distributed kernel (Section 5.1) already solves "multiple UmkaOS instances sharing memory and capabilities across a transport." SmartNIC/DPU offload (Section 5.11) already says "a DPU is a close remote node connected via PCIe."

The convergence: any device with its own processor is a potential peer kernel node. If a vendor replaces proprietary firmware with a UmkaOS-lite instance, that device becomes a full participant in the distributed kernel fabric — its memory becomes DSM-managed, its compute is visible to the cluster scheduler, capabilities flow across the interconnect.

22.8.13.1 The Three-Stage Adoption Path¶

Naming note: The "stages" below describe the accelerator integration maturity model within Section 22.8. They are NOT the project-wide implementation phases (Phase 1-5) defined in Section 24.2 (23-roadmap.md). The mapping is: Stage A (Opaque) ships with Phase 3-4 (Real Workloads / Production Ready). Stage B (Advisory) ships with Phase 4-5 (Production Ready / Ecosystem). Stage C (Peer) targets Phase 5+ (Ecosystem and beyond, vendor-driven).

Stage A — Opaque (Day 1, drop-in Linux replacement):
  ┌──────────────┐  AccelBase vtable   ┌──────────────┐
  │  UmkaOS   │ ──── (ioctl) ────► │  Proprietary  │
  │  (host CPU)   │                     │  firmware     │
  └──────────────┘                      └──────────────┘
  Kernel submits commands, reads telemetry. Device is a black box.
  Works with existing NVIDIA, AMD, Intel stacks. No vendor changes.

Stage B — Advisory topology (this section):
  Same as Stage A, plus:
  - Kernel builds multi-dimensional capacity profiles.
  - Workload classification drives power budgeting and intent optimization.
  - Sysfs exposes topology data for userspace runtimes.
  Still opaque device firmware. Benefits from kernel-side intelligence.

Stage C — Peer kernel node (vendor adoption):
  ┌──────────────┐  Section 5.1 distributed    ┌──────────────┐
  │  UmkaOS   │ ──── kernel ──────► │  UmkaOS   │
  │  (host CPU)   │  (PCIe/NVLink/CXL) │  (device)     │
  └──────────────┘                      └──────────────┘
  Device runs UmkaOS-lite. Becomes a node in the distributed fabric:
  - Device memory → DSM-managed (Section 5.6). Transparent page sharing.
  - Device compute → visible to cluster scheduler (Section 5.7).
  - Capabilities → network-portable across host↔device (Section 5.8).
  - Crash recovery → kernel restart on device, state preserved (Section 11.7).

  The application API is UNCHANGED between Stage A and Stage C.
  A CUDA app, an OpenCL app, a custom accelerator app — all work at every stage.
  Stage C just makes the device more deeply integrated and manageable.

22.8.13.2 Transport Unification¶

The architecture has a unified peer protocol (Section 5.1) with pluggable transport bindings. All transports (RDMA, PCIe BAR, CXL, NVLink, TCP, USB, HiperSockets) implement the unified ClusterTransport trait, with per-peer binding via PeerNode.transport: Arc<dyn ClusterTransport>:

All accelerator-to-host and accelerator-to-accelerator transport uses the unified ClusterTransport trait (Section 5.10). Each interconnect type has its own implementation:

// umka-core/src/transport/mod.rs

/// PCIe doorbell + status mailbox for low-latency host↔device signaling.
/// Mapped from a dedicated BAR region (typically BAR2 or a sub-range of BAR0).
/// The mailbox provides interrupt-free command submission notification and
/// completion polling for latency-sensitive paths.
pub struct PcieMailbox {
    /// MMIO address of the doorbell register (write-only from host).
    /// Writing a queue index + sequence number triggers device-side processing.
    pub doorbell: *mut u32,
    /// MMIO address of the status register (read-only from host).
    /// Device writes completion status here; host polls or uses MSI-X.
    pub status: *const u32,
    /// MSI-X vector for mailbox completion interrupts (None = polling mode).
    pub msix_vector: Option<u16>,
}

/// PCIe BAR-mapped transport. Host↔device on same machine.
/// For DPUs, discrete GPUs, add-in accelerators.
pub struct PcieBarTransport {
    device: DeviceNodeId,
    bar_base: PhysAddr,
    bar_size: u64,
    mailbox: Option<PcieMailbox>,
}

impl ClusterTransport for PcieBarTransport {
    fn fetch_page(&self, remote_addr: u64, local_addr: PhysAddr, size: u32)
        -> Result<(), TransportError>;  // DMA read from BAR (~1-5 μs)
    fn push_page(&self, local_addr: PhysAddr, remote_addr: u64, size: u32)
        -> Result<(), TransportError>;  // DMA write to BAR (~1-5 μs)
    fn atomic_cas(&self, remote_addr: u64, expected: u64, desired: u64)
        -> Result<u64, TransportError>; // MMIO atomic if hw supports, else doorbell
    fn atomic_faa(&self, remote_addr: u64, addend: u64)
        -> Result<u64, TransportError>;
    fn fence(&self) -> Result<(), TransportError>;  // SFENCE + BAR read-back
    fn send(&self, msg: &[u8]) -> Result<(), TransportError>;  // Write to BAR2 ring
    fn send_reliable(&self, msg: &[u8], timeout_ms: u32) -> Result<(), TransportError>;
    fn poll_recv(&self, buf: &mut [u8]) -> Option<usize>;
    fn transport_name(&self) -> &'static str { "pcie-bar" }
    fn supports_one_sided(&self) -> bool { true }
    fn is_coherent(&self) -> bool { false }
}

/// NVLink / NVLink-C2C transport. Chip-to-chip.
/// For GPU↔GPU or CPU↔GPU (Grace Hopper).
pub struct NvLinkTransport {
    device: DeviceNodeId,
    link_id: u32,
    coherent: bool,
    bandwidth_gbs: u32,
}

impl ClusterTransport for NvLinkTransport {
    // When coherent (Grace Hopper C2C): direct load/store, fence is no-op.
    // When non-coherent: GPU membar.sys for fence.
    fn transport_name(&self) -> &'static str { "nvlink" }
    fn supports_one_sided(&self) -> bool { true }
    fn is_coherent(&self) -> bool { self.coherent }
    // ... all other methods implemented ...
}

/// CXL 3.0 shared memory transport.
/// For CXL-attached accelerators, memory expanders, composable infrastructure.
pub struct CxlPeerTransport {
    device: DeviceNodeId,
    cxl_port: u32,
    bandwidth_gbs: u32,
}

impl ClusterTransport for CxlPeerTransport {
    // Hardware-coherent: fence is no-op, is_coherent returns true.
    // fetch_page/push_page: direct memcpy from/to shared region (~0.2-0.4 μs).
    // atomic_cas/faa: hardware CAS/FAA via shared memory (~0.1-0.3 μs).
    fn transport_name(&self) -> &'static str { "cxl" }
    fn supports_one_sided(&self) -> bool { true }
    fn is_coherent(&self) -> bool { true }
    // ... all other methods implemented ...
}

Fence semantics are part of the ClusterTransport trait: - RDMA: RC QP in-order delivery for writes; IBV_SEND_FENCE for reads/atomics. - PCIe: SFENCE + read-back from BAR (flush posted writes). - NVLink/CXL (coherent): hardware-coherent, fence is a no-op. - NVLink (non-coherent): GPU membar.sys instruction. - TCP: implicit (TCP is ordered).

TransportError Enum:

/// Errors from transport operations (used by ClusterTransport trait).
#[repr(u32)]
pub enum TransportError {
    /// Connection to remote node lost (link down, node crashed).
    ConnectionLost  = 0,
    /// Operation timed out (no response within deadline).
    Timeout         = 1,
    /// Invalid remote address (unmapped, out of range).
    InvalidAddress  = 2,
    /// Permission denied (rkey mismatch, capability revoked).
    PermissionDenied = 3,
    /// Device error (NIC failure, PCIe error, CXL protocol error).
    DeviceError     = 4,
    /// RDMA fabric down (all QPs to this peer in error state).
    FabricDown      = 5,
}

Transport Hot-Unplug Handling:

When a device backing a ClusterTransport is removed (PCIe hot-unplug, NVLink failure, CXL device removal):

The device registry emits DeviceEvent::Removed for the transport device.
All in-flight operations on that transport return TransportError::ConnectionLost.
The distributed kernel protocol (if active) downgrades the affected node:
DSM pages owned by that node become read-only on all other nodes.
Capabilities issued by that node are marked suspect (cannot be renewed).
If the device was a GPU running UmkaOS-lite (Phase 3 peer), its compute units are removed from the unified topology and its memory domains are marked unavailable.
Processes with active contexts on the removed device receive SIGBUS.

Key property: The distributed kernel protocol (DSM coherence, capability exchange, heartbeat, cluster join) is transport-agnostic. It sends messages and reads/writes remote memory. Whether that goes over RDMA, PCIe BAR, NVLink, or CXL is an implementation detail. The protocol layer doesn't change.

This means a GPU running UmkaOS-lite joins the distributed kernel using the same protocol as a remote server — just over NVLink instead of RDMA. The cluster scheduler, DSM directory, and capability system see it as another node.

22.8.13.3 Any Accelerator, Any Interconnect¶

This model is deliberately generic. It applies to any device with a processor, regardless of function:

Device type          Compute capacity profile (Section 22.6.3 fields)  When it becomes a peer node
───────────────────  ────────────────────────────────────────  ─────────────────────────────
GPU                  vector=2000, matrix=300, scalar=200      Vendor ships UmkaOS firmware
NPU                  matrix=40, vector=0, scalar=0            Vendor ships UmkaOS firmware
DPU/SmartNIC         scalar=5000, vector=100, memory_bw=200   Already runs Linux → easy port
Crypto coprocessor   scalar=1000, vector=0, matrix=0          Vendor ships UmkaOS firmware
FPGA                 variable (depends on bitstream)           FPGA shell runs UmkaOS
DSP                  vector=500, scalar=2000, matrix=0         Vendor ships UmkaOS firmware
CSD (comp. storage)  scalar=3000, memory_bw=500               NVMe controller runs UmkaOS
Future RISC-V accel  scalar+vector (implementation defined)    Naturally runs UmkaOS (RISC-V)

All device types express their capacity using the five ComputeCapacityProfile fields defined in Section 22.8 (scalar, vector, matrix, memory_bw, launch_overhead_us). Specialized workload categories (inference, network offload, crypto, signal processing) map to combinations of these base dimensions. For example, NPU inference throughput is captured by matrix (the dominant operation), and DPU network offload is captured by scalar + memory_bw.

RISC-V accelerators are the most natural fit: UmkaOS has RISC-V as a first-class target architecture (Section 2.22), so a RISC-V-based accelerator can run the same kernel binary (with a different device tree and minimal board support).

The architecture doesn't need to predict which device types will exist. It only needs to provide: 1. A generic compute unit model (Section 22.8, Section 22.8) — works for any device. 2. A transport-agnostic distributed kernel protocol (Section 5.1) — works over any interconnect. 3. An adoption path that doesn't require vendors to change anything until they choose to.

Mixed-Coherence Cluster Optimization:

In a system with both coherent (CXL, NVLink-C2C) and non-coherent (RDMA, PCIe BAR) transports, the DSM protocol can skip invalidation messages for node pairs connected by coherent transports. The is_coherent() method on ClusterTransport enables per-pair optimization:

Coherent pair (e.g., CPU ↔ CXL GPU): fence() is a no-op. No invalidation messages needed — hardware maintains coherence automatically.
Non-coherent pair (e.g., CPU ↔ remote RDMA node): standard DSM invalidation protocol applies (explicit messages + RDMA fencing).
Mixed cluster: The DSM directory tracks coherence per node-pair. Invalidation fanout skips coherent pairs, reducing message traffic in heterogeneous clusters.

22.8.14 Performance Impact¶

Systems without accelerators:
  Unified topology contains only CPU entries.
  Overhead: one additional struct per CPU core (~64 bytes).
  Runtime overhead: zero. Advisory system has nothing extra to advise on.

Systems with accelerators (steady state):
  Topology update: ~1μs per accelerator per second (read get_utilization).
  Workload classification: ~2μs per cgroup per second (read perf counters).
  Cross-device energy optimization: ~1μs per cgroup per power budget tick.
  Total: ~4μs per second per cgroup. Fraction of a percent.

  Same data is already being read by AccelScheduler (Section 22.1.2.4) and
  PowerBudgetEnforcer (Section 7.4). Unified topology reuses those readings.
  Marginal overhead: near zero.

Compute submission hot path: UNCHANGED.
  submit_commands() → AccelScheduler → driver. No new code in this path.

Benefit: better power budgeting decisions (Section 22.6.5) save more power than
  the microseconds spent on workload classification.

§23.1 AI/ML Policy Framework has been moved to Section 23.1 for clarity. The section number (§23.1) and all anchor links are unchanged.